Skip to content

Chat Models API

Large language models shipped with Neurosurfer share a unified interface through BaseModel. Choose the backend that matches your deployment constraints—cloud APIs, local GPUs, llama.cpp, or remote vLLM servers.

Supported backends

Model Runtime Deploy on Ideal for
OpenAIModel Hosted API OpenAI Cloud, LM Studio, Ollama, vLLM Production-ready quality with minimal setup
TransformersModel PyTorch Local GPU/CPU Running Hugging Face checkpoints directly
UnslothModel CUDA (Unsloth) Local GPU Fast inference for LoRA/QLoRA finetunes
LlamaCppModel llama.cpp CPU / lightweight GPU GGUF quantised models with tiny footprint

Quick start examples

1. Hosted API (OpenAI-compatible)

from neurosurfer.models.chat_models.openai import OpenAIModel

model = OpenAIModel(model_name="gpt-4o-mini")
completion = model.ask("Explain retrieval augmented generation in one sentence.")

print(completion.choices[0].message.content)

2. Hugging Face checkpoint (Transformers)

from neurosurfer.models.chat_models.transformers import TransformersModel

model = TransformersModel(
    model_name="/weights/Qwen3-4B-unsloth-bnb-4bit",
    load_in_4bit=False,  # already quantised
)

response = model.ask("List three benefits of local inference.")
print(response.choices[0].message.content)

3. llama.cpp GGUF

from neurosurfer.models.chat_models.llamacpp import LlamaCppModel

model = LlamaCppModel(
    model_path="/weights/llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=8,
)

answer = model.ask("What are GGUF files?")
print(answer.choices[0].message.content)

4. Unsloth accelerated inference

from neurosurfer.models.chat_models.unsloth import UnslothModel

model = UnslothModel(
    model_name="/weights/Qwen3-4B-unsloth-bnb-4bit",
    enable_thinking=False,
)

reply = model.ask("Describe Unsloth in two bullet points.")
print(reply.choices[0].message.content)

Choosing a backend

  • Use OpenAIModel when you need the highest quality models or want to point at any OpenAI-compatible API.
  • Use TransformersModel when you manage your own Hugging Face checkpoints and have GPU resources.
  • Use UnslothModel for LoRA/QLoRA finetunes optimised with Unsloth’s runtime.
  • Use LlamaCppModel on CPU-first deployments or when you rely on GGUF quantised weights.

See also