UnslothModel¶

Module: neurosurfer.models.chat_models.unsloth
Inherits: BaseModel

Overview¶

UnslothModel integrates Unsloth's FastLanguageModel runtime, giving you accelerated inference for LoRA/QLoRA checkpoints without rewriting application code. It mirrors the TransformersModel interface but leans on Unsloth's optimisations for NVIDIA GPUs.

Highlights¶

Optimised CUDA kernels with optional 4-bit/8-bit quantisation
Thread-safe streaming with stop signal support
Optional <think> suppression for Qwen-style models
Compatible with checkpoints produced by the Unsloth finetuning workflow

Constructor¶

`UnslothModel.init`¶

from neurosurfer.config import config

UnslothModel(
    model_name: str,
    *,
    max_seq_length: int = config.base_model.max_seq_length,
    load_in_4bit: bool = config.base_model.load_in_4bit,
    load_in_8bit: bool = False,
    full_finetuning: bool = config.base_model.full_finetuning,
    enable_thinking: bool = config.base_model.enable_thinking,
    stop_words: list[str] | None = config.base_model.stop_words,
    verbose: bool = config.base_model.verbose,
    logger: logging.Logger = logging.getLogger(),
    **kwargs,
)

Parameter	Type	Default	Description
`model_name`	`str`	–	Local path or Hugging Face id recognised by `FastLanguageModel`.
`max_seq_length`	`int`	`config.base_model.max_seq_length`	Context window passed to Unsloth; must align with how the model was trained.
`load_in_4bit`	`bool`	`config.base_model.load_in_4bit`	Load weights in 4-bit mode for memory savings.
`load_in_8bit`	`bool`	`False`	Load weights in 8-bit mode instead of 4-bit.
`full_finetuning`	`bool`	`config.base_model.full_finetuning`	Enable full-parameter finetuning mode (not needed for inference).
`enable_thinking`	`bool`	`config.base_model.enable_thinking`	Keep `<think>` reasoning traces instead of stripping them.
`stop_words`	`list[str] or None`	`config.base_model.stop_words`	Optional stop sequence list enforced client-side.
`verbose`	`bool`	`config.base_model.verbose`	Emit additional logs from the wrapper.
`logger`	`logging.Logger`	`logging.getLogger()`	Logger shared across helper methods.

Additional keyword arguments are forwarded to FastLanguageModel.from_pretrained.

Tip

config is imported from neurosurfer.config. Please see the configuration section for more details.

Usage¶

Non-streaming reply¶

from neurosurfer.models.chat_models.unsloth import UnslothModel

model = UnslothModel(
    model_name="/weights/Qwen2.5-7B-Instruct-bnb-4bit",
    enable_thinking=False,
)

response = model.ask("Summarise the README in two bullet points.")
print(response.choices[0].message.content)

Streaming with stop control¶

stream = model.ask(
    "List three benefits of Unsloth.",
    stream=True,
    temperature=0.6,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="")
    if "<END>" in delta:
        model.stop_generation()

Updating stop words at runtime¶

model.set_stop_words(["\nObservation:"])
reply = model.ask("Respond using the format 'Answer: ...' and 'Observation: ...'.")
print(reply.choices[0].message.content)

Notes¶

Unsloth currently targets CUDA; running on CPU is not supported.
The wrapper uses a background thread plus TextIteratorStreamer to support streaming and stop signals—call stop_generation() to cancel long generations.
Token counts are approximated via the tokenizer when possible; failures fall back to a word-count heuristic.

TransformersModel – direct Hugging Face integration
LlamaCppModel – run GGUF weights via llama.cpp

mkdocstrings output is temporarily disabled while import hooks are updated.