UnslothModel¶
Module: neurosurfer.models.chat_models.unsloth
Inherits: BaseModel
Overview¶
UnslothModel integrates Unsloth's FastLanguageModel runtime, giving you accelerated inference for LoRA/QLoRA checkpoints without rewriting application code. It mirrors the TransformersModel interface but leans on Unsloth's optimisations for NVIDIA GPUs.
Highlights¶
- Optimised CUDA kernels with optional 4-bit/8-bit quantisation
- Thread-safe streaming with stop signal support
- Optional
<think>suppression for Qwen-style models - Compatible with checkpoints produced by the Unsloth finetuning workflow
Constructor¶
UnslothModel.__init__¶
from neurosurfer.config import config
UnslothModel(
model_name: str,
*,
max_seq_length: int = config.base_model.max_seq_length,
load_in_4bit: bool = config.base_model.load_in_4bit,
load_in_8bit: bool = False,
full_finetuning: bool = config.base_model.full_finetuning,
enable_thinking: bool = config.base_model.enable_thinking,
stop_words: list[str] | None = config.base_model.stop_words,
verbose: bool = config.base_model.verbose,
logger: logging.Logger = logging.getLogger(),
**kwargs,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | – | Local path or Hugging Face id recognised by FastLanguageModel. |
max_seq_length | int | config.base_model.max_seq_length | Context window passed to Unsloth; must align with how the model was trained. |
load_in_4bit | bool | config.base_model.load_in_4bit | Load weights in 4-bit mode for memory savings. |
load_in_8bit | bool | False | Load weights in 8-bit mode instead of 4-bit. |
full_finetuning | bool | config.base_model.full_finetuning | Enable full-parameter finetuning mode (not needed for inference). |
enable_thinking | bool | config.base_model.enable_thinking | Keep <think> reasoning traces instead of stripping them. |
stop_words | list[str] or None | config.base_model.stop_words | Optional stop sequence list enforced client-side. |
verbose | bool | config.base_model.verbose | Emit additional logs from the wrapper. |
logger | logging.Logger | logging.getLogger() | Logger shared across helper methods. |
Additional keyword arguments are forwarded to FastLanguageModel.from_pretrained.
Tip
config is imported from neurosurfer.config. Please see the configuration section for more details.
Usage¶
Non-streaming reply¶
from neurosurfer.models.chat_models.unsloth import UnslothModel
model = UnslothModel(
model_name="/weights/Qwen2.5-7B-Instruct-bnb-4bit",
enable_thinking=False,
)
response = model.ask("Summarise the README in two bullet points.")
print(response.choices[0].message.content)
Streaming with stop control¶
stream = model.ask(
"List three benefits of Unsloth.",
stream=True,
temperature=0.6,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="")
if "<END>" in delta:
model.stop_generation()
Updating stop words at runtime¶
model.set_stop_words(["\nObservation:"])
reply = model.ask("Respond using the format 'Answer: ...' and 'Observation: ...'.")
print(reply.choices[0].message.content)
Notes¶
- Unsloth currently targets CUDA; running on CPU is not supported.
- The wrapper uses a background thread plus
TextIteratorStreamerto support streaming and stop signals—callstop_generation()to cancel long generations. - Token counts are approximated via the tokenizer when possible; failures fall back to a word-count heuristic.
Related models¶
TransformersModel– direct Hugging Face integrationLlamaCppModel– run GGUF weights via llama.cpp
mkdocstrings output is temporarily disabled while import hooks are updated.