TransformersModel¶
Module: neurosurfer.models.chat_models.transformers
Inherits: BaseModel
Overview¶
TransformersModel wraps any Hugging Face causal language model for local inference. It handles device placement, optional 4-bit quantisation via bitsandbytes, stop word filtering, and streaming support through TextIteratorStreamer.
Highlights¶
- Works with local checkpoints or remote Hub ids (
trust_remote_code=True) - Auto-selects GPU (
cuda) when available, otherwise falls back to CPU - Detects pre-quantised models to avoid double-quantising
- Supports
<think>suppression for Qwen/Qwen2 style models whenenable_thinking=False - Exposes graceful cancellation through
stop_generation
Constructor¶
TransformersModel.__init__¶
from neurosurfer.config import config
TransformersModel(
model_name: str = "openai/gpt-oss-20b",
*,
max_seq_length: int = config.base_model.max_seq_length,
load_in_4bit: bool = config.base_model.load_in_4bit,
enable_thinking: bool = config.base_model.enable_thinking,
stop_words: list[str] | None = config.base_model.stop_words,
verbose: bool = config.base_model.verbose,
logger: logging.Logger = logging.getLogger(),
**kwargs,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | "openai/gpt-oss-20b" | Hugging Face id or local path passed to from_pretrained. |
max_seq_length | int | config.base_model.max_seq_length | Logical context window (recorded on the instance). |
load_in_4bit | bool | config.base_model.load_in_4bit | Enable 4-bit quantisation via BitsAndBytesConfig if the model is not already quantised. |
enable_thinking | bool | config.base_model.enable_thinking | When False, <think> sections are stripped and Qwen-style prompts append /nothink. |
stop_words | list[str] \| None | config.base_model.stop_words | Optional stop sequence list enforced client-side. |
verbose | bool | config.base_model.verbose | Emit additional logging during model load and streaming. |
logger | logging.Logger | logging.getLogger() | Logger shared with helper methods. |
**kwargs | any | – | Forwarded to AutoModelForCausalLM.generate. |
During initialisation:
- Device (
cudavscpu) and dtype are selected automatically. AutoTokenizer.from_pretrainedandAutoModelForCausalLM.from_pretrainedare called withtrust_remote_code=True.- If
load_in_4bitisTrue, aBitsAndBytesConfigis attached unless the checkpoint already containsquantization_config.
Tip
config is imported from neurosurfer.config. Please see the configuration section for more details.
Usage¶
Basic completion¶
from neurosurfer.models.chat_models.transformers import TransformersModel
model = TransformersModel(
model_name="/home/nomi/workspace/Model_Weights/Qwen3-4B-unsloth-bnb-4bit",
load_in_4bit=False, # already quantised
)
reply = model.ask("Summarise the features of Neurosurfer in two sentences.")
print(reply.choices[0].message.content)
Streaming completion¶
stream = model.ask(
"Provide three bullet points about vector databases.",
system_prompt="Answer as a helpful engineer.",
stream=True,
temperature=0.3,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="")
Cancelling generation¶
stream = model.ask("Write a very long story about transformers.", stream=True)
for chunk in stream:
if should_cancel_now():
model.stop_generation()
break
print(chunk.choices[0].delta.content, end="")
Tips¶
- Quantised checkpoints: For weights already saved in 4-bit/8-bit format, leave
load_in_4bit=False. The loader inspectsconfig.jsonto avoid double quantisation. - Thinking tags: Qwen/Qwen2 models output
<think>...</think>content. Setenable_thinking=False(default) to strip it; setTrueto keep the reasoning trace. - Custom kwargs: Pass
top_p,top_k,repetition_penalty, etc. directly toaskand they propagate togenerate. - Stop words: Use
model.set_stop_words(["Observation:", "\nResult:"])to truncate outputs before those markers.
Related models¶
UnslothModel– similar API built on Unsloth'sFastLanguageModelLlamaCppModel– run GGUF weights via llama.cpp
mkdocstrings output is temporarily disabled while import hooks are updated.