BaseModel¶
Module: neurosurfer.models.chat_models.base
Type: Abstract base class
Overview¶
BaseModel is the foundation for every chat model shipped with Neurosurfer. It standardises request handling, streaming, token accounting, and OpenAI-compatible responses so that downstream agents can work with any backend without branching.
Core capabilities¶
- Creates OpenAI-style
ChatCompletionResponse/ChatCompletionChunkobjects viaask - Handles streaming vs. non-streaming in a single entry point
- Tracks prompt/completion token usage (when provided or estimated)
- Provides helper methods that subclasses can reuse to build deltas, stop chunks, and final responses
- Exposes
stop_generation,reset_stop_signal, andset_stop_wordshooks that concrete implementations respond to
Constructor¶
BaseModel.__init__¶
from neurosurfer.config import config
BaseModel(
*,
max_seq_length: int = config.base_model.max_seq_length,
verbose: bool = config.base_model.verbose,
logger: logging.Logger = logging.getLogger(),
**kwargs,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
max_seq_length | int | config.base_model.max_seq_length | Maximum context window supported by the backend. |
verbose | bool | config.base_model.verbose | When True, subclasses should emit additional debug logs. |
logger | logging.Logger | logging.getLogger() | Logger instance reused across helper methods. |
**kwargs | any | – | Forwarded to subclass initialisation; stored on subclasses as needed. |
Subclasses must call super().__init__ so shared state (model_name, lock, max_seq_length, etc.) is initialised before loading a backend in init_model().
Tip
config is imported from neurosurfer.config. Please see the configuration section for more details.
Public interface¶
ask¶
ask(
user_prompt: str,
*,
system_prompt: str = config.base_model.system_prompt,
chat_history: list[dict] = [],
temperature: float = config.base_model.temperature,
max_new_tokens: int = config.base_model.max_new_tokens,
stream: bool = False,
**kwargs,
) -> ChatCompletionResponse | Generator[ChatCompletionChunk, None, None]
Routes the request to _call (non-streaming) or _stream (streaming) and returns OpenAI-compatible Pydantic models. All subclasses must implement _call and _stream.
Usage:
response = model.ask("Explain RAG in one sentence")
print(response.choices[0].message.content)
for chunk in model.ask("Summarise the docs", stream=True):
delta = chunk.choices[0].delta.content or ""
print(delta, end="")
stop_generation¶
Sets an implementation-defined flag instructing the current generation to stop at the next opportunity. Concrete models should respect this flag inside their generation loops (see the built-in integrations for patterns).
reset_stop_signal¶
Clears any stop signal before a new request runs. Typically called internally by ask.
set_stop_words¶
Stores a stop word list which helpers such as _transformers_consume_stream use to truncate responses without emitting the terminator tokens.
Hooks for subclass authors¶
Concrete implementations must provide the following methods:
init_model(self) -> None: load the underlying model/client/tokenizer._call(...) -> ChatCompletionResponse: run a blocking generation and return the final response._stream(...) -> Generator[ChatCompletionChunk, None, None]: yield partial chunks followed by a terminal stop chunk.stop_generation(self) -> None: honour stop requests from callers.
Optional helpers supplied by BaseModel:
_delta_chunk(call_id, model, content)– build incremental streaming chunks._stop_chunk(call_id, model, finish_reason)– emit the terminating streaming chunk._final_nonstream_response(call_id, model, content, prompt_tokens, completion_tokens)– return a completeChatCompletionResponse._transformers_consume_stream(streamer)– consume tokens from aTextIteratorStreamer, applying stop words and<think>filtering.
Creating a custom model¶
from neurosurfer.models.chat_models.base import BaseModel
from neurosurfer.server.schemas import ChatCompletionResponse
class CustomModel(BaseModel):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.init_model()
def init_model(self):
self.model = load_backend_somehow()
self.tokenizer = load_tokenizer()
self.model_name = "my-custom-model"
def _call(self, **kwargs) -> ChatCompletionResponse:
# Run the backend, produce text, then build the response
content = self._run_backend(**kwargs)
return self._final_nonstream_response(
call_id=self.call_id,
model=self.model_name,
content=content,
prompt_tokens=0,
completion_tokens=0,
)
def _stream(self, **kwargs):
for token in self._backend_stream(**kwargs):
yield self._delta_chunk(self.call_id, self.model_name, token)
yield self._stop_chunk(self.call_id, self.model_name, finish_reason="stop")
def stop_generation(self):
self.backend.cancel_current_request()
See also¶
mkdocstrings output is temporarily disabled while import hooks are updated.