RAGAgent¶
Module: neurosurfer.agents.rag
Overview¶
RAGAgent is a lightweight, modular retrieval core for Retrieval‑Augmented Generation (RAG). It is:
- Vector‑store agnostic — plugs into any store implementing the
BaseVectorDBinterface. - Embedder agnostic — works with any model implementing
BaseEmbedder. - LLM/tokenizer agnostic — uses HuggingFace tokenizers when available, falls back to
tiktokenif present, and otherwise applies a robust character‑based heuristic to keep prompts under the model’s context window.
It performs three primary steps:
1) Retrieve: embed the query and fetch the top‑K most similar chunks.
2) Build context: convert retrieved docs into a joined context string (customizable).
3) Budget & trim: fit the context into the LLM window while leaving room for generation; return a RetrieveResult with a safe max_new_tokens value.
Optionally, you can call run(...) to perform a full generation (retrieve → fill prompt → llm.ask(...), streaming or not).
Package layout¶
neurosurfer/agents/rag/
├─ __init__.py # exports RAGAgent, RAGAgentConfig, RetrieveResult
├─ config.py # dataclasses for config and results
├─ token_utils.py # tokenizer/tokens fallback + trimming utilities
├─ context_builder.py # format & join retrieved chunks
├─ picker.py # helper to pick top files by grouped hits
└─ agent.py # main agent implementation
Constructor¶
RAGAgent(
llm: BaseModel,
vectorstore: BaseVectorDB,
embedder: BaseEmbedder,
*,
config: RAGAgentConfig | None = None,
logger: logging.Logger | None = None,
make_source: Callable[[Doc], str] | None = None,
)
| Parameter | Type | Description |
|---|---|---|
llm | BaseModel | Any supported chat model. Must expose ask(...) and (ideally) max_seq_length. A tokenizer is optional. |
vectorstore | BaseVectorDB | Must expose similarity_search(query_embedding, top_k, metadata_filter, similarity_threshold). |
embedder | BaseEmbedder | Must expose embed(sequence[str], normalize_embeddings=bool) -> list[list[float]]. |
config | RAGAgentConfig | None | Retrieval, context‑formatting, and budgeting knobs (defaults used when None). |
logger | logging.Logger \| None | Optional logger. |
make_source | Callable[[Doc], str] \| None | Override how each doc’s “source” label is rendered in context (filename, URI, etc.). |
RAGAgentConfig¶
@dataclass
class RAGAgentConfig:
# Retrieval
top_k: int = 5
similarity_threshold: float | None = None
# Output budgeting
fixed_max_new_tokens: int | None = None
auto_output_ratio: float = 0.25
min_output_tokens: int = 32
safety_margin_tokens: int = 32
# Context formatting
include_metadata_in_context: bool = True
context_separator: str = "\n\n---\n\n"
context_item_header_fmt: str = "Source: {source}"
normalize_embeddings: bool = True
# Tokenizer fallbacks
approx_chars_per_token: float = 4.0
Parameters¶
| Parameter | Type | Description |
|---|---|---|
top_k | int | Default number of chunks to fetch when top_k is not supplied to retrieve(...). |
similarity_threshold | float \| None | Global similarity floor for retrieval (override per call). |
fixed_max_new_tokens | int \| None | Hard cap for generation. If None, the agent derives a cap dynamically after trimming. |
auto_output_ratio | float | When no fixed cap is provided, initial portion of remaining window reserved for output before trimming (refined after trimming). |
min_output_tokens | int | Guarantee at least this many tokens remain for generation. |
safety_margin_tokens | int | Context margin to avoid overrunning model window due to tokenizer variance. |
include_metadata_in_context | bool | When True, prefixes each chunk with context_item_header_fmt.format(source=...). |
context_separator | str | Separator between chunks in the final context string. |
context_item_header_fmt | str | Format string for per‑chunk header line when metadata is included. |
normalize_embeddings | bool | Passed to embedder.embed(...); set False if your embedder already returns normalized vectors. |
approx_chars_per_token | float | Heuristic used when the model has no tokenizer and tiktoken is unavailable. ~4 chars/token is a practical default. |
Results object¶
RetrieveResult¶
Returned by retrieve(...) and used by run(...) to fill prompts and set generation limits safely.
| Field | Type | Description |
|---|---|---|
base_system_prompt | str | System prompt provided to retrieve(...) |
base_user_prompt | str | User prompt template (before inserting context). |
context | str | Trimmed context actually used. |
max_new_tokens | int | Recommended cap for generation after trimming. |
base_tokens | int | Tokens for system + history + user (no context yet). |
context_tokens_used | int | Tokens consumed by the trimmed context. |
token_budget | int | Model’s total context window (llm.max_seq_length or default). |
generation_budget | int | Remaining tokens for output. |
docs | list[Doc] | Retrieved docs from the vector store. |
distances | list[float \| None] | One per doc when the vector store returns distances. |
meta | dict | Diagnostics (available_for_context, initial cap, margins, etc.). |
Methods¶
retrieve(...) -> RetrieveResult¶
result = agent.retrieve(
user_query="Explain vector databases in simple terms.",
base_system_prompt="You are a helpful assistant.",
base_user_prompt="Use the context to answer.\n\n{context}\n\nQuestion: {query}",
chat_history=[{"role": "user", "content": "Hi!"}],
top_k=8,
metadata_filter={"collection": "docs"},
)
You can then use the result to build your own LLM call:
filled = result.base_user_prompt.format(context=result.context, query="Explain vector databases in simple terms.")
response = llm.ask(
system_prompt=result.base_system_prompt,
user_prompt=filled,
temperature=0.3,
max_new_tokens=result.max_new_tokens,
)
run(...) -> Iterator[str] | str¶
Runs retrieve → fill → generate for you. When stream=True, it yields generation chunks; otherwise it returns the full string.
for token in agent.run(
"List the main components of a RAG system.",
base_system_prompt="You are concise.",
base_user_prompt="Context:\n{context}\n\nQ: {query}\nA:",
stream=True,
temperature=0.2,
):
print(token, end="")
Arguments of note (subset):
| Arg | Type | Description |
|---|---|---|
stream | bool | If True, yields streaming tokens from llm.ask(..., stream=True). |
top_k | int \| None | Overrides config.top_k for this call. |
metadata_filter | dict \| None | Forwarded to the vector store. |
similarity_threshold | float \| None | Per‑call similarity floor. |
temperature | float \| None | Per‑call generation temperature (defaults to 0.3). |
max_new_tokens | int \| None | Per‑call cap; if omitted, uses RetrieveResult.max_new_tokens. |
**llm_kwargs | Any | Forwarded to llm.ask(...) (e.g., stop sequences). |
Token handling & fallbacks¶
TokenCounter ensures prompts stay within the model window even without a tokenizer:
- HuggingFace path — if
llm.tokenizerexists, we use it (apply_chat_template, fast counting, and exact trimming). - tiktoken path — if no tokenizer is available but
tiktokenis installed, we usecl100k_basefor robust counting & trimming. - Heuristic path — otherwise, we estimate tokens via
len(text) / approx_chars_per_tokenand binary‑search on character length to trim precisely enough.
This design guarantees safe budgeting across OpenAI‑style clients, HF models, and custom LLM wrappers.
Context formatting¶
Context serialization is handled by ContextBuilder. By default it:
- Adds a header line like
Source: <label>(wheninclude_metadata_in_context=True), where<label>is produced bymake_source(doc). - Joins chunks with
context_separator(default:\n\n---\n\n).
You can override make_source in the agent constructor or swap the builder if you need a different format (citations, bullet lists, etc.).
File picking helper¶
When your index spans many files (e.g., a codebase), use the helper in picker.py to find the most promising files for deeper focus:
from neurosurfer.agents.rag.picker import pick_files_by_grouped_chunk_hits
files = pick_files_by_grouped_chunk_hits(
embedder=embedder,
vector_db=vectorstore,
section_query="vector similarity threshold",
candidate_pool_size=200,
n_files=5,
file_key="filename",
)
This performs a wide similarity search, aggregates scores per file, and returns the top‑N paths.
Best practices¶
- Keep prompts short: Your
base_user_promptis applied before context insertion. Excess preamble reduces space for retrieved evidence. - Prefer normalized embeddings: Set
normalize_embeddings=Trueunless your embedder already normalizes vectors. - Use
fixed_max_new_tokenswhen needed: For deterministic generations (e.g., latency budgeting), set an explicit cap. - Log budgets:
RetrieveResult.metaandgeneration_budgetmake it easy to visualize headroom and trim behavior. - Multi‑stage retrieval: You can run
retrieve(...)multiple times (e.g., coarse → re‑rank) and combine contexts before callingrun(...)yourself.