RAG Examples¶
This page gives a practical, end‑to‑end tour of Neurosurfer’s RAG building blocks: the Chunker (structure‑aware document splitting) and the RAGIngestor (read → chunk → embed → store). You’ll find copy‑pasteable snippets for common workflows: chunking files, registering custom strategies, ingesting files/directories/raw text/ZIPs/URLs, retrieving top‑K matches, and wiring progress/cancellation.
đź§© Chunker — Overview¶
The Chunker intelligently splits content into semantically meaningful chunks while preserving structure. It supports multiple strategies out of the box and chooses an approach based on file type and heuristics:
- Python: AST‑aware chunking
- JS/TS/React: structure‑aware chunking
- Markdown/Text: header/section aware for docs, line/char for prose
- JSON: object/array aware chunking
- Comment‑aware filtering, configurable overlaps
- Extensible via custom strategies and handlers
Internally, it uses configuration from
config.chunker(overlap, sizes, etc.).
Quick start¶
from neurosurfer.rag.chunker import Chunker
chunker = Chunker()
code = '''
def area(r):
# Compute circle area
return 3.14159 * r * r
'''
chunks = chunker.chunk(code, file_path="utils.py")
for i, ch in enumerate(chunks, 1):
print(f"[{i}] {ch[:80]}")
Register a custom strategy (by extension)¶
Use a simple function (text, file_path) -> List[str] and map it to one or more extensions.
from neurosurfer.rag.chunker import Chunker
def my_double_newline(text: str, file_path: str | None = None):
# Split on blank lines; trim tiny fragments
parts = [p.strip() for p in text.split("\n\n")]
return [p for p in parts if len(p) >= 20]
chunker = Chunker()
chunker.register({".custom", ".note"}, my_double_newline)
sample = "A block...\n\nAnother block...\n\nShort\n"
print(chunker.chunk(sample, file_path="notes.custom"))
Custom handler (full control)¶
Handlers can accept config and other meta; useful for advanced routing or parameterized chunking.
from typing import Optional, List
from neurosurfer.rag.chunker import Chunker, ChunkerConfig
def my_handler(text: str, *, file_path: Optional[str] = None, config: Optional[ChunkerConfig] = None) -> List[str]:
# Example: fixed-size char windows with slight overlap from cfg
size = (config.char_chunk_size if config else 800)
overlap = (config.char_overlap if config else 60)
out = []
i = 0
while i < len(text):
out.append(text[i:i+size])
i += max(1, size - overlap)
return out
chunker = Chunker()
# Register a named handler and bind it to an extension (optional)
chunker.register_handler("wide_chars", my_handler)
chunker.map_extension_to_handler(".log", "wide_chars")
log_text = "..." # long logs
print(len(chunker.chunk(log_text, file_path="app.log")))
If you only need simple extension→function mapping,
register({'.ext'}, fn)is enough. Use handlers when you need access to theChunkerConfig.
📥 RAG Ingestor — Overview¶
RAGIngestor is the production‑grade ingestion pipeline: read files/directories/raw text/URLs/ZIPs → chunk → embed (batch) → dedupe → persist to a vector DB.
Key features: multi‑source input, parallel processing, progress callbacks, cancellation, content‑hash dedupe, and metadata preservation.
Typical setup:
from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
from neurosurfer.vectorstores import ChromaVectorStore # or any BaseVectorDB implementation
from neurosurfer.rag.ingestor import RAGIngestor
embedder = SentenceTransformerEmbedder("intfloat/e5-small-v2")
vs = ChromaVectorStore(collection_name="docs") # must implement BaseVectorDB API
ingestor = RAGIngestor(
embedder=embedder,
vector_store=vs,
batch_size=64,
max_workers=4,
deduplicate=True,
normalize_embeddings=True,
)
Add sources¶
# 1) Add individual files
ingestor.add_files(["README.md", "guide.md", "src/app.py"])
# 2) Recursively add a directory (skips common junk like node_modules, .git, etc.)
ingestor.add_directory("./docs")
# 3) Raw texts (with optional per‑item metadata)
ingestor.add_texts(
["Custom paragraph 1", "Custom paragraph 2"],
base_id="manual",
metadatas=[{"section": "intro"}, {"section": "notes"}],
)
# 4) ZIP archive (safe extraction to a temp folder, then indexed like a dir)
ingestor.add_zipfile("./handbook.zip")
# 5) URLs (requires a fetcher; here’s a tiny example)
def fetch(url: str) -> str | None:
# return cleaned text from URL (left as an exercise: requests + readability/bs4)
return None
ingestor.add_urls(["https://example.com/page1"], fetcher=fetch)
Build / ingest¶
stats = ingestor.build()
print(stats) # {'status': 'ok', 'sources': ..., 'chunks': ..., 'unique_chunks': ..., 'added': ...}
With progress callback and cancellation¶
import threading, time
progress = []
def on_progress(p):
progress.append(p)
if p.get("stage") == "embedding":
print(f"Embedding {p['embedded']}/{p['total']}...")
ingestor = RAGIngestor(
embedder=embedder,
vector_store=vs,
batch_size=64,
max_workers=4,
progress_cb=on_progress,
)
# Run build in a background thread
th = threading.Thread(target=lambda: ingestor.build(), daemon=True)
th.start()
# Cancel after a moment (simulate user clicking 'Stop')
time.sleep(0.2)
ingestor.cancel_event.set()
th.join()
Retrieve content (two ways)¶
A) Via RAGIngestor helpers
hits = ingestor.search("what is the ingestion pipeline?", top_k=5)
for doc, score in hits:
print(f"{score:.3f} | {doc.metadata.get('filename', doc.id)}")
print(doc.text[:120], "...
")
B) Directly via the vector store
# Prepare a query embedding
q = ingestor.embed_query("how do we chunk python code?")
# Use store's similarity_search with the query vector
matches = vs.similarity_search(q, top_k=5)
for doc, score in matches:
print(doc.id, score)
đź§Ş End‑to‑end mini pipeline¶
from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
from neurosurfer.vectorstores import ChromaVectorStore
from neurosurfer.rag.ingestor import RAGIngestor
# 1) Components
emb = SentenceTransformerEmbedder("intfloat/e5-small-v2")
store = ChromaVectorStore(collection_name="proj-docs")
ing = RAGIngestor(embedder=emb, vector_store=store, batch_size=48, max_workers=4)
# 2) Intake
ing.add_directory("./docs")
ing.add_files(["README.md", "CHANGELOG.md"])
ing.add_texts(["This is a private note about deployment steps."], base_id="notes")
# 3) Build
summary = ing.build()
print("Added:", summary["added"])
# 4) Retrieve
for doc, score in ing.search("deployment steps", top_k=3):
print(f"{score:.2f} | {doc.metadata.get('filename', doc.id)}")
print(doc.text[:160], "...\n")
âś… Tips¶
- Prefer batch sizes of 32–128 for sentence‑transformers to balance throughput vs memory.
- Use deduplication (default on) when indexing mixed sources to avoid repeated chunks.
- Add
default_metadatatoRAGIngestor(...)to stamp common fields across all docs. - Mind your overlaps in chunking — larger overlaps improve recall at a small cost.
- For very large repos, start with directory filters and a tighter set of
include_exts.