Skip to content

RAG Examples

This page gives a practical, end‑to‑end tour of Neurosurfer’s RAG building blocks: the Chunker (structure‑aware document splitting) and the RAGIngestor (read → chunk → embed → store). You’ll find copy‑pasteable snippets for common workflows: chunking files, registering custom strategies, ingesting files/directories/raw text/ZIPs/URLs, retrieving top‑K matches, and wiring progress/cancellation.


🧩 Chunker — Overview

The Chunker intelligently splits content into semantically meaningful chunks while preserving structure. It supports multiple strategies out of the box and chooses an approach based on file type and heuristics:

  • Python: AST‑aware chunking
  • JS/TS/React: structure‑aware chunking
  • Markdown/Text: header/section aware for docs, line/char for prose
  • JSON: object/array aware chunking
  • Comment‑aware filtering, configurable overlaps
  • Extensible via custom strategies and handlers

Internally, it uses configuration from config.chunker (overlap, sizes, etc.).

Quick start

from neurosurfer.rag.chunker import Chunker

chunker = Chunker()

code = '''
def area(r):
    # Compute circle area
    return 3.14159 * r * r
'''

chunks = chunker.chunk(code, file_path="utils.py")
for i, ch in enumerate(chunks, 1):
    print(f"[{i}] {ch[:80]}")

Register a custom strategy (by extension)

Use a simple function (text, file_path) -> List[str] and map it to one or more extensions.

from neurosurfer.rag.chunker import Chunker

def my_double_newline(text: str, file_path: str | None = None):
    # Split on blank lines; trim tiny fragments
    parts = [p.strip() for p in text.split("\n\n")]
    return [p for p in parts if len(p) >= 20]

chunker = Chunker()
chunker.register({".custom", ".note"}, my_double_newline)

sample = "A block...\n\nAnother block...\n\nShort\n"
print(chunker.chunk(sample, file_path="notes.custom"))

Custom handler (full control)

Handlers can accept config and other meta; useful for advanced routing or parameterized chunking.

from typing import Optional, List
from neurosurfer.rag.chunker import Chunker, ChunkerConfig

def my_handler(text: str, *, file_path: Optional[str] = None, config: Optional[ChunkerConfig] = None) -> List[str]:
    # Example: fixed-size char windows with slight overlap from cfg
    size = (config.char_chunk_size if config else 800)
    overlap = (config.char_overlap if config else 60)
    out = []
    i = 0
    while i < len(text):
        out.append(text[i:i+size])
        i += max(1, size - overlap)
    return out

chunker = Chunker()
# Register a named handler and bind it to an extension (optional)
chunker.register_handler("wide_chars", my_handler)
chunker.map_extension_to_handler(".log", "wide_chars")

log_text = "..."  # long logs
print(len(chunker.chunk(log_text, file_path="app.log")))

If you only need simple extension→function mapping, register({'.ext'}, fn) is enough. Use handlers when you need access to the ChunkerConfig.


📥 RAG Ingestor — Overview

RAGIngestor is the production‑grade ingestion pipeline: read files/directories/raw text/URLs/ZIPs → chunk → embed (batch) → dedupe → persist to a vector DB.

Key features: multi‑source input, parallel processing, progress callbacks, cancellation, content‑hash dedupe, and metadata preservation.

Typical setup:

from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
from neurosurfer.vectorstores import ChromaVectorStore   # or any BaseVectorDB implementation
from neurosurfer.rag.ingestor import RAGIngestor

embedder = SentenceTransformerEmbedder("intfloat/e5-small-v2")
vs = ChromaVectorStore(collection_name="docs")  # must implement BaseVectorDB API

ingestor = RAGIngestor(
    embedder=embedder,
    vector_store=vs,
    batch_size=64,
    max_workers=4,
    deduplicate=True,
    normalize_embeddings=True,
)

Add sources

# 1) Add individual files
ingestor.add_files(["README.md", "guide.md", "src/app.py"])

# 2) Recursively add a directory (skips common junk like node_modules, .git, etc.)
ingestor.add_directory("./docs")

# 3) Raw texts (with optional per‑item metadata)
ingestor.add_texts(
    ["Custom paragraph 1", "Custom paragraph 2"],
    base_id="manual",
    metadatas=[{"section": "intro"}, {"section": "notes"}],
)

# 4) ZIP archive (safe extraction to a temp folder, then indexed like a dir)
ingestor.add_zipfile("./handbook.zip")

# 5) URLs (requires a fetcher; here’s a tiny example)
def fetch(url: str) -> str | None:
    # return cleaned text from URL (left as an exercise: requests + readability/bs4)
    return None

ingestor.add_urls(["https://example.com/page1"], fetcher=fetch)

Build / ingest

stats = ingestor.build()
print(stats)  # {'status': 'ok', 'sources': ..., 'chunks': ..., 'unique_chunks': ..., 'added': ...}

With progress callback and cancellation

import threading, time

progress = []
def on_progress(p):
    progress.append(p)
    if p.get("stage") == "embedding":
        print(f"Embedding {p['embedded']}/{p['total']}...")

ingestor = RAGIngestor(
    embedder=embedder,
    vector_store=vs,
    batch_size=64,
    max_workers=4,
    progress_cb=on_progress,
)

# Run build in a background thread
th = threading.Thread(target=lambda: ingestor.build(), daemon=True)
th.start()

# Cancel after a moment (simulate user clicking 'Stop')
time.sleep(0.2)
ingestor.cancel_event.set()
th.join()

Retrieve content (two ways)

A) Via RAGIngestor helpers

hits = ingestor.search("what is the ingestion pipeline?", top_k=5)
for doc, score in hits:
    print(f"{score:.3f} | {doc.metadata.get('filename', doc.id)}")
    print(doc.text[:120], "...
")

B) Directly via the vector store

# Prepare a query embedding
q = ingestor.embed_query("how do we chunk python code?")
# Use store's similarity_search with the query vector
matches = vs.similarity_search(q, top_k=5)
for doc, score in matches:
    print(doc.id, score)

🧪 End‑to‑end mini pipeline

from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
from neurosurfer.vectorstores import ChromaVectorStore
from neurosurfer.rag.ingestor import RAGIngestor

# 1) Components
emb = SentenceTransformerEmbedder("intfloat/e5-small-v2")
store = ChromaVectorStore(collection_name="proj-docs")
ing = RAGIngestor(embedder=emb, vector_store=store, batch_size=48, max_workers=4)

# 2) Intake
ing.add_directory("./docs")
ing.add_files(["README.md", "CHANGELOG.md"])
ing.add_texts(["This is a private note about deployment steps."], base_id="notes")

# 3) Build
summary = ing.build()
print("Added:", summary["added"])

# 4) Retrieve
for doc, score in ing.search("deployment steps", top_k=3):
    print(f"{score:.2f} | {doc.metadata.get('filename', doc.id)}")
    print(doc.text[:160], "...\n")

âś… Tips

  • Prefer batch sizes of 32–128 for sentence‑transformers to balance throughput vs memory.
  • Use deduplication (default on) when indexing mixed sources to avoid repeated chunks.
  • Add default_metadata to RAGIngestor(...) to stamp common fields across all docs.
  • Mind your overlaps in chunking — larger overlaps improve recall at a small cost.
  • For very large repos, start with directory filters and a tighter set of include_exts.