Chunker¶
Module:
neurosurfer.rag.chunker
A production‑grade, extensible document chunker purpose‑built for RAG (Retrieval‑Augmented Generation) systems. It supports AST‑aware Python chunking, structure‑aware JS/TS chunking, header‑aware Markdown chunking, JSON object/array chunking, and robust fallbacks (line‑based for code‑like, char‑based for prose). It also includes custom handler registration, a router for dynamic strategy selection, comment‑aware filtering, prompt‑block stripping, and safety caps to prevent pathological outputs from custom logic.
Key Capabilities¶
- Strategy registry by file extension (built‑in + custom)
- Built‑ins:
.py,.js,.ts,.tsx,.jsx,.json,.md,.txt - AST‑aware Python chunking: groups imports / defs / classes (with decorators) and then windows large blocks
- JS/TS structure‑aware chunking: coarse segmentation by
function/classfollowed by cleanup - Markdown/README header‑aware chunking: windows by headings/max lines
- JSON chunking: per top‑level object/array element with size caps; falls back to char windows if needed
- Fallback heuristics:
"auto"detection prefers line‑based windows for code‑like text, char‑based for prose - Comment‑aware filtering: collapses long comment blocks, skips fully‑commented chunks
- Prompt‑block filtering (triple‑quoted strings that look like prompts), reducing retrieval poisoning
- Custom handler system: register callable handlers by name and bind them to extensions or a router
- Safety limits: cap chunk counts and total output characters from custom handlers
- Blacklist skip: ignore common binary/config/infra paths early
- Optional logging hook
Configuration (ChunkerConfig)¶
ChunkerConfig centralizes chunk sizes, overlaps, fallbacks, and safety limits.
| Name | Type | Default | Description |
|---|---|---|---|
fallback_chunk_size | int | 25 | Line-based window size (lines) for generic code splitting and Python/JS sub‑chunking. |
overlap_lines | int | 3 | Line overlap between consecutive windows (context retention). |
max_chunk_lines | int | 1000 | Hard cap on lines per chunk to avoid giant blocks. |
comment_block_threshold | int | 4 | Consecutive comment‑only lines at/above this threshold form a “comment block” that can be dropped. |
char_chunk_size | int | 1000 | Char-based window size (chars) for prose/unknown text. |
char_overlap | int | 150 | Char overlap between consecutive char windows. |
readme_max_lines | int | 30 | Max lines per Markdown/README chunk. |
json_chunk_size | int | 1000 | Max chars per JSON sub‑chunk when pretty‑printing. |
fallback_mode | str | "char" | What to do when no strategy: "char", "line", or "auto" (code‑like → line; else char). |
max_returned_chunks | int | 500 | Hard limit on number of chunks returned by a custom handler (post‑sanitize). |
max_total_output_chars | int | 1_000_000 | Hard limit on total characters returned by a custom handler (post‑sanitize). |
min_chunk_non_ws_chars | int | 1 | Drop chunks that have fewer than this many non‑whitespace characters. |
Example
from neurosurfer.rag.chunker import Chunker, ChunkerConfig
config = ChunkerConfig(
fallback_chunk_size=30,
overlap_lines=5,
char_chunk_size=1000,
comment_block_threshold=4,
fallback_mode="auto"
)
chunker = Chunker(config)
Blacklist / Skip Rules¶
Chunker avoids ingesting common non‑text or infra files via _should_skip_file(file_path) which tests against these regular‑expression patterns (examples):
*.lock, .env*, .git*, node_modules, __pycache__, .DS_Store, Thumbs.db,
*.png, *.jpg, *.svg, *.ico, *.zip, *.tar.gz, *.mp4, *.mp3,
/dist/, /build/, .idea, .vscode, .eslintrc, .prettierrc, .editorconfig,
.gitignore, /LICENSE, /CODEOWNERS, /CONTRIBUTING.md, /CHANGELOG.md
If a path matches any blacklist regex,
chunk(...)returns[]immediately.
Public API¶
class Chunker(config: ChunkerConfig = ChunkerConfig())¶
Built‑in strategy registration¶
.register(exts: List[str], fn: StrategyFn) -> None
Map file extensions to a strategy function:fn(text: str, file_path: Optional[str]) -> List[str]
Built‑ins are pre‑registered for:.py→_chunk_python(AST‑aware + line windows).js,.ts,.tsx,.jsx→_chunk_javascript.json→_chunk_json.md,.txt→_chunk_readme
Custom handler registry¶
-
.register_custom(name: str, handler: CustomChunkHandler) -> None
Registers a named handler. TheCustomChunkHandlersignature is:
handler(text: str, *, file_path: Optional[str] = None, config: Optional[ChunkerConfig] = None) -> List[str] -
.unregister_custom(name: str) -> None
Removes the handler and any extension mappings pointing to it. -
.list_custom_handlers() -> List[str]
Names of all registered custom handlers. -
.use_custom_for_ext(exts: List[str], handler_name: str) -> None
Route specific extensions to a named custom handler. -
.clear_custom_for_ext(exts: List[str]) -> None
Remove previous extension mappings. -
.set_router(router: Optional[Callable[[Optional[str], str], Optional[str]]]) -> None
Install a router invoked asrouter(file_path, text) -> handler_name | None. If it returns the name of a registered handler, that handler is used. -
.list_ext_mappings() -> List[Tuple[str, str]]
Returns current extension → custom handler mappings.
Logging (optional)¶
.set_logger(logger_fn: Callable[[str], None]) -> None
Attach a logger callback. Internal helpers route info/warn/error strings to this function.
Main entry point¶
.chunk(text: str, *, source_id: str | None = None, file_path: str | None = None, k: int = 40, custom: str | CustomChunkHandler | None = None) -> List[str]
Priority order used by chunk(...):
- Explicit
customhandler (string name or callable) - Router result (registered handler name)
- Extension mapping (registered handler name)
- Built‑in strategy for the extension
- Heuristic fallback: line windows if
_looks_like_code(text)else char windows
Additional rules:
- If
file_pathmatches the blacklist, returns[]. - If
textword count<= k, returns the whole text as a single chunk (if it meetsmin_chunk_non_ws_chars).
Strategy Details¶
Python (_chunk_python)¶
- Strips prompt‑like triple‑quoted blocks first (
_filter_prompt_like_blocks). - Parses AST; for each top‑level node (imports, defs/classes incl. decorators), collects its line range.
- Sub‑chunks large blocks by line windows with cleanup:
_split_into_chunks()→clean_chunk_lines() - Skips fully‑commented windows.
- Appends any remaining lines (non‑AST or trailing comments) via the same windowing logic.
- Fallback to
_line_windowson parse failure.
JavaScript/TypeScript (_chunk_javascript)¶
- Coarse split by
function ... {orclass ... {occurrences. - Cleans each segment via
_clean_linesand skips fully‑commented blocks. - Falls back to
_line_windowsif no structural matches found.
JSON (_chunk_json)¶
- Attempts
json.loads(text): - dict: chunk each
{"key": value}pretty‑printed, truncating byjson_chunk_size. - list: chunk each element pretty‑printed, capped by
json_chunk_size. - On parse error, falls back to
_char_windows(text, json_chunk_size, json_chunk_size * 0.2).
Markdown/README (_chunk_readme)¶
- Windows by headings and
readme_max_lines, ensuring manageable slices for embeddings.
Fallbacks¶
-
_line_windows(text, window=fallback_chunk_size)
Removes empty lines, applies line overlap (overlap_lines), cleans long comment blocks, skips fully‑commented windows. -
_char_windows(text, size=char_chunk_size, overlap=char_overlap)
Sliding char windows with overlap; empties removed inside each chunk.
Heuristics & Filters¶
_looks_like_code(text)– detects code‑likeness with braces/semicolons/keywords/indent patterns & short lines ratio._is_comment_line(line)and_is_fully_commented(lines)– used to skip comment‑only chunks._clean_lines(chunk_lines, comment_block_threshold)– collapses long comment blocks._is_prompt_like(text)+_filter_prompt_like_blocks(code)– removes triple‑quoted strings that look like LLM prompts.
Safety & Sanitization (custom handlers)¶
_sanitize_chunks(chunks)enforces:max_returned_chunksandmax_total_output_chars- drop entries shorter than
min_chunk_non_ws_chars - trim trailing newlines
Usage Examples¶
Baseline usage¶
from pathlib import Path
from neurosurfer.rag.chunker import Chunker, ChunkerConfig
chunker = Chunker(ChunkerConfig(fallback_mode="auto"))
py_text = Path("app/main.py").read_text()
py_chunks = chunker.chunk(py_text, file_path="app/main.py")
md_text = Path("README.md").read_text()
md_chunks = chunker.chunk(md_text, file_path="README.md")
raw = "A lot of prose..."
txt_chunks = chunker.chunk(raw, file_path="notes.unknown") # auto → char windows
Register a custom handler (by name)¶
from typing import List, Optional
from neurosurfer.rag.chunker import Chunker, ChunkerConfig
def my_handler(text: str, *, file_path: Optional[str] = None, config: Optional[ChunkerConfig] = None) -> List[str]:
# Example: split on double newlines and clamp ~1200 chars
segs = [s.strip() for s in text.split("\n\n") if s.strip()]
out, buf = [], ""
for s in segs:
if len(buf) + len(s) + 2 <= 1200:
buf = f"{buf}\n\n{s}".strip()
else:
if buf: out.append(buf)
buf = s
if buf: out.append(buf)
return out
chunker = Chunker()
chunker.register_custom("my_handler", my_handler)
chunker.use_custom_for_ext([".rst", ".tex"], "my_handler")
rst_text = open("paper.rst").read()
chunks = chunker.chunk(rst_text, file_path="paper.rst")
Use an ad‑hoc callable for a single call¶
def once(text: str, *, file_path=None, config=None):
return [p for p in text.split("\n\n") if p.strip()]
chunks = chunker.chunk(big_text, custom=once)
Install a router to decide dynamically¶
def router(file_path, text):
if file_path and file_path.endswith(".proto"):
return "proto_chunks" # must be registered via register_custom
if "BEGIN:VCARD" in text:
return "vcard_chunks"
return None
chunker.set_router(router)
Attach a logger¶
Production Notes & Best Practices¶
- Prefer
fallback_mode="auto"in mixed repos; it chooses line windows for code‑like content. - Keep
overlap_lineslow (2–5) to reduce duplication in code while preserving local context. - For long prose, start with
char_chunk_size ~ 800–1200andchar_overlap ~ 100–200. - Treat JSON as structured: chunking at top‑level keys/elements often boosts retrieval precision.
- When adding custom handlers, rely on
_sanitize_chunks(already applied) and consider your own local guards. - The blacklist is conservative; extend/override upstream if needed before crawling a repo.
- Chunk after text normalization (decode, HTML → text, etc.) to keep windowing deterministic.