FileReader¶

Module: neurosurfer.rag.filereader

A unified, production‑grade file → text loader for RAG pipelines. FileReader auto‑detects the file type by extension and applies the appropriate extractor to return clean UTF‑8 text that’s ready for downstream chunking (Chunker) and ingestion (RAG Ingestor). It is defensive by design: optional dependencies are handled gracefully and errors are returned as descriptive strings instead of crashing your pipeline.

Overview¶

FileReader exposes a single high‑level method, read(path), which dispatches to format‑specific readers. It supports documents, data files, presentations, code/config/log formats, HTML pages, and more. When a format is not explicitly supported, it falls back to plain‑text reading with UTF‑8 (and errors="ignore").

Key Capabilities¶

Auto‑detection by extension via supported_types mapping.
Broad format coverage out of the box (see table below).
Graceful degradation when optional libraries are missing (returns helpful messages).
Consistent plaintext output suitable for embedding + retrieval.
Zero surprises: reader methods never raise; you get text (or an “Error reading …” string).

Supported Formats & Readers¶

Category	Extensions	Reader method	Dependencies
PDF	`.pdf`	`_read_pdf`	`fitz` (PyMuPDF)
HTML	`.html`, `.htm`	`_read_html`	`bs4` (BeautifulSoup)
DOCX	`.docx`	`_read_docx`	`python-docx`
CSV / TSV	`.csv`, `.tsv`	`_read_csv`	`pandas`
Excel	`.xls`, `.xlsx`	`_read_excel`	`pandas`
YAML	`.yaml`, `.yml`	`_read_yaml`	`pyyaml` (optional)
XML	`.xml`	`_read_xml`	`xml.etree.ElementTree` (stdlib)
PPTX	`.pptx`	`_read_pptx`	`python-pptx` (optional)
Plain‑text family	`.txt`, `.md`, `.rtf`, `.doc`, `.odt`, `.json`, `.ppt`, `.py`, `.ipynb`, `.java`, `.js`, `.ts`, `.jsx`, `.tsx`, `.cpp`, `.c`, `.h`, `.cs`, `.go`, `.rb`, `.rs`, `.php`, `.swift`, `.kt`, `.sh`, `.bat`, `.ps1`, `.scala`, `.lua`, `.r`, `.env`, `.ini`, `.toml`, `.cfg`, `.conf`, `.properties`, `.log`, `.tex`, `.srt`, `.vtt`	`_read_txt`	none

Anything not in supported_types also falls back to _read_txt (UTF‑8, errors="ignore").

Dependencies¶

Required core libs (used if format encountered):
fitz (PyMuPDF) for PDF
docx (python‑docx) for DOCX
pandas for CSV/TSV/Excel
bs4 (BeautifulSoup) for HTML
Optional:
pyyaml for YAML (yaml.safe_load)
python-pptx for PPTX
xml.etree.ElementTree is from the standard library

When an optional dependency is missing, the corresponding reader returns a clear message (e.g., "python-pptx not installed").

Public API¶

`class FileReader`¶

Attributes¶

supported_types: dict[str, Callable] – mapping from extension (lowercase, with dot) to the concrete reader method.

Methods¶

read(file_path: str) -> str
Auto‑detects by extension and dispatches to a concrete _read_* method. If no handler is registered, uses _read_txt. Never raises; errors are returned as readable strings.

Format‑specific readers¶

_read_pdf(path: str) -> str
Page‑wise text extraction using fitz. On error returns "Error reading PDF: <message>".
_read_txt(path: str) -> str
Reads UTF‑8 with errors="ignore". On error returns "Error reading TXT: <message>".
_read_html(path: str) -> str
Parses with BeautifulSoup(..., "html.parser") and returns visible text via .get_text(). On error returns "Error reading HTML: <message>".
_read_docx(path: str) -> str
Iterates doc.paragraphs, joins with newlines. On error returns "Error reading DOCX: <message>".
_read_excel(path: str) -> str
Uses pandas.read_excel(..., sheet_name=None) to load all sheets; renders with .astype(str).to_string(index=False) and sheet headers. On error returns "Error reading Excel: <message>".
_read_csv(path: str) -> str
Uses pandas.read_csv, stringifies and returns .to_string(index=False). On error returns "Error reading CSV/TSV: <message>".
_read_yaml(path: str) -> str
Uses yaml.safe_load if pyyaml is available; otherwise returns "PyYAML not installed". On error returns "Error reading YAML: <message>".
_read_xml(path: str) -> str
Uses xml.etree.ElementTree.parse(...).getroot() and ET.tostring(..., encoding="unicode"). If unavailable, returns "XML parser not available". On error returns "Error reading XML: <message>".
_read_pptx(path: str) -> str
Uses Presentation(path); concatenates shape.text for all shapes across slides. If library is missing, returns "python-pptx not installed". On error returns "Error reading PPTX: <message>".

Behavior & Error Model¶

Non‑throwing: all readers catch exceptions and return "Error reading <FORMAT>: <message>" to prevent ingestion crashes. You may choose to skip these records upstream.
Encoding: text reading uses UTF‑8 with errors="ignore" to maximize robustness.
Best‑effort structure: dataframes, DOCX paragraphs, and PPTX shape texts are stringified in a predictable, readable way.

Usage Examples¶

Basic¶

from neurosurfer.rag.filereader import FileReader

reader = FileReader()

pdf_text = reader.read("report.pdf")
excel_text = reader.read("dataset.xlsx")
code_text = reader.read("script.py")
html_text = reader.read("page.html")

With Chunker & Ingestor¶

from neurosurfer.rag.filereader import FileReader
from neurosurfer.rag.chunker import Chunker
from neurosurfer.rag.ingestor import RAGIngestor
from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
from neurosurfer.vectorstores.chroma import ChromaDB

reader = FileReader()
chunker = Chunker()
ingestor = RAGIngestor(
    embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
    vector_store=ChromaDB(collection="neurosurfer")
)

text = reader.read("README.md")
chunks = chunker.chunk(text, file_path="README.md")
# or: ingestor.add_files(["README.md"]).build()

Handling Errors¶

txt = reader.read("possibly_corrupt.pdf")
if txt.startswith("Error reading"):
    # log and skip
    pass

Extension Mapping Reference (`supported_types`)¶

Below is the canonical mapping initialized in __init__ (extensions are lowercase). You can inspect it at runtime:

reader = FileReader()
print(sorted(reader.supported_types.keys()))

Registered as structured readers: .pdf, .html, .htm, .docx, .csv, .tsv, .xls, .xlsx, .xml, .yaml, .yml, .pptx
Registered to read as plain‑text: .txt, .md, .rtf, .doc, .odt, .json, .ppt, .py, .ipynb, .java, .js, .ts, .jsx, .tsx, .cpp, .c, .h, .cs, .go, .rb, .rs, .php, .swift, .kt, .sh, .bat, .ps1, .scala, .lua, .r, .env, .ini, .toml, .cfg, .conf, .properties, .log, .tex, .srt, .vtt
Anything else → _read_txt (fallback)

Production Notes & Best Practices¶

HTML: get_text() strips tags; if you need DOM‑aware extraction (tables, links), post‑process the HTML separately and feed structured text to the Chunker.
PDF: text extraction quality varies; consider adding an OCR fallback at a higher layer for scanned PDFs.
Excel/CSV: very wide tables can produce long lines—rely on the Chunker’s char windows to split into manageable pieces.
YAML/XML: readers return a serialized string; for semantic RAG over structured data, you may want to pre‑normalize into key/value lines.
Error strings: treat them as loggable noise—skip during ingestion rather than embedding error messages.
Encoding: if a file is known to be non‑UTF‑8, preconvert to UTF‑8 before invoking read() for best results.