SQLSchemaStore¶

Module: neurosurfer.db.sql.sql_schema_store.SQLSchemaStore

SQLSchemaStore extracts per-table schema strings from a live database and (optionally) asks an LLM to produce concise JSON summaries. It persists these to disk as a single JSON file you can reuse in agents/tools without re-querying the DB on every run.

What it stores¶

Each table is keyed by name and contains at least:

{
  "table_name": "users",
  "summary": "What data the table stores and its role in the system.",
  "schema": "CREATE TABLE ... /* sample rows ... */"
}

When summarize=True, the summary is generated by your provided LLM according to a strict prompt; otherwise you can store raw schema only (extend as needed).

Initialize¶

from neurosurfer.db.sql.sql_schema_store import SQLSchemaStore
from neurosurfer.models.chat_models.openai_like import OpenAILikeModel  # implements BaseModel

llm = OpenAILikeModel(model="gpt-4o-mini")  # optional if you won't summarize

store = SQLSchemaStore(
    db_uri="postgresql://user:pass@localhost:5432/analytics",
    llm=llm,                         # required if summarize=True
    sample_rows_in_table_info=3,     # how many rows to include in schema text
    storage_path="./"                # where to write sql_schema_store_<db>.json
)

Train (extract and optionally summarize)¶

train(...) is a generator that yields the number of processed tables (for progress bars).

for n in store.train(summarize=True, force=True):
    # e.g., update a progress UI
    pass

print("Saved to:", store.storage_path)
print("Tables summarized:", store.get_tables_count())

summarize=False → store raw schema strings (you can add summaries later).
summarize=True → uses your llm to generate concise JSON summaries per table.
force=True → clears any existing file and retrains all tables.

The built-in prompt enforces a strict JSON response:

System: You are a master of SQL...
User: Given the schema of {table_name} ... Output this JSON object: {"table_name": "...", "summary": "..."}

Read & use¶

data = store.get_table_data("users")          # dict with keys: table_name, summary, schema
count = store.get_tables_count()              # number of usable tables
tables = store.db.get_usable_table_names()    # from the underlying SQLDatabase

These summaries are ideal inputs for relevant-table selection, SQL generation, or DB insights tools. See the built-in tools under Tools → Built-in Tools → SQL.

API reference¶

`init(db_uri, llm=None, sample_rows_in_table_info=3, storage_path=None, logger=None)`¶

Creates an internal SQLDatabase with view_support=True and sets up a JSON file named sql_schema_store_<db>.json beneath storage_path (default ./).
If llm is provided, it must satisfy the BaseModel interface with .ask(...) returning an OpenAI-like response dict.

`train(summarize: bool = False, force: bool = False) -> Generator[int, None, None]`¶

Extracts schema strings for all usable tables.
If summarize=True, calls summarize_schema__(table, schema) per table.
Yields the count of processed tables for progress reporting.
Writes the final store to storage_path when complete.

`get_all_table_schemas() -> dict[str, str]`¶

Returns a mapping {table_name: schema_string} by delegating to the underlying SQLDatabase (get_table_info([t])).

`get_table_data(table: str) -> dict | None`¶

Fetches the stored JSON object for a single table.

`save_to_file()` / `load_from_file()`¶

Persist and restore the store JSON (storage_path).

`get_db_name() -> str | None`¶

Extracts the DB name from the URI; used to build the default filename.

`get_tables_count() -> int`¶

Returns the number of usable tables from SQLDatabase.

Tips & caveats¶

Ensure your DB user has metadata privileges (and read access for LIMIT N samples).
You can postprocess or augment summaries (e.g., add row counts, PK/FK lists).
Treat generated JSON carefully in downstream code; validate before use in agents.
For very large schemas, consider chunking or filtering tables to keep context sizes manageable.

SQLSchemaStore¶

What it stores¶

Initialize¶

Train (extract and optionally summarize)¶

Read & use¶

API reference¶

__init__(db_uri, llm=None, sample_rows_in_table_info=3, storage_path=None, logger=None)¶

train(summarize: bool = False, force: bool = False) -> Generator[int, None, None]¶

get_all_table_schemas() -> dict[str, str]¶

get_table_data(table: str) -> dict | None¶

save_to_file() / load_from_file()¶

get_db_name() -> str | None¶

get_tables_count() -> int¶