Skip to content

SQLSchemaStore

Module: neurosurfer.db.sql.sql_schema_store.SQLSchemaStore

SQLSchemaStore extracts per-table schema strings from a live database and (optionally) asks an LLM to produce concise JSON summaries. It persists these to disk as a single JSON file you can reuse in agents/tools without re-querying the DB on every run.


What it stores

Each table is keyed by name and contains at least:

{
  "table_name": "users",
  "summary": "What data the table stores and its role in the system.",
  "schema": "CREATE TABLE ... /* sample rows ... */"
}

When summarize=True, the summary is generated by your provided LLM according to a strict prompt; otherwise you can store raw schema only (extend as needed).


Initialize

from neurosurfer.db.sql.sql_schema_store import SQLSchemaStore
from neurosurfer.models.chat_models.openai_like import OpenAILikeModel  # implements BaseModel

llm = OpenAILikeModel(model="gpt-4o-mini")  # optional if you won't summarize

store = SQLSchemaStore(
    db_uri="postgresql://user:pass@localhost:5432/analytics",
    llm=llm,                         # required if summarize=True
    sample_rows_in_table_info=3,     # how many rows to include in schema text
    storage_path="./"                # where to write sql_schema_store_<db>.json
)

Train (extract and optionally summarize)

train(...) is a generator that yields the number of processed tables (for progress bars).

for n in store.train(summarize=True, force=True):
    # e.g., update a progress UI
    pass

print("Saved to:", store.storage_path)
print("Tables summarized:", store.get_tables_count())
  • summarize=False → store raw schema strings (you can add summaries later).
  • summarize=True → uses your llm to generate concise JSON summaries per table.
  • force=True → clears any existing file and retrains all tables.

The built-in prompt enforces a strict JSON response:

System: You are a master of SQL...
User: Given the schema of {table_name} ... Output this JSON object: {"table_name": "...", "summary": "..."}

Read & use

data = store.get_table_data("users")          # dict with keys: table_name, summary, schema
count = store.get_tables_count()              # number of usable tables
tables = store.db.get_usable_table_names()    # from the underlying SQLDatabase

These summaries are ideal inputs for relevant-table selection, SQL generation, or DB insights tools. See the built-in tools under Tools → Built-in Tools → SQL.


API reference

__init__(db_uri, llm=None, sample_rows_in_table_info=3, storage_path=None, logger=None)

  • Creates an internal SQLDatabase with view_support=True and sets up a JSON file named sql_schema_store_<db>.json beneath storage_path (default ./).
  • If llm is provided, it must satisfy the BaseModel interface with .ask(...) returning an OpenAI-like response dict.

train(summarize: bool = False, force: bool = False) -> Generator[int, None, None]

  • Extracts schema strings for all usable tables.
  • If summarize=True, calls summarize_schema__(table, schema) per table.
  • Yields the count of processed tables for progress reporting.
  • Writes the final store to storage_path when complete.

get_all_table_schemas() -> dict[str, str]

Returns a mapping {table_name: schema_string} by delegating to the underlying SQLDatabase (get_table_info([t])).

get_table_data(table: str) -> dict | None

Fetches the stored JSON object for a single table.

save_to_file() / load_from_file()

Persist and restore the store JSON (storage_path).

get_db_name() -> str | None

Extracts the DB name from the URI; used to build the default filename.

get_tables_count() -> int

Returns the number of usable tables from SQLDatabase.


Tips & caveats

  • Ensure your DB user has metadata privileges (and read access for LIMIT N samples).
  • You can postprocess or augment summaries (e.g., add row counts, PK/FK lists).
  • Treat generated JSON carefully in downstream code; validate before use in agents.
  • For very large schemas, consider chunking or filtering tables to keep context sizes manageable.