Chroma Vector Store¶
Module:
neurosurfer.vectorstores.chroma.ChromaVectorStore
Contract:BaseVectorDB· Data:Doc
Chroma-backed vector store using chromadb.PersistentClient. Provides persistent storage, metadata filtering, and similarity search (returns cosine similarity).
Features¶
- Persistent collections on disk (
persist_directory). add_documentswith upsert (ordelete+adddepending on Chroma version).similarity_searchwith optionalmetadata_filterandsimilarity_threshold.list_all_documentswith embeddings, metadatas, and texts.count,clear_collection,delete_documents,delete_collection.
Usage¶
from neurosurfer.vectorstores import ChromaVectorStore
from neurosurfer.vectorstores.base import Doc
from neurosurfer.models.embedders.sentence_transformer import SentenceTransformerEmbedder
vs = ChromaVectorStore(collection_name="my_docs", persist_directory="./chroma_db")
embedder = SentenceTransformerEmbedder(model_name="intfloat/e5-large-v2")
# Add docs (ensure embeddings are same dimension across all docs)
docs = [
Doc(id="a1", text="Neural search is fun",
embedding=embedder.embed("Neural search is fun"),
metadata={"topic": "search", "source": "notes.md"}),
Doc(id="b2", text="Transformers are powerful",
embedding=embedder.embed("Transformers are powerful"),
metadata={"topic": "nlp"}),
]
vs.add_documents(docs)
# Query
q = embedder.embed("power of transformers")
hits = vs.similarity_search(q, top_k=3, metadata_filter={"topic": "nlp"}, similarity_threshold=0.55)
for doc, score in hits:
print(f"{score:.3f} :: {doc.metadata.get('topic')} :: {doc.text[:60]}")
API Notes¶
__init__(collection_name: str, persist_directory: str = "chroma_storage")¶
Creates/opens a persistent collection. Internally uses chromadb.PersistentClient(path=...) and get_or_create_collection(name=...).
add_documents(docs: list[Doc]) -> None¶
- Builds lists of
ids,documents,embeddings,metadatas. - Uses
collection.upsert(...)if available, else falls back todelete(...)+add(...)(supports older Chroma versions). - IDs default to
Doc.idor fall back toBaseVectorDB._stable_id(doc)if missing.
similarity_search(query_embedding, top_k=20, metadata_filter=None, similarity_threshold=None) -> list[tuple[Doc, float]]¶
- Applies metadata filters via Chroma’s
whereclause (values or$infor lists). - Fetches up to
top_k * 2raw hits, then deduplicates and trims totop_k. - Converts Chroma distance to cosine similarity as
1.0 - distance. - If
similarity_thresholdis provided, drops hits below it.
list_all_documents(metadata_filter=None) -> list[Doc]¶
- Returns
Docobjects withtext,embedding,metadata. - Uses
collection.get(include=["documents", "metadatas", "embeddings"]).
clear_collection() / delete_collection() / delete_documents(ids)¶
clear_collection()recreates the collection (handy for tests).delete_collection()deletes and nulls the handle (recreate as needed).delete_documents(ids)removes specific records by ID.
Tips & Troubleshooting¶
- Embeddings: You are responsible for generating embeddings with a consistent dimension per collection.
- Thresholding: Tune
similarity_threshold(as cosine similarity) to filter noisy hits. - Filters: For multi-value filters, pass lists (the store converts to
$in). Example:{"topic": ["nlp", "search"]}. - Persistence: Ensure
persist_directoryis writable and not on a volatile mount if you expect durability.