ADR-061: FAISS Phase 1 with Post-Filter Metadata Strategy¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-061
- Related PRDs: PRD-021
Context & Problem Statement¶
The VectorStore protocol (ADR-060) needs a Phase 1 implementation. FAISS provides
fast in-process vector search but has no built-in metadata filtering. The system needs
to filter search results by type, feed, date, speaker, and grounding status. The
question is how to handle filtering without native support.
Decision¶
We adopt FAISS with post-filter metadata for Phase 1:
- FAISS as the default backend:
faiss-cpuin[project.dependencies].faiss-gpuas optional extra for CUDA users. - Over-fetch then post-filter: FAISS returns
top_k * 3candidates. A Python post-filter applies metadata predicates (type, feed, date, speaker, grounded). The topkfrom the filtered set are returned. - Metadata sidecar:
metadata.jsonalongsidevectors.faissmaps doc_id to metadata dict. Start with JSON; add SQLite option when corpora exceed ~50K vectors. - Auto index type selection:
IndexFlatIP+IndexIDMapfor < 100K vectors;IndexIVFFlatfor 100K–1M;IndexIVFPQfor > 1M. Auto-selected at persist time. - On-disk layout:
<output_dir>/search/vectors.faiss,metadata.json,index_meta.json.
Rationale¶
- CLI-first: FAISS is in-process, zero server overhead, ~20 MB dependency. No Docker, no external database for the default path.
- Sufficient at CLI scale: Post-filtering on ~1K candidates is sub-millisecond. The over-fetch ratio (3x) handles typical filter selectivity.
- Known upgrade path: Qdrant Phase 2 replaces post-filter with native payload
filtering. The
VectorStoreprotocol (ADR-060) makes the switch transparent. - Simplicity: Post-filter is ~15 lines of Python. No custom FAISS index or external filtering library.
Alternatives Considered¶
- Qdrant local mode for Phase 1: Rejected; heavier binary dependency, "for demos" per docs, Rust binary in Python package.
- FAISS with pre-filter (separate indexes per type): Rejected; multiplies index count, complicates incremental updates, marginal benefit at CLI scale.
- SQLite FTS5 for metadata + FAISS for vectors: Rejected; adds complexity without matching the simplicity of JSON sidecar at < 50K vectors.
Consequences¶
- Positive: Simple, fast, minimal dependencies. Works offline. Co-located with corpus outputs.
- Negative: Post-filter may return fewer than
kresults if filter is highly selective. Warning message when this occurs. Over-fetch ratio is a tunable. - Neutral: Requires
faiss-cpuas new dependency. Auto index type selection adds minor complexity but is transparent to callers.
Implementation Notes¶
- Module:
src/podcast_scraper/search/faiss_store.py - Index files:
<output_dir>/search/vectors.faiss,metadata.json,index_meta.json - Config:
vector_backend: "faiss"(default),vector_index_path(optional override) - Upgrade trigger: When platform mode ships, switch to
vector_backend: "qdrant"