Executive Summary
A news publisher's RAG system collapsed at 5M articles—far below their 100M target. Engineers rebuilt the retrieval pipeline with FAISS sharding, incremental indexing, and query routing, achieving 100M document capacity with 75ms latency.
Key Outcomes
- ▹ 1M → 100M documents (100x scale)
- ▹ 75ms query latency (within SLA)
- ▹ Zero downtime during index rebuilds
Client Situation
The publisher wanted to build an AI research assistant over their entire 20-year archive (100M articles). Their prototype worked at 1M but crashed beyond that.
Key Challenges
- ⚠ Flat vector index impossible at 100M scale
- ⚠ Daily index rebuild taking 24+ hours
- ⚠ Query latency ballooning from 50ms to 2s
Existing Architecture
Single-node FAISS with HNSW index. All embeddings in memory. Daily full rebuild.
- Memory limit (256GB) maxed at 5M 768-dim vectors
- HNSW index rebuild O(n log n) too slow at scale
- Single point of failure for retrieval
Solution Design
Sharded FAISS with IVF index, incremental indexing via Kafka, and query-time routing.
Key Decisions
- ✓ IVFPQ index (4-bit) reducing memory 8x
- ✓ Document sharding by year (20 shards, 5M each)
- ✓ Incremental updates via Kafka stream
Implementation
Pilot with 5 years of data before scaling to full 20-year archive.
Phase 1: Phase 1: Sharding Strategy
Sharded 20M documents by year into 20 FAISS indexes.
Phase 2: Phase 2: Incremental Indexing
Kafka consumers updating indexes in real-time (< 1 second latency).
Phase 3: Phase 3: Query Routing
Query router fan-out to relevant shards based on date filter.
Technical Challenges
- Cross-shard query merging
Impact: Queries without date filters needed to search all 20 shards (20x latency)
Resolution: Date filter requirement + small shard count for unconstrained queries
- IVF index training at scale
Impact: Training on 100M vectors took 1 week
Resolution: Trained on 10% sample + incremental fine-tuning
Results
- Document capacity
- Before5MAfter100MImprovement20x increase
- Query latency (P99)
- Before2s (at 5M)After75ms (at 100M)Improvement96% reduction
- Index rebuild time
- Before24 hoursAfter15 minutes (incremental)Improvement99% reduction
Lessons Learned
- 📘 IVFPQ with 4-bit compression reduced memory 8x with <5% recall loss
- 📘 Sharding by natural partition (date) simplified query routing
- 📘 Incremental indexing essential for real-time news ingestion
What We Would Do Differently
- 💡 Use HNSW for hot shards, IVF for cold storage
- 💡 Implement adaptive query routing based on result count
Role Relevance
RAG engineers designed the sharding and indexing strategy that scaled retrieval 20x, enabling AI-powered search over 20 years of content.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- How did you handle real-time ingestion of breaking news?
- Kafka stream with 5-second micro-batch updates to FAISS shards.
- What was the cost of 100M vector storage?
- 120GB with 4-bit PQ, costing $300/month on SSD storage.