Logo
OFFLINEPIXEL
Media / Publishing

Scaling Document Retrieval Platforms

A media company scaled document retrieval from 1M to 100M documents using vector database sharding and incremental indexing.

Executive Summary

A news publisher's RAG system collapsed at 5M articles—far below their 100M target. Engineers rebuilt the retrieval pipeline with FAISS sharding, incremental indexing, and query routing, achieving 100M document capacity with 75ms latency.

Key Outcomes

  • 1M → 100M documents (100x scale)
  • 75ms query latency (within SLA)
  • Zero downtime during index rebuilds

Client Situation

The publisher wanted to build an AI research assistant over their entire 20-year archive (100M articles). Their prototype worked at 1M but crashed beyond that.

Key Challenges

  • Flat vector index impossible at 100M scale
  • Daily index rebuild taking 24+ hours
  • Query latency ballooning from 50ms to 2s

Existing Architecture

Single-node FAISS with HNSW index. All embeddings in memory. Daily full rebuild.

  • Memory limit (256GB) maxed at 5M 768-dim vectors
  • HNSW index rebuild O(n log n) too slow at scale
  • Single point of failure for retrieval

Solution Design

Sharded FAISS with IVF index, incremental indexing via Kafka, and query-time routing.

Key Decisions

  • IVFPQ index (4-bit) reducing memory 8x
  • Document sharding by year (20 shards, 5M each)
  • Incremental updates via Kafka stream
FAISSKafkaS3KubernetesgRPC

Implementation

Pilot with 5 years of data before scaling to full 20-year archive.

  1. Phase 1: Phase 1: Sharding Strategy

    Sharded 20M documents by year into 20 FAISS indexes.

  2. Phase 2: Phase 2: Incremental Indexing

    Kafka consumers updating indexes in real-time (< 1 second latency).

  3. Phase 3: Phase 3: Query Routing

    Query router fan-out to relevant shards based on date filter.

Technical Challenges

Cross-shard query merging

Impact: Queries without date filters needed to search all 20 shards (20x latency)

Resolution: Date filter requirement + small shard count for unconstrained queries

IVF index training at scale

Impact: Training on 100M vectors took 1 week

Resolution: Trained on 10% sample + incremental fine-tuning

Results

Document capacity
Before5M
After100M
Improvement20x increase
Query latency (P99)
Before2s (at 5M)
After75ms (at 100M)
Improvement96% reduction
Index rebuild time
Before24 hours
After15 minutes (incremental)
Improvement99% reduction

Lessons Learned

  • 📘 IVFPQ with 4-bit compression reduced memory 8x with <5% recall loss
  • 📘 Sharding by natural partition (date) simplified query routing
  • 📘 Incremental indexing essential for real-time news ingestion

What We Would Do Differently

  • 💡 Use HNSW for hot shards, IVF for cold storage
  • 💡 Implement adaptive query routing based on result count

Role Relevance

RAG engineers designed the sharding and indexing strategy that scaled retrieval 20x, enabling AI-powered search over 20 years of content.

Critical Skills Demonstrated

Vector database scalingIVF/PQ optimizationIncremental indexingQuery routing strategies

Related Roles

Frequently Asked Questions

How did you handle real-time ingestion of breaking news?
Kafka stream with 5-second micro-batch updates to FAISS shards.
What was the cost of 100M vector storage?
120GB with 4-bit PQ, costing $300/month on SSD storage.