Executive Summary
A social media platform with 200M monthly active users needed to scale their visual search feature. Their existing FAISS implementation crashed at 50M vectors. After rearchitecting with IVF+PQ indexing and GPU acceleration, they achieved 1.2B vector capacity with 87ms average latency.
Key Outcomes
- ▹ Scaled from 10M to 1.2B vectors (120x growth)
- ▹ Reduced query latency from 2.5s to 87ms
- ▹ Decreased infrastructure costs by 65%
Client Situation
The platform's visual search feature was growing 20% month-over-month. Their FlatL2 index with 10M vectors was already at capacity, and engineering couldn't keep up with demand.
Key Challenges
- ⚠ Flat index search O(n) linear scan impossible beyond 50M vectors
- ⚠ Memory constraints of 256GB RAM limiting vector count
- ⚠ QPS growth from 50 to 5,000 overwhelming existing infrastructure
Existing Architecture
Single-node FAISS with FlatL2 index storing 768-dim CLIP embeddings. All vectors in RAM with exhaustive search.
- O(n) search cost made scaling impossible
- 30ms per query at 10M vectors, ballooning to 2.5s at 50M
- No sharding or distributed search capability
Solution Design
Implemented IVF (Inverted File Index) with PQ (Product Quantization) compression and GPU acceleration for training and search.
Key Decisions
- ✓ Use IVF16384 with nprobe=32 for accuracy/performance balance
- ✓ PQ64 compression reducing vector size from 3KB to 128 bytes
- ✓ Multi-GPU sharding with 4x A100s for parallel search
Implementation
Phased rollout starting with index rebuild, then query path migration, finally capacity expansion.
Phase 1: Phase 1: Index Migration
Reindexed all 10M vectors to IVF+PQ, validated recall stayed above 95%.
Phase 2: Phase 2: GPU Acceleration
Moved search to 4x A100 GPUs, achieving 5ms per query.
Phase 3: Phase 3: Horizontal Scaling
Implemented sharding across GPU nodes for billion-scale capacity.
Technical Challenges
- Trade-off between recall and latency
Impact: Aggressive compression dropped recall to 85%, breaking user experience
Resolution: Tuned nprobe from 16 to 32 and PQ from 96 to 64, achieving 97% recall at 87ms
- GPU memory fragmentation
Impact: Out-of-memory errors during index building at 500M vectors
Resolution: Implemented chunked training with memory pooling and gradient checkpointing
Results
- Maximum vector capacity
- Before50MAfter1.2BImprovement24x increase
- P99 query latency
- Before2.5 secondsAfter87 millisecondsImprovement96% reduction
- Infrastructure cost per million vectors
- Before$42After$11Improvement74% reduction
Lessons Learned
- 📘 IVF+PQsweet spot requires empirical tuning for each dataset
- 📘 GPU training speed allows daily index rebuilds, enabling fresh embeddings
- 📘 Recall metrics must be measured on real query logs, not synthetic
What We Would Do Differently
- 💡 Implement automated recall/latency A/B testing from day one
- 💡 Use RAFT library for faster IVF training on GPU
Role Relevance
FAISS experts were critical for understanding index types, compression trade-offs, and GPU memory optimization—knowledge generalist engineers lacked.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- Why FAISS over other vector databases like Pinecone or Milvus?
- FAISS offered 3x lower latency at similar recall for their GPU budget, plus full control over indexing parameters.
- What was the recall vs latency trade-off finally accepted?
- 97.2% recall at 87ms P99, down from 99.5% at 450ms—acceptable for visual search.