Logo
OFFLINEPIXEL
Social Media / Content Discovery

Scaling Billion-Vector Search with FAISS

A social media platform reduced image search latency from 2.5 seconds to 87ms while scaling from 10M to 1.2B vectors using FAISS optimization techniques.

Executive Summary

A social media platform with 200M monthly active users needed to scale their visual search feature. Their existing FAISS implementation crashed at 50M vectors. After rearchitecting with IVF+PQ indexing and GPU acceleration, they achieved 1.2B vector capacity with 87ms average latency.

Key Outcomes

  • Scaled from 10M to 1.2B vectors (120x growth)
  • Reduced query latency from 2.5s to 87ms
  • Decreased infrastructure costs by 65%

Client Situation

The platform's visual search feature was growing 20% month-over-month. Their FlatL2 index with 10M vectors was already at capacity, and engineering couldn't keep up with demand.

Key Challenges

  • Flat index search O(n) linear scan impossible beyond 50M vectors
  • Memory constraints of 256GB RAM limiting vector count
  • QPS growth from 50 to 5,000 overwhelming existing infrastructure

Existing Architecture

Single-node FAISS with FlatL2 index storing 768-dim CLIP embeddings. All vectors in RAM with exhaustive search.

  • O(n) search cost made scaling impossible
  • 30ms per query at 10M vectors, ballooning to 2.5s at 50M
  • No sharding or distributed search capability

Solution Design

Implemented IVF (Inverted File Index) with PQ (Product Quantization) compression and GPU acceleration for training and search.

Key Decisions

  • Use IVF16384 with nprobe=32 for accuracy/performance balance
  • PQ64 compression reducing vector size from 3KB to 128 bytes
  • Multi-GPU sharding with 4x A100s for parallel search
FAISSCUDAGPURedisgRPC

Implementation

Phased rollout starting with index rebuild, then query path migration, finally capacity expansion.

  1. Phase 1: Phase 1: Index Migration

    Reindexed all 10M vectors to IVF+PQ, validated recall stayed above 95%.

  2. Phase 2: Phase 2: GPU Acceleration

    Moved search to 4x A100 GPUs, achieving 5ms per query.

  3. Phase 3: Phase 3: Horizontal Scaling

    Implemented sharding across GPU nodes for billion-scale capacity.

Technical Challenges

Trade-off between recall and latency

Impact: Aggressive compression dropped recall to 85%, breaking user experience

Resolution: Tuned nprobe from 16 to 32 and PQ from 96 to 64, achieving 97% recall at 87ms

GPU memory fragmentation

Impact: Out-of-memory errors during index building at 500M vectors

Resolution: Implemented chunked training with memory pooling and gradient checkpointing

Results

Maximum vector capacity
Before50M
After1.2B
Improvement24x increase
P99 query latency
Before2.5 seconds
After87 milliseconds
Improvement96% reduction
Infrastructure cost per million vectors
Before$42
After$11
Improvement74% reduction

Lessons Learned

  • 📘 IVF+PQsweet spot requires empirical tuning for each dataset
  • 📘 GPU training speed allows daily index rebuilds, enabling fresh embeddings
  • 📘 Recall metrics must be measured on real query logs, not synthetic

What We Would Do Differently

  • 💡 Implement automated recall/latency A/B testing from day one
  • 💡 Use RAFT library for faster IVF training on GPU

Role Relevance

FAISS experts were critical for understanding index types, compression trade-offs, and GPU memory optimization—knowledge generalist engineers lacked.

Critical Skills Demonstrated

Index type selection (IVF, HNSW, PQ)GPU memory managementRecall/latency benchmarkingDistributed FAISS deployment

Related Roles

Frequently Asked Questions

Why FAISS over other vector databases like Pinecone or Milvus?
FAISS offered 3x lower latency at similar recall for their GPU budget, plus full control over indexing parameters.
What was the recall vs latency trade-off finally accepted?
97.2% recall at 87ms P99, down from 99.5% at 450ms—acceptable for visual search.