Executive Summary
A large e-commerce platform's recommendation models were too slow—450ms latency causing user drop-off. ML engineers applied ONNX quantization, GPU acceleration, and model pruning, reducing latency to 35ms while maintaining 98% accuracy.
Key Outcomes
- ▹ 450ms → 35ms inference latency (92% reduction)
- ▹ 25% increase in click-through rate
- ▹ GPU cost reduced 60%
Client Situation
Recommendation models were accurate but slow, causing 15% of users to abandon before seeing personalized results.
Key Challenges
- ⚠ 450ms P99 latency exceeding SLAs
- ⚠ PyTorch models inefficient on CPU
- ⚠ Batch inference causing additional queuing delays
Existing Architecture
PyTorch models running on CPU instances with single-threaded inference. No batching or caching.
- CPU inference 10x slower than GPU potential
- No model optimization applied
- Duplicate predictions for popular items
Solution Design
Multi-stage optimization: ONNX export, FP16 quantization, TensorRT optimization, and response caching.
Key Decisions
- ✓ ONNX runtime with TensorRT backend
- ✓ FP16 quantization (2x speedup, no accuracy loss)
- ✓ Redis caching for popular item recommendations
Implementation
Optimized models incrementally, A/B testing each change in production.
Phase 1: Phase 1: ONNX Export
Converted PyTorch models to ONNX format—immediate 2x speedup.
Phase 2: Phase 2: GPU Migration
Moved inference to NVIDIA T4 GPUs with TensorRT—10x speedup.
Phase 3: Phase 3: Caching
Added Redis cache for top 100K items—90% cache hit rate.
Technical Challenges
- ONNX operator compatibility
Impact: Some PyTorch ops not supported in ONNX
Resolution: Rewrote unsupported ops or used custom ONNX kernels
- Dynamic batching complexity
Impact: Varying request sizes reduced GPU utilization
Resolution: NVIDIA Triton's dynamic batching with configurable timeouts
Results
- P99 inference latency
- Before450msAfter35msImprovement92% reduction
- Model throughput (requests/sec)
- Before200After5,000Improvement25x increase
- Inference cost per million predictions
- Before$45After$12Improvement73% reduction
Lessons Learned
- 📘 FP16 quantization had no accuracy loss for recommendation models
- 📘 ONNX + TensorRT provided 10x speedup vs PyTorch CPU
- 📘 Response caching eliminated 90% of inference work
What We Would Do Differently
- 💡 Implement knowledge distillation for smaller models earlier
- 💡 Use speculative execution for batch prediction
Role Relevance
ML engineers combined model optimization techniques—quantization, GPU acceleration, and caching—to achieve 25x throughput improvement without sacrificing accuracy.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- Did quantization affect model accuracy?
- FP16 had 0.1% accuracy loss; INT8 had 2% loss—acceptable for recommendations.
- What GPU instances did you use?
- NVIDIA T4 on GCP ($300/month) replaced 10 CPU instances ($1,500/month).