Logo
OFFLINEPIXEL
E-commerce / Retail

Improving Model Inference Performance

An e-commerce company reduced ML model inference latency from 450ms to 35ms using ONNX quantization, GPU optimization, and model pruning.

Executive Summary

A large e-commerce platform's recommendation models were too slow—450ms latency causing user drop-off. ML engineers applied ONNX quantization, GPU acceleration, and model pruning, reducing latency to 35ms while maintaining 98% accuracy.

Key Outcomes

  • 450ms → 35ms inference latency (92% reduction)
  • 25% increase in click-through rate
  • GPU cost reduced 60%

Client Situation

Recommendation models were accurate but slow, causing 15% of users to abandon before seeing personalized results.

Key Challenges

  • 450ms P99 latency exceeding SLAs
  • PyTorch models inefficient on CPU
  • Batch inference causing additional queuing delays

Existing Architecture

PyTorch models running on CPU instances with single-threaded inference. No batching or caching.

  • CPU inference 10x slower than GPU potential
  • No model optimization applied
  • Duplicate predictions for popular items

Solution Design

Multi-stage optimization: ONNX export, FP16 quantization, TensorRT optimization, and response caching.

Key Decisions

  • ONNX runtime with TensorRT backend
  • FP16 quantization (2x speedup, no accuracy loss)
  • Redis caching for popular item recommendations
PyTorchONNXTensorRTNVIDIA TritonRedis

Implementation

Optimized models incrementally, A/B testing each change in production.

  1. Phase 1: Phase 1: ONNX Export

    Converted PyTorch models to ONNX format—immediate 2x speedup.

  2. Phase 2: Phase 2: GPU Migration

    Moved inference to NVIDIA T4 GPUs with TensorRT—10x speedup.

  3. Phase 3: Phase 3: Caching

    Added Redis cache for top 100K items—90% cache hit rate.

Technical Challenges

ONNX operator compatibility

Impact: Some PyTorch ops not supported in ONNX

Resolution: Rewrote unsupported ops or used custom ONNX kernels

Dynamic batching complexity

Impact: Varying request sizes reduced GPU utilization

Resolution: NVIDIA Triton's dynamic batching with configurable timeouts

Results

P99 inference latency
Before450ms
After35ms
Improvement92% reduction
Model throughput (requests/sec)
Before200
After5,000
Improvement25x increase
Inference cost per million predictions
Before$45
After$12
Improvement73% reduction

Lessons Learned

  • 📘 FP16 quantization had no accuracy loss for recommendation models
  • 📘 ONNX + TensorRT provided 10x speedup vs PyTorch CPU
  • 📘 Response caching eliminated 90% of inference work

What We Would Do Differently

  • 💡 Implement knowledge distillation for smaller models earlier
  • 💡 Use speculative execution for batch prediction

Role Relevance

ML engineers combined model optimization techniques—quantization, GPU acceleration, and caching—to achieve 25x throughput improvement without sacrificing accuracy.

Critical Skills Demonstrated

ONNX/TensorRT optimizationGPU inference optimizationModel quantizationCaching strategies

Related Roles

Frequently Asked Questions

Did quantization affect model accuracy?
FP16 had 0.1% accuracy loss; INT8 had 2% loss—acceptable for recommendations.
What GPU instances did you use?
NVIDIA T4 on GCP ($300/month) replaced 10 CPU instances ($1,500/month).