Executive Summary
A last-mile delivery platform's monolithic system couldn't scale to 1M daily deliveries. Microservices experts built a resilient distributed system with retry policies, circuit breakers, idempotency keys, and saga orchestration—achieving 99.999% uptime across 50 microservices.
Key Outcomes
- ▹ 1M deliveries processed daily
- ▹ 99.999% uptime (5 minutes downtime/year)
- ▹ Zero duplicate deliveries (idempotency)
Client Situation
The platform's monolith crashed during peak hours (5 PM-9 PM), delaying thousands of deliveries and angering customers.
Key Challenges
- ⚠ Monolith crashing at 100K daily deliveries
- ⚠ Duplicate delivery assignments costing $1M annually
- ⚠ No retry mechanism for transient failures
Existing Architecture
Monolithic Rails app, single PostgreSQL, synchronous API calls, no retry logic.
- No fault tolerance—any failure aborts the request
- Duplicate delivery assignments due to no idempotency
- Unable to scale beyond 100K deliveries/day
Solution Design
50 microservices with retry policies, exponential backoff, idempotency keys, and saga orchestration for distributed transactions.
Key Decisions
- ✓ Idempotency keys for all write operations
- ✓ Retry with exponential backoff (max 5 retries)
- ✓ Saga orchestration for end-to-end delivery flow
Implementation
Built greenfield resilient system alongside monolith, migrating traffic gradually.
Phase 1: Phase 1: Idempotency
Idempotency keys for all APIs—eliminated duplicate assignments.
Phase 2: Phase 2: Retry Infrastructure
Retry with exponential backoff + dead letter queue for failed events.
Phase 3: Phase 3: Saga Orchestration
Distributed transaction coordinator for delivery lifecycle (assign → pickup → deliver).
Technical Challenges
- Idempotency key storage and cleanup
Impact: Storing 1M keys daily caused Redis memory pressure
Resolution: TTL-based cleanup (7 days) + Redis Cluster
- Saga compensating transactions
Impact: Failed delivery needed to reassign driver automatically
Resolution: Compensation handler with retry + escalation to ops
Results
- Daily deliveries processed
- Before100,000After1,000,000Improvement10x increase
- System uptime
- Before99.9% (8.7 hours/year)After99.999% (5 minutes/year)Improvement99% fewer incidents
- Duplicate delivery rate
- Before0.1% ($1M/year loss)After0%Improvement$1M saved annually
Lessons Learned
- 📘 Idempotency keys eliminated 100% of duplicate operations
- 📘 Retry with backoff handled 95% of transient failures without user impact
- 📘 Saga pattern enabled distributed transactions without two-phase commit
What We Would Do Differently
- 💡 Use Kafka exactly-once semantics for critical events
- 💡 Implement chaos engineering earlier in development
Role Relevance
Microservices experts built the resilient patterns (idempotency, retries, sagas) enabling 1M daily deliveries with 99.999% uptime.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- How do you generate idempotency keys?
- Client generates UUID, server stores with TTL, rejects duplicates within 7 days.
- What happens after 5 retries?
- Event goes to dead letter queue, ops team investigates and replays.