Logo
OFFLINEPIXEL
Logistics / Supply Chain

Building Resilient Distributed Systems at Scale

A logistics company built a resilient distributed system handling 1M deliveries daily with 99.999% uptime using retries, timeouts, and idempotency.

Executive Summary

A last-mile delivery platform's monolithic system couldn't scale to 1M daily deliveries. Microservices experts built a resilient distributed system with retry policies, circuit breakers, idempotency keys, and saga orchestration—achieving 99.999% uptime across 50 microservices.

Key Outcomes

  • 1M deliveries processed daily
  • 99.999% uptime (5 minutes downtime/year)
  • Zero duplicate deliveries (idempotency)

Client Situation

The platform's monolith crashed during peak hours (5 PM-9 PM), delaying thousands of deliveries and angering customers.

Key Challenges

  • Monolith crashing at 100K daily deliveries
  • Duplicate delivery assignments costing $1M annually
  • No retry mechanism for transient failures

Existing Architecture

Monolithic Rails app, single PostgreSQL, synchronous API calls, no retry logic.

  • No fault tolerance—any failure aborts the request
  • Duplicate delivery assignments due to no idempotency
  • Unable to scale beyond 100K deliveries/day

Solution Design

50 microservices with retry policies, exponential backoff, idempotency keys, and saga orchestration for distributed transactions.

Key Decisions

  • Idempotency keys for all write operations
  • Retry with exponential backoff (max 5 retries)
  • Saga orchestration for end-to-end delivery flow
GogRPCKafkaRedisPostgreSQLSagaKubernetes

Implementation

Built greenfield resilient system alongside monolith, migrating traffic gradually.

  1. Phase 1: Phase 1: Idempotency

    Idempotency keys for all APIs—eliminated duplicate assignments.

  2. Phase 2: Phase 2: Retry Infrastructure

    Retry with exponential backoff + dead letter queue for failed events.

  3. Phase 3: Phase 3: Saga Orchestration

    Distributed transaction coordinator for delivery lifecycle (assign → pickup → deliver).

Technical Challenges

Idempotency key storage and cleanup

Impact: Storing 1M keys daily caused Redis memory pressure

Resolution: TTL-based cleanup (7 days) + Redis Cluster

Saga compensating transactions

Impact: Failed delivery needed to reassign driver automatically

Resolution: Compensation handler with retry + escalation to ops

Results

Daily deliveries processed
Before100,000
After1,000,000
Improvement10x increase
System uptime
Before99.9% (8.7 hours/year)
After99.999% (5 minutes/year)
Improvement99% fewer incidents
Duplicate delivery rate
Before0.1% ($1M/year loss)
After0%
Improvement$1M saved annually

Lessons Learned

  • 📘 Idempotency keys eliminated 100% of duplicate operations
  • 📘 Retry with backoff handled 95% of transient failures without user impact
  • 📘 Saga pattern enabled distributed transactions without two-phase commit

What We Would Do Differently

  • 💡 Use Kafka exactly-once semantics for critical events
  • 💡 Implement chaos engineering earlier in development

Role Relevance

Microservices experts built the resilient patterns (idempotency, retries, sagas) enabling 1M daily deliveries with 99.999% uptime.

Critical Skills Demonstrated

Idempotent API designRetry/backoff strategiesSaga orchestrationDistributed transactions

Related Roles

Frequently Asked Questions

How do you generate idempotency keys?
Client generates UUID, server stores with TTL, rejects duplicates within 7 days.
What happens after 5 retries?
Event goes to dead letter queue, ops team investigates and replays.