Executive Summary
A multi-strategy fund with $50B AUM calculated portfolio analytics daily, causing stale risk decisions. Real-time streaming architecture using Rust and ClickHouse reduced analytics latency from 24 hours to 500ms, enabling intraday risk management.
Key Outcomes
- ▹ 24 hours → 500ms analytics latency
- ▹ 10M position updates/sec processing
- ▹ 3 intraday risk events prevented ($20M saved)
Client Situation
The fund's risk team received P&L and exposure reports 12 hours after market close, too late for intraday position adjustments.
Key Challenges
- ⚠ Batch processing took 4+ hours for Greeks and VAR
- ⚠ Unable to monitor real-time exposure across 50k positions
- ⚠ Risk breaches detected only after market close
Existing Architecture
End-of-day batch job in Python reading from SQL Server, calculating Greeks and VAR using NumPy.
- Batch window impossible to reduce below 4 hours
- No support for intraday position changes
- Python single-threaded CPU bottleneck
Solution Design
Streaming architecture with Kafka for position updates, Rust for risk calculation, ClickHouse for real-time storage.
Key Decisions
- ✓ Use Rust for parallel risk calculation across 50k positions
- ✓ ClickHouse with materialized views for pre-aggregated analytics
- ✓ WebSocket push to risk dashboard
Implementation
Phased migration starting with P&L, then Greeks, finally VAR. Shadow mode for 1 month.
Phase 1: Phase 1: Real-time P&L
Built streaming P&L calculator matching batch results within 0.01%.
Phase 2: Phase 2: Greeks Engine
Implemented delta/gamma/vega calculations using Rust's ndarray.
Phase 3: Phase 3: VAR Dashboard
Built risk dashboard with historical simulation VAR at 5-minute intervals.
Technical Challenges
- Memory pressure for 50k positions
Impact: Full covariance matrix too large for single server
Resolution: Implemented factor model reducing dimension from 50k to 200
- Backpressure during volatility
Impact: Kafka lag reaching hours during market stress
Resolution: Added priority queuing for high-touch positions
Results
- Risk analytics latency
- Before24 hoursAfter500 msImprovement99.994% reduction
- Position updates processed/sec
- Before500 (batch)After10MImprovement20,000x increase
- Risk breaches caught intraday
- Before0After3 (in first 6 months)ImprovementPrevented $20M losses
Lessons Learned
- 📘 Rust's memory safety caught 12 concurrency bugs that would have corrupted risk state
- 📘 ClickHouse materialized views reduced query latency from 5s to 50ms
- 📘 Factor model essential for memory scalability
What We Would Do Differently
- 💡 Implement checkpoints for state recovery earlier
- 💡 Use DataFusion for in-process query engine
Role Relevance
Quant engineers built the high-performance risk engine, balancing numerical accuracy with real-time constraints at 10M updates/sec.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- How accurate is real-time VAR compared to end-of-day?
- 99.9% correlation with daily VAR using identical parameters and 10-minute windows.
- What's the hardware footprint?
- 6 servers with 256GB RAM each, down from 20 servers in batch architecture.