Executive Summary
Quant researchers at a systematic fund were bottlenecked by manual experiment execution and limited compute. A distributed research platform with Ray and experiment tracking scaled throughput from 10 to 10,000 experiments daily, reducing alpha discovery time from weeks to hours.
Key Outcomes
- ▹ 10 → 10,000 experiments/day (1,000x increase)
- ▹ Alpha discovery time from 2 weeks to 4 hours
- ▹ Researcher satisfaction score 3.2 → 4.8/5
Client Situation
Fifteen quant researchers shared a single JupyterHub server. Each experiment took 2-4 hours, forcing prioritization and wasted time.
Key Challenges
- ⚠ GPU/CPU contention causing 4-hour wait times
- ⚠ No experiment tracking—duplicate work common
- ⚠ Results stored inconsistently across researchers
Existing Architecture
Jupyter notebooks on shared EC2 instances. Manual result logging to Excel. No version control for experiments.
- Sequential execution of experiments
- No reproducibility across research team
- Wasted compute on duplicate experiments
Solution Design
Distributed research platform with Ray for compute, MLflow for tracking, and standardized experiment templates.
Key Decisions
- ✓ Use Ray for distributed hyperparameter search
- ✓ MLflow for experiment tracking and model registry
- ✓ Preemptible EC2 spot instances for 70% cost reduction
Implementation
Pilot with 2 researchers, then full rollout to 15 researchers over 3 months with training and onboarding.
Phase 1: Phase 1: Infrastructure Setup
Deployed Ray cluster with autoscaling (10-200 nodes) and MLflow tracking server.
Phase 2: Phase 2: Experiment Templates
Standardized factor research, backtesting, and ML templates reducing boilerplate.
Phase 3: Phase 3: Full Adoption
All researchers migrated, with daily 10k experiments running in parallel.
Technical Challenges
- Ray task serialization overhead
Impact: 2-second per-task latency limiting throughput
Resolution: Batched experiment parameters (100 per task) reducing overhead by 99%
- MLflow scalability at 10k experiments/day
Impact: PostgreSQL backend saturated, query latency 30 seconds
Resolution: Added read replicas and partitioned metadata storage
Results
- Experiments per day
- Before10After10,000Improvement1,000x increase
- Alpha discovery time
- Before2 weeksAfter4 hoursImprovement98% reduction
- Compute cost per experiment
- Before$8.50After$0.35Improvement96% reduction
Lessons Learned
- 📘 Researcher adoption required standardized templates and documentation
- 📘 Spot instances saved 70% cost with 5% interruption rate (acceptable for research)
- 📘 Experiment tracking enabled post-hoc analysis of 100k+ past runs
What We Would Do Differently
- 💡 Implement automated experiment prioritization based on potential impact
- 💡 Add model performance monitoring for production-bound experiments
Role Relevance
Quant researchers with engineering mindset built the platform themselves, understanding exactly what features would accelerate their discovery process.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- How many experiments actually made it to production?
- ~5% of experiments showed alpha, 1% deployed—from 500 promising signals weekly.
- What's the cost of running 10k experiments daily?
- $3,500 per day ($0.35/experiment × 10k), generating 3-5 production strategies yearly worth $100M+.