Executive Summary
A systematic fund's ML models performed well in backtests but decayed 50% out-of-sample due to overfitting. Implementing nested cross-validation, regularization, and purged time series splits reduced overfitting from 50% to 8%, saving $15M in potential losses.
Key Outcomes
- ▹ Overfitting reduced 50% → 8% decay
- ▹ Model feature count reduced 150 → 25 (83% reduction)
- ▹ $15M saved in avoided strategy failures
Client Situation
The fund's ML team built complex models with 150+ features that looked great in-sample but failed live—classic overfitting.
Key Challenges
- ⚠ 50% performance decay in live trading vs backtest
- ⚠ Feature engineering causing look-ahead bias
- ⚠ No rigorous out-of-sample validation framework
Existing Architecture
Random train/test split, no cross-validation, manual feature selection, no regularization.
- In-sample Sharpe 2.5 → live Sharpe 1.2 (52% decay)
- Model retrained rarely (quarterly)
- No testing for feature stability
Solution Design
Purged time series cross-validation, feature selection with L1 regularization, and walk-forward testing.
Key Decisions
- ✓ Nested cross-validation (5x5) for hyperparameter tuning
- ✓ Purged splits to prevent future data leakage
- ✓ Regularization (L1) reducing feature count 83%
Implementation
Validated on historical data first, then paper traded for 3 months before live deployment.
Phase 1: Phase 1: Validation Framework
Built purged time series CV (200 splits, 6 years of data).
Phase 2: Phase 2: Feature Reduction
L1 regularization reduced 150 features to 25, improved stability.
Phase 3: Phase 3: Live Deployment
Deployed 12 robust models with monthly retraining.
Technical Challenges
- Time series leakage in cross-validation
Impact: Future data leaking into training folds
Resolution: Purged splits with gap between train and validation (20 periods)
- Hyperparameter explosion
Impact: 5x5 nested CV = 25 parameter sets × 20 models = 500 training runs
Resolution: Bayesian optimization (Optuna) reduced iterations 90%
Results
- Live vs backtest Sharpe decay
- Before52%After8%Improvement84% reduction
- Model features
- Before150After25Improvement83% reduction
- Monthly retraining time
- Before8 hoursAfter45 minutesImprovement91% reduction
Lessons Learned
- 📘 Purged cross-validation essential for preventing look-ahead bias
- 📘 Regularization reduced overfitting more than more data
- 📘 Fewer, more stable features outperformed complex models live
What We Would Do Differently
- 💡 Implement Shapley values for feature interpretability earlier
- 💡 Use model stacking for diversification
Role Relevance
Validation experts overhauled the model development process, reducing overfitting from 50% to 8% and saving $15M in strategy failures.
Critical Skills Demonstrated
Related Roles
Frequently Asked Questions
- What is a purged time series split?
- Removes data between train and validation sets to prevent information leakage.
- How do you measure overfitting?
- Performance decay between in-sample CV and out-of-sample walk-forward.