Use machine learning models (gradient boosting, neural networks, transformers) to predict short-term stock returns from structured and alternative data. The modern evolution of statistical arbitrage.
History
Machine learning in finance began with Renaissance Technologies in the 1990s, though Jim Simons's team used speech-recognition techniques rather than modern deep learning. The field accelerated after 2010 with the explosion of alternative data (satellite imagery, social media, credit card data) and advances in NLP and deep learning. Two Sigma, founded by David Siegel and John Overdeck, has been at the forefront of ML-driven investing, hiring hundreds of data scientists. WorldQuant (founded by Igor Tulchinsky) crowdsources alpha signals from thousands of quants globally. The challenge remains overfitting: most ML signals that look good in backtests fail in live trading.
How It Works
Collect structured data (price, volume, fundamentals) and alternative data (NLP on news/filings, satellite imagery, web scraping, credit card data)
Engineer features: transform raw data into predictive signals (e.g., sentiment scores, supply chain indicators, earnings surprise momentum)
Train models (XGBoost, LightGBM, LSTM networks, or transformer architectures) to predict next-day or next-week stock returns
Use walk-forward validation (never look ahead) with purging and embargo to avoid data leakage
Combine hundreds of weak signals into an ensemble prediction; each individual signal may have <1% accuracy improvement over random
Execute via a stat-arb framework: long stocks with positive predictions, short those with negative, maintaining sector and factor neutrality
Example Trades
NLP model detects unusually positive sentiment shift in AMZN earnings call transcript (management tone, guidance language)
entry Long AMZN as part of sentiment-alpha basket, weighted by signal conviction
exit Signal decays after 3-5 days; position exits at next model update
result +0.8% contribution from this position over 4 days
Satellite imagery shows 15% increase in parking lot activity at Target stores vs seasonal baseline
entry Long TGT ahead of quarterly earnings with signal-proportional sizing
exit Earnings beat estimates; exit on the day-after-earnings gap-up
result +5.2% on the position; satellite signal confirmed by revenue beat
Related Charts
Who Runs This
When It Works vs. Fails
works
Markets with high cross-sectional dispersion where idiosyncratic factors drive returns. Data-rich environments with diverse information sources.
fails
Macro-dominated markets where all stocks move on the same factor. Black swan events with no training data. Markets where the signal-to-noise ratio is too low.
Risks
01 Overfitting: the #1 risk. Models that capture noise rather than signal look great in backtests but fail live
02 Alpha decay: ML signals decay rapidly as competitors discover similar patterns
03 Data quality: alternative data sources can be noisy, sparse, or biased. Garbage in, garbage out
04 Regime changes: models trained on one market environment may fail completely in a new regime
05 Computational cost: training and inference at scale requires significant GPU/infrastructure investment
Research
Gu, Kelly, Xiu, 2020
Deep Learning for Financial Applications
Heaton, Polson, Witte, 2017
The Virtue of Complexity in Return Prediction
Kelly, Pruitt, Su, 2022