RSM338: Applications of Machine Learning in Finance

Week 3: Financial Data II | January 21–22, 2026

Kevin Mott

Rotman School of Management

Today’s Roadmap

Last week we assumed log returns are normally distributed. Today we investigate that assumption and set up the prediction problem.

  1. Investigating the Assumption: Are returns actually normal? Skewness, kurtosis, and fat tails
  2. Autocorrelation: Does today’s return predict tomorrow’s?
  3. Predictive Regressions: Using other variables to forecast returns
  4. Out-of-Sample Testing: Why in-sample fit can be deceiving
  5. Certainty Equivalence: Connecting predictions to portfolio decisions

Theme: Moving from describing returns to predicting them—and understanding why prediction is hard.

Part I: Investigating the Assumption

The Normality Assumption

Everything in Week 2 relied on one key assumption:

\[r_t \sim N(\mu, \sigma^2)\]

Is this actually true?

To investigate, we need tools to measure how a distribution deviates from normality:

  • Skewness: Is the distribution symmetric?
  • Kurtosis: How heavy are the tails?

For a normal distribution: skewness \(= 0\) and excess kurtosis \(= 0\).

Let’s see what the data say.

Skewness

Skewness measures asymmetry of the distribution:

\[\gamma_1 = \frac{\mu_3}{\sigma^3}, \quad \text{where } \mu_3 = \mathbb{E}[(R - \mu)^3]\]

Sample skewness:

\[\hat{\gamma}_1 = \frac{\frac{1}{T}\sum_{t=1}^{T}(R_t - \bar{R})^3}{\hat{\sigma}^3}\]

  • \(\gamma_1 = 0\): Symmetric (like the normal distribution)
  • \(\gamma_1 > 0\): Right tail is longer (positive skew)
  • \(\gamma_1 < 0\): Left tail is longer (negative skew)

In finance: Stock returns often exhibit slight negative skewness—large losses are more common than large gains of the same magnitude.

Kurtosis

Kurtosis measures the “tailedness” of the distribution:

\[\gamma_2 = \frac{\mu_4}{\sigma^4} - 3, \quad \text{where } \mu_4 = \mathbb{E}[(R - \mu)^4]\]

The \(-3\) gives us excess kurtosis (relative to normal distribution).

Sample excess kurtosis:

\[\hat{\gamma}_2 = \frac{\frac{1}{T}\sum_{t=1}^{T}(R_t - \bar{R})^4}{\hat{\sigma}^4} - 3\]

  • \(\gamma_2 = 0\): Normal distribution
  • \(\gamma_2 > 0\): “Fat tails”—extreme events more likely than normal predicts
  • \(\gamma_2 < 0\): “Thin tails”—extreme events less likely

Why Kurtosis Matters

“Fat tails” means extreme events happen more often than a normal distribution predicts.

Practical implication: If you assume returns are normal, you will:

  • Underestimate the probability of market crashes
  • Underestimate the probability of extreme gains
  • Underestimate risk in general

This matters for:

  • Risk management (Value-at-Risk calculations)
  • Option pricing (Black-Scholes assumes normality)
  • Portfolio optimization

Visualizing Skewness and Kurtosis

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

# Skewness comparison
x = np.linspace(-4, 6, 1000)
ax1.plot(x, stats.norm.pdf(x), label='Normal (skew=0)')
ax1.plot(x, stats.skewnorm.pdf(x, 3), label='Positive skew')
ax1.plot(x, stats.skewnorm.pdf(x, -3), label='Negative skew')
ax1.set_title('Skewness')
ax1.legend()

# Kurtosis comparison
x = np.linspace(-5, 5, 1000)
ax2.plot(x, stats.norm.pdf(x), label='Normal (kurtosis=0)')
ax2.plot(x, stats.t.pdf(x, df=4), label='Fat tails (kurtosis>0)')
ax2.set_title('Kurtosis')
ax2.legend()

plt.tight_layout()
plt.show()

Stock returns: Typically exhibit slight negative skewness and positive excess kurtosis (fat tails).

Testing for Normality

Under the normal distribution:

  • Skewness \(\gamma_1 = 0\)
  • Excess kurtosis \(\gamma_2 = 0\)

We can test whether sample skewness and kurtosis are “too far” from zero.

Under normality, for large \(T\):

\[\sqrt{T} \cdot \hat{\gamma}_1 \sim N(0, 6) \quad \text{and} \quad \sqrt{T} \cdot \hat{\gamma}_2 \sim N(0, 24)\]

Test statistics:

\[z_1 = \frac{\hat{\gamma}_1}{\sqrt{6/T}}, \quad z_2 = \frac{\hat{\gamma}_2}{\sqrt{24/T}}\]

If \(|z_1| > 1.96\) or \(|z_2| > 1.96\), reject normality at 5% level.

Empirical Evidence: S&P 500 Returns

import numpy as np
import pandas as pd
from scipy import stats
import os

# Load S&P 500 data (cache locally to avoid repeated downloads)
csv_path = 'sp500_yf.csv'
if os.path.exists(csv_path):
    sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
else:
    import yfinance as yf
    sp500 = yf.download('^GSPC', start='1950-01-01', end='2025-12-31', progress=False)
    sp500.columns = sp500.columns.get_level_values(0)  # Flatten multi-level columns
    sp500.to_csv(csv_path)

# Compute daily log returns
sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

# Date range
print(f"Data range: {returns.index.min().strftime('%Y-%m-%d')} to {returns.index.max().strftime('%Y-%m-%d')}")
print(f"Number of daily observations: T = {len(returns):,}")

# Summary statistics
mean_ret = returns.mean()
std_ret = returns.std()
skew = stats.skew(returns)
kurt = stats.kurtosis(returns)  # excess kurtosis

print(f"\nSample mean (daily): {mean_ret:.4%}")
print(f"Sample std dev (daily): {std_ret:.4%}")
print(f"Skewness: {skew:.3f}")
print(f"Excess kurtosis: {kurt:.2f}")

# Test statistics
T = len(returns)
z1 = skew / np.sqrt(6/T)
z2 = kurt / np.sqrt(24/T)

print(f"\nTest statistics:")
print(f"z_skewness = {z1:.2f}  (reject if |z| > 1.96)")
print(f"z_kurtosis = {z2:.2f}  (reject if |z| > 1.96)")
Data range: 1950-01-04 to 2025-12-17
Number of daily observations: T = 19,111

Sample mean (daily): 0.0314%
Sample std dev (daily): 0.9970%
Skewness: -0.952
Excess kurtosis: 25.28

Test statistics:
z_skewness = -53.76  (reject if |z| > 1.96)
z_kurtosis = 713.24  (reject if |z| > 1.96)

Conclusion: Skewness is significantly negative (left tail), and kurtosis is way too high. Stock returns have fat tails—they are not normally distributed.

S&P 500 Returns vs the Fitted Normal Distribution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import os

# Load S&P 500 data (cache locally to avoid repeated downloads)
csv_path = 'sp500_yf.csv'
if os.path.exists(csv_path):
    sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
else:
    import yfinance as yf
    sp500 = yf.download('^GSPC', start='1950-01-01', end='2025-12-31', progress=False)
    sp500.columns = sp500.columns.get_level_values(0)  # Flatten multi-level columns
    sp500.to_csv(csv_path)

sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

# Annualize: multiply by 252 trading days
returns_annual = returns * 252

# Fit a normal distribution using sample mean and std dev
mu_hat = returns_annual.mean()
sigma_hat = returns_annual.std()

# Plot histogram of actual returns (log scale on y-axis)
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(returns_annual, bins=150, density=True, label='Actual S&P 500 returns')

# Overlay the FITTED normal distribution (same mean and std dev)
x = np.linspace(returns_annual.min(), returns_annual.max(), 500)
ax.plot(x, stats.norm.pdf(x, mu_hat, sigma_hat), label='Fitted Normal', linewidth=2)

ax.set_xlabel('Annualized Log Return (%)')
ax.set_ylabel('Density')
ax.legend()
ax.set_title('S&P 500 Returns vs Fitted Normal')
plt.show()

# Count extreme events (using daily standardized returns)
standardized = (returns - returns.mean()) / returns.std()
n_extreme = (abs(standardized) > 4).sum()
expected_normal = len(returns) * 2 * stats.norm.sf(4)  # two-tailed
print(f"Observations beyond 4 std devs: {n_extreme}")
print(f"Normal distribution predicts: {expected_normal:.1f}")
print(f"Actual is {n_extreme/expected_normal:.0f}x more than expected!")

Observations beyond 4 std devs: 107
Normal distribution predicts: 1.2
Actual is 88x more than expected!

Box Plots: Actual vs Normal-Implied Returns

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

# Load S&P 500 data (cache locally to avoid repeated downloads)
csv_path = 'sp500_yf.csv'
if os.path.exists(csv_path):
    sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
else:
    import yfinance as yf
    sp500 = yf.download('^GSPC', start='1950-01-01', end='2025-12-31', progress=False)
    sp500.columns = sp500.columns.get_level_values(0)  # Flatten multi-level columns
    sp500.to_csv(csv_path)

sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

# Annualize
returns_annual = returns * 252

# Simulate from a normal with the same mean and std dev
np.random.seed(42)
normal_sim = np.random.normal(returns_annual.mean(), returns_annual.std(), len(returns_annual))

# Side-by-side box plots
fig, ax = plt.subplots(figsize=(6, 5))
ax.boxplot([returns_annual, normal_sim], labels=['Actual S&P 500', 'Normal (simulated)'])
ax.set_ylabel('Annualized Daily Log Return (%)')
ax.set_title('Box Plot Comparison: Fat Tails in Action')
ax.axhline(0, linestyle='--', linewidth=0.5)
plt.show()

The actual returns have far more outliers (the dots beyond the whiskers) than a normal distribution with the same mean and variance would produce.

Quantifying the Fat Tails

import numpy as np
import pandas as pd
from scipy import stats
import os

# Load S&P 500 data
csv_path = 'sp500_yf.csv'
sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

# Standardize returns
mu, sigma = returns.mean(), returns.std()
standardized = (returns - mu) / sigma

# Count extreme events at various thresholds
for k in [3, 4, 5, 6]:
    actual = (abs(standardized) > k).sum()
    expected = len(returns) * 2 * stats.norm.sf(k)  # two-tailed
    ratio = actual / expected if expected > 0 else np.inf
    print(f"{k}-sigma events: {actual} actual vs {expected:.2e} expected → {ratio:,.0f}× more")

# Find the worst day (Black Monday: October 19, 1987)
worst_day = returns.idxmin()
worst_return = returns.min()
worst_sigma = (worst_return - mu) / sigma

print(f"\nWorst day: {worst_day.strftime('%Y-%m-%d')}")
print(f"Return: {worst_return:.2%}")
print(f"Standard deviations from mean: {worst_sigma:.1f} sigma")

# Probability of this under normality
prob_normal = 2 * stats.norm.sf(abs(worst_sigma))
print(f"Probability under normality: {prob_normal:.2e}")
print(f"That's 1 in {1/prob_normal:.2e} days")
3-sigma events: 272 actual vs 5.16e+01 expected → 5× more
4-sigma events: 107 actual vs 1.21e+00 expected → 88× more
5-sigma events: 53 actual vs 1.10e-02 expected → 4,837× more
6-sigma events: 35 actual vs 3.77e-05 expected → 928,152× more

Worst day: 1987-10-19
Return: -22.90%
Standard deviations from mean: -23.0 sigma
Probability under normality: 4.53e-117
That's 1 in 2.21e+116 days

Summary: The Normality Assumption

Key findings from 75 years of S&P 500 daily returns:

  1. Mean and volatility are reasonable: ~8% annual return, ~16% annual volatility
  2. Negative skewness (\(\approx -1\)): Crashes are sharper than rallies
  3. Excess kurtosis \(\approx 25\): Extreme events are far more common than normal predicts

Why do the test statistics look so extreme?

With \(T \approx 19{,}000\) daily observations, the standard error is tiny (\(\approx 1/\sqrt{T}\)). Even modest departures from normality become statistically overwhelming. This is the point: we have decisive evidence against normality.

Practical implication:

The normal distribution works fine for “typical” days. But it dangerously underestimates tail risk—4-sigma events happen 88× more often than normality predicts. Black Monday (1987) was a 23-sigma event: probability \(\approx 10^{-117}\) under normality. Yet it happened.

Part II: Autocorrelation

Does Yesterday Predict Today?

So far we’ve implicitly assumed returns are independent over time—yesterday’s return tells us nothing about today’s.

Autocorrelation (also called serial correlation) measures the extent to which a variable is correlated with its own past values.

  • Positive autocorrelation: High returns tend to be followed by high returns (momentum)
  • Negative autocorrelation: High returns tend to be followed by low returns (reversal)
  • Zero autocorrelation: Past returns don’t predict future returns

Why it matters:

  • Autocorrelation implies predictability—a potential trading opportunity
  • If returns were truly unpredictable (EMH), autocorrelation should be zero

Sample Autocorrelation

The sample autocorrelation at lag \(k\) measures the correlation between returns \(k\) periods apart:

\[\hat{\rho}_k = \frac{\sum_{t=k+1}^{T} (r_t - \bar{r})(r_{t-k} - \bar{r})}{\sum_{t=1}^{T} (r_t - \bar{r})^2}\]

Interpretation:

  • \(\hat{\rho}_1\): Correlation between today’s return and yesterday’s (lag 1)
  • \(\hat{\rho}_5\): Correlation between today’s return and the return 5 days ago
  • \(\hat{\rho}_k \in [-1, 1]\) by construction

Under the null hypothesis of no autocorrelation:

\[\hat{\rho}_k \sim N\left(0, \frac{1}{T}\right) \quad \text{approximately, for large } T\]

A 95% confidence interval is roughly \(\pm 2/\sqrt{T}\).

S&P 500 Autocorrelation: Live Data

import numpy as np
import pandas as pd
import os

# Load S&P 500 data
csv_path = 'sp500_yf.csv'
sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

T = len(returns)
print(f"Sample size: T = {T:,}")
print(f"Standard error under null: 1/√T = {1/np.sqrt(T):.4f}")
print(f"95% CI: ±{1.96/np.sqrt(T):.4f}")

# Compute autocorrelations at lags 1-10
print(f"\nAutocorrelations:")
for k in range(1, 11):
    rho_k = returns.autocorr(lag=k)
    z_stat = rho_k * np.sqrt(T)
    sig = "*" if abs(z_stat) > 1.96 else ""
    print(f"  Lag {k:2d}: ρ = {rho_k:+.4f}, z = {z_stat:+.2f} {sig}")
Sample size: T = 19,111
Standard error under null: 1/√T = 0.0072
95% CI: ±0.0142

Autocorrelations:
  Lag  1: ρ = -0.0016, z = -0.22 
  Lag  2: ρ = -0.0179, z = -2.48 *
  Lag  3: ρ = -0.0047, z = -0.65 
  Lag  4: ρ = -0.0155, z = -2.15 *
  Lag  5: ρ = -0.0027, z = -0.37 
  Lag  6: ρ = -0.0227, z = -3.13 *
  Lag  7: ρ = +0.0085, z = +1.18 
  Lag  8: ρ = -0.0123, z = -1.70 
  Lag  9: ρ = +0.0169, z = +2.33 *
  Lag 10: ρ = +0.0010, z = +0.14 

Most lags show tiny autocorrelations—but with \(T \approx 19{,}000\), even tiny correlations can be statistically significant. The economic significance is another matter…

The Ljung-Box Test

Testing autocorrelations one at a time is tedious. The Ljung-Box test tests whether any of the first \(K\) autocorrelations are significantly different from zero.

Test statistic:

\[Q(K) = T(T+2) \sum_{k=1}^{K} \frac{\hat{\rho}_k^2}{T-k}\]

Under the null hypothesis of no autocorrelation at any lag:

\[Q(K) \sim \chi^2_K\]

Interpretation:

  • Large \(Q\) \(\Rightarrow\) reject the null \(\Rightarrow\) evidence of autocorrelation
  • Common choice: \(K = 5\) for daily data, \(K = 4\) for quarterly
  • Joint test is more powerful than testing each lag separately

Visualizing Autocorrelation: ACF Plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load S&P 500 data
csv_path = 'sp500_yf.csv'
sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

T = len(returns)
max_lag = 20

# Compute autocorrelations
acf = [returns.autocorr(lag=k) for k in range(1, max_lag + 1)]

# Plot
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(range(1, max_lag + 1), acf)
ax.axhline(1.96/np.sqrt(T), linestyle='--', label='95% CI')
ax.axhline(-1.96/np.sqrt(T), linestyle='--')
ax.axhline(0)
ax.set_xlabel('Lag (days)')
ax.set_ylabel('Autocorrelation')
ax.set_title('S&P 500 Daily Returns: Autocorrelation Function')
ax.legend()
plt.show()

The autocorrelations are tiny (note the y-axis scale) but some exceed the 95% confidence bands.

Ljung-Box Test: Live Results

import numpy as np
import pandas as pd
from statsmodels.stats.diagnostic import acorr_ljungbox

# Load S&P 500 data
csv_path = 'sp500_yf.csv'
sp500 = pd.read_csv(csv_path, index_col='Date', parse_dates=True)
sp500['log_return'] = np.log(sp500['Close'] / sp500['Close'].shift(1))
returns = sp500['log_return'].dropna()

# Ljung-Box test for lags 1, 5, 10, 20
lb_results = acorr_ljungbox(returns, lags=[1, 5, 10, 20], return_df=True)
print("Ljung-Box Test Results:")
print(lb_results.to_string())

print("\nInterpretation:")
print("- Small p-values → reject null of no autocorrelation")
print("- But remember: statistical ≠ economic significance!")
Ljung-Box Test Results:
      lb_stat     lb_pvalue
1    0.050399  8.223702e-01
5   11.382756  4.429755e-02
10  30.960720  5.955535e-04
20  90.801224  5.367238e-11

Interpretation:
- Small p-values → reject null of no autocorrelation
- But remember: statistical ≠ economic significance!

Autocorrelation: What the Data Tell Us

S&P 500 daily returns show:

  • Autocorrelations are tiny (typically \(|\hat{\rho}_k| < 0.05\))
  • But statistically significant due to large sample size
  • Ljung-Box test rejects “no autocorrelation” at conventional levels

Interpretation challenges:

  • Statistical significance \(\neq\) economic significance
  • A correlation of 0.02 means \(R^2 = 0.0004\) (0.04% of variance explained)
  • Transaction costs may eliminate apparent profit opportunities
  • Patterns may not persist out of sample

Autocorrelation: Key Takeaways

  • Autocorrelation measures whether past returns predict future returns

  • Sample autocorrelation \(\hat{\rho}_k\) can be tested against a null of zero (SE \(\approx 1/\sqrt{T}\))

  • The Ljung-Box test tests multiple lags jointly

  • S&P 500 findings: Autocorrelations are tiny (\(|\hat{\rho}_k| < 0.05\)) but statistically significant with 75 years of data

  • Caution: Statistical significance \(\neq\) economic significance

    • \(\rho = 0.02\) means \(R^2 = 0.04\%\)—nearly useless for prediction
    • Transaction costs likely exceed any trading profits

Part III: Predictive Regressions

Beyond Autocorrelation: Other Predictors

Autocorrelation uses past returns to predict future returns.

But we might have other information that helps predict returns. Common predictor variables include:

  • Dividend yield: \(D_t/P_t\) (dividends divided by price)
  • Earnings yield: \(E_t/P_t\) (earnings divided by price)
  • Term spread: Long-term minus short-term Treasury yields
  • Default spread: Corporate bond yields minus Treasury yields
  • Volatility: Recent realized or implied volatility
  • Sentiment indicators: Consumer confidence, investor surveys

Predictive regression: Use these variables to forecast returns.

The Predictive Regression Model

The basic predictive regression takes the form:

\[r_{t+1} = \alpha + \beta x_t + \varepsilon_{t+1}\]

where:

  • \(r_{t+1}\) = return in period \(t+1\) (what we want to predict)
  • \(x_t\) = predictor variable observed at time \(t\) (known when forecasting)
  • \(\alpha\) = intercept
  • \(\beta\) = slope (measures predictive power of \(x\))
  • \(\varepsilon_{t+1}\) = forecast error (unpredictable component)

Timing is crucial: We use \(x_t\) (observed today) to predict \(r_{t+1}\) (realized tomorrow). This ensures the forecast is genuinely out-of-sample.

Multiple Predictors

With multiple predictor variables, the regression becomes:

\[r_{t+1} = \alpha + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_p x_{p,t} + \varepsilon_{t+1}\]

Or in vector notation:

\[r_{t+1} = \alpha + \boldsymbol{\beta}' \mathbf{x}_t + \varepsilon_{t+1}\]

where \(\mathbf{x}_t = (x_{1,t}, x_{2,t}, \ldots, x_{p,t})'\) is the vector of predictors and \(\boldsymbol{\beta} = (\beta_1, \beta_2, \ldots, \beta_p)'\) is the vector of slopes.

Question: How do we measure whether the predictors actually help?

Answer: The \(R^2\) statistic—but we need to be careful about which \(R^2\).

In-Sample \(R^2\)

The standard in-sample \(R^2\) measures how well the fitted model explains variation in the data used to estimate it:

\[R^2_{IS} = 1 - \frac{\sum_{t=1}^{T} (r_t - \hat{r}_t)^2}{\sum_{t=1}^{T} (r_t - \bar{r})^2}\]

where \(\hat{r}_t = \hat{\alpha} + \hat{\boldsymbol{\beta}}' \mathbf{x}_{t-1}\) is the fitted value.

Interpretation:

  • \(R^2_{IS} = 0\): Predictors explain none of the variation (useless)
  • \(R^2_{IS} = 1\): Predictors explain all variation (perfect fit)
  • For return prediction: \(R^2_{IS}\) is typically very small (1–5%)

Problem: In-sample \(R^2\) is overly optimistic. Adding more predictors always increases \(R^2_{IS}\), even if they have no true predictive power.

Part IV: Out-of-Sample Testing

Why In-Sample Performance Misleads

Overfitting: A model estimated on historical data will always fit that data better than it fits new data.

Sources of overfitting:

  • Coefficients are “tuned” to fit the specific noise in the estimation sample
  • With enough predictors, we can fit any dataset well (even random noise)
  • Data mining: Testing many predictors until we find one that “works”

Consequence: In-sample \(R^2\) overstates true predictive ability.

Solution: Evaluate predictive performance on data not used for estimation.

This leads us to out-of-sample (OOS) testing.

Out-of-Sample Testing: The Procedure

Rolling-window approach:

  1. Use data from periods \(1\) to \(t\) to estimate the model
  2. Use the estimated model to forecast return in period \(t+1\)
  3. Record the forecast error: \(e_{t+1} = r_{t+1} - \hat{r}_{t+1}\)
  4. Move forward one period and repeat

Key feature: Each forecast uses only information available at the time. We never “peek ahead” at future data.

Out-of-Sample \(R^2\)

The out-of-sample \(R^2\) compares forecast errors to a naive benchmark (typically the historical mean):

\[R^2_{OOS} = 1 - \frac{\sum_{t=t_0}^{T-1} (r_{t+1} - \hat{r}_{t+1})^2}{\sum_{t=t_0}^{T-1} (r_{t+1} - \bar{r}_t)^2}\]

where:

  • \(\hat{r}_{t+1}\) = model forecast using data through time \(t\)
  • \(\bar{r}_t\) = historical mean return using data through time \(t\)
  • \(t_0\) = start of the out-of-sample period

Interpretation:

  • \(R^2_{OOS} > 0\): Model beats the historical mean forecast
  • \(R^2_{OOS} = 0\): Model does no better than the mean
  • \(R^2_{OOS} < 0\): Model does worse than the mean (yes, this can happen!)

When \(R^2_{OOS}\) Can Be Negative

Unlike in-sample \(R^2\), out-of-sample \(R^2\) can be negative. This happens when:

  • The model overfits the estimation sample
  • Parameter estimates are unstable over time
  • The predictor’s relationship with returns has changed
  • The true predictability was illusory (data mining)

Interpretation: A negative \(R^2_{OOS}\) means you would have been better off ignoring the predictor and just using the historical average return.

This is a valuable finding—it tells you the model doesn’t work in practice.

Goyal and Welch (2008): A Sobering Study

Goyal and Welch tested the out-of-sample performance of many popular return predictors.

Their findings:

  • Most predictors that work in-sample fail out-of-sample
  • Many have negative \(R^2_{OOS}\)
  • The simple historical mean is hard to beat

Example results:

Predictor \(R^2_{IS}\) \(R^2_{OOS}\)
Dividend yield 1.2% −0.5%
Earnings yield 0.8% −1.2%
Term spread 0.3% −0.4%

Source: Goyal and Welch (2008), Review of Financial Studies

Lessons from the OOS Literature

  1. In-sample evidence is not enough. Any published predictor has passed in-sample tests; the real question is OOS performance.

  2. Return prediction is genuinely hard. Even small positive \(R^2_{OOS}\) (1–2%) is considered a success.

  3. Predictability may be time-varying. A predictor that worked historically may not work going forward.

  4. Combining predictors can help. Ensemble methods (averaging forecasts) often outperform individual predictors.

  5. Economic significance matters. Even if \(R^2_{OOS} > 0\), transaction costs may eliminate profits.

Out-of-Sample Testing: Key Takeaways

  • In-sample \(R^2\) overstates true predictive ability due to overfitting

  • Out-of-sample \(R^2\) provides an honest assessment of forecast performance

  • \(R^2_{OOS}\) can be negative, indicating the model is worse than a simple mean forecast

  • Most published return predictors fail OOS tests (Goyal-Welch 2008)

  • This is a preview of a central ML theme: We will spend much of this course on techniques to avoid overfitting and improve OOS performance

Part V: Certainty Equivalence

From Prediction to Portfolio Choice

Suppose we’ve forecast next period’s return \(\hat{r}_{t+1}\).

Question: How should this forecast affect our portfolio?

We need a framework that connects:

  • Expected return \(\mu\)
  • Risk (variance) \(\sigma^2\)
  • Investor preferences (risk aversion)

Mean-variance utility provides this connection.

Mean-Variance Utility

Recall from RSM332: A mean-variance investor evaluates portfolios using:

\[U = \mu_p - \frac{\gamma}{2} \sigma_p^2\]

where:

  • \(\mu_p\) = expected portfolio return
  • \(\sigma_p^2\) = portfolio variance
  • \(\gamma\) = coefficient of risk aversion (higher \(\gamma\) = more risk-averse)

Interpretation: The investor likes expected return but dislikes variance. They will accept lower expected return to reduce risk.

Typical values of \(\gamma\):

  • \(\gamma = 1\): Low risk aversion
  • \(\gamma = 3\)\(5\): Moderate risk aversion (most common assumption)
  • \(\gamma = 10\): High risk aversion

Optimal Allocation: Risky vs Risk-Free

Consider an investor choosing between a risky asset and a risk-free asset:

  • Risky asset: expected excess return \(\mu\), variance \(\sigma^2\)
  • Risk-free asset: return \(r_f\) (certain)
  • \(w\) = fraction of wealth in risky asset

Portfolio expected excess return: \(\mu_p = w \mu\)

Portfolio variance: \(\sigma_p^2 = w^2 \sigma^2\)

Optimal allocation (maximize \(U\)):

\[w^* = \frac{\mu}{\gamma \sigma^2}\]

Intuition:

  • Higher expected return \(\mu\) \(\Rightarrow\) invest more in risky asset
  • Higher risk \(\sigma^2\) \(\Rightarrow\) invest less
  • Higher risk aversion \(\gamma\) \(\Rightarrow\) invest less

Certainty Equivalent Return

The certainty equivalent return (CER) is the risk-free return that would give the same utility as the optimal risky portfolio.

At the optimal allocation \(w^* = \mu/(\gamma\sigma^2)\):

\[U^* = \frac{\mu^2}{2\gamma\sigma^2}\]

Since \(U = r_f + \text{utility from risky investment}\), the certainty equivalent is:

\[\text{CER} = r_f + \frac{\mu^2}{2\gamma\sigma^2}\]

Interpretation: This is the “guaranteed” return the investor would accept instead of taking on the risk. The term \(\mu^2/(2\gamma\sigma^2)\) is the value of being able to invest in the risky asset.

Certainty Equivalence and Predictability

Key insight: If we can predict returns (\(\mu\) varies over time), we can time the market.

  • When predicted return is high \(\Rightarrow\) increase \(w\) (invest more)
  • When predicted return is low \(\Rightarrow\) decrease \(w\) (invest less)

The value of predictability depends on:

  • How much \(\mu\) varies (more variation = more value)
  • How well we can forecast (higher \(R^2\) = more value)
  • Risk aversion \(\gamma\) (affects how aggressively we act on forecasts)

This framework will be central when we evaluate ML prediction models: even a small \(R^2\) improvement can translate into substantial utility gains.

Summary and Looking Ahead

Today’s Key Results

Part I: The Normality Assumption Is Approximate

  • Returns have fat tails (high kurtosis)
  • Normal models underestimate extreme events

Part II: Autocorrelation

  • Weak evidence that past returns predict future returns
  • Momentum at short horizons, mean reversion at long horizons

Parts III–IV: Predictive Regressions and OOS Testing

  • In-sample \(R^2\) overstates predictability
  • Most predictors fail out-of-sample (Goyal-Welch 2008)

Part V: Certainty Equivalence

  • Even small predictability has economic value
  • Framework for translating forecasts into portfolio decisions

What’s Next

Week 4: Introduction to Machine Learning

  • Supervised vs unsupervised learning
  • The prediction problem: given \(X\), predict \(Y\)
  • Bias-variance tradeoff
  • Cross-validation for time series

Preview: ML methods are designed to avoid overfitting and improve out-of-sample performance—exactly what we need for return prediction.

The rest of the course builds on today’s foundations:

  • Regularization (Weeks 6–7): Fighting overfitting in regression
  • Classification (Weeks 8–9): Predicting categories, not values
  • Ensembles & NNs (Weeks 10–11): Combining models for better OOS performance