
Lecture 9: Neural Networks & Deep Learning | March 18–19, 2026
Rotman School of Management
Today we introduce the last and most flexible model in the course: neural networks. To get there, we need to reconnect to a framework from Lecture 5 that has been quietly running behind every model we’ve studied since.
Today’s roadmap:
In Lecture 5 we generalized OLS into a “choose your ingredients” framework. Here was the slide:
\[\hat{\theta} = \arg\min_{\theta} \left\{ \underbrace{\sum_{i=1}^{n} \mathcal{L}(y_i, f_\theta(\mathbf{x}_i))}_{\text{loss function}} + \underbrace{\lambda \cdot \text{Penalty}(\theta)}_{\text{regularization}} \right\}\]
where \(f_\theta\) is the prediction function (parameterized by \(\theta\)), \(\mathcal{L}\) is the loss measuring how wrong our predictions are, and \(\lambda \cdot \text{Penalty}(\theta)\) is an optional regularization term that penalizes complex models.
| Component | OLS Choice | Alternatives |
|---|---|---|
| Function \(f_\theta\) | Linear: \(\mathbf{x}^\top \boldsymbol{\beta}\) | Polynomial, tree, neural network |
| Loss \(\mathcal{L}\) | Squared error | Absolute error, Huber, cross-entropy |
| Penalty | None (\(\lambda = 0\)) | Ridge (L2), Lasso (L1), Elastic Net |
At the time, we said “non-linear \(f\) comes in later lectures.” Every lecture since has been filling in a specific choice for \(f_\theta\) from the “Alternatives” column — we just didn’t always call it that.
Here’s where each model sits in that table:
| Lecture | Model | What \(f_\theta(\mathbf{x})\) is | Parameters \(\theta\) |
|---|---|---|---|
| 5 | Linear regression | \(\mathbf{x}^\top \boldsymbol{\beta}\) | Slopes and intercept |
| 5 | Ridge / Lasso | \(\mathbf{x}^\top \boldsymbol{\beta}\) (with penalty) | Slopes and intercept |
| 7 | Logistic regression | \(\sigma(\mathbf{x}^\top \mathbf{w})\) | Weights and bias |
| 8 | Decision trees / ensembles | Piecewise-constant regions | Split rules and leaf values |
In every case, the parameters \(\theta\) had clear interpretations. You could look at a regression coefficient and say “\(\beta_3 = 0.5\) means a one-unit increase in \(X_3\) is associated with a 0.5-unit increase in \(Y\).” Even tree splits are interpretable: “if momentum > 0.02, go left.”
Today we add one more row to the table:
| Lecture | Model | What \(f_\theta(\mathbf{x})\) is | Parameters \(\theta\) |
|---|---|---|---|
| 9 | Neural network | A composition of layers | Thousands (millions? billions?) of weights |
A neural network is just another choice of \(f_\theta\) — same loss functions, same regularization ideas. What’s different:
If we can’t interpret the parameters, why would we use a neural network?
Because the other models on our list have structural limitations:
Neural networks trade interpretability for flexibility:
The building block of every neural network is a neuron (also called a unit or node). A single neuron does two things:
\[a = g\!\left(\sum_{j=1}^{p} w_j x_j + b\right) = g(\mathbf{w}^\top \mathbf{x} + b)\]
where:
The weighted sum \(\mathbf{w}^\top \mathbf{x} + b\) should look familiar: it’s the same linear combination from logistic regression. The activation function \(g\) is what makes this more than a linear model.
If we choose the sigmoid as our activation function, a single neuron is exactly logistic regression:
\[a = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}\]
This is the same formula from the classification lecture. One neuron with a sigmoid activation = logistic regression. A neural network is what happens when we stack many of these neurons together.
If we choose no activation function at all (the “identity” activation, \(g(z) = z\)), a single neuron is just linear regression: \(\hat{y} = \mathbf{w}^\top \mathbf{x} + b\).
So neural networks are not a departure from what we’ve learned — they’re a generalization. The models we already know are special cases.
The activation function \(g\) controls what kind of nonlinearity each neuron introduces. Here are the most common choices:

Without activation functions, stacking layers doesn’t help. Suppose we have two layers of linear transformations:
\[\text{Layer 1: } \mathbf{h} = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1\] \[\text{Layer 2: } \hat{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2\]
Substituting:
\[\hat{y} = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = \underbrace{(\mathbf{W}_2 \mathbf{W}_1)}_{\mathbf{W}'} \mathbf{x} + \underbrace{(\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2)}_{\mathbf{b}'}\]
This is still a linear function of \(\mathbf{x}\). No matter how many linear layers we stack, the result is always equivalent to a single linear transformation. The depth buys us nothing.
Activation functions break this collapse. When we insert \(g\) between layers:
\[\hat{y} = \mathbf{W}_2 \, g(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2\]
the \(g\) prevents us from simplifying. Each layer now genuinely transforms the representation. This is why depth matters — but only with nonlinear activations.
A layer is a collection of neurons that all receive the same inputs but have different weights. With \(d\) neurons and \(p\) input features, the whole layer is one matrix operation:
\[\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b})\]
where:

Each line in the diagram represents one weight. With 3 inputs and 4 neurons, that’s \(3 \times 4 = 12\) weights plus 4 biases = 16 parameters. Already more than the 4 parameters in a 3-variable linear regression.
A feed-forward neural network (or multilayer perceptron / MLP) stacks layers — each layer’s output feeds into the next:
\[\mathbf{h}^{(1)} = g\!\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right)\] \[\mathbf{h}^{(2)} = g\!\left(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}\right)\] \[\hat{y} = \mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)}\]

Let’s count parameters for a concrete example. Suppose we have:
| Connection | Weights | Biases | Total |
|---|---|---|---|
| Input → Hidden 1 | \(10 \times 64 = 640\) | 64 | 704 |
| Hidden 1 → Hidden 2 | \(64 \times 32 = 2{,}048\) | 32 | 2,080 |
| Hidden 2 → Output | \(32 \times 1 = 32\) | 1 | 33 |
| Total | 2,817 |
The final layer of the network depends on the task:
Regression (predicting a continuous value like returns):
Binary classification (predicting default / no default):
Multi-class classification (predicting which of \(K\) sectors):
Hidden layers learn a useful representation; the output layer translates it into the prediction format we need.
A neural network is just a function \(f_\theta(\mathbf{x})\) that we plug into the same regression framework from Lecture 5:
\[\hat{\theta} = \arg\min_{\theta} \left\{ \sum_{i=1}^{n} \mathcal{L}\!\left(y_i, \; f_\theta(\mathbf{x}_i)\right) + \lambda \cdot \text{Penalty}(\theta) \right\}\]
What makes it different:
The loss \(\mathcal{L}\), the regularization penalty, the train/test split, cross-validation — all the machinery from earlier lectures carries over. The only new ingredient is the architecture of \(f_\theta\).
How flexible is a neural network, exactly? Could it represent any relationship?
Yes — with a caveat.
The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991):
A feed-forward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy.
In plain language: given enough neurons, a one-hidden-layer network can get as close as you want to any smooth function.
The theorem says a wide enough network can represent extremely complex functions:

With 5 neurons, the network captures only the broad shape. With 100, it gets most of the detail. With 1,000 and 100,000 neurons, the fit becomes nearly perfect — the network has enough capacity to match every wiggle of the true function. The universal approximation theorem guarantees that we can get as close as we want — if we use enough neurons.
The theorem is an existence result, not a practical guarantee. It says a solution exists, not that we can find it.
It doesn’t promise:
That gradient descent will find the right weights. The loss landscape of a neural network is non-convex — it has many local minima and saddle points. The optimization could get stuck.
How many neurons you need. The theorem says “a sufficient number,” but that number could be impractically large.
That the network will generalize. A network with enough parameters can memorize the training data perfectly (just like a high-degree polynomial). The theorem says nothing about performance on new data.
That one hidden layer is the best architecture. The theorem proves one wide layer is enough, but in practice, deeper networks with fewer neurons per layer are more efficient. They can represent certain functions with far fewer total parameters than a single wide layer would need.
This last point is the practical motivation for deep learning. Depth is not about theoretical power (one layer is enough) — it’s about efficiency and learning useful representations at each level.
We need to find the parameters \(\theta\) (all weights and biases) that minimize the loss:
\[\theta^* = \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}\!\left(y_i, \; f_\theta(\mathbf{x}_i)\right)\]
We solve it with gradient descent: start with random weights, compute the gradient, step in the opposite direction:
\[\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}\]
where \(\eta\) (eta) is the learning rate — how large a step we take.
Same losses from earlier lectures — they carry over directly.
Regression:
\[\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
Mean squared error — same as OLS.
Binary classification:
\[\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]\]
Binary cross-entropy — same as logistic regression. Heavily penalizes confident wrong predictions.
Multi-class classification (\(K\) classes):
\[\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})\]
where \(y_{ik} = 1\) if observation \(i\) belongs to class \(k\) and 0 otherwise.
With thousands of parameters, how do we compute the gradient \(\nabla_\theta \mathcal{L}\)?
Backpropagation (Rumelhart, Hinton & Williams, 1986) computes the gradient of the loss with respect to every weight. It works because a neural network is a nested composition of functions. For a 3-layer network:
\[\begin{align*} \mathbf{h}^{(1)} &= g\!\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right) \\ \mathbf{h}^{(2)} &= g\!\left(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}\right) \\ \hat{y} &= \mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)} \end{align*}\]
Substituting, the prediction is a nested composition:
\[\hat{y} = f^{(3)}\!\Big(\; f^{(2)}\!\Big(\; f^{(1)}(\mathbf{x})\;\Big)\;\Big)\]
where \(f^{(\ell)}(\cdot) = g(\mathbf{W}^{(\ell)}(\cdot) + \mathbf{b}^{(\ell)})\). The chain rule differentiates through each layer:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \hat{y}}}_{\text{output}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial \mathbf{h}^{(2)}}}_{\text{layer 3}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{h}^{(1)}}}_{\text{layer 2}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{W}^{(1)}}}_{\text{layer 1}}\]
The chain rule multiplies many factors together. If each is small, the gradient shrinks exponentially:
Why this happens with sigmoid/tanh: Their gradients are always < 1 (sigmoid’s max is 0.25), so the chain rule multiplies many small numbers together.
Why ReLU helps: Its gradient is either 0 or 1. For positive inputs, the gradient passes through unchanged — no shrinkage.
Standard gradient descent uses the entire training set at each step — slow with millions of observations.
Stochastic Gradient Descent (SGD) uses a small random subset (a mini-batch) instead:
\[\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}_{\text{batch}}\]
An epoch is one complete pass through the entire training set.
If you have 10,000 training observations and a batch size of 100, each epoch consists of 100 mini-batch updates. After one epoch, every observation has been used exactly once.
Training typically runs for many epochs — the optimizer sees the same data repeatedly, refining the weights each time.

Training loss decreases steadily. Validation loss decreases initially, then starts to increase — the model begins overfitting. The gap between the curves signals that the model is memorizing training data rather than learning generalizable patterns.
The learning rate \(\eta\) is usually the first hyperparameter to tune when training a neural network.

Too small: The model learns very slowly. You might need thousands of epochs to converge — wasteful and sometimes impractical.
Too large: The model overshoots and bounces around, never settling into a good minimum. The loss oscillates or even diverges.
Just right: The loss decreases steadily and converges to a low value.
Typical starting values: \(\eta = 0.001\) for Adam optimizer, \(\eta = 0.01\) for SGD. Often reduced during training (learning rate scheduling).
Plain SGD uses the same learning rate for every parameter and doesn’t account for the history of past gradients. Modern optimizers improve on this.
SGD with Momentum adds a “velocity” term \(\mathbf{v}_t\) that accumulates past gradients:
\[\begin{align*} \mathbf{v}_t &= \gamma \, \mathbf{v}_{t-1} + \eta \, \nabla_\theta \mathcal{L} \\ \theta_{t+1} &= \theta_t - \mathbf{v}_t \end{align*}\]
Adam (Adaptive Moment Estimation; Kingma & Ba, 2015) tracks the mean \(\mathbf{m}_t\) and variance \(\mathbf{v}_t\) of past gradients, adapting the learning rate per parameter:
\[\begin{align*} \mathbf{m}_t &= \beta_1 \, \mathbf{m}_{t-1} + (1 - \beta_1) \, \nabla_\theta \mathcal{L} && \text{(mean of gradients)} \\ \mathbf{v}_t &= \beta_2 \, \mathbf{v}_{t-1} + (1 - \beta_2) \, (\nabla_\theta \mathcal{L})^2 && \text{(variance of gradients)} \\ \theta_{t+1} &= \theta_t - \eta \, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \end{align*}\]
Putting it all together, training a neural network involves these steps:
1. Choose the architecture: How many hidden layers? How many neurons per layer? What activation functions?
2. Choose the loss function: MSE for regression, cross-entropy for classification.
3. Choose the optimizer: Adam is usually the default starting point.
4. Set hyperparameters: Learning rate, batch size, number of epochs.
5. Train: Feed mini-batches through the network, compute the loss, backpropagate, update weights. Repeat for many epochs.
6. Monitor: Track training and validation loss each epoch. Stop when validation loss stops improving.
7. Evaluate: Test on held-out data to estimate real-world performance.
This is the same workflow as fitting any ML model — define the model, choose the loss, optimize, evaluate. The difference is that neural networks have more architectural choices and take longer to train.
Neural networks typically have far more parameters than training observations (e.g., 50,000 parameters, 5,000 observations). Without regularization:
Unlike Ridge/Lasso (one knob: \(\lambda\)), neural networks use a toolkit of complementary strategies.
The simplest and most effective regularization technique: stop training before the model overfits.
Monitor validation loss during training. When it stops improving (or starts increasing), stop training and use the weights from the best epoch.

Early stopping acts as implicit regularization: by limiting the number of gradient descent steps, we prevent the model from fully fitting the training noise. It’s analogous to how reducing the number of boosting rounds controls overfitting in XGBoost.
Dropout (Srivastava et al., 2014): at each mini-batch, randomly set a fraction of neurons to zero.

Same L2 penalty as Ridge regression, applied to neural network weights:
\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_{\ell} \|\mathbf{W}^{(\ell)}\|_2^2\]
The gradient update now includes a term that pulls weights toward zero:
\[w_{t+1} = w_t - \eta \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_t} - \underbrace{\eta \lambda w_t}_{\text{decay}}\]
Two ways to add capacity to a network:
Wider: More neurons per layer (e.g., one hidden layer with 1,000 neurons)
Deeper: More layers with fewer neurons each (e.g., five hidden layers with 64 neurons each)
The universal approximation theorem says one wide layer is sufficient. In practice, deeper is often better:
For tabular financial data: 2–4 hidden layers usually suffice. 50+ layers are for images, text, and structured data.
As preceding layers change during training, the distribution of each layer’s inputs shifts (internal covariate shift). Normalization techniques fix this.
Batch Normalization (Ioffe & Szegedy, 2015) — normalize across the mini-batch:
\[\hat{z} = \frac{z - \mu_{\text{batch}}}{\sigma_{\text{batch}}}\]
Layer Normalization (Ba, Lei & Hinton, 2016) — normalize across features within each observation:
Both stabilize training, allow higher learning rates, and act as mild regularizers.
| Technique | What it does | Analogy |
|---|---|---|
| Early stopping | Stop before the model overfits | Limiting boosting rounds in XGBoost |
| Dropout | Randomly disable neurons during training | Ensemble averaging (like Random Forests) |
| Weight decay | Penalize large weights (L2 penalty) | Ridge regression’s \(\lambda \|\boldsymbol{\beta}\|_2^2\) |
| Batch / Layer norm | Normalize intermediate values | Standardizing features before regression |
In practice, these techniques are often combined. A typical setup: ReLU activation + Adam optimizer + dropout (0.2–0.5) + early stopping + some weight decay.
PyTorch is the most widely used deep learning framework in research and increasingly in industry. It handles all the backpropagation, gradient computation, and optimization automatically — you define the architecture and the training loop.
We’ll build a simple classifier for the same synthetic data we used with Random Forests and XGBoost.
import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_classification(
n_samples=1000, n_features=10, n_informative=5,
random_state=42
)
# Split into train, validation, and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.2, random_state=42
)
# Standardize features (neural networks are sensitive to feature scales)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)
print(f"Training set: {X_train_t.shape[0]} observations")
print(f"Validation set: {X_val_t.shape[0]} observations")
print(f"Test set: {X_test_t.shape[0]} observations")Training set: 640 observations
Validation set: 160 observations
Test set: 200 observations
# Build a feed-forward neural network
model = nn.Sequential(
nn.Linear(10, 64), # hidden layer 1: 64 neurons
nn.ReLU(),
nn.Linear(64, 32), # hidden layer 2: 32 neurons
nn.ReLU(),
nn.Linear(32, 1), # output layer: 1 neuron
nn.Sigmoid() # sigmoid for binary classification
)
# Choose optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss() # binary cross-entropy
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")Sequential(
(0): Linear(in_features=10, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=32, bias=True)
(3): ReLU()
(4): Linear(in_features=32, out_features=1, bias=True)
(5): Sigmoid()
)
Total parameters: 2,817
Each Linear layer is a fully-connected layer (every neuron connects to every neuron in the previous layer). ReLU activations introduce nonlinearity between layers.
from torch.utils.data import TensorDataset, DataLoader
# Create data loader for mini-batching
train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=32, shuffle=True)
# Training loop
train_losses = []
val_losses = []
torch.manual_seed(42)
for epoch in range(50):
# Training phase
for X_batch, y_batch in train_loader:
optimizer.zero_grad() # reset gradients
y_pred = model(X_batch) # forward pass
loss = loss_fn(y_pred, y_batch) # compute loss
loss.backward() # backpropagation
optimizer.step() # update weights
# Record losses (no gradients needed for evaluation)
with torch.no_grad():
train_losses.append(loss_fn(model(X_train_t), y_train_t).item())
val_losses.append(loss_fn(model(X_val_t), y_val_t).item())
print(f"Final training loss: {train_losses[-1]:.4f}")
print(f"Final validation loss: {val_losses[-1]:.4f}")Final training loss: 0.0821
Final validation loss: 0.1823

Training loss is consistently below validation loss — the model fits the training data better than unseen data, as expected. The gap is small here, meaning the model generalizes reasonably well.
Test accuracy: 0.950
# Compare with Random Forest and XGBoost on the same data
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
xgb = XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
xgb.fit(X_train, y_train)
print(f"\nModel comparison on test set:")
print(f"Neural Network: {test_accuracy:.3f}")
print(f"Random Forest: {rf.score(X_test, y_test):.3f}")
print(f"XGBoost: {xgb.score(X_test, y_test):.3f}")
Model comparison on test set:
Neural Network: 0.950
Random Forest: 0.950
XGBoost: 0.945
On small tabular datasets like this, neural networks typically perform comparably to tree-based methods — sometimes slightly better, sometimes slightly worse. The real advantage of neural networks emerges with large datasets and non-tabular data (images, text, sequences).
| Application | Why neural networks? |
|---|---|
| High-frequency trading | Large datasets, complex nonlinear patterns in order flow |
| Factor models | Gu, Kelly & Xiu (2020): NNs outperform linear/tree models for stock return prediction — but need large panels |
| Alternative data | Images, news, earnings calls, social media require specialized architectures (CNNs, transformers) |
| Risk management | Estimating loss distributions, stress testing, credit risk |
Where tree-based models still win: small-to-medium tabular datasets (most day-to-day finance). XGBoost/RF are competitive, faster to train, easier to tune, and interpretable.
| Hyperparameter | What it controls | Typical choices |
|---|---|---|
| Number of layers | Network depth | 2–4 for tabular data; dozens for images/text |
| Neurons per layer | Network width | 32, 64, 128, 256 |
| Activation function | Nonlinearity type | ReLU (hidden), sigmoid/softmax (output) |
| Learning rate | Step size in gradient descent | 0.001 (Adam), 0.01 (SGD) |
| Batch size | Observations per gradient update | 32–256 |
| Epochs | Passes through training data | Until validation loss stops improving |
| Dropout rate | Fraction of neurons dropped | 0.2–0.5 |
| Weight decay | L2 penalty strength | \(10^{-4}\) to \(10^{-2}\) |
| Optimizer | Gradient descent variant | Adam (default), SGD with momentum |
Start with a simple architecture (2 hidden layers, 64 neurons each, ReLU, Adam, dropout 0.3) and adjust based on validation performance.
model.eval(). Dropout and batch norm behave differently during training vs. evaluation. Always switch modes.A recurrent neural network adds a connection from a layer back to itself across time steps:
\[h_t = g(\mathbf{W}_h \, h_{t-1} + \mathbf{W}_x \, x_t + \mathbf{b})\]
LSTM (Hochreiter & Schmidhuber, 1997) fixes this by adding a cell state \(c_t\) — a separate memory channel with three learned gates:
\[\begin{align*} f_t &= \sigma(\mathbf{W}_f [h_{t-1}, x_t] + \mathbf{b}_f) \\ i_t &= \sigma(\mathbf{W}_i [h_{t-1}, x_t] + \mathbf{b}_i) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\ o_t &= \sigma(\mathbf{W}_o [h_{t-1}, x_t] + \mathbf{b}_o) \\ h_t &= o_t \odot \tanh(c_t) \end{align*}\]
where \(\odot\) is element-wise multiplication, \(\tilde{c}_t = \tanh(\mathbf{W}_c [h_{t-1}, x_t] + \mathbf{b}_c)\) is candidate new memory.
The problem with RNNs:
The fix: process the entire sequence at once, letting any word attend to any other word directly.
The transformer (Vaswani et al., 2017) projects each word into three learned roles:
Stacking into matrices \(\mathbf{Q}\), \(\mathbf{K}\), \(\mathbf{V}\):
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right) \mathbf{V}\]
A GAN (Goodfellow et al., 2014) trains two networks against each other:
The two networks play a minimax game:
\[\min_G \max_D \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [\log(1 - D(G(\mathbf{z})))]\]
Discriminator maximizes: assign high \(D(\mathbf{x})\) to real data, low \(D(G(\mathbf{z}))\) to fakes
Generator minimizes: produce \(G(\mathbf{z})\) that fools \(D\)
Training alternates between updating \(D\) and \(G\)
At convergence (in theory): generator produces data indistinguishable from real, \(D\) outputs 0.5 for everything
In practice: GAN training is notoriously unstable — the two networks can oscillate rather than converge
Every model we’ve seen today uses data to train: we have \((x_i, y_i)\) pairs and minimize a loss like MSE. But what if we don’t have data — we have an equation we know the solution must satisfy?
This idea originated in physics (Raissi, Perdikaris & Karniadakis, 2019), where it’s called Physics-Informed Neural Networks (PINNs). My dissertation applied it to finance — Finance-Informed Neural Networks (FINNs) — in two settings.
Model 60 overlapping generations of households (ages 20–79)–3 types: 0- ,low-, high-LTI–each choosing how much to consume vs. save. A household of age \(i\) maximizes lifetime utility — the discounted sum of utility from consumption over its remaining life:
\[\max \quad \mathbb{E}_t \sum_{s=0}^{59} \beta^s \; u(c_{i,t+i})\]
subject to a budget constraint each period:
\[c_{i,t} + k_{i+1,t+1} = (1-\tau_t)\,w_t\,\ell_i + (1 + r_t)\,k_{i,t} - a_{i,t}\]
where \(c_{i,t}\) is consumption, \(k_{i+1,t+1}\) is savings carried into next period, \(w_t\,\ell_i\) is after-tax labour income, \(r_t\) is the return on savings, and \(a_{i,t}\) is the student loan payment. This budget constraint is what links all the households together — everyone’s savings become the economy’s capital, which determines next period’s wages and returns for everyone else.
When you solve this optimization (take the first-order condition), you get a condition that must hold at every age: the marginal benefit of consuming one more dollar today must equal the expected marginal benefit of saving that dollar and consuming it tomorrow:
\[u'(c_{i,t}) = \beta \; \mathbb{E}_t\!\left[u'(c_{i+1,t+1}) \cdot (1 + r_{t+1})\right]\]
where \(u'(\cdot)\) is marginal utility, \(\beta\) is the discount factor (patience), and \(r_{t+1}\) is the return on savings. In words: if this condition is violated, the household can do better by shifting consumption between today and tomorrow.
That’s 59 \(\times\) 3 of these conditions (one per age per type) that must all hold simultaneously. The state space — tracking everyone’s wealth — is 178-dimensional. Traditional methods that use grids are hopeless here: a grid with just 10 points per dimension would have \(10^{178}\) grid points.
Instead, a neural network \(f_\theta\) learns the optimal saving decision for all 60 age groups at once, trained by penalizing violations of these optimality conditions. To avoid grids entirely, I use the neural network’s own policy to simulate the economy forward in time — generating a time series of states the economy actually visits — and evaluate the optimality conditions only at those points. No grid, no curse of dimensionality.
A caplet is an option on a future interest rate — it pays off if the rate exceeds a strike \(L_E\). From RSM332, the price of any derivative is the expected discounted payoff under no-arbitrage:
\[V(t) = \mathbb{E}\!\left[\exp\!\left(-\int_t^T r(s)\,ds\right) \cdot \text{payoff at } T\right]\]
where \(r(s)\) is the short-term interest rate along the path. The standard approach is Monte Carlo simulation: simulate thousands of random interest rate paths, compute the payoff on each, discount back, and average. This is slow — seconds per contract — and you need to re-simulate for every change in contract terms.
Instead, there is an equivalent way to express exactly the same price, not as an expectation over random paths, but as a deterministic equation that the price function \(V\) must satisfy:
\[-\frac{\partial V}{\partial \tau} + \mu' \nabla_f V + \frac{1}{2} \sum_{n=1}^{N} \sigma_n' \nabla_f^2 V \; \sigma_n - r \cdot V = 0\]
where \(\tau\) is time to maturity, \(f\) is the current forward rate curve, and \(\mu, \sigma\) come from the term structure model. You don’t need to understand every symbol here — the point is that this is an equation the price must satisfy, and it involves derivatives of \(V\) with respect to its inputs.
A neural network \(V_\theta\) approximates the price. The loss function squares the left-hand side of that equation: if \(V_\theta\) satisfies the equation, the loss is zero. Backpropagation — the same machinery used to train any neural network — computes all the partial derivatives of \(V_\theta\) that appear in the equation. No random simulation at all — the network learns the price by satisfying the equation directly.
The result: once trained, the network prices any contract instantly. Prices are 300,000–4,500,000\(\times\) faster than Monte Carlo simulation, with accuracy to 0.04 cents per dollar.
Deep learning = regression. A neural network is a choice of \(f_\theta\) in the same framework from Lecture 5. The loss functions, regularization, and evaluation methods all carry over.
The architecture. Neurons compute weighted sums plus nonlinear activations. Layers stack neurons together. Feed-forward networks chain layers sequentially. Activation functions (especially ReLU) are what make depth meaningful — without them, multiple layers collapse to a single linear transformation.
Universal approximation. A wide enough single-layer network can approximate any continuous function. But this is an existence theorem, not a practical recipe. Deep networks are more parameter-efficient and learn hierarchical representations.
Training. No closed-form solution — we use gradient descent. Backpropagation efficiently computes gradients by applying the chain rule backwards through the network. SGD with mini-batches makes training practical for large datasets. Adam is the default optimizer.
Regularization. Neural networks are massively overparameterized and prone to overfitting. Early stopping, dropout, weight decay, and batch normalization are the standard toolkit.
Modern architectures. RNNs and LSTMs handle sequential data by passing hidden states through time. Transformers replace recurrence with self-attention, processing entire sequences in parallel. GANs train two networks adversarially to generate synthetic data.
In practice. On tabular financial data, neural networks compete with but rarely dominate tree-based methods. Their real advantage is with large datasets and non-tabular data (images, text, sequences).
Neural networks unlock a new kind of data for finance: text.
Earnings calls, news articles, analyst reports, SEC filings, social media — vast amounts of financially relevant information is locked in natural language.
Lecture 10: Text & Natural Language Processing