
Lecture 9: Neural Networks & Deep Learning | March 18–19, 2026
Rotman School of Management
Today we introduce the last and most flexible model in the course: neural networks. To get there, we need to reconnect to a framework from Lecture 5 that has been quietly running behind every model we’ve studied since.
Today’s roadmap:
In Lecture 5 we generalized OLS into a “choose your ingredients” framework. Here was the slide:
\[\hat{\theta} = \arg\min_{\theta} \left\{ \underbrace{\sum_{i=1}^{n} \mathcal{L}(y_i, f_\theta(\mathbf{x}_i))}_{\text{loss function}} + \underbrace{\lambda \cdot \text{Penalty}(\theta)}_{\text{regularization}} \right\}\]
where \(f_\theta\) is the prediction function (parameterized by \(\theta\)), \(\mathcal{L}\) is the loss measuring how wrong our predictions are, and \(\lambda \cdot \text{Penalty}(\theta)\) is an optional regularization term that penalizes complex models.
| Component | OLS Choice | Alternatives |
|---|---|---|
| Function \(f_\theta\) | Linear: \(\mathbf{x}^\top \boldsymbol{\beta}\) | Polynomial, tree, neural network |
| Loss \(\mathcal{L}\) | Squared error | Absolute error, Huber, cross-entropy |
| Penalty | None (\(\lambda = 0\)) | Ridge (L2), Lasso (L1), Elastic Net |
At the time, we said “non-linear \(f\) comes in later lectures.” Every lecture since has been filling in a specific choice for \(f_\theta\) from the “Alternatives” column — we just didn’t always call it that.
Here’s where each model sits in that table:
| Lecture | Model | What \(f_\theta(\mathbf{x})\) is | Parameters \(\theta\) |
|---|---|---|---|
| 5 | Linear regression | \(\mathbf{x}^\top \boldsymbol{\beta}\) | Slopes and intercept |
| 5 | Ridge / Lasso | \(\mathbf{x}^\top \boldsymbol{\beta}\) (with penalty) | Slopes and intercept |
| 7 | Logistic regression | \(\sigma(\mathbf{x}^\top \mathbf{w})\) | Weights and bias |
| 8 | Decision trees / ensembles | Piecewise-constant regions | Split rules and leaf values |
In every case, the parameters \(\theta\) had clear interpretations. You could look at a regression coefficient and say “\(\beta_3 = 0.5\) means a one-unit increase in \(X_3\) is associated with a 0.5-unit increase in \(Y\).” Even tree splits are interpretable: “if momentum > 0.02, go left.”
Today we add one more row to the table:
| Lecture | Model | What \(f_\theta(\mathbf{x})\) is | Parameters \(\theta\) |
|---|---|---|---|
| 9 | Neural network | A composition of layers | Thousands (millions? billions?) of weights |
A neural network is just another choice of \(f_\theta\) — same loss functions, same regularization ideas. What’s different:
If we can’t interpret the parameters, why would we use a neural network?
Because the other models on our list have structural limitations:
Neural networks trade interpretability for flexibility:
The building block of every neural network is a neuron (also called a unit or node). A single neuron does two things:
\[a = g\!\left(\sum_{j=1}^{p} w_j x_j + b\right) = g(\mathbf{w}^\top \mathbf{x} + b)\]
where:
The weighted sum \(\mathbf{w}^\top \mathbf{x} + b\) should look familiar: it’s the same linear combination from logistic regression. The activation function \(g\) is what makes this more than a linear model.
If we choose the sigmoid as our activation function, a single neuron is exactly logistic regression:
\[a = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}\]
This is the same formula from the classification lecture. One neuron with a sigmoid activation = logistic regression. A neural network is what happens when we stack many of these neurons together.
If we choose no activation function at all (the “identity” activation, \(g(z) = z\)), a single neuron is just linear regression: \(\hat{y} = \mathbf{w}^\top \mathbf{x} + b\).
So neural networks are not a departure from what we’ve learned — they’re a generalization. The models we already know are special cases.
The activation function \(g\) controls what kind of nonlinearity each neuron introduces. Here are the most common choices:

Without activation functions, stacking layers doesn’t help. Suppose we have two layers of linear transformations:
\[\text{Layer 1: } \mathbf{h} = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1\] \[\text{Layer 2: } \hat{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2\]
Substituting:
\[\hat{y} = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = \underbrace{(\mathbf{W}_2 \mathbf{W}_1)}_{\mathbf{W}'} \mathbf{x} + \underbrace{(\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2)}_{\mathbf{b}'}\]
This is still a linear function of \(\mathbf{x}\). No matter how many linear layers we stack, the result is always equivalent to a single linear transformation. The depth buys us nothing.
Activation functions break this collapse. When we insert \(g\) between layers:
\[\hat{y} = \mathbf{W}_2 \, g(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2\]
the \(g\) prevents us from simplifying. Each layer now genuinely transforms the representation. This is why depth matters — but only with nonlinear activations.
A layer is a collection of neurons that all receive the same inputs but have different weights. With \(d\) neurons and \(p\) input features, the whole layer is one matrix operation:
\[\mathbf{h} = g(\mathbf{W}\mathbf{x} + \mathbf{b})\]
where:

Each line in the diagram represents one weight. With 3 inputs and 4 neurons, that’s \(3 \times 4 = 12\) weights plus 4 biases = 16 parameters. Already more than the 4 parameters in a 3-variable linear regression.
A feed-forward neural network (or multilayer perceptron / MLP) stacks layers — each layer’s output feeds into the next:
\[\mathbf{h}^{(1)} = g\!\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right)\] \[\mathbf{h}^{(2)} = g\!\left(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}\right)\] \[\hat{y} = \mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)}\]

Let’s count parameters for a concrete example. Suppose we have:
| Connection | Weights | Biases | Total |
|---|---|---|---|
| Input → Hidden 1 | \(10 \times 64 = 640\) | 64 | 704 |
| Hidden 1 → Hidden 2 | \(64 \times 32 = 2{,}048\) | 32 | 2,080 |
| Hidden 2 → Output | \(32 \times 1 = 32\) | 1 | 33 |
| Total | 2,817 |
The final layer of the network depends on the task:
Regression (predicting a continuous value like returns):
Binary classification (predicting default / no default):
Multi-class classification (predicting which of \(K\) sectors):
Hidden layers learn a useful representation; the output layer translates it into the prediction format we need.
A neural network is just a function \(f_\theta(\mathbf{x})\) that we plug into the same regression framework from Lecture 5:
\[\hat{\theta} = \arg\min_{\theta} \left\{ \sum_{i=1}^{n} \mathcal{L}\!\left(y_i, \; f_\theta(\mathbf{x}_i)\right) + \lambda \cdot \text{Penalty}(\theta) \right\}\]
What makes it different:
The loss \(\mathcal{L}\), the regularization penalty, the train/test split, cross-validation — all the machinery from earlier lectures carries over. The only new ingredient is the architecture of \(f_\theta\).
How flexible is a neural network, exactly? Could it represent any relationship?
Yes — with a caveat.
The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991):
A feed-forward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy.
In plain language: given enough neurons, a one-hidden-layer network can get as close as you want to any smooth function.
The theorem says a wide enough network can represent extremely complex functions:

With 5 neurons, the network captures only the broad shape. With 100, it gets most of the detail. With 1,000 and 100,000 neurons, the fit becomes nearly perfect — the network has enough capacity to match every wiggle of the true function. The universal approximation theorem guarantees that we can get as close as we want — if we use enough neurons.
The theorem is an existence result, not a practical guarantee. It says a solution exists, not that we can find it.
It doesn’t promise:
That gradient descent will find the right weights. The loss landscape of a neural network is non-convex — it has many local minima and saddle points. The optimization could get stuck.
How many neurons you need. The theorem says “a sufficient number,” but that number could be impractically large.
That the network will generalize. A network with enough parameters can memorize the training data perfectly (just like a high-degree polynomial). The theorem says nothing about performance on new data.
That one hidden layer is the best architecture. The theorem proves one wide layer is enough, but in practice, deeper networks with fewer neurons per layer are more efficient. They can represent certain functions with far fewer total parameters than a single wide layer would need.
This last point is the practical motivation for deep learning. Depth is not about theoretical power (one layer is enough) — it’s about efficiency and learning useful representations at each level.
We need to find the parameters \(\theta\) (all weights and biases) that minimize the loss:
\[\theta^* = \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}\!\left(y_i, \; f_\theta(\mathbf{x}_i)\right)\]
We solve it with gradient descent: start with random weights, compute the gradient, step in the opposite direction:
\[\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}\]
where \(\eta\) (eta) is the learning rate — how large a step we take.
Same losses from earlier lectures — they carry over directly.
Regression:
\[\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
Mean squared error — same as OLS.
Binary classification:
\[\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]\]
Binary cross-entropy — same as logistic regression. Heavily penalizes confident wrong predictions.
Multi-class classification (\(K\) classes):
\[\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})\]
where \(y_{ik} = 1\) if observation \(i\) belongs to class \(k\) and 0 otherwise.
With thousands of parameters, how do we compute the gradient \(\nabla_\theta \mathcal{L}\)?
Backpropagation (Rumelhart, Hinton & Williams, 1986) computes the gradient of the loss with respect to every weight. It works because a neural network is a nested composition of functions. For a 3-layer network:
\[\begin{align*} \mathbf{h}^{(1)} &= g\!\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right) \\ \mathbf{h}^{(2)} &= g\!\left(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}\right) \\ \hat{y} &= \mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)} \end{align*}\]
Substituting, the prediction is a nested composition:
\[\hat{y} = f^{(3)}\!\Big(\; f^{(2)}\!\Big(\; f^{(1)}(\mathbf{x})\;\Big)\;\Big)\]
where \(f^{(\ell)}(\cdot) = g(\mathbf{W}^{(\ell)}(\cdot) + \mathbf{b}^{(\ell)})\). The chain rule differentiates through each layer:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \hat{y}}}_{\text{output}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial \mathbf{h}^{(2)}}}_{\text{layer 3}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{h}^{(1)}}}_{\text{layer 2}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{W}^{(1)}}}_{\text{layer 1}}\]
The chain rule multiplies many factors together. If each is small, the gradient shrinks exponentially:
Why this happens with sigmoid/tanh: Their gradients are always < 1 (sigmoid’s max is 0.25), so the chain rule multiplies many small numbers together.
Why ReLU helps: Its gradient is either 0 or 1. For positive inputs, the gradient passes through unchanged — no shrinkage.
Standard gradient descent uses the entire training set at each step — slow with millions of observations.
Stochastic Gradient Descent (SGD) uses a small random subset (a mini-batch) instead:
\[\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}_{\text{batch}}\]
An epoch is one complete pass through the entire training set.
If you have 10,000 training observations and a batch size of 100, each epoch consists of 100 mini-batch updates. After one epoch, every observation has been used exactly once.
Training typically runs for many epochs — the optimizer sees the same data repeatedly, refining the weights each time.

Training loss decreases steadily. Validation loss decreases initially, then starts to increase — the model begins overfitting. The gap between the curves signals that the model is memorizing training data rather than learning generalizable patterns.
The learning rate \(\eta\) is usually the first hyperparameter to tune when training a neural network.

Too small: The model learns very slowly. You might need thousands of epochs to converge — wasteful and sometimes impractical.
Too large: The model overshoots and bounces around, never settling into a good minimum. The loss oscillates or even diverges.
Just right: The loss decreases steadily and converges to a low value.
Typical starting values: \(\eta = 0.001\) for Adam optimizer, \(\eta = 0.01\) for SGD. Often reduced during training (learning rate scheduling).
Plain SGD uses the same learning rate for every parameter and doesn’t account for the history of past gradients. Modern optimizers improve on this.
SGD with Momentum adds a “velocity” term \(\mathbf{v}_t\) that accumulates past gradients:
\[\begin{align*} \mathbf{v}_t &= \gamma \, \mathbf{v}_{t-1} + \eta \, \nabla_\theta \mathcal{L} \\ \theta_{t+1} &= \theta_t - \mathbf{v}_t \end{align*}\]
Adam (Adaptive Moment Estimation; Kingma & Ba, 2015) tracks the mean \(\mathbf{m}_t\) and variance \(\mathbf{v}_t\) of past gradients, adapting the learning rate per parameter:
\[\begin{align*} \mathbf{m}_t &= \beta_1 \, \mathbf{m}_{t-1} + (1 - \beta_1) \, \nabla_\theta \mathcal{L} && \text{(mean of gradients)} \\ \mathbf{v}_t &= \beta_2 \, \mathbf{v}_{t-1} + (1 - \beta_2) \, (\nabla_\theta \mathcal{L})^2 && \text{(variance of gradients)} \\ \theta_{t+1} &= \theta_t - \eta \, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \end{align*}\]
Putting it all together, training a neural network involves these steps:
1. Choose the architecture: How many hidden layers? How many neurons per layer? What activation functions?
2. Choose the loss function: MSE for regression, cross-entropy for classification.
3. Choose the optimizer: Adam is usually the default starting point.
4. Set hyperparameters: Learning rate, batch size, number of epochs.
5. Train: Feed mini-batches through the network, compute the loss, backpropagate, update weights. Repeat for many epochs.
6. Monitor: Track training and validation loss each epoch. Stop when validation loss stops improving.
7. Evaluate: Test on held-out data to estimate real-world performance.
This is the same workflow as fitting any ML model — define the model, choose the loss, optimize, evaluate. The difference is that neural networks have more architectural choices and take longer to train.
Neural networks typically have far more parameters than training observations (e.g., 50,000 parameters, 5,000 observations). Without regularization:
Unlike Ridge/Lasso (one knob: \(\lambda\)), neural networks use a toolkit of complementary strategies.
The simplest and most effective regularization technique: stop training before the model overfits.
Monitor validation loss during training. When it stops improving (or starts increasing), stop training and use the weights from the best epoch.

Early stopping acts as implicit regularization: by limiting the number of gradient descent steps, we prevent the model from fully fitting the training noise. It’s analogous to how reducing the number of boosting rounds controls overfitting in XGBoost.
Dropout (Srivastava et al., 2014): at each mini-batch, randomly set a fraction of neurons to zero.

Same L2 penalty as Ridge regression, applied to neural network weights:
\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_{\ell} \|\mathbf{W}^{(\ell)}\|_2^2\]
The gradient update now includes a term that pulls weights toward zero:
\[w_{t+1} = w_t - \eta \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_t} - \underbrace{\eta \lambda w_t}_{\text{decay}}\]
Two ways to add capacity to a network:
Wider: More neurons per layer (e.g., one hidden layer with 1,000 neurons)
Deeper: More layers with fewer neurons each (e.g., five hidden layers with 64 neurons each)
The universal approximation theorem says one wide layer is sufficient. In practice, deeper is often better:
For tabular financial data: 2–4 hidden layers usually suffice. 50+ layers are for images, text, and structured data.
As preceding layers change during training, the distribution of each layer’s inputs shifts (internal covariate shift). Normalization techniques fix this.
Batch Normalization (Ioffe & Szegedy, 2015) — normalize across the mini-batch:
\[\hat{z} = \frac{z - \mu_{\text{batch}}}{\sigma_{\text{batch}}}\]
Layer Normalization (Ba, Lei & Hinton, 2016) — normalize across features within each observation:
Both stabilize training, allow higher learning rates, and act as mild regularizers.
| Technique | What it does | Analogy |
|---|---|---|
| Early stopping | Stop before the model overfits | Limiting boosting rounds in XGBoost |
| Dropout | Randomly disable neurons during training | Ensemble averaging (like Random Forests) |
| Weight decay | Penalize large weights (L2 penalty) | Ridge regression’s \(\lambda \|\boldsymbol{\beta}\|_2^2\) |
| Batch / Layer norm | Normalize intermediate values | Standardizing features before regression |
In practice, these techniques are often combined. A typical setup: ReLU activation + Adam optimizer + dropout (0.2–0.5) + early stopping + some weight decay.
PyTorch is the most widely used deep learning framework in research and increasingly in industry. It handles all the backpropagation, gradient computation, and optimization automatically — you define the architecture and the training loop.
We’ll build a simple classifier for the same synthetic data we used with Random Forests and XGBoost.
import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_classification(
n_samples=1000, n_features=10, n_informative=5,
random_state=42
)
# Split into train, validation, and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.2, random_state=42
)
# Standardize features (neural networks are sensitive to feature scales)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)
print(f"Training set: {X_train_t.shape[0]} observations")
print(f"Validation set: {X_val_t.shape[0]} observations")
print(f"Test set: {X_test_t.shape[0]} observations")Training set: 640 observations
Validation set: 160 observations
Test set: 200 observations
# Build a feed-forward neural network
model = nn.Sequential(
nn.Linear(10, 64), # hidden layer 1: 64 neurons
nn.ReLU(),
nn.Linear(64, 32), # hidden layer 2: 32 neurons
nn.ReLU(),
nn.Linear(32, 1), # output layer: 1 neuron
nn.Sigmoid() # sigmoid for binary classification
)
# Choose optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss() # binary cross-entropy
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")Sequential(
(0): Linear(in_features=10, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=32, bias=True)
(3): ReLU()
(4): Linear(in_features=32, out_features=1, bias=True)
(5): Sigmoid()
)
Total parameters: 2,817
Each Linear layer is a fully-connected layer (every neuron connects to every neuron in the previous layer). ReLU activations introduce nonlinearity between layers.
from torch.utils.data import TensorDataset, DataLoader
# Create data loader for mini-batching
train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=32, shuffle=True)
# Training loop
train_losses = []
val_losses = []
torch.manual_seed(42)
for epoch in range(50):
# Training phase
for X_batch, y_batch in train_loader:
optimizer.zero_grad() # reset gradients
y_pred = model(X_batch) # forward pass
loss = loss_fn(y_pred, y_batch) # compute loss
loss.backward() # backpropagation
optimizer.step() # update weights
# Record losses (no gradients needed for evaluation)
with torch.no_grad():
train_losses.append(loss_fn(model(X_train_t), y_train_t).item())
val_losses.append(loss_fn(model(X_val_t), y_val_t).item())
print(f"Final training loss: {train_losses[-1]:.4f}")
print(f"Final validation loss: {val_losses[-1]:.4f}")Final training loss: 0.0821
Final validation loss: 0.1823

Training loss is consistently below validation loss — the model fits the training data better than unseen data, as expected. The gap is small here, meaning the model generalizes reasonably well.
Test accuracy: 0.950
# Compare with Random Forest and XGBoost on the same data
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
xgb = XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
xgb.fit(X_train, y_train)
print(f"\nModel comparison on test set:")
print(f"Neural Network: {test_accuracy:.3f}")
print(f"Random Forest: {rf.score(X_test, y_test):.3f}")
print(f"XGBoost: {xgb.score(X_test, y_test):.3f}")
Model comparison on test set:
Neural Network: 0.950
Random Forest: 0.950
XGBoost: 0.945
On small tabular datasets like this, neural networks typically perform comparably to tree-based methods — sometimes slightly better, sometimes slightly worse. The real advantage of neural networks emerges with large datasets and non-tabular data (images, text, sequences).
| Application | Why neural networks? |
|---|---|
| High-frequency trading | Large datasets, complex nonlinear patterns in order flow |
| Factor models | Gu, Kelly & Xiu (2020): NNs outperform linear/tree models for stock return prediction — but need large panels |
| Alternative data | Images, news, earnings calls, social media require specialized architectures (CNNs, transformers) |
| Risk management | Estimating loss distributions, stress testing, credit risk |
Where tree-based models still win: small-to-medium tabular datasets (most day-to-day finance). XGBoost/RF are competitive, faster to train, easier to tune, and interpretable.
| Hyperparameter | What it controls | Typical choices |
|---|---|---|
| Number of layers | Network depth | 2–4 for tabular data; dozens for images/text |
| Neurons per layer | Network width | 32, 64, 128, 256 |
| Activation function | Nonlinearity type | ReLU (hidden), sigmoid/softmax (output) |
| Learning rate | Step size in gradient descent | 0.001 (Adam), 0.01 (SGD) |
| Batch size | Observations per gradient update | 32–256 |
| Epochs | Passes through training data | Until validation loss stops improving |
| Dropout rate | Fraction of neurons dropped | 0.2–0.5 |
| Weight decay | L2 penalty strength | \(10^{-4}\) to \(10^{-2}\) |
| Optimizer | Gradient descent variant | Adam (default), SGD with momentum |
Start with a simple architecture (2 hidden layers, 64 neurons each, ReLU, Adam, dropout 0.3) and adjust based on validation performance.
model.eval(). Dropout and batch norm behave differently during training vs. evaluation. Always switch modes.The feed-forward networks we’ve been building treat each input as a flat vector of numbers — no notion of spatial structure, no notion of order, no notion of time. That’s fine for tabular data (rows of features), but many real-world problems have richer structure:
Each of the architectures in this section was designed to exploit a specific kind of structure. We won’t go deep into the math — the formulas get heavy — but you should understand what problem each one solves and why it was a breakthrough.
CNNs are the standard architecture for image data. They exploit a basic property of images: nearby pixels contain similar information — the same “edge detector” is useful everywhere in the image.
LeCun, Bottou, Bengio & Haffner (1998), Figure 2
LeNet-5 was designed to read handwritten cheques for banks — one of the first commercially deployed neural networks. Input image → convolutional filters → pooling (downsampling) → more filters → fully connected layer → output. The same architectural principles now power image recognition, medical imaging, and satellite analysis.
The problem: You have high-dimensional data (hundreds of features) but suspect that the variation is really driven by a smaller number of underlying factors. How do you find that low-dimensional structure?
A classical approach is Principal Component Analysis (PCA). PCA finds the directions of maximum variance in the data and projects onto them — compressing, say, 100 features into 5 “principal components” that capture most of the variation. It works well, but it can only find linear combinations of the original features. If the true low-dimensional structure is nonlinear (curved, folded, or otherwise complex), PCA will miss it.
The idea: Train a neural network to reconstruct its own input — but force the data through a narrow bottleneck in the middle. If the network can reconstruct the input well despite the bottleneck, the bottleneck must have learned a compressed representation that captures the essential structure. Because the encoder and decoder are neural networks, this compression can be nonlinear — it’s PCA without the linearity constraint.

Hinton & Salakhutdinov (2006) showed that deep autoencoders dramatically outperform PCA for dimensionality reduction. This Science paper was a catalyst for the modern deep learning resurgence. A later variant, the Variational Autoencoder (VAE) by Kingma & Welling (2014), adds a probabilistic twist that allows the decoder to generate new data by sampling from the bottleneck.
The problem: Feed-forward networks treat each input independently. But many financial signals are sequential — the meaning of today’s return depends on what happened yesterday, last week, and last month. An earnings call transcript is a sequence of words where order matters. A limit order book evolves through time.
The idea: Give the network a memory. At each time step, the network receives the current input and a summary of everything it has seen so far.
How it works:
The seminal paper: Rumelhart, Hinton & Williams (1986). The backpropagation framework from PDP laid the foundation for training networks with recurrent connections — hidden states that feed back into themselves across time steps.
The trouble is that for long sequences, the hidden state has to carry information through many steps. In practice, the gradient signal degrades over long chains — the vanishing gradient problem again. By the time the network reaches step 100, it has effectively forgotten what happened at step 1.
The problem with vanilla RNNs: they can’t remember things for more than a handful of time steps. For financial applications — where a quarterly earnings surprise might affect prices for months — this is a serious limitation.
The fix: add a separate memory channel (the “cell state”) with learned gates that control what to remember and what to forget.
Think of the cell state as a conveyor belt running above the network. At each time step, three gates decide:
Each gate is a small neural network that learns when to open and close from the data. The cell state can carry information forward across many time steps without degradation, because the forget gate can stay close to 1 (meaning “keep everything”) when the information is still relevant.
The seminal paper: Hochreiter & Schmidhuber (1997). LSTMs became the dominant architecture for sequence modelling for nearly two decades, until transformers overtook them.
The problem with RNNs/LSTMs: They process sequences one step at a time, left to right. Step 50 has to wait for steps 1–49 to finish — this is slow, and it means the model can’t be parallelized across time steps. Worse, even LSTMs struggle when related pieces of information are far apart in the sequence. Consider: “The executive who the board hired last quarter after a lengthy search resigned.” — subject and verb are nine words apart, and the RNN has to carry the subject through all the intervening words.
The idea: Throw away recurrence entirely. Instead, let every element in the sequence look at every other element directly, all at once. This mechanism is called self-attention.
How self-attention works:
Each element in the sequence (a word, a time step) gets projected into three roles:
For every pair of elements, the model computes a relevance score (how much does element A care about element B?) by comparing A’s query to B’s key. These scores become attention weights — and each element’s new representation is a weighted average of all the values in the sequence.
The critical advantage: all pairwise scores are computed simultaneously as a single matrix multiplication. This means the entire sequence is processed in parallel on a GPU, regardless of length. And any two elements can interact directly — no information needs to “survive” through intermediate steps.
The seminal paper: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017). “Attention Is All You Need” — one of the most cited papers in machine learning history. Transformers are the architecture behind GPT, Claude, and every modern large language model.
The problem: Every model we’ve built so far is discriminative — it maps inputs to outputs (\(\mathbf{x} \to y\)). But what if you need to generate new data that looks realistic? Synthetic financial data for stress testing, privacy-preserving data sharing, augmenting small training sets.
The idea: Train two networks against each other in a game. One tries to generate realistic fakes; the other tries to spot them. The competition drives both to improve.
How it works:
In practice, GAN training is notoriously unstable — the two networks can oscillate rather than converge, and the generator can collapse to producing only a few types of output. Despite this, GANs have produced remarkable results in image generation and are increasingly used in finance.
The seminal paper: Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville & Bengio (2014).
The problem with GANs: They’re powerful but notoriously hard to train. The generator and discriminator can oscillate instead of converging, and the generator often suffers from mode collapse — it learns to produce only a narrow range of outputs rather than the full diversity of the training data. Researchers wanted a generative approach that was more stable and produced higher-quality, more diverse outputs.
The idea: Instead of learning to generate data in one shot (like a GAN), learn to reverse a gradual noising process. Start with real data and slowly add random noise over many steps until it becomes pure static. Then train a neural network to reverse each step — to take slightly noisy data and make it slightly less noisy. To generate new data, start from pure noise and apply the learned denoising steps repeatedly until a clean sample emerges.
How it works:
The key advantage over GANs: there is no adversarial training, so no mode collapse and no training instability. The trade-off is that generation is slower — you need many denoising steps rather than a single forward pass. Diffusion models are the architecture behind DALL-E, Stable Diffusion, and modern image/video generation systems, and have largely displaced GANs as the dominant generative approach.
The seminal paper: Ho, Jain & Abbeel (2020). “Denoising Diffusion Probabilistic Models.”
| Architecture | Problem it solves | How | Seminal paper |
|---|---|---|---|
| CNN | Spatial structure (images) | Local filters + weight sharing | LeCun et al. (1998) |
| Autoencoder | Dimensionality reduction | Compress through bottleneck, reconstruct | Hinton & Salakhutdinov (2006) |
| RNN | Sequential data | Hidden state carries memory through time | Rumelhart, Hinton & Williams (1986) |
| LSTM | Long-range dependencies in sequences | Cell state with learned gates | Hochreiter & Schmidhuber (1997) |
| Transformer | Long-range dependencies + parallelism | Self-attention over entire sequence at once | Vaswani et al. (2017) |
| GAN | Generating synthetic data | Two networks compete (generator vs. discriminator) | Goodfellow et al. (2014) |
| Diffusion | Stable, high-quality generation | Learn to reverse a gradual noising process | Ho, Jain & Abbeel (2020) |
Each of these uses the same building blocks from earlier in this lecture — neurons, layers, activations, backpropagation, gradient descent. The difference is how the layers are connected and what structure in the data they’re designed to exploit.
Every model we’ve seen today uses data to train: we have \((x_i, y_i)\) pairs and minimize a loss like MSE. But what if we don’t have data — we have an equation we know the solution must satisfy?
This idea originated in physics (Raissi, Perdikaris & Karniadakis, 2019), where it’s called Physics-Informed Neural Networks (PINNs). My dissertation applied it to finance — Finance-Informed Neural Networks (FINNs) — in two settings.
Model 60 overlapping generations of households (ages 20–79)–3 types: 0- ,low-, high-LTI–each choosing how much to consume vs. save. A household of age \(i\) maximizes lifetime utility — the discounted sum of utility from consumption over its remaining life:
\[\max \quad \mathbb{E}_t \sum_{s=0}^{59} \beta^s \; u(c_{i,t+i})\]
subject to a budget constraint each period:
\[c_{i,t} + k_{i+1,t+1} = (1-\tau_t)\,w_t\,\ell_i + (1 + r_t)\,k_{i,t} - a_{i,t}\]
where \(c_{i,t}\) is consumption, \(k_{i+1,t+1}\) is savings carried into next period, \(w_t\,\ell_i\) is after-tax labour income, \(r_t\) is the return on savings, and \(a_{i,t}\) is the student loan payment. This budget constraint is what links all the households together — everyone’s savings become the economy’s capital, which determines next period’s wages and returns for everyone else.
When you solve this optimization (take the first-order condition), you get a condition that must hold at every age: the marginal benefit of consuming one more dollar today must equal the expected marginal benefit of saving that dollar and consuming it tomorrow:
\[u'(c_{i,t}) = \beta \; \mathbb{E}_t\!\left[u'(c_{i+1,t+1}) \cdot (1 + r_{t+1})\right]\]
where \(u'(\cdot)\) is marginal utility, \(\beta\) is the discount factor (patience), and \(r_{t+1}\) is the return on savings. In words: if this condition is violated, the household can do better by shifting consumption between today and tomorrow.
That’s 59 \(\times\) 3 of these conditions (one per age per type) that must all hold simultaneously. The state space — tracking everyone’s wealth — is 178-dimensional. Traditional methods that use grids are hopeless here: a grid with just 10 points per dimension would have \(10^{178}\) grid points.
Instead, a neural network \(f_\theta\) learns the optimal saving decision for all 60 age groups at once, trained by penalizing violations of these optimality conditions. To avoid grids entirely, I use the neural network’s own policy to simulate the economy forward in time — generating a time series of states the economy actually visits — and evaluate the optimality conditions only at those points. No grid, no curse of dimensionality.
A caplet is an option on a future interest rate — it pays off if the rate exceeds a strike \(L_E\). From RSM332, the price of any derivative is the expected discounted payoff under no-arbitrage:
\[V(t) = \mathbb{E}\!\left[\exp\!\left(-\int_t^T r(s)\,ds\right) \cdot \text{payoff at } T\right]\]
where \(r(s)\) is the short-term interest rate along the path. The standard approach is Monte Carlo simulation: simulate thousands of random interest rate paths, compute the payoff on each, discount back, and average. This is slow — seconds per contract — and you need to re-simulate for every change in contract terms.
Instead, there is an equivalent way to express exactly the same price, not as an expectation over random paths, but as a deterministic equation that the price function \(V\) must satisfy:
\[-\frac{\partial V}{\partial \tau} + \mu' \nabla_f V + \frac{1}{2} \sum_{n=1}^{N} \sigma_n' \nabla_f^2 V \; \sigma_n - r \cdot V = 0\]
where \(\tau\) is time to maturity, \(f\) is the current forward rate curve, and \(\mu, \sigma\) come from the term structure model. You don’t need to understand every symbol here — the point is that this is an equation the price must satisfy, and it involves derivatives of \(V\) with respect to its inputs.
A neural network \(V_\theta\) approximates the price. The loss function squares the left-hand side of that equation: if \(V_\theta\) satisfies the equation, the loss is zero. Backpropagation — the same machinery used to train any neural network — computes all the partial derivatives of \(V_\theta\) that appear in the equation. No random simulation at all — the network learns the price by satisfying the equation directly.
The result: once trained, the network prices any contract instantly. Prices are 300,000–4,500,000\(\times\) faster than Monte Carlo simulation, with accuracy to 0.04 cents per dollar.
Deep learning = regression. A neural network is a choice of \(f_\theta\) in the same framework from Lecture 5. The loss functions, regularization, and evaluation methods all carry over.
The architecture. Neurons compute weighted sums plus nonlinear activations. Layers stack neurons together. Feed-forward networks chain layers sequentially. Activation functions (especially ReLU) are what make depth meaningful — without them, multiple layers collapse to a single linear transformation.
Universal approximation. A wide enough single-layer network can approximate any continuous function. But this is an existence theorem, not a practical recipe. Deep networks are more parameter-efficient and learn hierarchical representations.
Training. No closed-form solution — we use gradient descent. Backpropagation efficiently computes gradients by applying the chain rule backwards through the network. SGD with mini-batches makes training practical for large datasets. Adam is the default optimizer.
Regularization. Neural networks are massively overparameterized and prone to overfitting. Early stopping, dropout, weight decay, and batch normalization are the standard toolkit.
Specialized architectures. CNNs exploit spatial structure through local filters and weight sharing. Autoencoders learn compressed representations by reconstructing their own input through a bottleneck. RNNs and LSTMs handle sequential data by passing hidden states through time. Transformers replace recurrence with self-attention, processing entire sequences in parallel. GANs train two networks adversarially to generate synthetic data.
In practice. On tabular financial data, neural networks compete with but rarely dominate tree-based methods. Their real advantage is with large datasets and non-tabular data (images, text, sequences).
Neural networks unlock a new kind of data for finance: text.
Earnings calls, news articles, analyst reports, SEC filings, social media — vast amounts of financially relevant information is locked in natural language.
Lecture 10: Text & Natural Language Processing