
Week 3: Introduction to Machine Learning | January 21–22, 2026
Rotman School of Management
Last week, we studied the statistical properties of financial returns—how they’re distributed, why the normality assumption fails, and why prediction is hard. Today we step back to understand the broader framework: Machine Learning.
Traditional programming: You write explicit rules for the computer to follow.
Example: Building a spam filter the traditional way:
IF email contains "Nigerian prince" THEN spam
IF email contains "free money" THEN spam
IF sender is in contacts THEN not spam
IF email contains "urgent wire transfer" THEN spam
...
Problems with this approach:
Question: How would you write rules to recognize a cat in a photo? Or predict tomorrow’s stock return?
Machine learning: Instead of writing rules, you show the computer examples and let it learn the patterns.
The same spam filter, ML approach:
For many problems, it’s easier to collect examples than to write rules.
Machine Learning = building models that learn patterns directly from data, rather than being explicitly programmed.

Traditional programming: Human writes rules, computer applies them to data.
Machine learning: Human provides data and desired outputs, computer learns the rules.
Use ML when:
Finance examples where ML excels:
Week 2 set up the problem ML tries to solve:
Machine learning is a systematic framework for:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
This course focuses on supervised and unsupervised learning. We won’t cover reinforcement learning.
Think of ML methods as tools in a toolbox.
Just as an experienced contractor knows which tool is right for each repair—hammer for nails, wrench for bolts, saw for cutting—you’ll learn which ML method is right for each problem.
The tools we’ll cover:
| Tool | What it does | When to use it |
|---|---|---|
| Linear regression | Predict a number from features | Simple relationships, interpretability matters |
| Regularized regression | Prevent overfitting | Many features, small samples |
| Logistic regression | Predict probabilities/classes | Binary outcomes (default/no default) |
| Decision trees | Capture nonlinear patterns | Complex interactions between features |
| Clustering | Group similar observations | No labels, want to find structure |
The goal of this course: Build your intuition so you recognize which tool fits which problem—and understand why.
The prediction problem:
Given input features \(\mathbf{x}\) (what we observe), predict an output \(y\) (what we want to know).
Notation:
The goal: Learn a function \(f: \mathbb{R}^p \to \mathcal{Y}\) such that \(f(\mathbf{x}) \approx y\).
Two main types of supervised learning:
| Type | Target \(y\) | Example |
|---|---|---|
| Regression | Continuous (real-valued) | Predict stock return, house price |
| Classification | Categorical (discrete) | Predict spam/not spam, buy/sell/hold |

Regression: The target \(y\) is a continuous number. We want to minimize how far off our predictions are.
Classification: The target \(y\) is a category (class). We want to predict the correct class as often as possible.
Linear regression assumes the relationship between features and target is linear:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon\]
In matrix form, for \(N\) observations:
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
where \(\mathbf{y} \in \mathbb{R}^N\), \(\mathbf{X} \in \mathbb{R}^{N \times (p+1)}\), and \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\).
The OLS solution: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}\)
This is the “learning algorithm” for linear regression—it finds the \(\boldsymbol{\beta}\) that minimizes squared error.
The ML perspective: We’re trying to approximate some unknown function \(f\):
\[y = f(\mathbf{x}) + \varepsilon\]
This function \(f\) could be linear: \(f(\mathbf{x}) = \mathbf{X}\boldsymbol{\beta}\). But it might not be.
We don’t know what \(f\) is. That’s what “learning” means—finding a good approximation \(\hat{f}\) from data, whether that turns out to be linear or not.
Different ML methods = different assumptions about \(f\):
| Method | Assumption about \(f\) |
|---|---|
| Linear regression | \(f\) is linear |
| Polynomial regression | \(f\) is a polynomial |
| Decision trees | \(f\) is piecewise constant |
| Deep neural networks | \(f\) is a composition of simple nonlinear functions: which can approximate any continuous function! |
The tradeoff: More flexible models can fit complex patterns, but risk overfitting (fitting noise instead of signal) and have growing computational costs.
Credit scoring:
Fraud detection:
Trading signals:
Ordinal classification (a hybrid):
The structure-discovery problem:
Given only input features \(\mathbf{x}\)—no labels—find interesting patterns or structure.
Notation:
Key difference from supervised learning:
In supervised learning, there’s a target variable \(y\) we’re trying to predict. In unsupervised learning, there’s no \(y\)—we’re just trying to understand the structure of \(\mathbf{X}\) itself. Which observations are similar? Are there natural groupings? What are the main patterns?
Main unsupervised tasks:
| Task | Goal | Example |
|---|---|---|
| Clustering | Group similar observations | Group stocks by return patterns |
| Dimensionality reduction | Find low-dimensional representation | Reduce 100 features to 5 factors |
| Density estimation | Estimate the data distribution | Model the joint distribution of returns |
| Anomaly detection | Find unusual observations | Detect outlier transactions |
Clustering stocks:
Factor models / PCA:
Anomaly detection:
We’ll study clustering (K-means) in detail in Week 4.
The sequential decision problem:
An agent interacts with an environment over time, receiving rewards or penalties for its actions.
The setup:
Finance applications:
Why it’s different:
We won’t cover RL in depth, but it’s an active research area in quantitative finance.
Every ML algorithm has three components:
Example: Linear regression
The ML framework gives us a systematic way to think about prediction problems.
The model is the function \(f(\mathbf{x})\) we’re trying to learn.
We have to decide what form \(f\) takes. This is a choice we make:
| Model type | Form | Parameters to learn |
|---|---|---|
| Linear regression | \(f(\mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p\) | \(\boldsymbol{\beta}\) |
| Polynomial | \(f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots\) | Coefficients |
| Decision tree | Piecewise constant regions | Split points, leaf values |
| Neural network | Compositions of nonlinear functions | Weights and biases |
The tradeoff:
Learning = finding the parameters. Once we choose a model form (e.g., linear), the learning algorithm finds the specific parameter values (e.g., \(\hat{\boldsymbol{\beta}}\)) that best fit the data.
The loss function measures how bad a prediction is.
Notation:
Common loss functions for regression:
| Name | Formula | Properties |
|---|---|---|
| Squared error | \(L(y, \hat{y}) = (y - \hat{y})^2\) | Penalizes large errors heavily |
| Absolute error | \(L(y, \hat{y}) = |y - \hat{y}|\) | More robust to outliers |
Common loss functions for classification:
| Name | Formula | Properties |
|---|---|---|
| 0-1 loss | \(L(y, \hat{y}) = \mathbf{1}[y \neq \hat{y}]\) | 1 if wrong, 0 if correct |
| Cross-entropy | \(L(y, p) = -y \log p - (1-y)\log(1-p)\) | For probabilistic predictions |
The loss function defines what “good prediction” means.
We want to minimize average loss across the training data.
Empirical risk (training error):
\[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} L(y_i, f_{\boldsymbol{\theta}}(\mathbf{x}_i))\]
For squared error loss:
\[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2 = \text{MSE}\]
The learning problem becomes an optimization problem:
\[\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})\]
Find the parameters \(\boldsymbol{\theta}^*\) that minimize average loss on the training data.
The learning algorithm finds the optimal parameters \(\boldsymbol{\theta}^*\).
For some problems, there’s a closed-form solution:
Linear regression with squared error:
\[\boldsymbol{\beta}^* = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}\]
This is the OLS formula from your statistics courses.
For most problems, we use iterative optimization:
Gradient descent: Move parameters in the direction that reduces loss.
\[\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})\]
where:
Gradient descent is the workhorse of modern ML—it’s how neural networks are trained.
We want to minimize error. From calculus, you know that minima occur where the derivative equals zero:
\[\frac{d\mathcal{L}}{d\theta} = 0 \quad \Rightarrow \quad \theta^*\]
Problem: For complex models, we can’t solve this equation analytically.
Solution: Use the derivative to guide us toward the minimum.
The gradient \(\nabla \mathcal{L}\) points in the direction of steepest increase. So the negative gradient points toward steepest decrease.
\[\underbrace{-\nabla_{\boldsymbol{\theta}} \mathcal{L}}_{\text{direction of steepest decrease}}\]
Gradient descent: Take small steps in the direction of steepest decrease until we reach a point where \(\nabla \mathcal{L} \approx 0\) (a minimum).
This is like walking downhill in fog—you can’t see the bottom, but you can feel which direction is steepest and step that way.
Gradient descent iteratively moves “downhill” on the loss surface until it reaches a minimum.
Step 1: Choose a model
Step 2: Choose a loss function
Step 3: Fit the model (run the learning algorithm)
Step 4: Evaluate on new data
You don’t need to be a programmer to use ML. Most of the hard work is already done—you just need to know which tools to use.
The main packages:
| Package | What it does | You’ll use it to… |
|---|---|---|
numpy |
Fast math on arrays | Store data, do matrix operations |
pandas |
Data tables (like Excel) | Load CSV files, clean data, compute returns |
matplotlib |
Plotting | Visualize results |
scikit-learn |
ML algorithms | Fit models, make predictions, evaluate |
All of these are pre-installed in most Python environments (Anaconda, Google Colab, etc.).
The algorithms behind ML are genuinely complex—gradient descent, matrix decompositions, optimization routines. A production implementation of random forests is thousands of lines of code.
But Python is a language built on packages. Someone else has already:
So our code stays high-level and brief:
1. Load data → pandas.read_csv()
2. Prepare features → pandas/numpy operations
3. Split data → sklearn.model_selection.train_test_split()
4. Fit model → model.fit(X_train, y_train)
5. Make predictions → model.predict(X_test)
6. Evaluate → Compare predictions to truth
Every ML project follows this pattern. The hard work is understanding which model to use and how to interpret results—that’s what this course teaches.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# 1. Create some fake stock data
np.random.seed(42)
data = pd.DataFrame({
'market_return': np.random.randn(100),
'stock_return': np.random.randn(100)
})
data['stock_return'] = 0.5 + 1.2 * data['market_return'] + 0.3 * np.random.randn(100)
# Visualize the raw data
plt.scatter(data['market_return'], data['stock_return'])
plt.xlabel('Market Return')
plt.ylabel('Stock Return')
plt.title('Step 1: Look at your data')
plt.show()
# 2. Split into training and test sets
X = data[['market_return']] # Features (what we observe)
y = data['stock_return'] # Target (what we predict)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 3. Fit model
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Estimated beta: {model.coef_[0]:.2f}")
print(f"Estimated alpha: {model.intercept_:.2f}")
# Visualize the fitted model
plt.scatter(X_train, y_train, label='Training data')
x_line = np.linspace(-3, 3, 100)
plt.plot(x_line, model.intercept_ + model.coef_[0] * x_line, color='red', label='Fitted line')
plt.xlabel('Market Return')
plt.ylabel('Stock Return')
plt.title('Step 2: Fit the model')
plt.legend()
plt.show()Estimated beta: 1.28
Estimated alpha: 0.54

# 4. Predict on test data and evaluate
predictions = model.predict(X_test)
# Visualize: predicted vs actual
plt.scatter(y_test, predictions)
plt.plot([-2, 3], [-2, 3], 'r--', label='Perfect predictions') # 45-degree line
plt.xlabel('Actual Stock Return')
plt.ylabel('Predicted Stock Return')
plt.title('Step 3: Evaluate predictions')
plt.legend()
plt.show()
Almost every ML model in scikit-learn uses the same interface:
Swapping models is easy:
# Linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
# Ridge regression (with regularization)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
# Random forest
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
# The rest of the code stays the same!You learn one interface, you can use dozens of models.
When you see code in this course, don’t panic. Focus on:
1. What data goes in?
2. What model are we using?
3. What comes out?
You don’t need to memorize syntax. You need to understand what the code is doing.
Loading data:
Computing log returns:
Selecting columns:
Train/test split:
These patterns repeat constantly. After a few weeks, they’ll feel natural.
ML is not magic. It fails when its assumptions are violated.
1. Dependence on historical data
2. The stationarity assumption
3. Regime changes
Reality Check
“All models are wrong, but some are useful.” — George Box
Overfitting: The model learns noise in the training data rather than true patterns.
Symptoms:

Prevention strategies (covered in later weeks):
Week 2 previewed this: most return predictors fail out-of-sample (Goyal-Welch 2008). Why?
Overfitting: The model learns patterns in the training data that don’t generalize.
The ML terminology:
| Term | Meaning |
|---|---|
| Training error | Performance on data used to fit the model |
| Test error | Performance on new, unseen data |
| Overfitting | Training error << Test error |
Much of this course is about avoiding overfitting:
What is Machine Learning?
Types of Learning:
The ML Formalism:
Python for ML:
fit(), predict()Limitations:
| Week | Topic |
|---|---|
| 4 | Clustering |
| 5 | Regression (linear, ridge, lasso) |
| 6 | ML & Portfolio Theory |
| 7 | Linear Classification |
| 8 | Nonlinear Classification |
| 9 | Ensemble Methods |
| 10 | Neural Networks & Deep Learning |
| 11 | Text & NLP |
Each week adds new tools to your toolbox—and everything builds on the framework we introduced today.