Linear Regression

01

Historical Context

The remarkable origin story of regression and the method of least squares.

The Birth of Regression

The word "regression" itself comes from Sir Francis Galton, a Victorian-era polymath who was studying hereditary traits in the 1880s. He observed that the children of exceptionally tall parents tended to be shorter than their parents, and children of exceptionally short parents tended to be taller. He called this phenomenon "regression toward mediocrity" (later rephrased as "regression to the mean").

In his 1886 paper, Galton plotted the heights of parents against the heights of their adult children and drew the best-fitting line through the data. This was one of the earliest uses of what we now call linear regression. The line he drew was not a 45-degree line of perfect inheritance, but a shallower line - showing that extreme values tend to "regress" toward the average.

The Method of Least Squares

The mathematical foundation for fitting lines to data was actually developed decades before Galton, through one of the most famous priority disputes in the history of mathematics.

In 1805, Adrien-Marie Legendre published the method of least squares in his work on determining the orbits of comets. He proposed minimizing the sum of squared residuals as the criterion for the best-fitting line.

However, Carl Friedrich Gauss claimed he had been using the method since 1795, and in 1809 he published a more rigorous treatment connecting least squares to the normal distribution of errors. This connection - that least squares is the maximum likelihood estimator when errors are normally distributed - gave the method deep theoretical justification.

Why Least Squares Won

Gauss proved that if the errors follow a normal distribution, then the least squares solution is the maximum likelihood estimator - it finds the parameters that make the observed data most probable. This gave least squares a probabilistic foundation that no competing method could match.

Timeline of Linear Regression

1805

Adrien-Marie Legendre publishes the method of least squares for comet orbit prediction

1809

Carl Friedrich Gauss publishes a rigorous treatment connecting least squares to the normal distribution

1886

Francis Galton coins "regression" while studying hereditary heights

1922

Ronald Fisher formalizes maximum likelihood estimation, strengthening the theoretical basis

Today

Linear regression remains the most widely used statistical technique across all sciences

02

Intuition

What does it mean to fit a line to data, and why is it useful?

The Simplest Prediction Problem

Imagine you are a real estate agent. You have data on houses: their sizes (in square feet) and their selling prices. You notice a pattern - bigger houses tend to sell for more. You want to use this pattern to predict the price of a new house based on its size.

Linear regression answers this question by finding the best-fitting straight line through the data points. Once you have this line, you can plug in any house size and get a predicted price.

\text{Price} = w \times \text{Size} + b

Where $$w$$ is the slope (price per square foot) and $$b$$ is the y-intercept (base price).

What Makes a Line "Best"?

There are infinitely many lines you could draw through a scatter plot. The question is: which line is the best one? Linear regression defines "best" by minimizing the total distance between the data points and the line.

For each data point, the residual is the vertical distance between the actual value $$y_i$$ and the predicted value $\hat{y}_i$ :

e_i = y_i - \hat{y}_i

1

Compute Predictions

For each data point, calculate what the line predicts:

\hat{y}_i = wx_i + b

2

Compute Residuals

Measure the error - how far off the prediction is from the actual value:

e_i = y_i - \hat{y}_i

3

Square and Sum

Square each residual (so negatives do not cancel positives) and add them all up:

\text{Total Error} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

4

Minimize

Find the values of $$w$$ and $$b$$ that make this total as small as possible.

Why Squared Errors?

You might ask: why not just sum the absolute values of the residuals? There are several mathematical reasons why squaring is preferred:

Differentiable

The squared function is smooth everywhere, enabling calculus-based optimization

Penalizes Large Errors

Squaring amplifies large residuals, strongly discouraging outlier-level errors

Unique Solution

The sum of squared errors is strictly convex, guaranteeing a single global minimum

Sensitivity to Outliers

The flip side of penalizing large errors is that outliers have a disproportionate influence on the regression line. A single extreme data point can significantly shift the best-fit line. This is why checking for outliers is crucial before applying linear regression.

03

Mathematical Formulation

The complete mathematical framework for linear regression.

1. The Linear Model

The fundamental assumption of linear regression is that the target variable $$y$$ is a linear function of the input features $$x$$ , plus some random noise:

y = w^Tx + b + \epsilon

Where:

$w = [w_1, w_2, \ldots, w_d]^T$ is the weight vector (one weight per feature)
$x = [x_1, x_2, \ldots, x_d]^T$ is the feature vector
$$b$$ is the bias (intercept) term
$\epsilon$ is the noise term, assumed to be normally distributed: $\epsilon \sim \mathcal{N}(0, \sigma^2)$

2. The Hypothesis Function

Given a set of learned parameters, the model predicts:

\hat{y} = h_{w,b}(x) = w^Tx + b = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b

This is also called the hypothesis function. The hat notation $\hat{y}$ denotes a predicted value, as opposed to the true observed value $$y$$ .

3. Matrix Notation

For a dataset with $$n$$ samples and $$d$$ features, we can express all predictions simultaneously using matrix notation. We augment the feature matrix with a column of ones to absorb the bias term:

X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix}, \quad w = \begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}

Then the predictions for all samples become a single matrix multiplication:

\hat{y} = Xw

Why Matrix Notation Matters

Matrix notation is not just a shorthand - it enables efficient computation using optimized linear algebra libraries (BLAS, LAPACK). A single matrix multiply can compute millions of predictions simultaneously, which is why linear regression scales beautifully to large datasets.

4. The Residual Vector

The vector of all residuals (errors) is:

e = y - \hat{y} = y - Xw

Each component $e_i = y_i - \hat{y}_i$ measures how far the prediction is from the actual value for sample $$i$$ . Our goal is to make this residual vector as small as possible.

04

Cost Function (MSE)

Deriving and understanding the Mean Squared Error loss function.

The Mean Squared Error

The cost function measures how well our model fits the data. For linear regression, we use the Mean Squared Error (MSE):

J(w,b) = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

The factor of $\frac{1}{2}$ is a mathematical convenience - it cancels with the exponent when we take the derivative, making the gradient formulas cleaner. The factor of $\frac{1}{n}$ normalizes by the number of samples so the cost does not grow with dataset size.

MSE in Matrix Form

Using matrix notation, the MSE becomes remarkably elegant:

J(w) = \frac{1}{2n}(y - Xw)^T(y - Xw) = \frac{1}{2n}\|y - Xw\|^2

This is the squared Euclidean norm of the residual vector, divided by $$2n$$ . Geometrically, it measures the squared length of the error vector in $$n$$ -dimensional space.

Why MSE is Convex

A function is convex if its second derivative is positive (or, in multiple dimensions, if its Hessian matrix is positive semi-definite). Let us verify this for MSE:

Convexity Proof

Step 1: Expand the cost function:

J(w) = \frac{1}{2n}(y^Ty - 2w^TX^Ty + w^TX^TXw)

Step 2: Compute the gradient (first derivative):

\nabla_w J = \frac{1}{n}(X^TXw - X^Ty)

Step 3: Compute the Hessian (second derivative):

H = \frac{1}{n}X^TX

Conclusion: For any non-zero vector

$v$

:

v^THv = \frac{1}{n}v^TX^TXv = \frac{1}{n}\|Xv\|^2 \geq 0

Since the Hessian is positive semi-definite, $$J(w)$$ is convex. If $$X$$ has full column rank, then $\|Xv\|^2 > 0$ for all $v \neq 0$ , making the Hessian positive definite and $$J(w)$$ strictly convex with a unique global minimum.

Interpreting MSE

Understanding the Loss Values

When

y = \hat{y}

(perfect prediction):

J = \frac{1}{2n}\sum(0)^2 = 0

The loss is zero - the model perfectly fits all data points.

When errors exist:

$$J > 0$$

The loss is always positive. Larger errors are penalized quadratically - an error of 2 contributes 4 times as much as an error of 1.

MSE vs RMSE

MSE is in squared units of the target variable, making it hard to interpret directly. The Root Mean Squared Error (RMSE) $= \sqrt{\text{MSE}}$ is in the same units as the target, making it more interpretable. For example, if you predict house prices in dollars, RMSE tells you the typical error in dollars.

05

Closed-Form Solution (Normal Equation)

Deriving the exact analytical solution using calculus and linear algebra.

The Normal Equation

Unlike logistic regression, linear regression has a beautiful closed-form solution. We can find the optimal weights directly by setting the gradient of the cost function to zero and solving for $$w$$ .

Full Derivation

Step 1: Start with the cost function in matrix form:

J(w) = \frac{1}{2n}(y - Xw)^T(y - Xw)

Step 2: Expand:

J(w) = \frac{1}{2n}(y^Ty - 2w^TX^Ty + w^TX^TXw)

Step 3: Take the gradient with respect to

$w$

and set it to zero:

\nabla_w J = \frac{1}{n}(-X^Ty + X^TXw) = 0

Step 4: Solve for

$w$

:

$$X^TXw = X^Ty$$

Final Result:

w^* = (X^TX)^{-1}X^Ty

This is the Normal Equation (also called the Ordinary Least Squares, or OLS, solution). The name "normal" comes from the fact that the residual vector $$y - Xw^*$$ is orthogonal (normal) to the column space of $$X$$ .

The Hat Matrix

The predicted values can be written as:

\hat{y} = Xw^* = X(X^TX)^{-1}X^Ty = Hy

Where $H = X(X^TX)^{-1}X^T$ is called the hat matrix (because it puts the "hat" on $$y$$ ). The hat matrix is an orthogonal projection matrix that projects $$y$$ onto the column space of $$X$$ .

Geometric Interpretation

Linear regression finds the point in the column space of $$X$$ that is closest to the vector $$y$$ . The residual vector $y - \hat{y}$ is perpendicular to every column of $$X$$ , which is exactly the condition $$X^T(y - Xw) = 0$$ , i.e., the normal equation.

Normal Equation vs Gradient Descent

Normal Equation

Exact solution in one step

No iterations, no hyperparameters to tune

No learning rate needed

No risk of divergence or slow convergence

Complexity: $$O(d^3)$$

Matrix inversion is cubic in the number of features

Use when: Number of features $$d < 10{,}000$$

Gradient Descent

Iterative approximation

Converges to the optimal solution over many steps

Requires learning rate tuning

Too large diverges; too small is very slow

Complexity: $$O(nd)$$ per iteration

Scales linearly with both features and samples

Use when: Number of features $$d > 10{,}000$$ or data is too large to fit in memory

When the Normal Equation Fails

The normal equation requires $$X^TX$$ to be invertible. This fails when: (1) features are linearly dependent (multicollinearity), (2) there are more features than samples ( $$d > n$$ ). In these cases, use regularization (Ridge/Lasso) or the pseudo-inverse $$X^+$$ .

06

Gradient Descent

Step-by-step gradient derivation and the iterative optimization approach.

Gradient Derivation

To use gradient descent, we need the gradient of the MSE cost function with respect to the parameters. Let us derive it step by step.

Step-by-Step Derivation

Step 1: Start with the cost for one sample:

J_i = \frac{1}{2}(y_i - w^Tx_i - b)^2

Step 2: Apply the chain rule for the weight gradient:

\frac{\partial J_i}{\partial w} = -(y_i - w^Tx_i - b) \cdot x_i = -(y_i - \hat{y}_i) \cdot x_i

Step 3: Apply the chain rule for the bias gradient:

\frac{\partial J_i}{\partial b} = -(y_i - w^Tx_i - b) = -(y_i - \hat{y}_i)

Step 4: Average over all

$n$

samples:

\frac{\partial J}{\partial w} = -\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)x_i = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)x_i

Matrix Form:

\frac{\partial J}{\partial w} = \frac{1}{n}X^T(Xw - y)

Interpreting the Gradient

The gradient $\frac{1}{n}X^T(Xw - y)$ has a beautiful interpretation: it is the correlation between the input features $$X$$ and the prediction errors $$(Xw - y)$$ . When features are correlated with errors, the weights need adjustment. When the gradient is zero, the errors are uncorrelated with the features - the model has extracted all the linear signal.

The Update Rule

Gradient descent iteratively updates the weights by taking steps in the direction opposite to the gradient:

w := w - \eta \cdot \frac{1}{n}X^T(Xw - y)

b := b - \eta \cdot \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Where $\eta$ is the learning rate, a hyperparameter that controls the step size.

Learning Rate Effects

Gradient Descent Animation

Watch gradient descent optimize the loss. Adjust the learning rate to see how it affects convergence.

Learning Rate: 0.1

Too Small (0.001)

Very slow convergence. May take thousands of iterations to reach the minimum.

Just Right (0.01 - 0.1)

Smooth, efficient convergence. The sweet spot for most problems.

Too Large (1.0+)

Overshoots the minimum, may oscillate or diverge entirely.

Variants of Gradient Descent

Batch Gradient Descent

Uses the entire dataset for each update. Stable but slow for large datasets.

Stochastic GD (SGD)

Uses one random sample per update. Fast but noisy. Can escape shallow local minima.

Mini-Batch GD

Uses a small batch (32-256 samples) per update. Best of both worlds - fast and stable.

07

Interactive Visualization Lab

Hands-on interactive components to build deep intuition.

Scatter Plot Playground

Adjust the slope and intercept sliders to manually fit a line through the data. Watch the MSE change in real-time as you move the line. Try to find the values that minimize the error.

Slope (w): 1.0

Intercept (b): 0.0

Slope: w=1.00 Intercept: b=0.00 MSE: --

3D Loss Surface

Visualize the MSE as a function of the slope $$w$$ and intercept $$b$$ . The bowl shape confirms convexity - there is only one minimum. Drag to rotate.

Animated Training

Watch how gradient descent iteratively adjusts the regression line to fit the data. The loss curve on the right shows convergence.

08

Advantages & Disadvantages

When to use linear regression and when to consider alternatives.

Advantages

Simple and Interpretable

Each coefficient directly tells you how much the target changes for a one-unit change in that feature, holding all other features constant. This interpretability is invaluable in fields like medicine and economics.

Closed-Form Solution

Unlike most machine learning models, linear regression has an exact analytical solution. No iterations, no hyperparameter tuning for the optimization step.

Computationally Efficient

Training is extremely fast - a single matrix multiplication solves the problem. Prediction is a simple dot product. This makes it practical for real-time applications and embedded systems.

Statistical Guarantees

Under the OLS assumptions, linear regression provides the Best Linear Unbiased Estimator (BLUE) by the Gauss-Markov theorem. It comes with confidence intervals, hypothesis tests, and p-values.

Disadvantages

Assumes Linear Relationship

If the true relationship between features and target is nonlinear (quadratic, exponential, etc.), linear regression will systematically underfit the data.

Sensitive to Outliers

Because the loss function squares the errors, a single extreme outlier can dramatically shift the regression line and corrupt the learned coefficients.

Multicollinearity Issues

When features are highly correlated, the coefficient estimates become unstable and have high variance. The matrix $$X^TX$$ becomes nearly singular, making inversion numerically unreliable.

Cannot Model Interactions Automatically

Without manual feature engineering, linear regression cannot capture interaction effects or nonlinear patterns. You must explicitly create polynomial or interaction terms.

09

From Beginner to Advanced

Extending linear regression to more powerful frameworks.

Polynomial Regression

What if the relationship between the feature and target is curved? We can extend linear regression by adding polynomial features:

\hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \cdots + w_p x^p

This is still "linear" regression because the model is linear in the parameters $$w$$ , even though it is nonlinear in the input $$x$$ . We simply create new features $x^2, x^3, \ldots$ and apply ordinary linear regression.

Overfitting Risk

High-degree polynomials can fit the training data perfectly but generalize poorly to new data. A degree-20 polynomial can pass through every training point but oscillate wildly between them. Always use cross-validation to select the polynomial degree.

Regularization

To prevent overfitting and improve generalization, we add a penalty term to the cost function that discourages large weights:

Ridge Regression (L2)

J_{\text{Ridge}} = \frac{1}{2n}\|y - Xw\|^2 + \frac{\lambda}{2}\|w\|^2

Shrinks all weights toward zero but never exactly to zero. The closed-form solution becomes:

w^* = (X^TX + \lambda I)^{-1}X^Ty

Adding $\lambda I$ to $$X^TX$$ guarantees invertibility, solving the multicollinearity problem.

Lasso Regression (L1)

J_{\text{Lasso}} = \frac{1}{2n}\|y - Xw\|^2 + \lambda\|w\|_1

Can drive weights exactly to zero, performing automatic feature selection. No closed-form solution - must use iterative methods (coordinate descent).

Elastic Net

Elastic Net combines L1 and L2 regularization, getting the best of both worlds:

J_{\text{ElasticNet}} = \frac{1}{2n}\|y - Xw\|^2 + \lambda_1\|w\|_1 + \frac{\lambda_2}{2}\|w\|^2

It performs feature selection (like Lasso) while handling correlated features gracefully (like Ridge). The mixing ratio $\alpha = \frac{\lambda_1}{\lambda_1 + \lambda_2}$ controls the balance between L1 and L2.

Multiple Linear Regression

When we have multiple input features, the model extends naturally:

\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b

The geometric interpretation changes: instead of fitting a line in 2D, we are fitting a hyperplane in $$(d+1)$$ -dimensional space. All the same mathematics (normal equation, gradient descent, MSE) apply unchanged.

R-Squared ( $$R^2$$ ) - Coefficient of Determination

How do we measure how well our model fits the data? The $$R^2$$ metric quantifies the proportion of variance in the target that is explained by the model:

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

Where $SS_{\text{res}}$ is the residual sum of squares and $SS_{\text{tot}}$ is the total sum of squares around the mean $\bar{y}$ .

$$R^2 = 1$$

Perfect fit - the model explains all variance

$$R^2 = 0$$

Model no better than predicting the mean

$$R^2 < 0$$

Model is worse than predicting the mean (rare, usually a bug)

R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-d-1}

Adjusted $$R^2$$ penalizes for adding irrelevant features

Beware of High $$R^2$$

A high $$R^2$$ does not necessarily mean the model is good. Adding more features always increases $$R^2$$ (even random noise). Always use Adjusted $$R^2$$ or cross-validation to evaluate model quality honestly.

Connection to Machine Learning

Linear regression is the foundation upon which much of machine learning is built:

Logistic Regression wraps a sigmoid around linear regression for classification
Neural Networks are compositions of linear transformations and nonlinear activations
Support Vector Machines find the optimal separating hyperplane, a concept rooted in linear models
Generalized Linear Models (GLMs) extend linear regression to non-normal distributions via link functions

Mastering linear regression gives you the conceptual toolkit for understanding nearly every supervised learning algorithm.

Historical Context

The Birth of Regression

The Method of Least Squares

Why Least Squares Won

Timeline of Linear Regression

Intuition

The Simplest Prediction Problem

What Makes a Line "Best"?

Compute Predictions

Compute Residuals

Square and Sum

Minimize

Why Squared Errors?

Differentiable

Penalizes Large Errors

Unique Solution

Sensitivity to Outliers

Mathematical Formulation

1. The Linear Model

2. The Hypothesis Function

3. Matrix Notation

Why Matrix Notation Matters

4. The Residual Vector

Cost Function (MSE)

The Mean Squared Error

MSE in Matrix Form

Why MSE is Convex

Convexity Proof

Interpreting MSE

Understanding the Loss Values

MSE vs RMSE

Closed-Form Solution (Normal Equation)

The Normal Equation

Full Derivation

The Hat Matrix

Geometric Interpretation

Normal Equation vs Gradient Descent

Normal Equation

Exact solution in one step

No learning rate needed

Complexity: $O(d^3)$

Gradient Descent

Iterative approximation

Requires learning rate tuning

Complexity: $O(nd)$ per iteration

When the Normal Equation Fails

Gradient Descent

Gradient Derivation

Step-by-Step Derivation

Interpreting the Gradient

The Update Rule

Learning Rate Effects

Gradient Descent Animation

Too Small (0.001)

Just Right (0.01 - 0.1)

Too Large (1.0+)

Variants of Gradient Descent

Batch Gradient Descent

Stochastic GD (SGD)

Mini-Batch GD

Interactive Visualization Lab

Scatter Plot Playground

3D Loss Surface

Animated Training

Advantages & Disadvantages

Advantages

Simple and Interpretable

Closed-Form Solution

Computationally Efficient

Statistical Guarantees

Disadvantages

Assumes Linear Relationship

Sensitive to Outliers

Multicollinearity Issues

Cannot Model Interactions Automatically

From Beginner to Advanced

Polynomial Regression

Overfitting Risk

Regularization

Complexity: $$O(d^3)$$

Complexity: $$O(nd)$$ per iteration

R-Squared ( $$R^2$$ ) - Coefficient of Determination

Beware of High $$R^2$$