Linear Regression
From historical origins to mathematical mastery. A beginner-friendly, visually interactive deep dive into the most fundamental predictive algorithm in statistics and machine learning.
Begin Learning ↓From historical origins to mathematical mastery. A beginner-friendly, visually interactive deep dive into the most fundamental predictive algorithm in statistics and machine learning.
Begin Learning ↓The remarkable origin story of regression and the method of least squares.
The word "regression" itself comes from Sir Francis Galton, a Victorian-era polymath who was studying hereditary traits in the 1880s. He observed that the children of exceptionally tall parents tended to be shorter than their parents, and children of exceptionally short parents tended to be taller. He called this phenomenon "regression toward mediocrity" (later rephrased as "regression to the mean").
In his 1886 paper, Galton plotted the heights of parents against the heights of their adult children and drew the best-fitting line through the data. This was one of the earliest uses of what we now call linear regression. The line he drew was not a 45-degree line of perfect inheritance, but a shallower line - showing that extreme values tend to "regress" toward the average.
The mathematical foundation for fitting lines to data was actually developed decades before Galton, through one of the most famous priority disputes in the history of mathematics.
In 1805, Adrien-Marie Legendre published the method of least squares in his work on determining the orbits of comets. He proposed minimizing the sum of squared residuals as the criterion for the best-fitting line.
However, Carl Friedrich Gauss claimed he had been using the method since 1795, and in 1809 he published a more rigorous treatment connecting least squares to the normal distribution of errors. This connection - that least squares is the maximum likelihood estimator when errors are normally distributed - gave the method deep theoretical justification.
Gauss proved that if the errors follow a normal distribution, then the least squares solution is the maximum likelihood estimator - it finds the parameters that make the observed data most probable. This gave least squares a probabilistic foundation that no competing method could match.
What does it mean to fit a line to data, and why is it useful?
Imagine you are a real estate agent. You have data on houses: their sizes (in square feet) and their selling prices. You notice a pattern - bigger houses tend to sell for more. You want to use this pattern to predict the price of a new house based on its size.
Linear regression answers this question by finding the best-fitting straight line through the data points. Once you have this line, you can plug in any house size and get a predicted price.
Where $w$ is the slope (price per square foot) and $b$ is the y-intercept (base price).
There are infinitely many lines you could draw through a scatter plot. The question is: which line is the best one? Linear regression defines "best" by minimizing the total distance between the data points and the line.
For each data point, the residual is the vertical distance between the actual value $y_i$ and the predicted value $\hat{y}_i$:
For each data point, calculate what the line predicts:
Measure the error - how far off the prediction is from the actual value:
Square each residual (so negatives do not cancel positives) and add them all up:
Find the values of $w$ and $b$ that make this total as small as possible.
You might ask: why not just sum the absolute values of the residuals? There are several mathematical reasons why squaring is preferred:
The squared function is smooth everywhere, enabling calculus-based optimization
Squaring amplifies large residuals, strongly discouraging outlier-level errors
The sum of squared errors is strictly convex, guaranteeing a single global minimum
The flip side of penalizing large errors is that outliers have a disproportionate influence on the regression line. A single extreme data point can significantly shift the best-fit line. This is why checking for outliers is crucial before applying linear regression.
The complete mathematical framework for linear regression.
The fundamental assumption of linear regression is that the target variable $y$ is a linear function of the input features $x$, plus some random noise:
Where:
Given a set of learned parameters, the model predicts:
This is also called the hypothesis function. The hat notation $\hat{y}$ denotes a predicted value, as opposed to the true observed value $y$.
For a dataset with $n$ samples and $d$ features, we can express all predictions simultaneously using matrix notation. We augment the feature matrix with a column of ones to absorb the bias term:
Then the predictions for all samples become a single matrix multiplication:
Matrix notation is not just a shorthand - it enables efficient computation using optimized linear algebra libraries (BLAS, LAPACK). A single matrix multiply can compute millions of predictions simultaneously, which is why linear regression scales beautifully to large datasets.
The vector of all residuals (errors) is:
Each component $e_i = y_i - \hat{y}_i$ measures how far the prediction is from the actual value for sample $i$. Our goal is to make this residual vector as small as possible.
Deriving and understanding the Mean Squared Error loss function.
The cost function measures how well our model fits the data. For linear regression, we use the Mean Squared Error (MSE):
The factor of $\frac{1}{2}$ is a mathematical convenience - it cancels with the exponent when we take the derivative, making the gradient formulas cleaner. The factor of $\frac{1}{n}$ normalizes by the number of samples so the cost does not grow with dataset size.
Using matrix notation, the MSE becomes remarkably elegant:
This is the squared Euclidean norm of the residual vector, divided by $2n$. Geometrically, it measures the squared length of the error vector in $n$-dimensional space.
A function is convex if its second derivative is positive (or, in multiple dimensions, if its Hessian matrix is positive semi-definite). Let us verify this for MSE:
Since the Hessian is positive semi-definite, $J(w)$ is convex. If $X$ has full column rank, then $\|Xv\|^2 > 0$ for all $v \neq 0$, making the Hessian positive definite and $J(w)$ strictly convex with a unique global minimum.
The loss is zero - the model perfectly fits all data points.
The loss is always positive. Larger errors are penalized quadratically - an error of 2 contributes 4 times as much as an error of 1.
MSE is in squared units of the target variable, making it hard to interpret directly. The Root Mean Squared Error (RMSE) $= \sqrt{\text{MSE}}$ is in the same units as the target, making it more interpretable. For example, if you predict house prices in dollars, RMSE tells you the typical error in dollars.
Deriving the exact analytical solution using calculus and linear algebra.
Unlike logistic regression, linear regression has a beautiful closed-form solution. We can find the optimal weights directly by setting the gradient of the cost function to zero and solving for $w$.
This is the Normal Equation (also called the Ordinary Least Squares, or OLS, solution). The name "normal" comes from the fact that the residual vector $y - Xw^*$ is orthogonal (normal) to the column space of $X$.
The predicted values can be written as:
Where $H = X(X^TX)^{-1}X^T$ is called the hat matrix (because it puts the "hat" on $y$). The hat matrix is an orthogonal projection matrix that projects $y$ onto the column space of $X$.
Linear regression finds the point in the column space of $X$ that is closest to the vector $y$. The residual vector $y - \hat{y}$ is perpendicular to every column of $X$, which is exactly the condition $X^T(y - Xw) = 0$, i.e., the normal equation.
No iterations, no hyperparameters to tune
No risk of divergence or slow convergence
Matrix inversion is cubic in the number of features
Use when: Number of features $d < 10{,}000$
Converges to the optimal solution over many steps
Too large diverges; too small is very slow
Scales linearly with both features and samples
Use when: Number of features $d > 10{,}000$ or data is too large to fit in memory
The normal equation requires $X^TX$ to be invertible. This fails when: (1) features are linearly dependent (multicollinearity), (2) there are more features than samples ($d > n$). In these cases, use regularization (Ridge/Lasso) or the pseudo-inverse $X^+$.
Step-by-step gradient derivation and the iterative optimization approach.
To use gradient descent, we need the gradient of the MSE cost function with respect to the parameters. Let us derive it step by step.
The gradient $\frac{1}{n}X^T(Xw - y)$ has a beautiful interpretation: it is the correlation between the input features $X$ and the prediction errors $(Xw - y)$. When features are correlated with errors, the weights need adjustment. When the gradient is zero, the errors are uncorrelated with the features - the model has extracted all the linear signal.
Gradient descent iteratively updates the weights by taking steps in the direction opposite to the gradient:
Where $\eta$ is the learning rate, a hyperparameter that controls the step size.
Watch gradient descent optimize the loss. Adjust the learning rate to see how it affects convergence.
Very slow convergence. May take thousands of iterations to reach the minimum.
Smooth, efficient convergence. The sweet spot for most problems.
Overshoots the minimum, may oscillate or diverge entirely.
Uses the entire dataset for each update. Stable but slow for large datasets.
Uses one random sample per update. Fast but noisy. Can escape shallow local minima.
Uses a small batch (32-256 samples) per update. Best of both worlds - fast and stable.
Hands-on interactive components to build deep intuition.
Adjust the slope and intercept sliders to manually fit a line through the data. Watch the MSE change in real-time as you move the line. Try to find the values that minimize the error.
Visualize the MSE as a function of the slope $w$ and intercept $b$. The bowl shape confirms convexity - there is only one minimum. Drag to rotate.
Watch how gradient descent iteratively adjusts the regression line to fit the data. The loss curve on the right shows convergence.
When to use linear regression and when to consider alternatives.
Each coefficient directly tells you how much the target changes for a one-unit change in that feature, holding all other features constant. This interpretability is invaluable in fields like medicine and economics.
Unlike most machine learning models, linear regression has an exact analytical solution. No iterations, no hyperparameter tuning for the optimization step.
Training is extremely fast - a single matrix multiplication solves the problem. Prediction is a simple dot product. This makes it practical for real-time applications and embedded systems.
Under the OLS assumptions, linear regression provides the Best Linear Unbiased Estimator (BLUE) by the Gauss-Markov theorem. It comes with confidence intervals, hypothesis tests, and p-values.
If the true relationship between features and target is nonlinear (quadratic, exponential, etc.), linear regression will systematically underfit the data.
Because the loss function squares the errors, a single extreme outlier can dramatically shift the regression line and corrupt the learned coefficients.
When features are highly correlated, the coefficient estimates become unstable and have high variance. The matrix $X^TX$ becomes nearly singular, making inversion numerically unreliable.
Without manual feature engineering, linear regression cannot capture interaction effects or nonlinear patterns. You must explicitly create polynomial or interaction terms.
Extending linear regression to more powerful frameworks.
What if the relationship between the feature and target is curved? We can extend linear regression by adding polynomial features:
This is still "linear" regression because the model is linear in the parameters $w$, even though it is nonlinear in the input $x$. We simply create new features $x^2, x^3, \ldots$ and apply ordinary linear regression.
High-degree polynomials can fit the training data perfectly but generalize poorly to new data. A degree-20 polynomial can pass through every training point but oscillate wildly between them. Always use cross-validation to select the polynomial degree.
To prevent overfitting and improve generalization, we add a penalty term to the cost function that discourages large weights:
Shrinks all weights toward zero but never exactly to zero. The closed-form solution becomes:
Adding $\lambda I$ to $X^TX$ guarantees invertibility, solving the multicollinearity problem.
Can drive weights exactly to zero, performing automatic feature selection. No closed-form solution - must use iterative methods (coordinate descent).
Elastic Net combines L1 and L2 regularization, getting the best of both worlds:
It performs feature selection (like Lasso) while handling correlated features gracefully (like Ridge). The mixing ratio $\alpha = \frac{\lambda_1}{\lambda_1 + \lambda_2}$ controls the balance between L1 and L2.
When we have multiple input features, the model extends naturally:
The geometric interpretation changes: instead of fitting a line in 2D, we are fitting a hyperplane in $(d+1)$-dimensional space. All the same mathematics (normal equation, gradient descent, MSE) apply unchanged.
How do we measure how well our model fits the data? The $R^2$ metric quantifies the proportion of variance in the target that is explained by the model:
Where $SS_{\text{res}}$ is the residual sum of squares and $SS_{\text{tot}}$ is the total sum of squares around the mean $\bar{y}$.
Perfect fit - the model explains all variance
Model no better than predicting the mean
Model is worse than predicting the mean (rare, usually a bug)
Adjusted $R^2$ penalizes for adding irrelevant features
A high $R^2$ does not necessarily mean the model is good. Adding more features always increases $R^2$ (even random noise). Always use Adjusted $R^2$ or cross-validation to evaluate model quality honestly.
Linear regression is the foundation upon which much of machine learning is built:
Mastering linear regression gives you the conceptual toolkit for understanding nearly every supervised learning algorithm.