y = wx + b
J(θ) = MSE
R² = 1 - SSres/SStot
Quick Reference
Regression
Home / Study Lab / Cheat Sheets / Linear Regression
QUICK REFERENCE

Linear Regression
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Linear Model:
$$\hat{y} = w^Tx + b$$
Hypothesis:
$$h_{w,b}(x) = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b$$
Prediction (matrix):
$$\hat{y} = Xw$$
Residual:
$$e_i = y_i - \hat{y}_i$$
Hat Matrix:
$$H = X(X^TX)^{-1}X^T$$
R-Squared:
$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Cost Function (MSE)

MSE:
$$J(w,b) = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
Matrix Form:
$$J(w) = \frac{1}{2n}(y - Xw)^T(y - Xw)$$
When $y = \hat{y}$:
$$J = 0 \quad \text{(perfect fit)}$$
When error exists:
$$J > 0 \quad \text{(always positive, quadratic penalty)}$$

The MSE is strictly convex (when X has full rank) - guaranteed unique global minimum!

Normal Equation

Closed-Form Solution:
$$w^* = (X^TX)^{-1}X^Ty$$
Normal Equation:
$$X^TXw = X^Ty$$
Hessian:
$$H = \frac{1}{n}X^TX \quad \text{(positive semi-definite)}$$

Requires $X^TX$ to be invertible. Fails when features are linearly dependent or $d > n$. Complexity: $O(d^3)$.

Gradient

Weight Gradient:
$$\frac{\partial J}{\partial w} = \frac{1}{n}X^T(Xw - y)$$
Bias Gradient:
$$\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)$$
Update Rule:
$$w := w - \eta \cdot \frac{1}{n}X^T(Xw - y)$$

Gradient = correlation between features and prediction errors. When gradient = 0, errors are uncorrelated with features.

Assumptions

  • Linearity: The relationship between features and target is linear
  • Independence: Observations are independent of each other
  • Homoscedasticity: Constant variance of residuals across all levels of the features
  • Normality: Residuals are normally distributed (needed for inference, not prediction)
  • No Multicollinearity: Features are not highly correlated with each other
  • No Autocorrelation: Residuals are not correlated with each other (important for time series)

Common Mistakes

  • Using linear regression for inherently non-linear relationships without feature engineering
  • Ignoring outliers that can heavily skew the regression line
  • Not checking assumptions (residual plots, normality tests) before interpreting results
  • Overfitting with too many features relative to the number of samples
  • Using $R^2$ alone to evaluate model quality (always check Adjusted $R^2$ and residual plots)
  • Confusing correlation with causation based on regression coefficients
  • Extrapolating far beyond the range of training data
  • Ignoring multicollinearity - leads to unstable and uninterpretable coefficients

Regularization

Ridge (L2):
$$J + \frac{\lambda}{2}\|w\|^2 \quad \Rightarrow \quad w^* = (X^TX + \lambda I)^{-1}X^Ty$$
Lasso (L1):
$$J + \lambda\|w\|_1 \quad \text{(no closed form)}$$
Elastic Net:
$$J + \lambda_1\|w\|_1 + \frac{\lambda_2}{2}\|w\|^2$$

Ridge shrinks all weights uniformly. Lasso produces sparse weights (feature selection). Elastic Net combines both.

Interview Questions

Q: What are the assumptions of linear regression?

A: Linearity, independence of observations, homoscedasticity (constant variance of errors), normality of residuals, and no multicollinearity among features.

Q: What is the difference between the normal equation and gradient descent?

A: The normal equation $w = (X^TX)^{-1}X^Ty$ gives the exact solution in one step but is $O(d^3)$. Gradient descent is iterative and $O(nd)$ per step, better for large feature dimensions.

Q: How do you handle multicollinearity?

A: Use Ridge (L2) regularization, remove correlated features, apply PCA for dimensionality reduction, or compute Variance Inflation Factor (VIF) to identify and address problematic features.

Q: What does $R^2$ measure and what are its limitations?

A: $R^2$ measures the proportion of variance explained by the model. It always increases with more features. Use Adjusted $R^2$ instead, which penalizes for adding irrelevant features.

Q: Why is linear regression sensitive to outliers?

A: Because MSE squares the errors, a single extreme outlier contributes disproportionately to the loss. The squared term amplifies large residuals, pulling the regression line toward the outlier.

Q: When would you use Ridge vs Lasso vs Elastic Net?

A: Use Ridge when all features are potentially relevant (shrinks but keeps all). Use Lasso when you want feature selection (drives some weights to zero). Use Elastic Net when features are correlated and you want selection.