Linear Regression
Cheat Sheet
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Everything you need on one page. Perfect for revision, interviews, and quick reference.
The MSE is strictly convex (when X has full rank) - guaranteed unique global minimum!
Requires $X^TX$ to be invertible. Fails when features are linearly dependent or $d > n$. Complexity: $O(d^3)$.
Gradient = correlation between features and prediction errors. When gradient = 0, errors are uncorrelated with features.
Ridge shrinks all weights uniformly. Lasso produces sparse weights (feature selection). Elastic Net combines both.
A: Linearity, independence of observations, homoscedasticity (constant variance of errors), normality of residuals, and no multicollinearity among features.
A: The normal equation $w = (X^TX)^{-1}X^Ty$ gives the exact solution in one step but is $O(d^3)$. Gradient descent is iterative and $O(nd)$ per step, better for large feature dimensions.
A: Use Ridge (L2) regularization, remove correlated features, apply PCA for dimensionality reduction, or compute Variance Inflation Factor (VIF) to identify and address problematic features.
A: $R^2$ measures the proportion of variance explained by the model. It always increases with more features. Use Adjusted $R^2$ instead, which penalizes for adding irrelevant features.
A: Because MSE squares the errors, a single extreme outlier contributes disproportionately to the loss. The squared term amplifies large residuals, pulling the regression line toward the outlier.
A: Use Ridge when all features are potentially relevant (shrinks but keeps all). Use Lasso when you want feature selection (drives some weights to zero). Use Elastic Net when features are correlated and you want selection.