Linear Regression Cheat Sheet | Techma Zone Study Lab

Key Formulas

Linear Model:

\hat{y} = w^Tx + b

Hypothesis:

h_{w,b}(x) = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b

Prediction (matrix):

\hat{y} = Xw

Residual:

e_i = y_i - \hat{y}_i

Hat Matrix:

H = X(X^TX)^{-1}X^T

R-Squared:

R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Cost Function (MSE)

MSE:

J(w,b) = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Matrix Form:

J(w) = \frac{1}{2n}(y - Xw)^T(y - Xw)

When $y = \hat{y}$:

J = 0 \quad \text{(perfect fit)}

When error exists:

J > 0 \quad \text{(always positive, quadratic penalty)}

The MSE is strictly convex (when X has full rank) - guaranteed unique global minimum!

Normal Equation

Closed-Form Solution:

w^* = (X^TX)^{-1}X^Ty

Normal Equation:

$$X^TXw = X^Ty$$

Hessian:

H = \frac{1}{n}X^TX \quad \text{(positive semi-definite)}

Requires $X^TX$ to be invertible. Fails when features are linearly dependent or $d > n$. Complexity: $O(d^3)$.

Gradient

Weight Gradient:

\frac{\partial J}{\partial w} = \frac{1}{n}X^T(Xw - y)

Bias Gradient:

\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

Update Rule:

w := w - \eta \cdot \frac{1}{n}X^T(Xw - y)

Gradient = correlation between features and prediction errors. When gradient = 0, errors are uncorrelated with features.

Assumptions

Linearity: The relationship between features and target is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of residuals across all levels of the features
Normality: Residuals are normally distributed (needed for inference, not prediction)
No Multicollinearity: Features are not highly correlated with each other
No Autocorrelation: Residuals are not correlated with each other (important for time series)

Common Mistakes

Using linear regression for inherently non-linear relationships without feature engineering
Ignoring outliers that can heavily skew the regression line
Not checking assumptions (residual plots, normality tests) before interpreting results
Overfitting with too many features relative to the number of samples
Using $R^2$ alone to evaluate model quality (always check Adjusted $R^2$ and residual plots)
Confusing correlation with causation based on regression coefficients
Extrapolating far beyond the range of training data
Ignoring multicollinearity - leads to unstable and uninterpretable coefficients

Regularization

Ridge (L2):

J + \frac{\lambda}{2}\|w\|^2 \quad \Rightarrow \quad w^* = (X^TX + \lambda I)^{-1}X^Ty

Lasso (L1):

J + \lambda\|w\|_1 \quad \text{(no closed form)}

Elastic Net:

J + \lambda_1\|w\|_1 + \frac{\lambda_2}{2}\|w\|^2

Ridge shrinks all weights uniformly. Lasso produces sparse weights (feature selection). Elastic Net combines both.

Interview Questions

Q: What are the assumptions of linear regression?

A: Linearity, independence of observations, homoscedasticity (constant variance of errors), normality of residuals, and no multicollinearity among features.

Q: What is the difference between the normal equation and gradient descent?

A: The normal equation $w = (X^TX)^{-1}X^Ty$ gives the exact solution in one step but is $$O(d^3)$$ . Gradient descent is iterative and $$O(nd)$$ per step, better for large feature dimensions.

Q: How do you handle multicollinearity?

A: Use Ridge (L2) regularization, remove correlated features, apply PCA for dimensionality reduction, or compute Variance Inflation Factor (VIF) to identify and address problematic features.

Q: What does $$R^2$$ measure and what are its limitations?

A: $$R^2$$ measures the proportion of variance explained by the model. It always increases with more features. Use Adjusted $$R^2$$ instead, which penalizes for adding irrelevant features.

Q: Why is linear regression sensitive to outliers?

A: Because MSE squares the errors, a single extreme outlier contributes disproportionately to the loss. The squared term amplifies large residuals, pulling the regression line toward the outlier.

Q: When would you use Ridge vs Lasso vs Elastic Net?

A: Use Ridge when all features are potentially relevant (shrinks but keeps all). Use Lasso when you want feature selection (drives some weights to zero). Use Elastic Net when features are correlated and you want selection.

Linear RegressionCheat Sheet