σ(z) = 1/(1+e-z)
BCE = -y log ŷ
∇w = ∂L/∂w
Quick Reference
Classification
Home / Study Lab / Cheat Sheets / Logistic Regression
QUICK REFERENCE

Logistic Regression
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Linear Model:
$$z = w^Tx + b$$
Sigmoid:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Prediction:
$$P(y=1|x) = \sigma(w^Tx + b)$$
Odds:
$$\frac{P}{1-P} = e^{w^Tx+b}$$
Log-Odds:
$$\log\left(\frac{P}{1-P}\right) = w^Tx + b$$
Sigmoid Derivative:
$$\sigma'(z) = \sigma(z)(1-\sigma(z))$$

Loss Function

BCE Loss:
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}[y_i\log(p_i) + (1-y_i)\log(1-p_i)]$$
When y=1:
$$\mathcal{L} = -\log(p) \quad \text{(penalize low confidence)}$$
When y=0:
$$\mathcal{L} = -\log(1-p) \quad \text{(penalize high confidence)}$$

The BCE loss is convex - guaranteed global minimum!

Optimization Steps

  1. Initialize weights $w$ and bias $b$ (zeros or random)
  2. Compute predictions: $p_i = \sigma(w^Tx_i + b)$
  3. Compute loss: $\mathcal{L}(w,b)$
  4. Compute gradients
  5. Update weights: $w := w - \eta \nabla_w \mathcal{L}$
  6. Repeat steps 2-5 until convergence
No closed form exists.

Gradient

Weight Gradient:
$$\frac{\partial \mathcal{L}}{\partial w} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)x_i$$
Bias Gradient:
$$\frac{\partial \mathcal{L}}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)$$
Update Rule:
$$w := w - \eta(p_i - y_i)x_i$$

Gradient = (prediction error) × (input features)

Assumptions

  • Binary outcome variable (0 or 1)
  • Linear relationship between features and log-odds
  • Observations are independent
  • Little or no multicollinearity among features
  • Large enough sample size
  • No extreme outliers in continuous predictors

Common Mistakes

  • Using MSE loss instead of BCE (non-convex!)
  • Not scaling features before training
  • Ignoring multicollinearity
  • Expecting non-linear decision boundaries
  • Confusing logistic regression with linear regression
  • Forgetting to check class imbalance
  • Interpreting coefficients as probabilities (they are log-odds!)
  • Using accuracy as the only metric for imbalanced data

Regularization

L2 (Ridge):
$$\mathcal{L} + \frac{\lambda}{2}\|w\|^2$$
L1 (Lasso):
$$\mathcal{L} + \lambda\|w\|_1$$

L1 produces sparse weights (feature selection). L2 shrinks all weights uniformly.

Interview Questions

Q: Why not use MSE for logistic regression?

A: MSE with sigmoid creates a non-convex loss surface with local minima. BCE is convex, ensuring global optimum.

Q: Is there a closed-form solution?

A: No. The sigmoid makes the equation nonlinear in weights. We use iterative methods like gradient descent.

Q: What does the coefficient represent?

A: Each coefficient represents the change in log-odds for a one-unit increase in the corresponding feature.

Q: How is it related to neural networks?

A: Logistic regression is a single-layer neural network with sigmoid activation. It is the building block of deep learning.

Q: When would you choose logistic regression over a neural network?

A: When interpretability matters, data is limited, features are linearly separable, or you need calibrated probabilities.

Q: What is the decision boundary?

A: The hyperplane where $w^Tx + b = 0$, i.e., where the predicted probability is exactly 0.5.