P(A|B) = P(B|A)P(A)/P(B)
MAP = argmax
α smoothing
Quick Reference
Probability
Home / Study Lab / Cheat Sheets / Naive Bayes
QUICK REFERENCE

Naive Bayes
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Bayes' Theorem:
$$P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}$$
Prior:
$$P(y) = \frac{\text{count}(y)}{N}$$
Likelihood:
$$P(x|y) = \prod_{i=1}^{n} P(x_i|y)$$
Posterior (Naive):
$$P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y)$$
MAP Decision:
$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)$$
Log Form:
$$\hat{y} = \arg\max_y \left[\log P(y) + \sum_{i=1}^{n} \log P(x_i|y)\right]$$

Gaussian Naive Bayes

PDF:
$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$
Mean:
$$\mu_y = \frac{1}{N_y} \sum_{i \in y} x_i$$
Variance:
$$\sigma_y^2 = \frac{1}{N_y} \sum_{i \in y} (x_i - \mu_y)^2$$

Used for continuous features. Assumes features follow a normal distribution within each class.

Multinomial Naive Bayes

Word Probability:
$$P(x_i|y) = \frac{N_{yi} + \alpha}{N_y + \alpha |V|}$$
Where:
$N_{yi}$ = count of word $i$ in class $y$, $N_y$ = total words in class $y$, $|V|$ = vocabulary size

Best for text classification with word counts (bag of words model).

Bernoulli Naive Bayes

Feature Probability:
$$P(x_i|y) = P(x_i=1|y)^{x_i} \cdot (1 - P(x_i=1|y))^{(1-x_i)}$$

For binary features. Models both presence AND absence of features (unlike Multinomial which ignores absence).

Laplace Smoothing

Smoothed Probability:
$$P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\text{count}(y) + \alpha \cdot |V|}$$
Laplace:
$\alpha = 1$ (add-one smoothing)
Lidstone:
$0 < \alpha < 1$

Prevents zero probabilities from destroying predictions. Essential for real-world applications.

Decision Rule

  1. Calculate prior $P(y)$ for each class from training data
  2. For each feature $x_i$, look up $P(x_i|y)$ from training
  3. Multiply: $P(y) \times P(x_1|y) \times P(x_2|y) \times ...$
  4. Use log-trick: $\log P(y) + \sum \log P(x_i|y)$
  5. Pick the class with the highest score

No iterative optimization needed - just counting and multiplying.

Assumptions

  • Features are conditionally independent given the class
  • All features contribute equally to the prediction
  • Feature distributions match the chosen model (Gaussian/Multinomial/Bernoulli)
  • Training data is representative of the true distribution
  • Classes are mutually exclusive and exhaustive

Common Mistakes

  • Forgetting Laplace smoothing (zero probabilities)
  • Using Gaussian NB on categorical features
  • Using Multinomial NB on negative-valued features
  • Not converting to log space (numerical underflow)
  • Assuming probability outputs are well-calibrated
  • Ignoring highly correlated features that violate independence
  • Not preprocessing text (stop words, stemming) for text NB
  • Using raw counts instead of TF-IDF with Multinomial NB

When to Use / Not Use

Use When:

  • Small training dataset
  • High-dimensional data (many features)
  • Text classification / NLP tasks
  • Fast training and prediction needed
  • Good baseline model required
  • Real-time classification

Avoid When:

  • Features are heavily correlated
  • Probability estimates must be accurate
  • Complex non-linear decision boundaries needed
  • Dataset has many numerical features with non-Gaussian distributions

Interview Quick-Fire

Q: Why is it called "naive"?

A: Because it assumes all features are conditionally independent given the class label - a simplification that rarely holds in practice but works surprisingly well.

Q: When would you prefer Naive Bayes over Logistic Regression?

A: When you have very little training data, need extremely fast training/prediction, or have a very high-dimensional feature space (like text with thousands of words).

Q: How does Naive Bayes handle missing features?

A: Simply omit the missing feature from the likelihood product. Since features are assumed independent, this is mathematically valid.

Q: What happens without Laplace smoothing?

A: If any feature has zero probability for a class, the entire posterior for that class becomes zero, regardless of all other evidence.

Q: Is Naive Bayes a discriminative or generative model?

A: Generative - it models the joint distribution P(x,y) by estimating P(x|y) and P(y), then uses Bayes' rule to get P(y|x).

Q: Why does Naive Bayes work well despite the independence assumption being wrong?

A: The classification decision only needs the correct ranking of class probabilities, not accurate probability values. Even biased probability estimates often produce the correct argmax.