Naive Bayes Cheat Sheet | Techma Zone Study Lab

Key Formulas

Bayes' Theorem:

P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}

Prior:

P(y) = \frac{\text{count}(y)}{N}

Likelihood:

P(x|y) = \prod_{i=1}^{n} P(x_i|y)

Posterior (Naive):

P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y)

MAP Decision:

\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)

Log Form:

\hat{y} = \arg\max_y \left[\log P(y) + \sum_{i=1}^{n} \log P(x_i|y)\right]

Gaussian Naive Bayes

PDF:

P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)

Mean:

\mu_y = \frac{1}{N_y} \sum_{i \in y} x_i

Variance:

\sigma_y^2 = \frac{1}{N_y} \sum_{i \in y} (x_i - \mu_y)^2

Used for continuous features. Assumes features follow a normal distribution within each class.

Multinomial Naive Bayes

Word Probability:

P(x_i|y) = \frac{N_{yi} + \alpha}{N_y + \alpha |V|}

Where:

$N_{yi}$ = count of word $i$ in class $y$, $N_y$ = total words in class $y$, $|V|$ = vocabulary size

Best for text classification with word counts (bag of words model).

Bernoulli Naive Bayes

Feature Probability:

P(x_i|y) = P(x_i=1|y)^{x_i} \cdot (1 - P(x_i=1|y))^{(1-x_i)}

For binary features. Models both presence AND absence of features (unlike Multinomial which ignores absence).

Laplace Smoothing

Smoothed Probability:

P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\text{count}(y) + \alpha \cdot |V|}

Laplace:

$\alpha = 1$ (add-one smoothing)

Lidstone:

0 < \alpha < 1

Prevents zero probabilities from destroying predictions. Essential for real-world applications.

Decision Rule

Calculate prior $P(y)$ for each class from training data
For each feature $x_i$, look up $P(x_i|y)$ from training
Multiply: $P(y) \times P(x_1|y) \times P(x_2|y) \times ...$
Use log-trick: $\log P(y) + \sum \log P(x_i|y)$
Pick the class with the highest score

No iterative optimization needed - just counting and multiplying.

Assumptions

Features are conditionally independent given the class
All features contribute equally to the prediction
Feature distributions match the chosen model (Gaussian/Multinomial/Bernoulli)
Training data is representative of the true distribution
Classes are mutually exclusive and exhaustive

Common Mistakes

Forgetting Laplace smoothing (zero probabilities)
Using Gaussian NB on categorical features
Using Multinomial NB on negative-valued features
Not converting to log space (numerical underflow)
Assuming probability outputs are well-calibrated
Ignoring highly correlated features that violate independence
Not preprocessing text (stop words, stemming) for text NB
Using raw counts instead of TF-IDF with Multinomial NB

When to Use / Not Use

Use When:

Small training dataset
High-dimensional data (many features)
Text classification / NLP tasks
Fast training and prediction needed
Good baseline model required
Real-time classification

Avoid When:

Features are heavily correlated
Probability estimates must be accurate
Complex non-linear decision boundaries needed
Dataset has many numerical features with non-Gaussian distributions

Interview Quick-Fire

Q: Why is it called "naive"?

A: Because it assumes all features are conditionally independent given the class label - a simplification that rarely holds in practice but works surprisingly well.

Q: When would you prefer Naive Bayes over Logistic Regression?

A: When you have very little training data, need extremely fast training/prediction, or have a very high-dimensional feature space (like text with thousands of words).

Q: How does Naive Bayes handle missing features?

A: Simply omit the missing feature from the likelihood product. Since features are assumed independent, this is mathematically valid.

Q: What happens without Laplace smoothing?

A: If any feature has zero probability for a class, the entire posterior for that class becomes zero, regardless of all other evidence.

Q: Is Naive Bayes a discriminative or generative model?

A: Generative - it models the joint distribution P(x,y) by estimating P(x|y) and P(y), then uses Bayes' rule to get P(y|x).

Q: Why does Naive Bayes work well despite the independence assumption being wrong?

A: The classification decision only needs the correct ranking of class probabilities, not accurate probability values. Even biased probability estimates often produce the correct argmax.

Naive BayesCheat Sheet