Naive Bayes
From Bayes' theorem to text classification. A beginner-friendly, visually interactive deep dive into the fastest and most elegant probabilistic classifier.
Begin Learning ↓From Bayes' theorem to text classification. A beginner-friendly, visually interactive deep dive into the fastest and most elegant probabilistic classifier.
Begin Learning ↓How a posthumous essay on probability changed the course of science and machine learning.
Thomas Bayes was a Presbyterian minister and mathematician in England. He spent much of his life studying logic and theology, but it was his work on probability that would immortalize his name. Bayes was interested in a fundamental question: given that an event has occurred, how should we update our belief about its underlying cause?
Bayes never published the work himself. After his death in 1761, his friend Richard Price discovered the manuscript and published it in 1763 as "An Essay towards solving a Problem in the Doctrine of Chances" in the Philosophical Transactions of the Royal Society.
Independently of Bayes, the French mathematician Pierre-Simon Laplace (1749-1827) arrived at the same theorem and developed it into a comprehensive framework. Laplace applied Bayesian reasoning to an extraordinary range of problems: celestial mechanics, population statistics, the reliability of witness testimony, and even the probability that the sun would rise tomorrow.
While Bayes planted the seed, it was Laplace who built the tree. Modern Bayesian statistics owes as much to Laplace as it does to Bayes himself.
Probability theory began with games of chance in the 1600s, when mathematicians like Pascal and Fermat studied dice and cards. Bayes and Laplace extended these ideas to inverse probability -- reasoning backwards from observations to causes. This idea lay dormant for over a century before experiencing a dramatic revival:
Why the "naive" assumption is both technically wrong and practically brilliant.
Naive Bayes is a probabilistic classifier. Instead of learning a decision boundary directly, it learns the probability distributions of each class and uses Bayes' theorem to compute the probability that a new data point belongs to each class. The class with the highest probability wins.
The word "naive" comes from a single, powerful assumption: all features are conditionally independent given the class label. This means that knowing the value of one feature tells you nothing about another feature, as long as you already know the class.
Imagine you are building a spam detector. Each email has features: the presence of words like "free", "win", "click", "meeting", "project". The naive assumption says:
Given that an email is spam, the probability of seeing "free" is independent of the probability of seeing "win". Each word contributes its own evidence independently.
Each word votes independently for spam or ham, then votes are combined
We ignore that "free" and "click" often appear together in spam
No iterative optimization needed -- just count frequencies
In real data, features are almost never truly independent. Words in emails are correlated. Medical symptoms cluster together. So why does Naive Bayes still perform so well?
Research by Domingos and Pazzani (1997) showed that Naive Bayes can be optimal even when the independence assumption is strongly violated. The key insight is that dependence among features does not necessarily degrade classification accuracy. What matters is whether the dependences change which class has the highest posterior probability, not whether the posterior probabilities themselves are accurate.
The elegant formula that allows us to reason from evidence to causes.
Bayes' theorem relates the conditional probability of a hypothesis given evidence to the reverse conditional:
Each part has a name and a role:
Our initial belief about the class before seeing any features. For example, if 30% of emails are spam, then $P(\text{spam}) = 0.3$.
The probability of observing the features given the class. How likely are these particular words given that the email is spam?
The total probability of observing these features across all classes. Acts as a normalizing constant.
Our updated belief about the class after seeing the evidence. This is what we want to compute.
The joint likelihood $P(x_1, x_2, \ldots, x_n \mid y)$ is extremely hard to estimate -- it requires exponentially many parameters. The naive conditional independence assumption simplifies it to a product of individual likelihoods:
Substituting back into Bayes' theorem:
Since the evidence $P(x_1, \ldots, x_n)$ is the same for all classes, it is just a normalizing constant. For classification, we only need to compare posteriors across classes, so we can write: $P(y \mid x) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)$. The class with the largest unnormalized posterior wins.
See how prior beliefs change after observing evidence. Before seeing any words, we assume 30% of emails are spam (prior). After observing the word "free", the posterior shifts dramatically toward spam.
Three variants for different data types -- continuous, count-based, and binary.
When features are continuous (like height, weight, or temperature), we assume each feature follows a Gaussian (normal) distribution within each class. The likelihood for feature $x_i$ given class $y$ is:
Where $\mu_y$ and $\sigma_y^2$ are the mean and variance of feature $x_i$ for class $y$, estimated from training data.
Use cases: Iris dataset classification, medical diagnosis with continuous measurements, sensor data classification.
When features represent discrete counts (like word frequencies in a document), we use the multinomial distribution. The likelihood of feature $x_i$ given class $y$ is based on relative frequency:
Where $N_{yi}$ is the count of feature $i$ in class $y$, $N_y$ is the total count of all features in class $y$, $n$ is the number of features, and $\alpha$ is the smoothing parameter.
Use cases: Text classification with bag-of-words, document categorization, topic modeling.
When features are binary (present or absent), each feature follows a Bernoulli distribution. The likelihood explicitly models both presence and absence of a feature:
Unlike Multinomial NB, Bernoulli NB penalizes the absence of features that are typically present in a class. If the word "free" usually appears in spam but is missing from an email, that counts as evidence against spam.
Use cases: Document classification with binary term presence, short text classification, feature selection tasks.
Continuous features. Assumes normal distribution per class. Fast and effective for numerical data.
Count features (word frequencies). Best choice for text classification with TF or TF-IDF vectors.
Binary features (0/1). Penalizes missing features. Good for short text or boolean data.
Below, two Gaussian curves represent the distribution of a single feature for two different classes. Where the curves overlap, classification is uncertain. The decision boundary falls at the crossover point.
No gradient descent, no iterations -- just counting and computing statistics.
Training a Naive Bayes classifier is remarkably simple compared to most machine learning algorithms. There is no loss function to minimize, no gradient descent, and no learning rate to tune. The entire process is a single pass through the data.
Count how many training samples belong to each class and divide by the total number of samples.
For each feature and each class, estimate the probability distribution. For Gaussian NB, compute the mean and variance. For Multinomial NB, compute word frequencies. For Bernoulli NB, compute the fraction of samples where the feature is present.
That is it. The model is trained. All the information needed for prediction is stored in the priors and the class-conditional distributions. There are no weights to optimize iteratively.
Most classifiers like logistic regression, SVMs, and neural networks require iterative optimization. They repeatedly pass through the data, compute gradients, and adjust weights over hundreds or thousands of iterations.
Naive Bayes skips all of this. Its parameters are derived directly from summary statistics of the data: counts, means, and variances. This makes it one of the fastest classifiers to train, with a time complexity of $O(n \cdot d)$ where $n$ is the number of samples and $d$ is the number of features.
On a dataset with 100,000 samples and 10,000 features, Naive Bayes can be trained in under a second. Logistic regression might take minutes. A deep neural network could take hours. This makes Naive Bayes the go-to choice when you need a quick baseline or when data arrives in real-time streams.
Maximum A Posteriori estimation and the log-trick for numerical stability.
To classify a new data point, we compute the posterior probability for each class and pick the class with the highest value. This is called the MAP decision rule:
We iterate over all possible classes, compute the unnormalized posterior for each, and select the winner. The denominator $P(x)$ is omitted because it is identical for all classes and does not affect the argmax.
In practice, multiplying many small probabilities together leads to numerical underflow -- the result becomes so small that computers round it to zero. The solution is to work in log-space:
Since the logarithm is a monotonically increasing function, the argmax is preserved. Products become sums, and small probabilities become manageable negative numbers.
Multiplying many small probabilities leads to numerical underflow. With just 1000 features each having probability 0.01, the product is $10^{-2000}$ -- far below the smallest number a 64-bit float can represent ($\approx 10^{-308}$). The log-trick prevents this entirely.
Suppose we have two classes (A and B) and two features ($x_1$, $x_2$). Given a new point with $x_1 = 1$ and $x_2 = 0$:
Solving the zero-frequency problem that can silently destroy your classifier.
Consider a text classifier. If the word "cryptocurrency" never appeared in any spam email during training, then $P(\text{cryptocurrency} \mid \text{spam}) = 0$. Since we multiply all feature probabilities together, this single zero zeroes out the entire product, regardless of how much other evidence points to spam.
A single unseen feature can override thousands of other features. The model becomes 100% confident in the wrong direction because of a missing data point.
The fix is simple but essential. We add a small count $\alpha$ to every feature count:
Where $|V|$ is the number of possible values for the feature (vocabulary size for text), and $\alpha$ is the smoothing parameter:
Without smoothing, a single unseen feature can completely override all other evidence, making the classifier predict the wrong class with 100% confidence. This is one of the most common bugs in Naive Bayes implementations and can be extremely difficult to debug because the model appears to work correctly on most test cases.
Smoothing has an elegant Bayesian interpretation. Adding $\alpha$ to each count is equivalent to placing a Dirichlet prior on the multinomial probability parameters. With $\alpha = 1$, this is a uniform prior, saying we believe all feature values are equally likely before seeing the data.
As we observe more data, the influence of the prior diminishes and the estimates converge to the true frequencies. With small datasets, the prior has a larger stabilizing effect. With large datasets, it becomes negligible.
The most famous application of Naive Bayes: filtering your inbox.
Let us walk through exactly how a Naive Bayes spam filter works, from training to prediction.
Collect all unique words from the training emails. Optionally remove stop words ("the", "is", "a") and apply stemming. This gives us our feature set.
Count spam and ham emails. If 300 out of 1000 emails are spam: $P(\text{spam}) = 0.3$ and $P(\text{ham}) = 0.7$.
For each word in the vocabulary, compute $P(\text{word} \mid \text{spam})$ and $P(\text{word} \mid \text{ham})$ using frequency counts with Laplace smoothing.
For a new email, compute the log-posterior for both spam and ham by summing log-priors and log-likelihoods for each word present. The class with the higher score wins.
Suppose our trained model has these probabilities (with smoothing applied):
The chart below shows how different words have different probabilities under the spam and ham classes. Words like "free" and "money" are strong spam indicators, while "meeting" and "project" are strong ham indicators.
Despite its simplicity, Naive Bayes was the backbone of email spam filters for over a decade. Paul Graham's 2002 essay "A Plan for Spam" popularized the Bayesian approach, and SpamAssassin, one of the most widely deployed spam filters, uses a Bayesian classifier at its core. Even today, many production systems start with Naive Bayes as the first line of defense.
Understanding when Naive Bayes shines and when it struggles.
The fundamental assumption of Naive Bayes is that features are conditionally independent given the class. In mathematical terms:
This assumption is almost never true in practice. In text, words are heavily correlated ("New" and "York" appear together). In medical data, symptoms cluster together. Yet the classifier often works well regardless.
The independence assumption causes the most damage when:
Single pass through the data. No iterative optimization, no learning rate tuning. Trains in O(n*d) time.
Few parameters to estimate means it needs far less training data than complex models. Excellent when data is scarce.
Scales gracefully to thousands of features. Text classification with 50,000-word vocabularies is no problem.
Quick to implement and hard to beat for many problems. If Naive Bayes works, you may not need anything more complex.
The only parameter to tune is the smoothing factor alpha. No hidden layers, no regularization strength, no kernel choice.
Features are rarely independent in practice. Correlated features get their evidence double-counted.
While class rankings are often correct, the actual probability values can be wildly miscalibrated. Do not trust the confidence scores.
Cannot model interactions between features. If two features together are predictive but individually are not, Naive Bayes will miss this.
Performance heavily depends on how features are represented. Continuous features must follow the assumed distribution (Gaussian) for good results.
Real-world domains where Naive Bayes delivers production-grade results.
The classic application. Classifying emails as spam or ham based on word frequencies remains one of the most successful NB deployments.
Determining whether a product review, tweet, or comment is positive, negative, or neutral. Multinomial NB works especially well here.
Predicting disease presence from symptoms and test results. Gaussian NB is commonly used when features are continuous measurements.
Automatically sorting documents into categories (legal, financial, technical) based on their content. Powers many enterprise content management systems.
When predictions must be made in microseconds (fraud detection, ad serving), the speed of NB prediction makes it invaluable.
Collaborative filtering approaches can use NB to predict user preferences based on past behavior and item features.
From scikit-learn basics to a complete text classification pipeline.
The simplest way to get started with Naive Bayes in Python. Gaussian NB is ideal when features are continuous and approximately normally distributed.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train Gaussian Naive Bayes
model = GaussianNB()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
The go-to choice for text classification. Combine with CountVectorizer or TfidfVectorizer for a complete pipeline.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Sample training data
emails = [
"Free money click here now",
"Win a brand new iPhone today",
"Cheap discount offer limited time",
"Meeting scheduled for tomorrow morning",
"Project deadline is next Friday",
"Please review the attached report",
"Get rich quick with this method",
"Quarterly budget review meeting agenda",
]
labels = [1, 1, 1, 0, 0, 0, 1, 0] # 1=spam, 0=ham
# Build a pipeline: vectorize text then classify
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB(alpha=1.0))
])
# Train
pipeline.fit(emails, labels)
# Predict on new emails
new_emails = [
"Free discount click now to win",
"Team meeting about project deadline"
]
predictions = pipeline.predict(new_emails)
print(predictions) # [1, 0] = [spam, ham]
Best when features are binary indicators (word present or absent). It explicitly penalizes the absence of features that are typical for a class.
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
# Binarize the word counts
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(emails)
# Train Bernoulli NB
model = BernoulliNB(alpha=1.0)
model.fit(X, labels)
# Predict
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)
print(predictions)
A production-ready example with proper train/test split, cross-validation, and detailed evaluation metrics.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
# Load the 20 Newsgroups dataset (subset)
categories = ['sci.med', 'sci.space', 'rec.sport.hockey']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# Build pipeline with TF-IDF and Multinomial NB
pipeline = Pipeline([
('tfidf', TfidfVectorizer(
stop_words='english',
max_features=10000
)),
('nb', MultinomialNB(alpha=0.1))
])
# Cross-validation on training set
cv_scores = cross_val_score(pipeline, train.data, train.target, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train on full training set and evaluate on test set
pipeline.fit(train.data, train.target)
y_pred = pipeline.predict(test.data)
print(classification_report(
test.target, y_pred,
target_names=test.target_names
))
The alpha parameter controls the amount of Laplace smoothing. The default is 1.0, but values like 0.1 or 0.01 often work better in practice. Use cross-validation to find the optimal value for your dataset. Lower alpha values give more weight to the observed data, while higher values impose a stronger uniform prior.