Naive Bayes
Interview Questions
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
Naive Bayes is a probabilistic classifier based on Bayes' theorem. It is called "naive" because it makes the simplifying assumption that all features are conditionally independent given the class label. In other words, the algorithm assumes that knowing the value of one feature tells you nothing about the value of another feature, as long as you already know the class.
Despite this strong and often unrealistic assumption, Naive Bayes performs remarkably well in practice, especially for text classification tasks like spam detection and sentiment analysis. Its simplicity, speed, and surprising effectiveness have made it a go-to baseline classifier in machine learning.
Bayes' theorem describes how to update our beliefs about something (the hypothesis) after observing new evidence. Mathematically, it relates the posterior probability P(H|E) to the prior probability P(H), the likelihood P(E|H), and the evidence P(E). The formula is: P(H|E) = P(E|H) * P(H) / P(E).
In the context of classification, Bayes' theorem answers the question: "Given the features I observe, what is the probability of each class?" The prior captures how common each class is before seeing any features, the likelihood captures how probable the observed features are under each class, and the evidence acts as a normalizing constant to ensure probabilities sum to one.
Gaussian Naive Bayes assumes that continuous features follow a normal (Gaussian) distribution within each class. For each feature and class, it estimates the mean and variance from training data, then uses the Gaussian probability density function to compute likelihoods. This variant is best suited for numerical data.
Multinomial Naive Bayes models feature counts or frequencies and is ideal for text classification where features represent word counts or TF-IDF values. Bernoulli Naive Bayes handles binary features indicating whether a feature is present or absent, making it useful for document classification with binary term occurrence indicators.
The prior probability P(y) represents the probability of each class before observing any features. It is typically estimated from the training data as the proportion of samples belonging to each class. For example, if 30% of training emails are spam and 70% are not spam, then P(spam) = 0.3 and P(not spam) = 0.7.
The prior captures our baseline expectation about class frequencies. In cases where the dataset is imbalanced, the prior reflects that imbalance. Some implementations allow you to set custom priors to override the data-driven estimates, which can be useful when the training set distribution does not reflect the true population distribution.
Training a Naive Bayes classifier only requires a single pass through the data to compute class priors and class-conditional feature statistics. For Gaussian NB, this means calculating means and variances per feature per class. For Multinomial NB, it means counting word occurrences per class. There is no iterative optimization loop like gradient descent.
The time complexity is O(n * d) where n is the number of samples and d is the number of features, making it one of the fastest classifiers available. This efficiency makes Naive Bayes an excellent choice for large-scale applications, high-dimensional text data, and situations where you need a quick baseline model before trying more complex algorithms.
Laplace smoothing, also called additive smoothing, adds a small count alpha (typically 1) to every feature-class combination. Without smoothing, if a feature value was never observed with a particular class during training, the estimated probability would be zero. Since Naive Bayes computes the product of all feature probabilities, a single zero probability wipes out all other evidence and forces the entire posterior to zero, regardless of how strong the other features are.
The smoothed probability formula is: P(x_i|y) = (count(x_i, y) + alpha) / (count(y) + alpha * |V|), where |V| is the vocabulary size or number of possible feature values. When alpha equals 1, this is called Laplace smoothing. When alpha is less than 1, it is called Lidstone smoothing. This technique ensures that no probability is ever exactly zero and that the model can handle previously unseen feature-class combinations at prediction time.
Naive Bayes computes the product of many individual feature probabilities, each of which is less than 1. When you multiply hundreds or thousands of small numbers together -- for instance, in text classification with a large vocabulary -- the resulting product quickly approaches zero, causing numerical underflow where the computer can no longer represent the number accurately.
By taking logarithms, we convert the product into a sum: log(P(y)) + sum(log(P(x_i|y))). Sums of reasonable negative numbers are numerically stable and can be represented without precision issues. Importantly, the logarithm is a monotonically increasing function, so the class that maximizes the log-posterior is the same class that maximizes the original posterior. The argmax result is identical, but the computation is far more reliable.
Naive Bayes is a generative model. It models the joint distribution P(x, y) by estimating the class-conditional likelihoods P(x|y) and class priors P(y), then uses Bayes' rule to compute the posterior P(y|x). In contrast, discriminative models like logistic regression model the posterior P(y|x) directly without ever modeling the feature distributions.
The generative approach has several practical advantages: it can handle missing features naturally by simply omitting them from the likelihood product, it can generate synthetic data by sampling from the learned distributions, and it works well with small training sets because it makes stronger assumptions. Discriminative models, on the other hand, tend to achieve higher classification accuracy with sufficient training data because they focus directly on the decision boundary rather than modeling the full data distribution.
Since Naive Bayes assumes all features are conditionally independent given the class, a missing feature can simply be omitted from the likelihood product during prediction. Mathematically, this is equivalent to marginalizing (integrating or summing) over all possible values of the missing feature. The classifier proceeds using only the features that are available.
This is a significant practical advantage over many other models that require complete feature vectors at prediction time. Models like SVMs, neural networks, or decision trees typically need imputation strategies (mean filling, zero filling, etc.) to handle missing data, which can introduce bias. Naive Bayes handles missingness naturally as a direct consequence of its generative formulation and independence assumption, without any additional preprocessing.
Both Naive Bayes and logistic regression are linear classifiers that can be applied to the same types of problems. Naive Bayes is generative and estimates the class-conditional feature distributions P(x|y), while logistic regression is discriminative and directly models the decision boundary P(y|x). This fundamental difference drives their respective strengths and weaknesses.
Prefer Naive Bayes when training data is small, features are roughly independent, or you need extremely fast training and prediction. Prefer logistic regression when you have more training data, features are correlated, or you need well-calibrated probability outputs. Logistic regression generally achieves better accuracy as the dataset grows larger (lower asymptotic error), while Naive Bayes reaches its best performance faster with less data. In practice, Naive Bayes is often the first model you try as a baseline, and logistic regression is the next step up.
The key insight is that classification only requires the correct ranking of class posteriors, not accurate absolute probability values. Even when features are correlated and the individual probability estimates are biased, the relative ordering -- which class has the highest posterior -- often remains correct. The classifier can be a poor estimator of probabilities yet still be an excellent classifier.
Research by Domingos and Pazzani (1997) provided theoretical analysis showing that the optimality of the Naive Bayes classifier depends on feature dependencies distributing evenly across classes, not on strict independence. When correlations affect all classes similarly, they do not change which class wins the argmax. Additionally, in the log-sum formulation, individual biases from correlated features can partially cancel each other out, further preserving correct classification.
Naive Bayes and Maximum Entropy (logistic regression) are both special cases of linear classifiers that express the log-posterior as a linear function of the input features. They form a well-known generative-discriminative pair: Naive Bayes is the generative counterpart, and logistic regression is the discriminative counterpart. Both produce linear decision boundaries in feature space.
Ng and Jordan (2002) published an influential analysis showing that the generative model (Naive Bayes) converges faster -- it needs less training data to reach good performance -- but converges to a higher asymptotic error. The discriminative model (logistic regression) converges more slowly but ultimately achieves lower error with sufficient data. This means Naive Bayes is preferable in low-data regimes, while logistic regression wins when data is abundant. They are complementary approaches to the same underlying classification problem.
Several approaches exist to handle non-Gaussian continuous features. First, you can apply mathematical transformations such as log, square root, or Box-Cox to make the feature distributions more Gaussian before applying Gaussian NB. Second, you can discretize or bin the continuous features into categorical buckets and then use Multinomial NB, though this requires careful choice of bin boundaries.
Third, you can use kernel density estimation (KDE) instead of the Gaussian assumption for class-conditional distributions -- this is sometimes called Flexible Naive Bayes. KDE makes no parametric assumptions about the shape of the distribution. Fourth, you can use a mixed Naive Bayes approach that applies different distribution assumptions to different features based on their characteristics. In practice, Gaussian NB is surprisingly robust to moderate departures from normality, so these alternatives are mainly needed when features are heavily skewed or multimodal.
Complement Naive Bayes, introduced by Rennie et al. in 2003, addresses the problem of imbalanced datasets where some classes have far more training examples than others. Instead of estimating the likelihood P(x|y) from examples belonging to class y, it estimates the complement: P(x|not y) -- the probability of features given all classes except y. The class with the lowest complement probability is selected as the prediction.
This approach is more robust to class imbalance because the complement set (all examples not in class y) is always larger than the class set itself, providing more stable and reliable probability estimates. The complement set smooths out the estimation noise that plagues minority classes with few training samples. Complement NB has been shown to consistently outperform standard Multinomial NB on text classification benchmarks, particularly when the class distribution is skewed.
Naive Bayes is ideal for real-time systems due to its O(d) prediction time -- just d additions in log-probability space. For deployment, precompute and store all log-probabilities in a hash map for O(1) feature lookup. Use feature hashing (the hashing trick) to map an unbounded vocabulary into a fixed-size array, keeping memory usage constant regardless of how many unique tokens appear over time.
Implement online or incremental learning to update feature counts and class priors as new labeled data arrives, without requiring full retraining. This is straightforward because the model is based on counts: just increment the relevant counters. For performance-critical paths, use integer arithmetic with fixed-point log-probability representations instead of floating-point. Finally, add a user feedback loop where reported spam and false positives are fed back into the model as labeled examples, enabling continuous improvement and adaptation to evolving spam patterns.