Logistic Regression
Interview Questions
15 commonly asked interview questions on logistic regression with detailed answers. Master the sigmoid, log loss, and decision boundaries to ace your next ML interview.
15 commonly asked interview questions on logistic regression with detailed answers. Master the sigmoid, log loss, and decision boundaries to ace your next ML interview.
Logistic regression is a supervised learning algorithm used for classification tasks. Despite its name containing the word "regression," it predicts discrete class labels rather than continuous values. It models the probability that a given input belongs to a particular class by applying the sigmoid function to a linear combination of the input features.
You use logistic regression when you need to classify observations into two or more categories. Common use cases include spam detection, disease diagnosis, customer churn prediction, and credit risk assessment. It is often the first classification algorithm tried because of its simplicity, interpretability, and strong performance on linearly separable data.
The sigmoid function, also called the logistic function, is defined as sigma(z) = 1 / (1 + e^(-z)), where z is the linear combination of weights and input features (z = w^T x + b). It maps any real-valued number to the range (0, 1), making its output interpretable as a probability.
The sigmoid is used in logistic regression because it provides a smooth, differentiable transformation that converts unbounded linear outputs into valid probabilities. Its S-shaped curve naturally squashes extreme values toward 0 or 1 while maintaining sensitivity near the decision boundary at z = 0 where sigma(z) = 0.5. This differentiability is essential for gradient-based optimization during training.
Linear regression predicts continuous numerical values and minimizes the mean squared error between predictions and actual values. Logistic regression predicts the probability of class membership and minimizes the log loss (binary cross-entropy). While both use a linear combination of features internally, logistic regression wraps that linear output with the sigmoid function to produce probabilities.
Another key difference lies in the output interpretation. Linear regression outputs can be any real number, whereas logistic regression outputs are bounded between 0 and 1. Linear regression assumes a Gaussian error distribution and a linear relationship between features and the target. Logistic regression models a linear relationship between features and the log-odds of the target class, making no assumption about the distribution of features themselves.
A decision boundary is the surface that separates the feature space into regions corresponding to different predicted classes. In binary logistic regression, it is the set of points where the predicted probability equals 0.5, which corresponds to the linear equation w^T x + b = 0. Points on one side are classified as class 1 and points on the other side as class 0.
Because the underlying model is linear, the decision boundary of standard logistic regression is always a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). However, by engineering polynomial or interaction features, you can create non-linear decision boundaries while still using the logistic regression framework. The threshold of 0.5 is a default and can be adjusted to trade off between precision and recall.
The most common evaluation metrics for logistic regression include accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC-ROC). Accuracy measures the overall percentage of correct predictions but can be misleading with imbalanced classes. Precision measures how many predicted positives are actually positive, while recall measures how many actual positives were correctly identified.
The F1 score is the harmonic mean of precision and recall, providing a balanced measure when both matter. AUC-ROC evaluates the model across all possible classification thresholds by plotting the true positive rate against the false positive rate. Log loss (binary cross-entropy) is also important because it penalizes confident wrong predictions more heavily than uncertain ones, and it is the actual objective function that logistic regression optimizes during training.
The log loss, also called binary cross-entropy, is the cost function used to train logistic regression. For a single sample, it is defined as: L = -[y * log(p) + (1 - y) * log(1 - p)], where y is the true label (0 or 1) and p is the predicted probability. When y = 1, only the first term is active, and the loss increases as p approaches 0. When y = 0, only the second term is active, and the loss increases as p approaches 1.
The overall cost function is the average log loss across all training samples: J = -(1/n) * sum[y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]. This function is convex for logistic regression, guaranteeing that gradient descent will find the global minimum. Mean squared error is not used because applying it with the sigmoid function creates a non-convex surface with many local minima, making optimization unreliable.
Regularization adds a penalty term to the log loss cost function to prevent overfitting by discouraging large weight values. L2 regularization (Ridge) adds the sum of squared weights: J_reg = J + (lambda/2) * sum(w_i^2). This shrinks all coefficients toward zero but rarely sets any to exactly zero. L1 regularization (Lasso) adds the sum of absolute weights: J_reg = J + lambda * sum(|w_i|). This can drive some coefficients to exactly zero, performing automatic feature selection.
In scikit-learn, the regularization strength is controlled by the parameter C, which is the inverse of lambda (C = 1/lambda). A smaller C means stronger regularization. Elastic Net combines both L1 and L2 penalties, offering a balance between feature selection and coefficient shrinkage. Regularization is especially important when the number of features is large relative to the number of samples, or when features are highly correlated.
There are two main strategies for extending logistic regression to multiclass problems. The One-vs-Rest (OvR) approach trains K separate binary classifiers, one per class, where each classifier distinguishes one class from all others. At prediction time, the class whose classifier outputs the highest probability is chosen. This is simple and works well but can produce uncalibrated probabilities that do not sum to one.
The Multinomial (Softmax) approach extends logistic regression natively to K classes by replacing the sigmoid with the softmax function: P(y = k | x) = exp(w_k^T x) / sum_j(exp(w_j^T x)). This jointly trains all K sets of weights and produces properly calibrated probabilities that sum to one across all classes. Scikit-learn supports both strategies via the multi_class parameter. Softmax is generally preferred when classes are mutually exclusive and you need valid probability distributions.
Logistic regression assumes a linear relationship between the independent features and the log-odds (logit) of the dependent variable. It does not require a linear relationship between features and the raw probability, but the log-odds must be a linear function of the features. It also assumes that observations are independent of each other, meaning the outcome for one sample does not influence another.
The model assumes little to no multicollinearity among the independent variables. High multicollinearity inflates the variance of coefficient estimates and makes them unstable. Logistic regression also requires a sufficiently large sample size, with a common guideline being at least 10 to 20 events per predictor variable. Unlike linear regression, it does not assume normally distributed errors, constant variance (homoscedasticity), or normally distributed features.
In logistic regression, the model predicts the log-odds: log(p / (1 - p)) = w_0 + w_1*x_1 + w_2*x_2 + ... A coefficient w_i represents the change in log-odds of the outcome for a one-unit increase in feature x_i, holding all other features constant. To convert this to an odds ratio, you exponentiate the coefficient: OR = exp(w_i).
An odds ratio greater than 1 means that a one-unit increase in the feature increases the odds of the positive class. An odds ratio less than 1 means it decreases the odds. An odds ratio of exactly 1 means the feature has no effect. For example, if the coefficient for "years of experience" is 0.3, the odds ratio is exp(0.3) = 1.35, meaning each additional year of experience multiplies the odds of the positive outcome by 1.35, or increases them by about 35%.
Starting with the log loss for a single sample: L = -[y * log(sigma(z)) + (1 - y) * log(1 - sigma(z))], where z = w^T x and sigma(z) = 1 / (1 + e^(-z)). A crucial property of the sigmoid is that its derivative is sigma'(z) = sigma(z) * (1 - sigma(z)). Using the chain rule, dL/dz = sigma(z) - y, which is elegantly the difference between the prediction and the true label.
To find the gradient with respect to the weights, apply the chain rule again: dL/dw_j = (dL/dz) * (dz/dw_j) = (sigma(z) - y) * x_j. For the full dataset, the gradient of the cost function is: dJ/dw_j = (1/n) * sum_i[(sigma(z_i) - y_i) * x_ij]. This is the same form as the gradient in linear regression with MSE loss, except the prediction is sigma(z) instead of z. The gradient descent update rule becomes: w_j = w_j - alpha * (1/n) * sum_i[(sigma(z_i) - y_i) * x_ij].
Maximum Likelihood Estimation (MLE) finds the weights that maximize the probability of observing the training data: w_MLE = argmax_w product_i P(y_i | x_i, w). Taking the negative log converts this to minimizing the log loss. MLE has no regularization and can overfit when the number of features is large or the data is linearly separable, because it will drive weights toward infinity to achieve perfect separation.
Maximum A Posteriori (MAP) estimation incorporates a prior distribution over the weights: w_MAP = argmax_w P(w | data) = argmax_w [P(data | w) * P(w)]. A Gaussian prior P(w) ~ N(0, 1/lambda) leads to L2 regularization, and a Laplace prior P(w) ~ Laplace(0, 1/lambda) leads to L1 regularization. MAP estimation is equivalent to adding the regularization penalty to the MLE objective. The regularization strength lambda corresponds to the precision (inverse variance) of the prior, providing a principled Bayesian interpretation of why regularization works.
When classes are imbalanced, standard logistic regression tends to be biased toward the majority class because the overall loss is dominated by the more frequent class. The primary technique to address this is class weighting, where each sample's contribution to the loss is scaled by the inverse class frequency. In scikit-learn, setting class_weight='balanced' automatically assigns weights proportional to n_samples / (n_classes * n_class_samples), giving minority class errors more influence during training.
Beyond class weighting, you can adjust the classification threshold. Instead of the default 0.5, you can lower the threshold to increase recall for the minority class at the cost of precision. Resampling techniques like SMOTE (oversampling the minority) or random undersampling (reducing the majority) can also help. Stratified cross-validation ensures that each fold maintains the original class distribution. Evaluating with metrics like AUC-ROC, precision-recall AUC, and F1 score is critical because accuracy alone is misleading with imbalanced data.
Logistic regression is mathematically identical to a neural network with no hidden layers -- a single neuron with a sigmoid activation function. The neuron takes the weighted sum of inputs (z = w^T x + b), applies the sigmoid activation, and outputs a probability. The binary cross-entropy loss and gradient descent training procedure are exactly the same. This makes logistic regression the simplest possible neural network architecture.
When you stack multiple logistic regression units into layers and connect them, you get a feedforward neural network (multilayer perceptron). The key difference is that neural networks with hidden layers can learn non-linear decision boundaries through composed non-linear transformations, while a single logistic regression unit can only learn a linear boundary. The backpropagation algorithm used to train deep networks is a generalization of the gradient computation used in logistic regression. Understanding logistic regression deeply is therefore foundational to understanding all of deep learning.
Logistic regression often outperforms complex models when the true decision boundary is approximately linear, training data is limited, or the feature space is high-dimensional and sparse. In text classification with bag-of-words features, for example, logistic regression frequently matches or beats random forests and gradient boosting because the data lives in a very high-dimensional sparse space where linear models excel. With small datasets, complex models are prone to overfitting, while logistic regression's simplicity provides a strong inductive bias.
Logistic regression also wins when interpretability is a requirement. In healthcare, finance, and legal domains, stakeholders need to understand why a prediction was made. The clear coefficient-to-odds-ratio mapping in logistic regression provides transparent explanations that black-box models cannot offer. Additionally, in production systems where latency and computational cost matter, logistic regression's O(d) prediction time and minimal memory footprint make it preferable to ensemble methods. When the bias-variance tradeoff favors lower variance, simpler models like logistic regression are the better choice.