σ(z) = ?
15 Questions
Ace the Interview
Interview Prep
Logistic Regression
Home / Study Lab / Logistic Regression Interview
INTERVIEW PREP

Logistic Regression
Interview Questions

15 commonly asked interview questions on logistic regression with detailed answers. Master the sigmoid, log loss, and decision boundaries to ace your next ML interview.

EASY What is logistic regression and when do you use it?

Logistic regression is a supervised learning algorithm used for classification tasks. Despite its name containing the word "regression," it predicts discrete class labels rather than continuous values. It models the probability that a given input belongs to a particular class by applying the sigmoid function to a linear combination of the input features.

You use logistic regression when you need to classify observations into two or more categories. Common use cases include spam detection, disease diagnosis, customer churn prediction, and credit risk assessment. It is often the first classification algorithm tried because of its simplicity, interpretability, and strong performance on linearly separable data.

Key Points
  • Classification algorithm that predicts probabilities using the sigmoid function
  • Outputs values between 0 and 1 representing class membership probability
  • Used for binary and multiclass classification problems
  • Highly interpretable and often serves as a strong baseline model
EASY What is the sigmoid function and why is it used?

The sigmoid function, also called the logistic function, is defined as sigma(z) = 1 / (1 + e^(-z)), where z is the linear combination of weights and input features (z = w^T x + b). It maps any real-valued number to the range (0, 1), making its output interpretable as a probability.

The sigmoid is used in logistic regression because it provides a smooth, differentiable transformation that converts unbounded linear outputs into valid probabilities. Its S-shaped curve naturally squashes extreme values toward 0 or 1 while maintaining sensitivity near the decision boundary at z = 0 where sigma(z) = 0.5. This differentiability is essential for gradient-based optimization during training.

Key Points
  • Formula: sigma(z) = 1 / (1 + e^(-z))
  • Maps any real number to the interval (0, 1)
  • Output at z = 0 is exactly 0.5, forming the default decision boundary
  • Smooth and differentiable, enabling gradient-based optimization
EASY How is logistic regression different from linear regression?

Linear regression predicts continuous numerical values and minimizes the mean squared error between predictions and actual values. Logistic regression predicts the probability of class membership and minimizes the log loss (binary cross-entropy). While both use a linear combination of features internally, logistic regression wraps that linear output with the sigmoid function to produce probabilities.

Another key difference lies in the output interpretation. Linear regression outputs can be any real number, whereas logistic regression outputs are bounded between 0 and 1. Linear regression assumes a Gaussian error distribution and a linear relationship between features and the target. Logistic regression models a linear relationship between features and the log-odds of the target class, making no assumption about the distribution of features themselves.

Key Points
  • Linear regression: continuous output; logistic regression: probability output (0 to 1)
  • Different loss functions -- MSE vs. log loss (cross-entropy)
  • Logistic regression applies the sigmoid to the linear combination
  • Logistic regression models log-odds as a linear function of features
EASY What is a decision boundary in logistic regression?

A decision boundary is the surface that separates the feature space into regions corresponding to different predicted classes. In binary logistic regression, it is the set of points where the predicted probability equals 0.5, which corresponds to the linear equation w^T x + b = 0. Points on one side are classified as class 1 and points on the other side as class 0.

Because the underlying model is linear, the decision boundary of standard logistic regression is always a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). However, by engineering polynomial or interaction features, you can create non-linear decision boundaries while still using the logistic regression framework. The threshold of 0.5 is a default and can be adjusted to trade off between precision and recall.

Key Points
  • Defined where predicted probability equals 0.5 (w^T x + b = 0)
  • Always linear (hyperplane) for standard logistic regression
  • Non-linear boundaries possible with polynomial feature engineering
  • The 0.5 threshold can be adjusted based on application needs
EASY What metrics are used to evaluate logistic regression?

The most common evaluation metrics for logistic regression include accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC-ROC). Accuracy measures the overall percentage of correct predictions but can be misleading with imbalanced classes. Precision measures how many predicted positives are actually positive, while recall measures how many actual positives were correctly identified.

The F1 score is the harmonic mean of precision and recall, providing a balanced measure when both matter. AUC-ROC evaluates the model across all possible classification thresholds by plotting the true positive rate against the false positive rate. Log loss (binary cross-entropy) is also important because it penalizes confident wrong predictions more heavily than uncertain ones, and it is the actual objective function that logistic regression optimizes during training.

Key Points
  • Accuracy, precision, recall, and F1 score for classification performance
  • AUC-ROC evaluates performance across all thresholds
  • Log loss penalizes confident wrong predictions
  • Confusion matrix provides a detailed breakdown of prediction types
MEDIUM Explain the log loss / binary cross-entropy cost function.

The log loss, also called binary cross-entropy, is the cost function used to train logistic regression. For a single sample, it is defined as: L = -[y * log(p) + (1 - y) * log(1 - p)], where y is the true label (0 or 1) and p is the predicted probability. When y = 1, only the first term is active, and the loss increases as p approaches 0. When y = 0, only the second term is active, and the loss increases as p approaches 1.

The overall cost function is the average log loss across all training samples: J = -(1/n) * sum[y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]. This function is convex for logistic regression, guaranteeing that gradient descent will find the global minimum. Mean squared error is not used because applying it with the sigmoid function creates a non-convex surface with many local minima, making optimization unreliable.

Key Points
  • L = -[y * log(p) + (1 - y) * log(1 - p)]
  • Heavily penalizes confident but wrong predictions
  • Convex for logistic regression, ensuring a unique global minimum
  • Derived from maximum likelihood estimation of the Bernoulli distribution
MEDIUM How does regularization work in logistic regression (L1 vs L2)?

Regularization adds a penalty term to the log loss cost function to prevent overfitting by discouraging large weight values. L2 regularization (Ridge) adds the sum of squared weights: J_reg = J + (lambda/2) * sum(w_i^2). This shrinks all coefficients toward zero but rarely sets any to exactly zero. L1 regularization (Lasso) adds the sum of absolute weights: J_reg = J + lambda * sum(|w_i|). This can drive some coefficients to exactly zero, performing automatic feature selection.

In scikit-learn, the regularization strength is controlled by the parameter C, which is the inverse of lambda (C = 1/lambda). A smaller C means stronger regularization. Elastic Net combines both L1 and L2 penalties, offering a balance between feature selection and coefficient shrinkage. Regularization is especially important when the number of features is large relative to the number of samples, or when features are highly correlated.

Key Points
  • L2 (Ridge) shrinks weights toward zero; prevents overfitting
  • L1 (Lasso) can zero out weights; performs feature selection
  • Elastic Net combines L1 and L2 penalties
  • Controlled by hyperparameter C (inverse of regularization strength) in scikit-learn
MEDIUM How do you handle multiclass classification with logistic regression?

There are two main strategies for extending logistic regression to multiclass problems. The One-vs-Rest (OvR) approach trains K separate binary classifiers, one per class, where each classifier distinguishes one class from all others. At prediction time, the class whose classifier outputs the highest probability is chosen. This is simple and works well but can produce uncalibrated probabilities that do not sum to one.

The Multinomial (Softmax) approach extends logistic regression natively to K classes by replacing the sigmoid with the softmax function: P(y = k | x) = exp(w_k^T x) / sum_j(exp(w_j^T x)). This jointly trains all K sets of weights and produces properly calibrated probabilities that sum to one across all classes. Scikit-learn supports both strategies via the multi_class parameter. Softmax is generally preferred when classes are mutually exclusive and you need valid probability distributions.

Key Points
  • One-vs-Rest (OvR): trains K binary classifiers, one per class
  • Multinomial (Softmax): extends logistic regression natively to K classes
  • Softmax outputs calibrated probabilities summing to one
  • OvR is simpler; softmax is preferred for mutually exclusive classes
MEDIUM What are the assumptions of logistic regression?

Logistic regression assumes a linear relationship between the independent features and the log-odds (logit) of the dependent variable. It does not require a linear relationship between features and the raw probability, but the log-odds must be a linear function of the features. It also assumes that observations are independent of each other, meaning the outcome for one sample does not influence another.

The model assumes little to no multicollinearity among the independent variables. High multicollinearity inflates the variance of coefficient estimates and makes them unstable. Logistic regression also requires a sufficiently large sample size, with a common guideline being at least 10 to 20 events per predictor variable. Unlike linear regression, it does not assume normally distributed errors, constant variance (homoscedasticity), or normally distributed features.

Key Points
  • Linear relationship between features and the log-odds
  • Independence of observations
  • Little or no multicollinearity among predictors
  • Large sample size relative to the number of predictors
  • No assumption of normally distributed features or errors
MEDIUM How do you interpret logistic regression coefficients as odds ratios?

In logistic regression, the model predicts the log-odds: log(p / (1 - p)) = w_0 + w_1*x_1 + w_2*x_2 + ... A coefficient w_i represents the change in log-odds of the outcome for a one-unit increase in feature x_i, holding all other features constant. To convert this to an odds ratio, you exponentiate the coefficient: OR = exp(w_i).

An odds ratio greater than 1 means that a one-unit increase in the feature increases the odds of the positive class. An odds ratio less than 1 means it decreases the odds. An odds ratio of exactly 1 means the feature has no effect. For example, if the coefficient for "years of experience" is 0.3, the odds ratio is exp(0.3) = 1.35, meaning each additional year of experience multiplies the odds of the positive outcome by 1.35, or increases them by about 35%.

Key Points
  • Coefficients represent change in log-odds per unit increase in the feature
  • Odds ratio = exp(coefficient); intuitive multiplicative interpretation
  • OR > 1: feature increases odds; OR < 1: feature decreases odds
  • Interpretation assumes all other features are held constant
HARD Derive the gradient of the log loss function.

Starting with the log loss for a single sample: L = -[y * log(sigma(z)) + (1 - y) * log(1 - sigma(z))], where z = w^T x and sigma(z) = 1 / (1 + e^(-z)). A crucial property of the sigmoid is that its derivative is sigma'(z) = sigma(z) * (1 - sigma(z)). Using the chain rule, dL/dz = sigma(z) - y, which is elegantly the difference between the prediction and the true label.

To find the gradient with respect to the weights, apply the chain rule again: dL/dw_j = (dL/dz) * (dz/dw_j) = (sigma(z) - y) * x_j. For the full dataset, the gradient of the cost function is: dJ/dw_j = (1/n) * sum_i[(sigma(z_i) - y_i) * x_ij]. This is the same form as the gradient in linear regression with MSE loss, except the prediction is sigma(z) instead of z. The gradient descent update rule becomes: w_j = w_j - alpha * (1/n) * sum_i[(sigma(z_i) - y_i) * x_ij].

Key Points
  • Sigmoid derivative: sigma'(z) = sigma(z) * (1 - sigma(z))
  • Gradient per sample: (sigma(z) - y) * x_j -- prediction minus truth times feature
  • Same functional form as linear regression gradient but with sigmoid prediction
  • Convexity of log loss guarantees convergence to the global minimum
HARD Compare MLE vs MAP estimation in logistic regression.

Maximum Likelihood Estimation (MLE) finds the weights that maximize the probability of observing the training data: w_MLE = argmax_w product_i P(y_i | x_i, w). Taking the negative log converts this to minimizing the log loss. MLE has no regularization and can overfit when the number of features is large or the data is linearly separable, because it will drive weights toward infinity to achieve perfect separation.

Maximum A Posteriori (MAP) estimation incorporates a prior distribution over the weights: w_MAP = argmax_w P(w | data) = argmax_w [P(data | w) * P(w)]. A Gaussian prior P(w) ~ N(0, 1/lambda) leads to L2 regularization, and a Laplace prior P(w) ~ Laplace(0, 1/lambda) leads to L1 regularization. MAP estimation is equivalent to adding the regularization penalty to the MLE objective. The regularization strength lambda corresponds to the precision (inverse variance) of the prior, providing a principled Bayesian interpretation of why regularization works.

Key Points
  • MLE maximizes likelihood; equivalent to unregularized log loss minimization
  • MAP adds a prior over weights; equivalent to regularized log loss
  • Gaussian prior yields L2 regularization; Laplace prior yields L1
  • MLE can overfit with separable data; MAP prevents this via the prior
HARD How does logistic regression handle imbalanced datasets?

When classes are imbalanced, standard logistic regression tends to be biased toward the majority class because the overall loss is dominated by the more frequent class. The primary technique to address this is class weighting, where each sample's contribution to the loss is scaled by the inverse class frequency. In scikit-learn, setting class_weight='balanced' automatically assigns weights proportional to n_samples / (n_classes * n_class_samples), giving minority class errors more influence during training.

Beyond class weighting, you can adjust the classification threshold. Instead of the default 0.5, you can lower the threshold to increase recall for the minority class at the cost of precision. Resampling techniques like SMOTE (oversampling the minority) or random undersampling (reducing the majority) can also help. Stratified cross-validation ensures that each fold maintains the original class distribution. Evaluating with metrics like AUC-ROC, precision-recall AUC, and F1 score is critical because accuracy alone is misleading with imbalanced data.

Key Points
  • Class weighting scales loss contribution by inverse class frequency
  • Threshold adjustment trades precision for recall on the minority class
  • SMOTE and undersampling address imbalance at the data level
  • Stratified cross-validation preserves class ratios in each fold
  • Avoid accuracy; use AUC-ROC, F1, or precision-recall AUC instead
HARD Explain the connection between logistic regression and neural networks.

Logistic regression is mathematically identical to a neural network with no hidden layers -- a single neuron with a sigmoid activation function. The neuron takes the weighted sum of inputs (z = w^T x + b), applies the sigmoid activation, and outputs a probability. The binary cross-entropy loss and gradient descent training procedure are exactly the same. This makes logistic regression the simplest possible neural network architecture.

When you stack multiple logistic regression units into layers and connect them, you get a feedforward neural network (multilayer perceptron). The key difference is that neural networks with hidden layers can learn non-linear decision boundaries through composed non-linear transformations, while a single logistic regression unit can only learn a linear boundary. The backpropagation algorithm used to train deep networks is a generalization of the gradient computation used in logistic regression. Understanding logistic regression deeply is therefore foundational to understanding all of deep learning.

Key Points
  • Logistic regression is a neural network with zero hidden layers
  • Single sigmoid neuron with binary cross-entropy loss
  • Adding hidden layers enables non-linear decision boundaries
  • Backpropagation generalizes the logistic regression gradient computation
HARD When would logistic regression outperform more complex models?

Logistic regression often outperforms complex models when the true decision boundary is approximately linear, training data is limited, or the feature space is high-dimensional and sparse. In text classification with bag-of-words features, for example, logistic regression frequently matches or beats random forests and gradient boosting because the data lives in a very high-dimensional sparse space where linear models excel. With small datasets, complex models are prone to overfitting, while logistic regression's simplicity provides a strong inductive bias.

Logistic regression also wins when interpretability is a requirement. In healthcare, finance, and legal domains, stakeholders need to understand why a prediction was made. The clear coefficient-to-odds-ratio mapping in logistic regression provides transparent explanations that black-box models cannot offer. Additionally, in production systems where latency and computational cost matter, logistic regression's O(d) prediction time and minimal memory footprint make it preferable to ensemble methods. When the bias-variance tradeoff favors lower variance, simpler models like logistic regression are the better choice.

Key Points
  • Approximately linear decision boundaries favor logistic regression
  • Small datasets and high-dimensional sparse features suit linear models
  • Interpretability requirements in regulated industries
  • Low-latency production environments with strict compute budgets
  • Strong baseline that complex models must demonstrably beat

Continue Your Journey