Linear Regression Interview Questions

EASY What is linear regression?

Linear regression is a supervised learning algorithm that models the relationship between one or more independent variables (features) and a continuous dependent variable (target) by fitting a linear equation to the observed data. The goal is to find the best-fitting straight line (or hyperplane in multiple dimensions) that minimizes the difference between predicted and actual values.

The simplest form is y = mx + b, where m is the slope (coefficient) and b is the y-intercept. In the general form, the model predicts the target as a weighted sum of input features plus a bias term. Linear regression is one of the most fundamental algorithms in statistics and machine learning, widely used for prediction, forecasting, and understanding relationships between variables.

Key Points

Models a linear relationship between features and a continuous target
Minimizes the sum of squared differences between predicted and actual values
Output is a continuous numerical value, not a class label
Foundation of many more advanced regression techniques

EASY What is the difference between simple and multiple linear regression?

Simple linear regression involves exactly one independent variable and one dependent variable. The model takes the form y = b0 + b1*x, where b0 is the intercept and b1 is the slope. It fits a straight line through a two-dimensional scatter plot. Simple linear regression is useful when you want to understand or quantify the relationship between a single predictor and the outcome.

Multiple linear regression extends this concept to two or more independent variables. The model becomes y = b0 + b1*x1 + b2*x2 + ... + bn*xn, fitting a hyperplane in a higher-dimensional space. Multiple regression allows you to control for confounding variables, assess the individual contribution of each predictor while holding others constant, and generally build more accurate models by incorporating more information. Most real-world regression problems involve multiple predictors.

Key Points

Simple regression has one predictor; multiple regression has two or more
Simple regression fits a line; multiple regression fits a hyperplane
Multiple regression can control for confounding variables
Both use the same underlying least-squares optimization

EASY What is R-squared and what does it tell you?

R-squared, also known as the coefficient of determination, measures the proportion of variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1 for standard linear regression. An R-squared of 0.85 means that 85% of the variability in the target is captured by the model, while the remaining 15% is unexplained.

However, R-squared has an important limitation: it always increases (or stays the same) when you add more features, even if those features are irrelevant. This is why adjusted R-squared was introduced, which penalizes the addition of unnecessary predictors by accounting for the number of features relative to the number of observations. When comparing models with different numbers of predictors, adjusted R-squared is the more reliable metric.

Key Points

Measures the proportion of variance explained by the model
Ranges from 0 (no explanatory power) to 1 (perfect fit)
Always increases with more features, even irrelevant ones
Adjusted R-squared penalizes unnecessary predictors

EASY What is the difference between correlation and regression?

Correlation measures the strength and direction of the linear relationship between two variables. It produces a single value (the correlation coefficient r) ranging from -1 to +1, where -1 indicates a perfect negative relationship, 0 indicates no linear relationship, and +1 indicates a perfect positive relationship. Correlation is symmetric -- the correlation of X with Y is the same as Y with X -- and it does not imply any causal or predictive direction.

Regression, on the other hand, goes further by establishing a predictive equation. It quantifies how much the dependent variable changes for a given change in the independent variable, provides an intercept and slope for making predictions, and distinguishes between the predictor and the response. Regression is directional, meaning that regressing Y on X gives a different equation than regressing X on Y. In short, correlation tells you that a relationship exists, while regression tells you the nature and magnitude of that relationship and enables prediction.

Key Points

Correlation measures strength and direction; regression models the relationship
Correlation is symmetric; regression is directional
Regression provides coefficients for prediction; correlation does not
Neither correlation nor regression implies causation

EASY What are residuals in linear regression?

Residuals are the differences between the observed (actual) values and the predicted values from the regression model. For each data point, the residual equals y_actual minus y_predicted. Positive residuals mean the model underestimated the actual value, while negative residuals mean it overestimated. The sum of all residuals in an ordinary least squares model is always zero.

Residuals play a critical role in diagnosing the quality and validity of a regression model. Plotting residuals against predicted values or against individual features can reveal violations of key assumptions. If residuals show a pattern (such as a curve or a funnel shape), this indicates that the model is missing something -- perhaps a non-linear relationship or heteroscedasticity. Ideally, residuals should be randomly scattered around zero with constant variance, forming no discernible pattern.

Key Points

Residual = actual value minus predicted value
Sum of residuals in OLS is always zero
Used to diagnose model fit and assumption violations
Should be randomly distributed with constant variance

MEDIUM Explain the assumptions of linear regression.

Linear regression relies on several key assumptions. First, linearity: the relationship between the independent and dependent variables must be linear. Second, independence: the residuals (errors) must be independent of each other, meaning one observation's error does not influence another's. Third, homoscedasticity: the variance of the residuals must be constant across all levels of the independent variables. Fourth, normality: the residuals should be approximately normally distributed, which is important for valid hypothesis tests and confidence intervals.

Additionally, in multiple regression, there should be no perfect multicollinearity among the predictors, meaning no independent variable should be a perfect linear combination of others. Violations of these assumptions do not necessarily make linear regression useless, but they can lead to biased coefficients, unreliable standard errors, and incorrect significance tests. Each assumption can be checked with specific diagnostic tools: residual plots for linearity and homoscedasticity, the Durbin-Watson test for independence, Q-Q plots for normality, and the Variance Inflation Factor (VIF) for multicollinearity.

Key Points

Linearity -- relationship between X and Y must be linear
Independence -- residuals must not be correlated
Homoscedasticity -- constant variance of residuals
Normality -- residuals should follow a normal distribution
No perfect multicollinearity among predictors

MEDIUM What is multicollinearity and how do you detect it?

Multicollinearity occurs when two or more independent variables in a multiple regression model are highly correlated with each other. When this happens, the model has difficulty distinguishing the individual effect of each correlated predictor. The coefficients become unstable and can change dramatically with small changes in the data, standard errors become inflated, and individual predictor significance tests become unreliable -- even though the overall model may still predict well.

The primary tool for detecting multicollinearity is the Variance Inflation Factor (VIF). VIF measures how much the variance of a regression coefficient is inflated due to collinearity with other predictors. A VIF of 1 means no collinearity, values between 1 and 5 suggest moderate collinearity, and values above 10 indicate severe collinearity that should be addressed. You can also check the correlation matrix between features, though this only detects pairwise collinearity. Solutions include removing one of the correlated variables, combining them through PCA, or using regularization techniques like Ridge regression.

Key Points

High correlation between independent variables
Inflates standard errors and destabilizes coefficients
Detected using VIF (values above 10 are severe)
Solutions: remove features, use PCA, or apply Ridge regression

MEDIUM Compare Ridge, Lasso, and Elastic Net regularization.

Ridge regression (L2 regularization) adds a penalty equal to the sum of the squared coefficients multiplied by a tuning parameter lambda. This shrinks all coefficients toward zero but never sets them exactly to zero, so all features remain in the model. Ridge is particularly effective at handling multicollinearity because it distributes the coefficient weight among correlated features rather than assigning it arbitrarily to one of them.

Lasso regression (L1 regularization) adds a penalty equal to the sum of the absolute values of the coefficients multiplied by lambda. Unlike Ridge, Lasso can shrink coefficients all the way to exactly zero, effectively performing automatic feature selection. This makes the resulting model more interpretable. Elastic Net combines both penalties with a mixing parameter alpha that controls the balance between L1 and L2. Elastic Net inherits the feature selection capability of Lasso and the stability of Ridge, making it especially useful when there are many correlated features and you want both sparsity and grouping of correlated predictors.

Key Points

Ridge (L2) shrinks coefficients but keeps all features
Lasso (L1) can zero out coefficients for feature selection
Elastic Net combines L1 and L2 penalties
Ridge handles multicollinearity; Lasso provides sparsity
All three add bias to reduce variance and prevent overfitting

MEDIUM How do you handle outliers in linear regression?

Outliers can disproportionately influence linear regression because the OLS cost function squares the errors, giving large residuals an outsized impact on the fitted line. The first step is always to investigate outliers to determine whether they are data entry errors, measurement issues, or genuinely unusual observations. Tools like Cook's distance measure how much each data point influences the overall regression -- a Cook's distance greater than 1 (or greater than 4/n) flags a highly influential point. Leverage values and studentized residuals also help identify problematic observations.

Once identified, you have several options. You can remove outliers if they are clearly erroneous, but you should document and justify the removal. You can apply a robust regression method such as RANSAC, Huber regression, or Theil-Sen estimation, which down-weight or ignore outliers. Transforming the target variable (log, square root) can also reduce the impact of extreme values. Another approach is to use regularization, which constrains the coefficients and reduces sensitivity to individual data points. The worst approach is blindly removing all points with large residuals without understanding why they are outliers.

Key Points

OLS squares errors, amplifying outlier influence
Use Cook's distance and leverage to identify influential points
Robust regression methods (Huber, RANSAC) down-weight outliers
Investigate before removing -- never blindly delete

MEDIUM Explain the bias-variance tradeoff in regression.

The bias-variance tradeoff is a fundamental concept that describes the tension between two sources of prediction error. Bias is the error introduced by approximating a complex real-world problem with a simplified model -- high bias means the model is too simple and systematically misses the true pattern (underfitting). Variance is the error introduced by the model's sensitivity to fluctuations in the training data -- high variance means the model fits the training data too closely and fails to generalize (overfitting).

In linear regression, a simple model with few features may have high bias but low variance. Adding more features or polynomial terms reduces bias but increases variance. The total expected error can be decomposed as: Error = Bias^2 + Variance + Irreducible Noise. The goal is to find the sweet spot where the combined error is minimized. Regularization (Ridge, Lasso) explicitly manages this tradeoff by introducing a controlled amount of bias (through the penalty term) in exchange for a significant reduction in variance, often resulting in better generalization performance on unseen data.

Key Points

Bias = systematic error from oversimplified models (underfitting)
Variance = sensitivity to training data fluctuations (overfitting)
Total error = Bias^2 + Variance + Irreducible Noise
Regularization adds bias to reduce variance for better generalization

HARD Derive the OLS normal equation.

The Ordinary Least Squares (OLS) objective is to minimize the sum of squared residuals: L(beta) = (y - X*beta)^T * (y - X*beta). Expanding this gives L = y^T*y - 2*beta^T*X^T*y + beta^T*X^T*X*beta. To find the minimum, we take the derivative with respect to beta and set it equal to zero: dL/d(beta) = -2*X^T*y + 2*X^T*X*beta = 0.

Solving for beta gives the normal equation: beta = (X^T * X)^(-1) * X^T * y. This closed-form solution directly computes the optimal coefficients without any iterative process. For this solution to exist, the matrix X^T*X must be invertible, which requires that the columns of X are linearly independent (no perfect multicollinearity) and that there are at least as many observations as features (n >= p). When X^T*X is singular or nearly singular, the solution is numerically unstable, which is precisely the situation where regularization methods like Ridge regression become necessary -- Ridge adds a term lambda*I to X^T*X, guaranteeing invertibility.

Key Points

Minimize L = (y - X*beta)^T * (y - X*beta)
Take derivative, set to zero: X^T*X*beta = X^T*y
Solution: beta = (X^T*X)^(-1) * X^T*y
Requires X^T*X to be invertible (no perfect multicollinearity)
Ridge adds lambda*I to guarantee invertibility

HARD How would you handle non-linear relationships in linear regression?

Despite its name, linear regression can model non-linear relationships because "linear" refers to linearity in the parameters (coefficients), not in the features. The most common approach is polynomial regression, where you add polynomial terms such as x^2, x^3, or interaction terms like x1*x2 as new features. The model y = b0 + b1*x + b2*x^2 is still a linear regression because it is linear in the coefficients b0, b1, and b2, even though the relationship with x is quadratic.

Other approaches include applying mathematical transformations to the features or target (log, square root, reciprocal), using basis function expansion (splines, radial basis functions) to create flexible non-linear mappings while staying within the linear regression framework, and adding interaction terms to capture how the effect of one feature depends on the value of another. You can also use piecewise linear regression (splines) which fits different linear segments in different regions of the feature space. The key consideration is that adding more polynomial or basis function terms increases model complexity, so cross-validation should be used to select the right degree of complexity and avoid overfitting.

Key Points

Linear regression is linear in parameters, not necessarily in features
Polynomial features (x^2, x^3) capture curved relationships
Log/sqrt transformations can linearize certain relationships
Splines and basis functions provide flexible non-linear modeling
Use cross-validation to avoid overfitting with added complexity

HARD Explain heteroscedasticity and its impact.

Heteroscedasticity occurs when the variance of the residuals is not constant across all levels of the independent variables. For example, in a model predicting income from years of experience, the spread of residuals might be small for junior employees but very large for senior employees. This violates the homoscedasticity assumption and appears as a funnel or fan shape in residual plots. Common causes include omitted variables, incorrect functional form, or the inherent nature of the data.

While OLS coefficient estimates remain unbiased under heteroscedasticity, the standard errors become incorrect, leading to invalid t-tests, F-tests, and confidence intervals. You might conclude a variable is significant when it is not, or vice versa. The Breusch-Pagan test and White test are formal statistical tests for detecting heteroscedasticity. Solutions include using Weighted Least Squares (WLS) where observations with higher variance receive less weight, applying heteroscedasticity-consistent standard errors (HC0-HC3, also known as White's robust standard errors), transforming the dependent variable (log transformation often stabilizes variance), or using generalized least squares (GLS) which accounts for the error structure.

Key Points

Non-constant variance of residuals across predictor levels
OLS coefficients remain unbiased but standard errors become wrong
Detected via Breusch-Pagan test, White test, or residual plots
Solutions: WLS, robust standard errors, log transformation, or GLS

HARD Compare gradient descent vs normal equation approach.

The normal equation provides a closed-form analytical solution: beta = (X^T*X)^(-1) * X^T*y. It computes the optimal coefficients directly in one step without any iterations or hyperparameters. However, it requires computing the matrix inverse of X^T*X, which has a time complexity of O(p^3) where p is the number of features. This makes it computationally infeasible when the number of features is very large (typically above 10,000), as the matrix inversion becomes prohibitively slow and memory-intensive.

Gradient descent is an iterative optimization algorithm that repeatedly updates the coefficients by moving in the direction of the steepest decrease in the cost function. Each iteration computes the gradient and takes a step proportional to the learning rate. The time complexity per iteration is O(n*p), and it typically requires many iterations to converge. However, it scales much better to large feature sets and can handle datasets that do not fit in memory (via stochastic or mini-batch variants). Gradient descent requires tuning the learning rate and choosing a stopping criterion, while the normal equation has no hyperparameters. In practice, gradient descent is preferred for large-scale problems, while the normal equation is preferred when the number of features is moderate and an exact solution is desired.

Key Points

Normal equation: closed-form, exact, O(p^3) -- impractical for large p
Gradient descent: iterative, approximate, O(n*p) per iteration
Normal equation has no hyperparameters; GD requires learning rate tuning
GD variants (SGD, mini-batch) scale to massive datasets
Normal equation preferred for p < 10,000; GD for larger problems

HARD How do you validate a linear regression model properly?

Proper validation of a linear regression model involves both statistical diagnostics and predictive performance assessment. Start by checking the assumptions: examine residual plots for linearity and homoscedasticity, use Q-Q plots or the Shapiro-Wilk test for residual normality, compute VIF scores for multicollinearity, and apply the Durbin-Watson test for autocorrelation. Then evaluate the model's explanatory power using adjusted R-squared, and test individual coefficients with t-tests and the overall model with the F-test to ensure statistical significance.

For predictive validation, never evaluate on the training data alone. Use train-test splits (typically 70-30 or 80-20) to assess out-of-sample performance, and employ k-fold cross-validation (usually 5 or 10 folds) for more robust estimates. Key metrics include RMSE (root mean squared error) for understanding prediction magnitude, MAE (mean absolute error) for a more robust measure less sensitive to outliers, and MAPE (mean absolute percentage error) for relative accuracy. Compare your model's performance against a baseline such as predicting the mean. Finally, check for overfitting by comparing training and validation metrics -- a large gap signals overfitting. For time series data, use time-aware validation strategies like walk-forward validation rather than random splits.

Key Points

Check assumptions: residual plots, Q-Q plots, VIF, Durbin-Watson
Use train-test split and k-fold cross-validation
Evaluate with RMSE, MAE, adjusted R-squared, and F-test
Compare training vs validation metrics to detect overfitting
Use time-aware splits for time series data

Linear Regression
Interview Questions

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Continue Your Journey

Guide

Cheat Sheet

Quiz

Linear RegressionInterview Questions

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Continue Your Journey

Guide

Cheat Sheet

Quiz

Linear Regression
Interview Questions