SVM Interview Questions

EASY What is SVM and what problems does it solve?

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal separating hyperplane between classes. It works by identifying the decision boundary that maximizes the margin -- the distance between the boundary and the closest data points from each class. These closest points are called support vectors.

SVM can solve both classification and regression problems. For classification, it separates data into two or more classes with the widest possible gap. For regression (SVR), it fits a tube of tolerance around the data. SVM is particularly effective in high-dimensional spaces, making it popular for text classification, image recognition, bioinformatics, and handwriting detection.

Key Points

Finds the optimal hyperplane that maximizes the margin between classes
Handles both classification (SVC) and regression (SVR) tasks
Effective in high-dimensional spaces with clear margins
Widely used in text classification, image recognition, and bioinformatics

EASY What is a hyperplane and how does SVM use it?

A hyperplane is a flat subspace of one dimension less than its ambient space. In 2D, a hyperplane is a line; in 3D, it is a plane; and in higher dimensions, it generalizes accordingly. Mathematically, it is defined by the equation w·x + b = 0, where w is the weight vector (normal to the hyperplane), x is the input vector, and b is the bias term.

SVM uses the hyperplane as a decision boundary to separate data points belonging to different classes. Among all possible hyperplanes that correctly classify the training data, SVM selects the one that maximizes the margin -- the perpendicular distance from the hyperplane to the nearest data points on either side. This maximum-margin hyperplane provides the best generalization to unseen data by creating the largest buffer zone between classes.

Key Points

A hyperplane is a (d-1)-dimensional subspace in d-dimensional space
Defined by w·x + b = 0 where w is the normal vector
SVM selects the hyperplane that maximizes the margin
Maximum margin leads to better generalization on unseen data

EASY What are support vectors and why are they important?

Support vectors are the training data points that lie closest to the decision boundary (hyperplane). They sit exactly on the margin boundaries, at a distance of 1/||w|| from the hyperplane. These are the most difficult points to classify and the ones that directly determine the position and orientation of the optimal hyperplane.

Support vectors are critically important because the SVM solution depends entirely on them. If you removed all other training points and retrained, you would get the exact same hyperplane, as long as the support vectors remain. This property makes SVM memory-efficient at prediction time -- only the support vectors need to be stored. It also means that SVM is robust to outliers that are far from the decision boundary, since those points have no influence on the model.

Key Points

Data points closest to the decision boundary
Directly determine the position of the optimal hyperplane
Removing non-support-vector points does not change the model
Only support vectors need to be stored for prediction

EASY What is the difference between hard margin and soft margin SVM?

Hard margin SVM requires that all training data points be correctly classified with no violations of the margin. It works only when the data is perfectly linearly separable, meaning there exists a hyperplane that can separate the two classes without any misclassifications. If even one point falls on the wrong side, hard margin SVM will fail to find a solution.

Soft margin SVM, introduced by Cortes and Vapnik (1995), relaxes this constraint by allowing some data points to violate the margin or even be misclassified. It introduces slack variables (xi) that measure the degree of margin violation for each point, and a regularization parameter C that controls the trade-off between maximizing the margin and minimizing the total violation. A larger C penalizes violations more heavily, yielding a narrower margin. A smaller C allows more violations, producing a wider margin that is more tolerant of noise and outliers.

Key Points

Hard margin requires perfect linear separability -- no violations allowed
Soft margin allows controlled misclassifications via slack variables
The C parameter controls the margin-violation trade-off
Soft margin SVM is used in nearly all practical applications

EASY Name some real-world applications of SVM.

SVM has been successfully applied across many domains. In text classification and NLP, it is used for spam detection, sentiment analysis, and document categorization. In bioinformatics, SVM classifies proteins, detects genes, and predicts disease from gene expression data. In image recognition, it powers handwriting recognition systems like those used in postal services and has been used for face detection and object classification.

In finance, SVM is applied to stock market prediction, credit scoring, and fraud detection. In medical diagnosis, it helps classify tumors as malignant or benign and assists in disease prediction from clinical data. SVM is also used in remote sensing for land-use classification from satellite imagery, and in intrusion detection systems for cybersecurity. Its strength in high-dimensional, small-sample-size problems makes it particularly well-suited to these scientific and engineering applications.

Key Points

Text classification: spam filtering, sentiment analysis, document categorization
Bioinformatics: protein classification, gene expression analysis
Image recognition: handwriting, face detection, object classification
Finance and medicine: fraud detection, tumor classification, credit scoring

MEDIUM Explain the kernel trick and why it's useful.

The kernel trick allows SVM to learn non-linear decision boundaries without explicitly transforming the data into a higher-dimensional feature space. The key insight is that the SVM optimization (in its dual form) and the final decision function depend only on dot products between pairs of data points, not on the individual feature vectors. A kernel function K(x_i, x_j) computes this dot product in the high-dimensional space directly from the original input vectors, without ever performing the actual transformation.

This is enormously useful because explicitly mapping data to a very high (or even infinite) dimensional space would be computationally prohibitive. For example, the RBF kernel implicitly maps data into an infinite-dimensional space, yet the kernel evaluation itself takes O(d) time where d is the original dimension. The kernel trick makes it feasible for SVM to capture complex non-linear relationships while maintaining computational tractability, which is why kernel SVM has been so successful on problems where linear methods fail.

Key Points

Computes dot products in high-dimensional space without explicit transformation
SVM dual formulation depends only on pairwise dot products
Enables non-linear decision boundaries efficiently
RBF kernel implicitly maps to infinite-dimensional space in O(d) time

MEDIUM Compare linear, polynomial, and RBF kernels.

The linear kernel K(x_i, x_j) = x_i · x_j is the simplest and fastest. It produces a straight decision boundary and works best when data is linearly separable or nearly so. It is the preferred choice for high-dimensional data like text (where d is large), because in such spaces data tends to be linearly separable and a linear kernel avoids overfitting.

The polynomial kernel K(x_i, x_j) = (gamma * x_i · x_j + r)^d maps data into a feature space of polynomial combinations up to degree d. It captures feature interactions and is useful when the decision boundary has polynomial curvature. However, it can be computationally expensive for high degrees and suffers from numerical instability. The RBF (Radial Basis Function) kernel K(x_i, x_j) = exp(-gamma * ||x_i - x_j||^2) is the most commonly used non-linear kernel. It maps data into an infinite-dimensional space and can model arbitrarily complex boundaries. The gamma parameter controls the radius of influence of each support vector: a small gamma means wide influence (smoother boundary), while a large gamma means narrow influence (more complex boundary).

Key Points

Linear: fastest, best for high-dimensional or linearly separable data
Polynomial: captures feature interactions, degree d controls complexity
RBF: most versatile, infinite-dimensional mapping, gamma controls smoothness
RBF is the default choice; linear is preferred for very high-dimensional data

MEDIUM How do you tune C and gamma hyperparameters?

The C parameter controls the trade-off between a smooth decision boundary (large margin) and correctly classifying training points. A small C makes the margin wider, accepting more misclassifications (higher bias, lower variance). A large C makes the margin narrower, trying to classify all training points correctly (lower bias, higher variance), which can lead to overfitting. The gamma parameter (for RBF and polynomial kernels) controls how much influence a single training example has. A small gamma means far-reaching influence (smoother, simpler boundary), while a large gamma means only nearby points matter (more complex, wiggly boundary).

The standard approach to tuning is grid search with cross-validation. You define a grid of C and gamma values, typically on a logarithmic scale (e.g., C = [0.001, 0.01, 0.1, 1, 10, 100] and gamma = [0.001, 0.01, 0.1, 1]). For each combination, you train the SVM using k-fold cross-validation and evaluate performance. The combination with the best cross-validation score is selected. Randomized search and Bayesian optimization (e.g., using Optuna) are more efficient alternatives for larger search spaces, as they avoid evaluating every combination.

Key Points

C controls margin width vs. misclassification penalty
Gamma controls the influence radius of individual training points
Grid search with cross-validation is the standard tuning method
Use logarithmic scales; Bayesian optimization for larger search spaces

MEDIUM How does SVM handle multiclass classification?

SVM is inherently a binary classifier, so multiclass problems require decomposition strategies. The two most common approaches are One-vs-Rest (OvR) and One-vs-One (OvO). In One-vs-Rest, you train K binary classifiers, one for each class, where each classifier separates one class from all the others. At prediction time, the class whose classifier produces the highest decision function value wins.

In One-vs-One, you train K*(K-1)/2 binary classifiers, one for every pair of classes. At prediction time, each classifier votes for one of its two classes, and the class with the most votes wins (majority voting). OvO is generally preferred for SVM because each subproblem involves fewer training points, making individual classifiers faster to train. Scikit-learn uses OvO by default for SVM. Some advanced approaches like Crammer-Singer formulate multiclass SVM as a single optimization problem, but these are less commonly used in practice.

Key Points

SVM is binary by nature -- multiclass requires decomposition
One-vs-Rest (OvR): K classifiers, one per class vs. all others
One-vs-One (OvO): K*(K-1)/2 pairwise classifiers with majority voting
OvO is the default in scikit-learn; faster per subproblem

MEDIUM What are the advantages and disadvantages of SVM?

SVM has several notable advantages. It is highly effective in high-dimensional spaces, even when the number of dimensions exceeds the number of samples. It is memory-efficient because only support vectors are stored in the model. The kernel trick provides great flexibility to model non-linear boundaries. SVM is also robust to overfitting in high-dimensional spaces due to its maximum-margin principle, and it has strong theoretical guarantees through structural risk minimization.

However, SVM has significant disadvantages. Training time scales poorly with dataset size -- O(n^2) to O(n^3) for the standard QP solver -- making it impractical for very large datasets. It does not directly provide probability estimates (Platt scaling is used as a post-hoc workaround but requires additional cross-validation). Choosing the right kernel and tuning hyperparameters (C, gamma) can be challenging and computationally expensive. SVM is also sensitive to feature scaling, and it does not perform well when classes heavily overlap or when the dataset is very noisy.

Key Points

Advantages: effective in high dimensions, memory-efficient, kernel flexibility
Advantages: strong generalization via maximum margin, theoretical guarantees
Disadvantages: slow training on large datasets (O(n^2) to O(n^3))
Disadvantages: no native probability output, sensitive to feature scaling

HARD Derive the SVM optimization objective and explain the dual formulation.

The primal SVM objective for hard-margin is: minimize (1/2)||w||^2 subject to y_i(w·x_i + b) ≥ 1 for all i. We minimize ||w||^2 because the margin width is 2/||w||, so minimizing the norm maximizes the margin. The constraints ensure every point is on the correct side with at least unit functional margin. For soft margin, we add slack variables: minimize (1/2)||w||^2 + C * sum(xi_i) subject to y_i(w·x_i + b) ≥ 1 - xi_i and xi_i ≥ 0.

To derive the dual, we introduce Lagrange multipliers alpha_i ≥ 0 for each constraint and form the Lagrangian. Taking derivatives with respect to w and b, setting them to zero, and substituting back yields the dual problem: maximize sum(alpha_i) - (1/2) * sum_i sum_j (alpha_i * alpha_j * y_i * y_j * x_i·x_j) subject to 0 ≤ alpha_i ≤ C and sum(alpha_i * y_i) = 0. The dual is preferred because it depends only on dot products x_i·x_j, enabling the kernel trick. Non-zero alpha_i correspond to support vectors. The KKT complementarity conditions guarantee that alpha_i > 0 only for points on or violating the margin.

Key Points

Primal: minimize (1/2)||w||^2 + C * sum(xi_i) subject to margin constraints
Dual derived via Lagrangian and KKT conditions
Dual depends only on dot products -- enables the kernel trick
Non-zero Lagrange multipliers identify the support vectors

HARD Explain the relationship between SVM and hinge loss.

The soft-margin SVM optimization can be reformulated as an unconstrained empirical risk minimization problem: minimize (1/n) * sum(max(0, 1 - y_i * f(x_i))) + lambda * ||w||^2, where f(x) = w·x + b. The first term is the hinge loss L(y, f(x)) = max(0, 1 - y*f(x)), which penalizes points that are misclassified or within the margin. The second term is the L2 regularizer that controls model complexity. The parameter lambda = 1/(2nC) connects this formulation to the constrained version.

The hinge loss has a characteristic shape: it is zero when a point is correctly classified with functional margin at least 1 (well outside the margin), and it increases linearly as the point moves into or through the margin to the wrong side. Unlike logistic loss, hinge loss is exactly zero for well-classified points, which is why SVM produces sparse solutions -- only the support vectors (points at or within the margin) contribute to the loss and influence the model. This sparsity is a defining characteristic of SVM and directly follows from the non-smooth nature of the hinge loss at the kink point.

Key Points

Hinge loss: max(0, 1 - y*f(x)) -- zero for correctly classified points outside margin
SVM = hinge loss minimization + L2 regularization
Hinge loss produces sparse solutions (only support vectors contribute)
Connection: lambda = 1/(2nC) links the regularized and constrained forms

HARD How does SVM compare to logistic regression mathematically?

Both SVM and logistic regression are linear classifiers that can be expressed as regularized empirical risk minimization: minimize (1/n) * sum(L(y_i, f(x_i))) + lambda * ||w||^2. The only difference is the loss function. SVM uses hinge loss: max(0, 1 - y*f(x)). Logistic regression uses logistic loss: log(1 + exp(-y*f(x))). Both losses are convex, ensuring a global optimum, but they have different properties.

Hinge loss is exactly zero for well-classified points (functional margin ≥ 1), making SVM sensitive only to support vectors near the boundary. Logistic loss is always positive and never exactly zero, so every data point contributes to the gradient and influences the model. This means logistic regression produces well-calibrated probability estimates (via the sigmoid function), while SVM does not natively output probabilities. SVM tends to produce sparser models and can generalize better with limited data, while logistic regression is preferred when probability calibration matters or when all data points carry useful signal. With kernel extensions, both can handle non-linear boundaries.

Key Points

Both are regularized ERM -- they differ only in the loss function
SVM (hinge loss) ignores well-classified points; LR (log loss) uses all points
LR provides calibrated probabilities; SVM produces sparser models
Both are convex and can be kernelized for non-linear boundaries

HARD Explain the VC dimension and its relevance to SVM.

The Vapnik-Chervonenkis (VC) dimension measures the capacity (complexity) of a hypothesis class. It is defined as the largest number of data points that can be shattered (i.e., correctly classified under all possible label assignments) by the hypothesis class. For linear classifiers in d-dimensional space, the VC dimension is d + 1. A higher VC dimension means the model can fit more complex patterns but also has a greater risk of overfitting.

SVM is deeply connected to VC theory through structural risk minimization (SRM). The generalization error bound depends on the VC dimension of the hypothesis class, not just the training error. By maximizing the margin, SVM effectively restricts the hypothesis class to large-margin classifiers, which have a lower VC dimension than the full class of linear classifiers. Specifically, Vapnik showed that the VC dimension of the set of hyperplanes with margin at least rho is bounded by min(R^2/rho^2, d) + 1, where R is the radius of the smallest enclosing sphere. Thus, maximizing the margin minimizes the VC dimension bound, which tightens the generalization bound. This is the theoretical justification for why SVM generalizes well.

Key Points

VC dimension = largest number of points a classifier can shatter
SVM uses structural risk minimization to control generalization
Maximizing margin reduces the effective VC dimension
VC bound: min(R^2/rho^2, d) + 1 -- larger margin means lower VC dimension

HARD How would you handle a very large dataset with SVM?

Standard SVM solvers like SMO have O(n^2) to O(n^3) time complexity and require storing the kernel matrix, which is O(n^2) in memory. For datasets with millions of samples, this is intractable. The first strategy is to use linear SVM with stochastic gradient descent (SGD) on the primal hinge loss objective, which scales as O(n*d) per epoch and requires only O(d) memory. Libraries like LIBLINEAR and scikit-learn's SGDClassifier implement this efficiently and can handle millions of samples.

For non-linear SVM at scale, several approaches exist. Random Fourier Features (Rahimi and Recht, 2007) approximate the RBF kernel by mapping data into a finite-dimensional space where linear SVM can be applied, preserving most of the kernel's expressive power. The Nystrom method approximates the kernel matrix using a subset of columns. Budgeted or online SVM variants maintain a fixed-size set of support vectors and update them incrementally. You can also subsample the data, train on a representative subset, or use divide-and-conquer approaches that train local SVMs on data partitions and combine them. Finally, GPU-accelerated SVM implementations like ThunderSVM can provide significant speedups on moderately large datasets.

Key Points

Linear SVM with SGD scales to millions of samples in O(n*d)
Random Fourier Features approximate RBF kernel for linear SVM
Nystrom method approximates the kernel matrix from a subset
Budgeted/online SVM and divide-and-conquer for streaming or distributed settings
GPU-accelerated solvers (ThunderSVM) for moderate-scale non-linear SVM

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Continue Your Journey

Guide

Cheat Sheet

Quiz