SVM Interview Questions
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
A Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal separating hyperplane between classes. It works by identifying the decision boundary that maximizes the margin -- the distance between the boundary and the closest data points from each class. These closest points are called support vectors.
SVM can solve both classification and regression problems. For classification, it separates data into two or more classes with the widest possible gap. For regression (SVR), it fits a tube of tolerance around the data. SVM is particularly effective in high-dimensional spaces, making it popular for text classification, image recognition, bioinformatics, and handwriting detection.
A hyperplane is a flat subspace of one dimension less than its ambient space. In 2D, a hyperplane is a line; in 3D, it is a plane; and in higher dimensions, it generalizes accordingly. Mathematically, it is defined by the equation w·x + b = 0, where w is the weight vector (normal to the hyperplane), x is the input vector, and b is the bias term.
SVM uses the hyperplane as a decision boundary to separate data points belonging to different classes. Among all possible hyperplanes that correctly classify the training data, SVM selects the one that maximizes the margin -- the perpendicular distance from the hyperplane to the nearest data points on either side. This maximum-margin hyperplane provides the best generalization to unseen data by creating the largest buffer zone between classes.
Support vectors are the training data points that lie closest to the decision boundary (hyperplane). They sit exactly on the margin boundaries, at a distance of 1/||w|| from the hyperplane. These are the most difficult points to classify and the ones that directly determine the position and orientation of the optimal hyperplane.
Support vectors are critically important because the SVM solution depends entirely on them. If you removed all other training points and retrained, you would get the exact same hyperplane, as long as the support vectors remain. This property makes SVM memory-efficient at prediction time -- only the support vectors need to be stored. It also means that SVM is robust to outliers that are far from the decision boundary, since those points have no influence on the model.
Hard margin SVM requires that all training data points be correctly classified with no violations of the margin. It works only when the data is perfectly linearly separable, meaning there exists a hyperplane that can separate the two classes without any misclassifications. If even one point falls on the wrong side, hard margin SVM will fail to find a solution.
Soft margin SVM, introduced by Cortes and Vapnik (1995), relaxes this constraint by allowing some data points to violate the margin or even be misclassified. It introduces slack variables (xi) that measure the degree of margin violation for each point, and a regularization parameter C that controls the trade-off between maximizing the margin and minimizing the total violation. A larger C penalizes violations more heavily, yielding a narrower margin. A smaller C allows more violations, producing a wider margin that is more tolerant of noise and outliers.
SVM has been successfully applied across many domains. In text classification and NLP, it is used for spam detection, sentiment analysis, and document categorization. In bioinformatics, SVM classifies proteins, detects genes, and predicts disease from gene expression data. In image recognition, it powers handwriting recognition systems like those used in postal services and has been used for face detection and object classification.
In finance, SVM is applied to stock market prediction, credit scoring, and fraud detection. In medical diagnosis, it helps classify tumors as malignant or benign and assists in disease prediction from clinical data. SVM is also used in remote sensing for land-use classification from satellite imagery, and in intrusion detection systems for cybersecurity. Its strength in high-dimensional, small-sample-size problems makes it particularly well-suited to these scientific and engineering applications.
The kernel trick allows SVM to learn non-linear decision boundaries without explicitly transforming the data into a higher-dimensional feature space. The key insight is that the SVM optimization (in its dual form) and the final decision function depend only on dot products between pairs of data points, not on the individual feature vectors. A kernel function K(x_i, x_j) computes this dot product in the high-dimensional space directly from the original input vectors, without ever performing the actual transformation.
This is enormously useful because explicitly mapping data to a very high (or even infinite) dimensional space would be computationally prohibitive. For example, the RBF kernel implicitly maps data into an infinite-dimensional space, yet the kernel evaluation itself takes O(d) time where d is the original dimension. The kernel trick makes it feasible for SVM to capture complex non-linear relationships while maintaining computational tractability, which is why kernel SVM has been so successful on problems where linear methods fail.
The linear kernel K(x_i, x_j) = x_i · x_j is the simplest and fastest. It produces a straight decision boundary and works best when data is linearly separable or nearly so. It is the preferred choice for high-dimensional data like text (where d is large), because in such spaces data tends to be linearly separable and a linear kernel avoids overfitting.
The polynomial kernel K(x_i, x_j) = (gamma * x_i · x_j + r)^d maps data into a feature space of polynomial combinations up to degree d. It captures feature interactions and is useful when the decision boundary has polynomial curvature. However, it can be computationally expensive for high degrees and suffers from numerical instability. The RBF (Radial Basis Function) kernel K(x_i, x_j) = exp(-gamma * ||x_i - x_j||^2) is the most commonly used non-linear kernel. It maps data into an infinite-dimensional space and can model arbitrarily complex boundaries. The gamma parameter controls the radius of influence of each support vector: a small gamma means wide influence (smoother boundary), while a large gamma means narrow influence (more complex boundary).
The C parameter controls the trade-off between a smooth decision boundary (large margin) and correctly classifying training points. A small C makes the margin wider, accepting more misclassifications (higher bias, lower variance). A large C makes the margin narrower, trying to classify all training points correctly (lower bias, higher variance), which can lead to overfitting. The gamma parameter (for RBF and polynomial kernels) controls how much influence a single training example has. A small gamma means far-reaching influence (smoother, simpler boundary), while a large gamma means only nearby points matter (more complex, wiggly boundary).
The standard approach to tuning is grid search with cross-validation. You define a grid of C and gamma values, typically on a logarithmic scale (e.g., C = [0.001, 0.01, 0.1, 1, 10, 100] and gamma = [0.001, 0.01, 0.1, 1]). For each combination, you train the SVM using k-fold cross-validation and evaluate performance. The combination with the best cross-validation score is selected. Randomized search and Bayesian optimization (e.g., using Optuna) are more efficient alternatives for larger search spaces, as they avoid evaluating every combination.
SVM is inherently a binary classifier, so multiclass problems require decomposition strategies. The two most common approaches are One-vs-Rest (OvR) and One-vs-One (OvO). In One-vs-Rest, you train K binary classifiers, one for each class, where each classifier separates one class from all the others. At prediction time, the class whose classifier produces the highest decision function value wins.
In One-vs-One, you train K*(K-1)/2 binary classifiers, one for every pair of classes. At prediction time, each classifier votes for one of its two classes, and the class with the most votes wins (majority voting). OvO is generally preferred for SVM because each subproblem involves fewer training points, making individual classifiers faster to train. Scikit-learn uses OvO by default for SVM. Some advanced approaches like Crammer-Singer formulate multiclass SVM as a single optimization problem, but these are less commonly used in practice.
SVM has several notable advantages. It is highly effective in high-dimensional spaces, even when the number of dimensions exceeds the number of samples. It is memory-efficient because only support vectors are stored in the model. The kernel trick provides great flexibility to model non-linear boundaries. SVM is also robust to overfitting in high-dimensional spaces due to its maximum-margin principle, and it has strong theoretical guarantees through structural risk minimization.
However, SVM has significant disadvantages. Training time scales poorly with dataset size -- O(n^2) to O(n^3) for the standard QP solver -- making it impractical for very large datasets. It does not directly provide probability estimates (Platt scaling is used as a post-hoc workaround but requires additional cross-validation). Choosing the right kernel and tuning hyperparameters (C, gamma) can be challenging and computationally expensive. SVM is also sensitive to feature scaling, and it does not perform well when classes heavily overlap or when the dataset is very noisy.
The primal SVM objective for hard-margin is: minimize (1/2)||w||^2 subject to y_i(w·x_i + b) ≥ 1 for all i. We minimize ||w||^2 because the margin width is 2/||w||, so minimizing the norm maximizes the margin. The constraints ensure every point is on the correct side with at least unit functional margin. For soft margin, we add slack variables: minimize (1/2)||w||^2 + C * sum(xi_i) subject to y_i(w·x_i + b) ≥ 1 - xi_i and xi_i ≥ 0.
To derive the dual, we introduce Lagrange multipliers alpha_i ≥ 0 for each constraint and form the Lagrangian. Taking derivatives with respect to w and b, setting them to zero, and substituting back yields the dual problem: maximize sum(alpha_i) - (1/2) * sum_i sum_j (alpha_i * alpha_j * y_i * y_j * x_i·x_j) subject to 0 ≤ alpha_i ≤ C and sum(alpha_i * y_i) = 0. The dual is preferred because it depends only on dot products x_i·x_j, enabling the kernel trick. Non-zero alpha_i correspond to support vectors. The KKT complementarity conditions guarantee that alpha_i > 0 only for points on or violating the margin.
The soft-margin SVM optimization can be reformulated as an unconstrained empirical risk minimization problem: minimize (1/n) * sum(max(0, 1 - y_i * f(x_i))) + lambda * ||w||^2, where f(x) = w·x + b. The first term is the hinge loss L(y, f(x)) = max(0, 1 - y*f(x)), which penalizes points that are misclassified or within the margin. The second term is the L2 regularizer that controls model complexity. The parameter lambda = 1/(2nC) connects this formulation to the constrained version.
The hinge loss has a characteristic shape: it is zero when a point is correctly classified with functional margin at least 1 (well outside the margin), and it increases linearly as the point moves into or through the margin to the wrong side. Unlike logistic loss, hinge loss is exactly zero for well-classified points, which is why SVM produces sparse solutions -- only the support vectors (points at or within the margin) contribute to the loss and influence the model. This sparsity is a defining characteristic of SVM and directly follows from the non-smooth nature of the hinge loss at the kink point.
Both SVM and logistic regression are linear classifiers that can be expressed as regularized empirical risk minimization: minimize (1/n) * sum(L(y_i, f(x_i))) + lambda * ||w||^2. The only difference is the loss function. SVM uses hinge loss: max(0, 1 - y*f(x)). Logistic regression uses logistic loss: log(1 + exp(-y*f(x))). Both losses are convex, ensuring a global optimum, but they have different properties.
Hinge loss is exactly zero for well-classified points (functional margin ≥ 1), making SVM sensitive only to support vectors near the boundary. Logistic loss is always positive and never exactly zero, so every data point contributes to the gradient and influences the model. This means logistic regression produces well-calibrated probability estimates (via the sigmoid function), while SVM does not natively output probabilities. SVM tends to produce sparser models and can generalize better with limited data, while logistic regression is preferred when probability calibration matters or when all data points carry useful signal. With kernel extensions, both can handle non-linear boundaries.
The Vapnik-Chervonenkis (VC) dimension measures the capacity (complexity) of a hypothesis class. It is defined as the largest number of data points that can be shattered (i.e., correctly classified under all possible label assignments) by the hypothesis class. For linear classifiers in d-dimensional space, the VC dimension is d + 1. A higher VC dimension means the model can fit more complex patterns but also has a greater risk of overfitting.
SVM is deeply connected to VC theory through structural risk minimization (SRM). The generalization error bound depends on the VC dimension of the hypothesis class, not just the training error. By maximizing the margin, SVM effectively restricts the hypothesis class to large-margin classifiers, which have a lower VC dimension than the full class of linear classifiers. Specifically, Vapnik showed that the VC dimension of the set of hyperplanes with margin at least rho is bounded by min(R^2/rho^2, d) + 1, where R is the radius of the smallest enclosing sphere. Thus, maximizing the margin minimizes the VC dimension bound, which tightens the generalization bound. This is the theoretical justification for why SVM generalizes well.
Standard SVM solvers like SMO have O(n^2) to O(n^3) time complexity and require storing the kernel matrix, which is O(n^2) in memory. For datasets with millions of samples, this is intractable. The first strategy is to use linear SVM with stochastic gradient descent (SGD) on the primal hinge loss objective, which scales as O(n*d) per epoch and requires only O(d) memory. Libraries like LIBLINEAR and scikit-learn's SGDClassifier implement this efficiently and can handle millions of samples.
For non-linear SVM at scale, several approaches exist. Random Fourier Features (Rahimi and Recht, 2007) approximate the RBF kernel by mapping data into a finite-dimensional space where linear SVM can be applied, preserving most of the kernel's expressive power. The Nystrom method approximates the kernel matrix using a subset of columns. Budgeted or online SVM variants maintain a fixed-size set of support vectors and update them incrementally. You can also subsample the data, train on a representative subset, or use divide-and-conquer approaches that train local SVMs on data partitions and combine them. Finally, GPU-accelerated SVM implementations like ThunderSVM can provide significant speedups on moderately large datasets.