SVM
Cheat Sheet
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Requires perfectly linearly separable data. No misclassification is allowed. Use only when data has zero overlap between classes.
$C$ controls the trade-off between maximizing the margin and minimizing classification errors. Large $C$ = less tolerance for violations.
RBF is the most popular kernel. $\gamma = \frac{1}{2\sigma^2}$ controls the influence radius of each support vector.
Mercer's theorem: $K$ is a valid kernel if and only if the kernel matrix is positive semi-definite for all inputs.
Always tune $C$ and $\gamma$ together using grid search or randomized search with cross-validation.
The $\varepsilon$-tube defines a margin of tolerance where errors are not penalized. Support vectors lie on or outside the tube boundary.
class_weight='balanced'A: The data points closest to the decision boundary (on or within the margin). They are the only points that influence the position and orientation of the hyperplane. Removing non-support vectors does not change the model.
A: A larger margin provides better generalization to unseen data. By maximizing the margin, SVM finds the decision boundary with the greatest separation between classes, reducing overfitting risk.
A: $C$ balances margin width vs. classification errors. High $C$ penalizes misclassifications heavily (narrow margin, risk overfitting). Low $C$ allows more errors for a wider margin (risk underfitting).
A: Instead of explicitly mapping data to a higher-dimensional space (expensive), the kernel trick computes the dot product in that space directly using a kernel function, giving non-linear decision boundaries efficiently.
A: SVM excels with high-dimensional data, small-to-medium datasets, and when you need non-linear boundaries (via kernels). Logistic Regression is preferred when you need probability outputs and faster training on large datasets.
A: SVM is inherently binary. Multi-class is achieved via One-vs-One (OvO) - builds $\frac{k(k-1)}{2}$ classifiers, or One-vs-Rest (OvR) - builds $k$ classifiers. Sklearn uses OvO by default.
A: SVM relies on distances between data points. Features on larger scales dominate the distance calculation, causing the model to ignore smaller-scale features. Standardization ensures all features contribute equally.
A: $\gamma$ defines how far the influence of a single training example reaches. High $\gamma$ means each point has close-range influence (complex boundary, overfitting). Low $\gamma$ means far-range influence (smooth boundary, underfitting).