max margin
kernel trick
C parameter
Quick Reference
Classification
Home / Study Lab / Cheat Sheets / SVM Cheat Sheet
QUICK REFERENCE

SVM
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Hyperplane:
$$\mathbf{w} \cdot \mathbf{x} + b = 0$$
Decision Function:
$$f(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b)$$
Margin Width:
$$\text{margin} = \frac{2}{\|\mathbf{w}\|}$$
Optimization Objective:
$$\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2$$
Hinge Loss:
$$L(y, f(x)) = \max(0, 1 - y \cdot f(x))$$
Distance to Hyperplane:
$$d = \frac{|\mathbf{w} \cdot \mathbf{x} + b|}{\|\mathbf{w}\|}$$

Hard Margin SVM

Objective:
$$\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2$$
Constraint:
$$y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i$$
Dual Form:
$$\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)$$

Requires perfectly linearly separable data. No misclassification is allowed. Use only when data has zero overlap between classes.

Soft Margin SVM

Objective:
$$\min_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i$$
Constraints:
$$y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$
Slack Variable $\xi_i$:
$\xi_i = 0$: correct side, $0 < \xi_i < 1$: inside margin but correct, $\xi_i \geq 1$: misclassified

$C$ controls the trade-off between maximizing the margin and minimizing classification errors. Large $C$ = less tolerance for violations.

Kernel Functions

Linear:
$$K(\mathbf{x}, \mathbf{x'}) = \mathbf{x} \cdot \mathbf{x'}$$
Polynomial:
$$K(\mathbf{x}, \mathbf{x'}) = (\mathbf{x} \cdot \mathbf{x'} + r)^d$$
RBF (Gaussian):
$$K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x'}\|^2\right)$$
Sigmoid:
$$K(\mathbf{x}, \mathbf{x'}) = \tanh(\alpha \, \mathbf{x} \cdot \mathbf{x'} + c)$$

RBF is the most popular kernel. $\gamma = \frac{1}{2\sigma^2}$ controls the influence radius of each support vector.

The Kernel Trick

Core Idea:
$$K(\mathbf{x}, \mathbf{x'}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{x'})$$
Decision (Dual):
$$f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\right)$$
  1. Map data to higher-dimensional space via $\phi(\mathbf{x})$ where it becomes linearly separable
  2. Compute $K(\mathbf{x}, \mathbf{x'})$ directly without ever calculating $\phi$ explicitly
  3. Massive computational savings: avoid working in potentially infinite-dimensional space

Mercer's theorem: $K$ is a valid kernel if and only if the kernel matrix is positive semi-definite for all inputs.

Hyperparameters

$C$ (Regularization):
High $C$ = narrow margin, fewer violations (risk overfitting). Low $C$ = wide margin, more violations (risk underfitting).
$\gamma$ (RBF width):
High $\gamma$ = tight decision boundary around points (overfit). Low $\gamma$ = smooth boundary (underfit).
$d$ (Polynomial degree):
Higher $d$ = more complex boundary. Common choices: $d = 2$ or $d = 3$.
Kernel Choice:
Start with linear. If underfitting, try RBF. Use polynomial for known interaction features.

Always tune $C$ and $\gamma$ together using grid search or randomized search with cross-validation.

SVM for Regression (SVR)

Objective:
$$\min_{\mathbf{w}, b, \xi, \xi^*} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)$$
Constraints:
$$y_i - (\mathbf{w} \cdot \mathbf{x}_i + b) \leq \varepsilon + \xi_i$$
ε-insensitive loss:
$$L_\varepsilon = \max(0, |y - f(x)| - \varepsilon)$$

The $\varepsilon$-tube defines a margin of tolerance where errors are not penalized. Support vectors lie on or outside the tube boundary.

Advantages & Limitations

Advantages:

  • Effective in high-dimensional spaces (even when $d > n$)
  • Memory efficient: only stores support vectors
  • Versatile: different kernels for different problems
  • Robust to overfitting in high dimensions with proper $C$
  • Clear geometric intuition (maximum margin)

Limitations:

  • Not suitable for very large datasets ($O(n^2)$ to $O(n^3)$ training)
  • Sensitive to feature scaling - always standardize first
  • No direct probability estimates (use Platt scaling)
  • Poor performance with noisy data and overlapping classes
  • Kernel and hyperparameter choice requires careful tuning

Common Mistakes

  • Not scaling features before training (SVM is distance-based)
  • Using RBF kernel when linear is sufficient (simpler is better)
  • Ignoring class imbalance - use class_weight='balanced'
  • Not tuning $C$ and $\gamma$ together with cross-validation
  • Using default hyperparameters without any search
  • Applying SVM to very large datasets without considering alternatives like SGDClassifier
  • Forgetting to standardize test data with the same scaler as training data
  • Interpreting SVM decision values as probabilities

Interview Quick-Fire

Q: What are support vectors?

A: The data points closest to the decision boundary (on or within the margin). They are the only points that influence the position and orientation of the hyperplane. Removing non-support vectors does not change the model.

Q: Why maximize the margin?

A: A larger margin provides better generalization to unseen data. By maximizing the margin, SVM finds the decision boundary with the greatest separation between classes, reducing overfitting risk.

Q: What does the $C$ parameter control?

A: $C$ balances margin width vs. classification errors. High $C$ penalizes misclassifications heavily (narrow margin, risk overfitting). Low $C$ allows more errors for a wider margin (risk underfitting).

Q: Explain the kernel trick in simple terms.

A: Instead of explicitly mapping data to a higher-dimensional space (expensive), the kernel trick computes the dot product in that space directly using a kernel function, giving non-linear decision boundaries efficiently.

Q: SVM vs. Logistic Regression - when to pick SVM?

A: SVM excels with high-dimensional data, small-to-medium datasets, and when you need non-linear boundaries (via kernels). Logistic Regression is preferred when you need probability outputs and faster training on large datasets.

Q: How does SVM handle multi-class classification?

A: SVM is inherently binary. Multi-class is achieved via One-vs-One (OvO) - builds $\frac{k(k-1)}{2}$ classifiers, or One-vs-Rest (OvR) - builds $k$ classifiers. Sklearn uses OvO by default.

Q: Why is feature scaling important for SVM?

A: SVM relies on distances between data points. Features on larger scales dominate the distance calculation, causing the model to ignore smaller-scale features. Standardization ensures all features contribute equally.

Q: What is the role of $\gamma$ in RBF kernel?

A: $\gamma$ defines how far the influence of a single training example reaches. High $\gamma$ means each point has close-range influence (complex boundary, overfitting). Low $\gamma$ means far-range influence (smooth boundary, underfitting).

Continue Your Journey