SVM Cheat Sheet | Techma Zone Study Lab

Key Formulas

Hyperplane:

\mathbf{w} \cdot \mathbf{x} + b = 0

Decision Function:

f(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b)

Margin Width:

\text{margin} = \frac{2}{\|\mathbf{w}\|}

Optimization Objective:

\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2

Hinge Loss:

L(y, f(x)) = \max(0, 1 - y \cdot f(x))

Distance to Hyperplane:

d = \frac{|\mathbf{w} \cdot \mathbf{x} + b|}{\|\mathbf{w}\|}

Hard Margin SVM

Objective:

\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2

Constraint:

y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i

Dual Form:

\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)

Requires perfectly linearly separable data. No misclassification is allowed. Use only when data has zero overlap between classes.

Soft Margin SVM

Objective:

\min_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i

Constraints:

y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

Slack Variable $\xi_i$:

$\xi_i = 0$: correct side, $0 < \xi_i < 1$: inside margin but correct, $\xi_i \geq 1$: misclassified

$C$ controls the trade-off between maximizing the margin and minimizing classification errors. Large $C$ = less tolerance for violations.

Kernel Functions

Linear:

K(\mathbf{x}, \mathbf{x'}) = \mathbf{x} \cdot \mathbf{x'}

Polynomial:

K(\mathbf{x}, \mathbf{x'}) = (\mathbf{x} \cdot \mathbf{x'} + r)^d

RBF (Gaussian):

K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x'}\|^2\right)

Sigmoid:

K(\mathbf{x}, \mathbf{x'}) = \tanh(\alpha \, \mathbf{x} \cdot \mathbf{x'} + c)

RBF is the most popular kernel. $\gamma = \frac{1}{2\sigma^2}$ controls the influence radius of each support vector.

The Kernel Trick

Core Idea:

K(\mathbf{x}, \mathbf{x'}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{x'})

Decision (Dual):

f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\right)

Map data to higher-dimensional space via $\phi(\mathbf{x})$ where it becomes linearly separable
Compute $K(\mathbf{x}, \mathbf{x'})$ directly without ever calculating $\phi$ explicitly
Massive computational savings: avoid working in potentially infinite-dimensional space

Mercer's theorem: $K$ is a valid kernel if and only if the kernel matrix is positive semi-definite for all inputs.

Hyperparameters

$C$ (Regularization):

High $C$ = narrow margin, fewer violations (risk overfitting). Low $C$ = wide margin, more violations (risk underfitting).

$\gamma$ (RBF width):

High $\gamma$ = tight decision boundary around points (overfit). Low $\gamma$ = smooth boundary (underfit).

$d$ (Polynomial degree):

Higher $d$ = more complex boundary. Common choices: $d = 2$ or $d = 3$.

Kernel Choice:

Start with linear. If underfitting, try RBF. Use polynomial for known interaction features.

Always tune $C$ and $\gamma$ together using grid search or randomized search with cross-validation.

SVM for Regression (SVR)

Objective:

\min_{\mathbf{w}, b, \xi, \xi^*} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)

Constraints:

y_i - (\mathbf{w} \cdot \mathbf{x}_i + b) \leq \varepsilon + \xi_i

ε-insensitive loss:

L_\varepsilon = \max(0, |y - f(x)| - \varepsilon)

The $\varepsilon$-tube defines a margin of tolerance where errors are not penalized. Support vectors lie on or outside the tube boundary.

Advantages & Limitations

Advantages:

Effective in high-dimensional spaces (even when $d > n$)
Memory efficient: only stores support vectors
Versatile: different kernels for different problems
Robust to overfitting in high dimensions with proper $C$
Clear geometric intuition (maximum margin)

Limitations:

Not suitable for very large datasets ($O(n^2)$ to $O(n^3)$ training)
Sensitive to feature scaling - always standardize first
No direct probability estimates (use Platt scaling)
Poor performance with noisy data and overlapping classes
Kernel and hyperparameter choice requires careful tuning

Common Mistakes

Not scaling features before training (SVM is distance-based)
Using RBF kernel when linear is sufficient (simpler is better)
Ignoring class imbalance - use class_weight='balanced'
Not tuning $C$ and $\gamma$ together with cross-validation
Using default hyperparameters without any search
Applying SVM to very large datasets without considering alternatives like SGDClassifier
Forgetting to standardize test data with the same scaler as training data
Interpreting SVM decision values as probabilities

Interview Quick-Fire

Q: What are support vectors?

A: The data points closest to the decision boundary (on or within the margin). They are the only points that influence the position and orientation of the hyperplane. Removing non-support vectors does not change the model.

Q: Why maximize the margin?

A: A larger margin provides better generalization to unseen data. By maximizing the margin, SVM finds the decision boundary with the greatest separation between classes, reducing overfitting risk.

Q: What does the $C$ parameter control?

A: $C$ balances margin width vs. classification errors. High $C$ penalizes misclassifications heavily (narrow margin, risk overfitting). Low $C$ allows more errors for a wider margin (risk underfitting).

Q: Explain the kernel trick in simple terms.

A: Instead of explicitly mapping data to a higher-dimensional space (expensive), the kernel trick computes the dot product in that space directly using a kernel function, giving non-linear decision boundaries efficiently.

Q: SVM vs. Logistic Regression - when to pick SVM?

A: SVM excels with high-dimensional data, small-to-medium datasets, and when you need non-linear boundaries (via kernels). Logistic Regression is preferred when you need probability outputs and faster training on large datasets.

Q: How does SVM handle multi-class classification?

A: SVM is inherently binary. Multi-class is achieved via One-vs-One (OvO) - builds $\frac{k(k-1)}{2}$ classifiers, or One-vs-Rest (OvR) - builds $k$ classifiers. Sklearn uses OvO by default.

Q: Why is feature scaling important for SVM?

A: SVM relies on distances between data points. Features on larger scales dominate the distance calculation, causing the model to ignore smaller-scale features. Standardization ensures all features contribute equally.

Q: What is the role of $\gamma$ in RBF kernel?

A: $\gamma$ defines how far the influence of a single training example reaches. High $\gamma$ means each point has close-range influence (complex boundary, overfitting). Low $\gamma$ means far-range influence (smooth boundary, underfitting).

SVM
Cheat Sheet

Key Formulas

Hard Margin SVM

Soft Margin SVM

Kernel Functions

The Kernel Trick

Hyperparameters

SVM for Regression (SVR)

Advantages & Limitations

Advantages:

Limitations:

Common Mistakes

Interview Quick-Fire

Q: What are support vectors?

Q: Why maximize the margin?

Q: What does the $C$ parameter control?

Q: Explain the kernel trick in simple terms.

Q: SVM vs. Logistic Regression - when to pick SVM?

Q: How does SVM handle multi-class classification?

Q: Why is feature scaling important for SVM?

Q: What is the role of $\gamma$ in RBF kernel?

Continue Your Journey

SVM Guide

SVM Quiz

Interview Prep

SVMCheat Sheet

Key Formulas

Hard Margin SVM

Soft Margin SVM

Kernel Functions

The Kernel Trick

Hyperparameters

SVM for Regression (SVR)

Advantages & Limitations

Advantages:

Limitations:

Common Mistakes

Interview Quick-Fire

Q: What are support vectors?

Q: Why maximize the margin?

Q: What does the $C$ parameter control?

Q: Explain the kernel trick in simple terms.

Q: SVM vs. Logistic Regression - when to pick SVM?

Q: How does SVM handle multi-class classification?

Q: Why is feature scaling important for SVM?

Q: What is the role of $\gamma$ in RBF kernel?

Continue Your Journey

SVM Guide

SVM Quiz

Interview Prep

SVM
Cheat Sheet