Gradient Boosting Cheat Sheet | Techma Zone Study Lab

Key Formulas

Additive Model:

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})

Pseudo-Residuals:

r_{im} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \Bigg|_{F=F_{m-1}}

XGBoost Objective:

\text{Obj} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{M} \Omega(f_k)

Regularization Term:

\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

Optimal Leaf Weight:

w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} H_i + \lambda}

Split Gain:

\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma

Algorithm Steps

Initialize $F_0(\mathbf{x}) = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$ (e.g., mean for MSE, log-odds for log-loss)
For $m = 1$ to $M$:
1. Compute pseudo-residuals: $r_{im} = -\partial L / \partial F(\mathbf{x}_i)$
2. Fit a decision tree $h_m$ to targets $\{r_{im}\}$
3. Find optimal leaf values using gradient and Hessian
4. Update: $F_m = F_{m-1} + \eta \cdot h_m$
Output: $F_M(\mathbf{x}) = F_0 + \sum_{m=1}^{M} \eta \cdot h_m(\mathbf{x})$

For classification, apply sigmoid ($\sigma$) or softmax to $F_M(\mathbf{x})$ to get probabilities.

Hyperparameters

n_estimators:

Number of boosting rounds (trees). Use early stopping to find optimal value. Typical: 100-5000.

learning_rate ($\eta$):

Shrinkage per tree. Lower = better generalization but more trees needed. Typical: 0.01-0.3.

max_depth:

Maximum tree depth. Controls interaction order. Typical: 3-8. Default XGBoost: 6.

subsample:

Fraction of rows per tree. Reduces variance. Typical: 0.5-1.0.

colsample_bytree:

Fraction of features per tree. Like Random Forest's feature bagging. Typical: 0.5-1.0.

Start with learning_rate=0.1, max_depth=4, subsample=0.8 and tune from there.

Regularization

L1 ($\alpha$ / reg_alpha):

\alpha \sum_{j=1}^{T} |w_j|

L2 ($\lambda$ / reg_lambda):

\frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

$\gamma$ (min_split_loss):

Minimum loss reduction for a split. Higher = more conservative trees.

Early Stopping:

Stop training when validation loss hasn't improved for $k$ rounds. Most important regularizer.

Early stopping + low learning rate is the single most effective regularization strategy for gradient boosting.

XGBoost vs LightGBM vs CatBoost

XGBoost:

Level-wise growth. Second-order optimization. Best all-rounder. Handles missing values.

LightGBM:

Leaf-wise growth. Histogram splitting. Fastest training. GOSS + EFB for speed.

CatBoost:

Symmetric trees. Native categorical features. Ordered boosting. Least tuning needed.

All three produce similar accuracy. Choose LightGBM for speed, CatBoost for categorical data, XGBoost for general reliability.

Feature Importance

Gain:

Average loss reduction when the feature is used for splitting. Most informative built-in metric.

Cover:

Average number of samples affected by splits on that feature.

Weight (Frequency):

Number of times feature is used across all trees. Simple but can be misleading.

SHAP Values:

f(\mathbf{x}) = \phi_0 + \sum_{j=1}^{d} \phi_j(\mathbf{x})

SHAP values are the gold standard -- they provide consistent, local+global, directional explanations.

Common Pitfalls

Training without early stopping -- leads to severe overfitting
Using too high a learning rate (> 0.3) with many trees
Not using a validation set for hyperparameter tuning
Data leakage: using future information or target-correlated features
Ignoring class imbalance -- use scale_pos_weight or resampling
Feature scaling is NOT required (tree-based models are invariant to monotonic transformations)
Over-tuning on a single train/test split instead of cross-validation
Not setting a random seed for reproducibility

Interview Quick-Fire

Q: What is gradient boosting?

A: An ensemble method that builds models sequentially, where each new model corrects errors of the previous ensemble by fitting to the negative gradient of the loss function (pseudo-residuals).

Q: How does boosting differ from bagging?

A: Boosting builds models sequentially (each depends on the previous), focusing on hard examples. Bagging (e.g., Random Forest) builds models independently in parallel and averages them.

Q: What is the role of the learning rate?

A: It shrinks the contribution of each tree, requiring more trees but improving generalization. Lower learning rate + more trees = better performance but slower training.

Q: Why does XGBoost use second-order gradients?

A: The Hessian (second derivative) provides curvature information, enabling more accurate split decisions and optimal leaf weights -- similar to Newton's method vs. gradient descent.

Q: When should you choose gradient boosting?

A: For structured/tabular data with mixed feature types. It dominates when data is not images, text, or audio. Especially strong with moderate-sized datasets (1K-10M rows).

Gradient Boosting
Cheat Sheet

Key Formulas

Algorithm Steps

Hyperparameters

Regularization

XGBoost vs LightGBM vs CatBoost

Feature Importance

Common Pitfalls

Interview Quick-Fire

Q: What is gradient boosting?

Q: How does boosting differ from bagging?

Q: What is the role of the learning rate?

Q: Why does XGBoost use second-order gradients?

Q: When should you choose gradient boosting?

Continue Your Journey

Gradient Boosting Guide

Gradient Boosting Quiz

Interview Prep

Gradient BoostingCheat Sheet

Key Formulas

Algorithm Steps

Hyperparameters

Regularization

XGBoost vs LightGBM vs CatBoost

Feature Importance

Common Pitfalls

Interview Quick-Fire

Q: What is gradient boosting?

Q: How does boosting differ from bagging?

Q: What is the role of the learning rate?

Q: Why does XGBoost use second-order gradients?

Q: When should you choose gradient boosting?

Continue Your Journey

Gradient Boosting Guide

Gradient Boosting Quiz

Interview Prep

Gradient Boosting
Cheat Sheet