F_m = F_{m-1} + η·h_m
pseudo-residuals
early stopping
Quick Reference
XGBoost
Home / Study Lab / Cheat Sheets / Gradient Boosting Cheat Sheet
QUICK REFERENCE

Gradient Boosting
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Additive Model:
$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})$$
Pseudo-Residuals:
$$r_{im} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \Bigg|_{F=F_{m-1}}$$
XGBoost Objective:
$$\text{Obj} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{M} \Omega(f_k)$$
Regularization Term:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$
Optimal Leaf Weight:
$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} H_i + \lambda}$$
Split Gain:
$$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma$$

Algorithm Steps

  1. Initialize $F_0(\mathbf{x}) = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$ (e.g., mean for MSE, log-odds for log-loss)
  2. For $m = 1$ to $M$:
    1. Compute pseudo-residuals: $r_{im} = -\partial L / \partial F(\mathbf{x}_i)$
    2. Fit a decision tree $h_m$ to targets $\{r_{im}\}$
    3. Find optimal leaf values using gradient and Hessian
    4. Update: $F_m = F_{m-1} + \eta \cdot h_m$
  3. Output: $F_M(\mathbf{x}) = F_0 + \sum_{m=1}^{M} \eta \cdot h_m(\mathbf{x})$

For classification, apply sigmoid ($\sigma$) or softmax to $F_M(\mathbf{x})$ to get probabilities.

Hyperparameters

n_estimators:
Number of boosting rounds (trees). Use early stopping to find optimal value. Typical: 100-5000.
learning_rate ($\eta$):
Shrinkage per tree. Lower = better generalization but more trees needed. Typical: 0.01-0.3.
max_depth:
Maximum tree depth. Controls interaction order. Typical: 3-8. Default XGBoost: 6.
subsample:
Fraction of rows per tree. Reduces variance. Typical: 0.5-1.0.
colsample_bytree:
Fraction of features per tree. Like Random Forest's feature bagging. Typical: 0.5-1.0.

Start with learning_rate=0.1, max_depth=4, subsample=0.8 and tune from there.

Regularization

L1 ($\alpha$ / reg_alpha):
$$\alpha \sum_{j=1}^{T} |w_j|$$
L2 ($\lambda$ / reg_lambda):
$$\frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$
$\gamma$ (min_split_loss):
Minimum loss reduction for a split. Higher = more conservative trees.
Early Stopping:
Stop training when validation loss hasn't improved for $k$ rounds. Most important regularizer.

Early stopping + low learning rate is the single most effective regularization strategy for gradient boosting.

XGBoost vs LightGBM vs CatBoost

XGBoost:
Level-wise growth. Second-order optimization. Best all-rounder. Handles missing values.
LightGBM:
Leaf-wise growth. Histogram splitting. Fastest training. GOSS + EFB for speed.
CatBoost:
Symmetric trees. Native categorical features. Ordered boosting. Least tuning needed.

All three produce similar accuracy. Choose LightGBM for speed, CatBoost for categorical data, XGBoost for general reliability.

Feature Importance

Gain:
Average loss reduction when the feature is used for splitting. Most informative built-in metric.
Cover:
Average number of samples affected by splits on that feature.
Weight (Frequency):
Number of times feature is used across all trees. Simple but can be misleading.
SHAP Values:
$$f(\mathbf{x}) = \phi_0 + \sum_{j=1}^{d} \phi_j(\mathbf{x})$$

SHAP values are the gold standard -- they provide consistent, local+global, directional explanations.

Common Pitfalls

  • Training without early stopping -- leads to severe overfitting
  • Using too high a learning rate (> 0.3) with many trees
  • Not using a validation set for hyperparameter tuning
  • Data leakage: using future information or target-correlated features
  • Ignoring class imbalance -- use scale_pos_weight or resampling
  • Feature scaling is NOT required (tree-based models are invariant to monotonic transformations)
  • Over-tuning on a single train/test split instead of cross-validation
  • Not setting a random seed for reproducibility

Interview Quick-Fire

Q: What is gradient boosting?

A: An ensemble method that builds models sequentially, where each new model corrects errors of the previous ensemble by fitting to the negative gradient of the loss function (pseudo-residuals).

Q: How does boosting differ from bagging?

A: Boosting builds models sequentially (each depends on the previous), focusing on hard examples. Bagging (e.g., Random Forest) builds models independently in parallel and averages them.

Q: What is the role of the learning rate?

A: It shrinks the contribution of each tree, requiring more trees but improving generalization. Lower learning rate + more trees = better performance but slower training.

Q: Why does XGBoost use second-order gradients?

A: The Hessian (second derivative) provides curvature information, enabling more accurate split decisions and optimal leaf weights -- similar to Newton's method vs. gradient descent.

Q: When should you choose gradient boosting?

A: For structured/tabular data with mixed feature types. It dominates when data is not images, text, or audio. Especially strong with moderate-sized datasets (1K-10M rows).

Continue Your Journey