H(S) = -Σp log p
Gini = 1 - Σpi²
IG = H(S) - H(S|A)
Quick Reference
Ensemble
Home / Study Lab / Cheat Sheets / Decision Trees
QUICK REFERENCE

Decision Trees &
Random Forests Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Splitting Criteria

Entropy:
$$H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$
Information Gain:
$$IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)$$
Gini Impurity:
$$Gini(S) = 1 - \sum_{i=1}^{c} p_i^2$$
Gain Ratio:
$$GR(S,A) = \frac{IG(S,A)}{SI(A)}$$
Variance Reduction (Regression):
$$\Delta\text{Var} = \text{Var}(S) - \sum_{v}\frac{|S_v|}{|S|}\text{Var}(S_v)$$

Entropy ranges [0, log₂(c)]. Gini ranges [0, 1-1/c]. For binary: entropy max = 1, Gini max = 0.5.

Tree Building Algorithm

  1. Start with all training data at the root node
  2. For each feature, evaluate all possible split points
  3. Select the feature + threshold that maximizes impurity reduction
  4. Split data into left child ($x_j \leq t$) and right child ($x_j > t$)
  5. Recurse on each child node
  6. Stop when a criterion is met: pure node, max depth, min samples, or min impurity decrease
  7. Assign leaf value: majority class (classification) or mean (regression)
Best Split:
$$\arg\max_{A,t}\left[\text{Impurity}(S) - \text{Impurity}_{split}\right]$$

Complexity: O(n · m · log n) where n = samples, m = features.

Pruning

Pre-Pruning (Early Stopping):
  • max_depth: Limit tree depth (e.g., 5-10)
  • min_samples_split: Min samples to justify a split
  • min_samples_leaf: Min samples in each leaf
  • max_leaf_nodes: Max total number of leaves
  • min_impurity_decrease: Min impurity reduction for split
Post-Pruning (Cost-Complexity):
$$R_\alpha(T) = R(T) + \alpha|T|$$

R(T) = misclassification rate, |T| = number of leaves, α found via cross-validation. Use ccp_alpha in scikit-learn.

Random Forests

Prediction (Classification):
$$\hat{y} = \text{mode}(h_1(x), h_2(x), \ldots, h_B(x))$$
Prediction (Regression):
$$\hat{y} = \frac{1}{B}\sum_{b=1}^{B} h_b(x)$$
Features per Split:
$$m = \lfloor\sqrt{p}\rfloor \text{ (clf)} \quad m = \lfloor p/3 \rfloor \text{ (reg)}$$
  • Bagging: Bootstrap sample (with replacement) for each tree
  • Random subspace: Random $m$ features at each split
  • OOB Error: ~36.8% of data left out per tree; free validation
  • Feature Importance: MDI (impurity-based) or MDA (permutation)

Each bootstrap sample contains ~63.2% unique data. OOB error approximates leave-one-out CV.

Gradient Boosting

Sequential Update:
$$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$
Pseudo-Residuals:
$$r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\bigg|_{F=F_{m-1}}$$
XGBoost Objective:
$$\mathcal{L} = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k)$$
Regularization:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2$$
  • η (learning_rate): 0.01-0.3; lower = more trees needed
  • Trees are weak learners: depth 3-8 typical
  • Always use early stopping with validation set

Common Mistakes

  • Growing trees without pruning or depth limits (severe overfitting)
  • Ignoring class imbalance - trees bias toward majority class without balancing
  • Not tuning hyperparameters - defaults rarely give best performance for boosting
  • Using trees for extrapolation - trees cannot predict outside the training data range
  • Relying on MDI feature importance for high-cardinality features (biased!)
  • Setting learning rate too high in boosting - causes overfitting and instability
  • Forgetting to set a random seed - tree ensembles are stochastic
  • Using a single tree when an ensemble would be more appropriate (high variance)

Key Hyperparameters

  • max_depth: Tree depth limit. Single tree: 4-10. Boosting: 3-8. RF: None (fully grown)
  • min_samples_split: Min samples to split a node. Higher = less overfitting. Try 2-20
  • n_estimators: Number of trees. RF: 100-500. Boosting: use early stopping
  • learning_rate (η): Boosting step size. 0.01-0.3. Lower pairs with more trees
  • max_features: Features per split. RF: sqrt(p) or p/3. Boosting: 0.5-0.8
  • subsample: Fraction of data per tree (boosting). 0.5-0.9 reduces overfitting
  • min_child_weight: XGBoost: min sum of instance weights in leaf. Higher = conservative
  • reg_alpha / reg_lambda: L1/L2 regularization on leaf weights (XGBoost)

Start with defaults, then tune learning_rate + n_estimators first, then max_depth, then regularization.

Interview Questions

Q: What is the difference between Gini impurity and entropy?

A: Both measure node impurity. Entropy uses logarithms and ranges [0, log₂(c)]. Gini uses squared probabilities and ranges [0, 1-1/c]. In practice, they produce nearly identical trees. Gini is slightly faster (no log computation) and is the scikit-learn default.

Q: Why do single decision trees have high variance?

A: The greedy splitting algorithm is sensitive to the training data. Small changes in the data can lead to completely different split decisions at the root, cascading into entirely different tree structures. Ensembles fix this by averaging many trees.

Q: How does Random Forest reduce overfitting compared to a single tree?

A: Random Forest uses two sources of randomness: (1) bootstrap sampling gives each tree different training data, and (2) random feature selection at each split decorrelates the trees. Averaging many decorrelated trees reduces variance without increasing bias.

Q: What is the difference between bagging and boosting?

A: Bagging trains trees independently in parallel on bootstrap samples and averages them (reduces variance). Boosting trains trees sequentially, where each tree corrects the errors of the previous ones (reduces bias and variance). Boosting typically achieves higher accuracy but is more prone to overfitting.

Q: Why can't decision trees extrapolate?

A: Trees partition the feature space into rectangles and assign a constant prediction to each region. For values outside the training range, the tree assigns the prediction of the nearest known region. It cannot learn trends that extend beyond the data, unlike linear models.

Q: What makes XGBoost faster and better than vanilla gradient boosting?

A: XGBoost uses second-order Taylor expansion for better split finding, built-in L1/L2 regularization to prevent overfitting, column subsampling for diversity, sparsity-aware splitting for missing values, and cache-optimized parallel computation for speed.