Decision Trees & Random Forests Cheat Sheet

Splitting Criteria

Entropy:

H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)

Information Gain:

IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)

Gini Impurity:

Gini(S) = 1 - \sum_{i=1}^{c} p_i^2

Gain Ratio:

GR(S,A) = \frac{IG(S,A)}{SI(A)}

Variance Reduction (Regression):

\Delta\text{Var} = \text{Var}(S) - \sum_{v}\frac{|S_v|}{|S|}\text{Var}(S_v)

Entropy ranges [0, log₂(c)]. Gini ranges [0, 1-1/c]. For binary: entropy max = 1, Gini max = 0.5.

Tree Building Algorithm

Start with all training data at the root node
For each feature, evaluate all possible split points
Select the feature + threshold that maximizes impurity reduction
Split data into left child ( $x_j \leq t$ ) and right child ( $$x_j > t$$ )
Recurse on each child node
Stop when a criterion is met: pure node, max depth, min samples, or min impurity decrease
Assign leaf value: majority class (classification) or mean (regression)

Best Split:

\arg\max_{A,t}\left[\text{Impurity}(S) - \text{Impurity}_{split}\right]

Complexity: O(n · m · log n) where n = samples, m = features.

Pruning

Pre-Pruning (Early Stopping):

max_depth: Limit tree depth (e.g., 5-10)
min_samples_split: Min samples to justify a split
min_samples_leaf: Min samples in each leaf
max_leaf_nodes: Max total number of leaves
min_impurity_decrease: Min impurity reduction for split

Post-Pruning (Cost-Complexity):

R_\alpha(T) = R(T) + \alpha|T|

R(T) = misclassification rate, |T| = number of leaves, α found via cross-validation. Use ccp_alpha in scikit-learn.

Random Forests

Prediction (Classification):

\hat{y} = \text{mode}(h_1(x), h_2(x), \ldots, h_B(x))

Prediction (Regression):

\hat{y} = \frac{1}{B}\sum_{b=1}^{B} h_b(x)

Features per Split:

m = \lfloor\sqrt{p}\rfloor \text{ (clf)} \quad m = \lfloor p/3 \rfloor \text{ (reg)}

Bagging: Bootstrap sample (with replacement) for each tree
Random subspace: Random $$m$$ features at each split
OOB Error: ~36.8% of data left out per tree; free validation
Feature Importance: MDI (impurity-based) or MDA (permutation)

Each bootstrap sample contains ~63.2% unique data. OOB error approximates leave-one-out CV.

Gradient Boosting

Sequential Update:

F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Pseudo-Residuals:

r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\bigg|_{F=F_{m-1}}

XGBoost Objective:

\mathcal{L} = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k)

Regularization:

\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2

η (learning_rate): 0.01-0.3; lower = more trees needed
Trees are weak learners: depth 3-8 typical
Always use early stopping with validation set

Common Mistakes

Growing trees without pruning or depth limits (severe overfitting)
Ignoring class imbalance - trees bias toward majority class without balancing
Not tuning hyperparameters - defaults rarely give best performance for boosting
Using trees for extrapolation - trees cannot predict outside the training data range
Relying on MDI feature importance for high-cardinality features (biased!)
Setting learning rate too high in boosting - causes overfitting and instability
Forgetting to set a random seed - tree ensembles are stochastic
Using a single tree when an ensemble would be more appropriate (high variance)

Key Hyperparameters

max_depth: Tree depth limit. Single tree: 4-10. Boosting: 3-8. RF: None (fully grown)
min_samples_split: Min samples to split a node. Higher = less overfitting. Try 2-20
n_estimators: Number of trees. RF: 100-500. Boosting: use early stopping
learning_rate (η): Boosting step size. 0.01-0.3. Lower pairs with more trees
max_features: Features per split. RF: sqrt(p) or p/3. Boosting: 0.5-0.8
subsample: Fraction of data per tree (boosting). 0.5-0.9 reduces overfitting
min_child_weight: XGBoost: min sum of instance weights in leaf. Higher = conservative
reg_alpha / reg_lambda: L1/L2 regularization on leaf weights (XGBoost)

Start with defaults, then tune learning_rate + n_estimators first, then max_depth, then regularization.

Interview Questions

Q: What is the difference between Gini impurity and entropy?

A: Both measure node impurity. Entropy uses logarithms and ranges [0, log₂(c)]. Gini uses squared probabilities and ranges [0, 1-1/c]. In practice, they produce nearly identical trees. Gini is slightly faster (no log computation) and is the scikit-learn default.

Q: Why do single decision trees have high variance?

A: The greedy splitting algorithm is sensitive to the training data. Small changes in the data can lead to completely different split decisions at the root, cascading into entirely different tree structures. Ensembles fix this by averaging many trees.

Q: How does Random Forest reduce overfitting compared to a single tree?

A: Random Forest uses two sources of randomness: (1) bootstrap sampling gives each tree different training data, and (2) random feature selection at each split decorrelates the trees. Averaging many decorrelated trees reduces variance without increasing bias.

Q: What is the difference between bagging and boosting?

A: Bagging trains trees independently in parallel on bootstrap samples and averages them (reduces variance). Boosting trains trees sequentially, where each tree corrects the errors of the previous ones (reduces bias and variance). Boosting typically achieves higher accuracy but is more prone to overfitting.

Q: Why can't decision trees extrapolate?

A: Trees partition the feature space into rectangles and assign a constant prediction to each region. For values outside the training range, the tree assigns the prediction of the nearest known region. It cannot learn trends that extend beyond the data, unlike linear models.

Q: What makes XGBoost faster and better than vanilla gradient boosting?

A: XGBoost uses second-order Taylor expansion for better split finding, built-in L1/L2 regularization to prevent overfitting, column subsampling for diversity, sparsity-aware splitting for missing values, and cache-optimized parallel computation for speed.

Decision Trees &Random Forests Cheat Sheet