Decision Trees & Random Forests - Complete Guide

01

Historical Context

Tracing the evolution of tree-based learning from early decision theory to modern ensemble methods.

The Origins of Decision Theory

The idea of making decisions through a sequence of questions is as old as human reasoning itself. From medical diagnosis to botanical classification, experts have always used hierarchical decision-making. The formalization of this process into computational algorithms began in the 1960s and has since become one of the most successful paradigms in machine learning.

Decision trees are unique among machine learning algorithms because they mirror the way humans naturally think. When a doctor diagnoses a patient, they ask a series of questions - Does the patient have a fever? Is the cough dry or productive? Each answer narrows down the possibilities. Decision tree algorithms automate this exact process.

Key Milestones

1966

Earl Hunt publishes the Concept Learning System (CLS), one of the earliest decision tree algorithms for pattern classification. It introduced the idea of recursively partitioning data based on feature values.

1979

J. Ross Quinlan develops ID3 (Iterative Dichotomiser 3), which uses information gain based on entropy to select splitting features. This brought information theory into the heart of decision tree learning.

1984

Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone publish CART (Classification and Regression Trees). CART introduced the Gini impurity criterion and binary splitting, creating a unified framework for both classification and regression.

1993

Quinlan introduces C4.5, the successor to ID3. C4.5 handles continuous attributes, missing values, and includes pruning to combat overfitting. It became the benchmark algorithm for years.

1996

Leo Breiman introduces Bagging (Bootstrap Aggregating), showing that combining multiple trees trained on bootstrap samples dramatically reduces variance and improves prediction accuracy.

2001

Breiman publishes the Random Forest algorithm, adding random feature selection at each split on top of bagging. Random Forests become one of the most widely used algorithms in machine learning.

2016

Tianqi Chen releases XGBoost, an optimized gradient boosting framework. XGBoost dominates Kaggle competitions and becomes the go-to algorithm for structured/tabular data.

Today

Tree-based ensemble methods (XGBoost, LightGBM, CatBoost) remain the most successful algorithms for tabular data, consistently outperforming deep learning on structured datasets.

Why Trees Still Matter

Despite the deep learning revolution, tree-based models remain the dominant approach for structured (tabular) data. In a landmark 2022 study, researchers found that gradient-boosted trees consistently outperform neural networks on tabular datasets. The reasons are clear: trees handle mixed data types naturally, require minimal preprocessing, and provide interpretable results.

Industry Impact

Tree-based models power critical systems across industries: credit scoring in finance, fraud detection in banking, clinical decision support in healthcare, recommendation engines in e-commerce, and predictive maintenance in manufacturing. Their interpretability makes them the preferred choice when model decisions must be explained.

02

Intuition - How Trees Think

Understanding the fundamental logic behind decision trees through everyday analogies.

The 20 Questions Game

A decision tree works exactly like the game of 20 questions. You start with a broad question that divides the possibilities into groups, then ask progressively more specific questions to narrow down to the answer. The key is choosing the right question at each step - the one that gives you the most information.

Consider classifying animals. You might first ask: "Does it have fur?" This immediately splits the animal kingdom into two groups. Then you might ask: "Does it have four legs?" Each question narrows the possibilities until you reach a definitive classification.

How a Decision Tree Classifies

1

Start at the Root

The root node contains the entire dataset. The algorithm evaluates every feature and every possible split point to find the single best question that separates the classes most effectively.

2

Ask the Best Question

The "best" question is the one that produces the most homogeneous (pure) child nodes. We measure purity using entropy, Gini index, or other criteria. For example: "Is income > $50,000?" splits the data into two groups.

3

Split and Recurse

Data flows left or right based on the answer. Each child node becomes a new sub-problem. The algorithm repeats the process recursively on each branch, finding the best question for each subset of data.

4

Reach a Leaf Node

When a stopping criterion is met (pure node, maximum depth, minimum samples), we create a leaf node. The leaf assigns the majority class (classification) or the average value (regression) of the samples that reached it.

Key Properties of Decision Trees

Interpretable

Decisions can be visualized and explained to non-technical stakeholders as simple if-then rules

Non-parametric

Makes no assumptions about the underlying data distribution, unlike linear models or naive Bayes

Handles Mixed Data

Naturally works with both numerical and categorical features without encoding or transformation

No Scaling Needed

Feature scaling (normalization/standardization) is unnecessary because splits are based on thresholds, not distances

Classification vs. Regression Trees

Decision trees handle both tasks elegantly:

Classification Tree

Predicts a discrete class label. Leaf nodes assign the majority class among the training samples that reached them. Splitting criteria use entropy or Gini impurity to maximize class purity.

\hat{y} = \text{mode}(y_i \text{ in leaf})

Regression Tree

Predicts a continuous value. Leaf nodes assign the mean (or median) of the target values. Splitting criteria use variance reduction or mean squared error to minimize prediction error.

\hat{y} = \frac{1}{|L|}\sum_{i \in L} y_i

03

Splitting Criteria - Entropy & Information Gain

The mathematical foundation for choosing the best feature to split on at each node.

What is Entropy?

Entropy, borrowed from information theory (Claude Shannon, 1948), measures the impurity or disorder in a set. In the context of decision trees, entropy quantifies how mixed the class labels are in a node. A node with all samples from one class has zero entropy (perfectly pure). A node with an equal mix of classes has maximum entropy (maximum disorder).

H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)

Where $$S$$ is the set of samples, $$c$$ is the number of classes, and $$p_i$$ is the proportion of samples belonging to class $$i$$ .

Entropy Derivation - Step by Step

Step 1: Define the Surprise

The surprise (self-information) of an event with probability $$p$$ is defined as $I(p) = -\log_2(p)$ . A rare event (low probability) carries high surprise; a certain event carries zero surprise.

Step 2: Take the Expected Value

Entropy is the expected surprise across all classes. We weight each class's surprise by its probability of occurring:

H(S) = \mathbb{E}[I(p)] = \sum_{i=1}^{c} p_i \cdot (-\log_2(p_i)) = -\sum_{i=1}^{c} p_i \log_2(p_i)

Step 3: Interpret the Values

For binary classification: $$H = 0$$ means all samples belong to one class (pure). $$H = 1$$ means a 50/50 split (maximum impurity). The entropy curve is symmetric and concave.

Worked Example: Binary Entropy

Consider a dataset with 10 samples: 7 positive and 3 negative.

H(S) = -\frac{7}{10}\log_2\left(\frac{7}{10}\right) - \frac{3}{10}\log_2\left(\frac{3}{10}\right)

$$H(S) = -(0.7)(-0.515) - (0.3)(-1.737)$$

H(S) = 0.3605 + 0.5211 = 0.8813 \text{ bits}

This is fairly high entropy, indicating the node is not very pure. Splitting on a good feature should reduce this value in the child nodes.

Information Gain

Information Gain measures how much entropy is reduced after splitting on a particular feature. The feature with the highest information gain is chosen as the splitting criterion.

IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

Where $$A$$ is the feature being evaluated, $\text{Values}(A)$ is the set of possible values for feature $$A$$ , and $$S_v$$ is the subset of $$S$$ where feature $$A$$ has value $$v$$ .

The Greedy Strategy

At each node, the algorithm greedily selects the feature and split point that maximizes information gain. This is a locally optimal decision - the algorithm does not look ahead to see if a different split might lead to a better tree overall. Despite this greedy approach, decision trees perform remarkably well in practice.

Visualizing Entropy

The binary entropy curve shows how impurity changes as the class proportion varies from 0 (all one class) to 1 (all the other class). Notice how entropy peaks at 0.5 (maximum uncertainty) and reaches zero at the extremes (pure nodes).

Gain Ratio (C4.5 Improvement)

Information gain has a bias toward features with many distinct values (high cardinality). For example, a unique ID feature would perfectly split every sample but provide no generalization. Quinlan's C4.5 algorithm addresses this with the Gain Ratio:

GR(S, A) = \frac{IG(S, A)}{SI(A)}

Where the Split Information penalizes features with many values:

SI(A) = -\sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \log_2\left(\frac{|S_v|}{|S|}\right)

Watch Out for High-Cardinality Features

Features like user IDs, timestamps, or zip codes can artificially inflate information gain because they create many unique splits. Always use gain ratio or set a minimum number of samples per leaf to avoid this trap.

04

Gini Index

An alternative impurity measure used by CART that is computationally efficient and widely adopted.

Gini Impurity Formula

The Gini impurity measures the probability that a randomly chosen sample from the node would be incorrectly classified if it were randomly labeled according to the distribution of classes in that node.

Gini(S) = 1 - \sum_{i=1}^{c} p_i^2

Where $$p_i$$ is the proportion of class $$i$$ in the node. For binary classification, this simplifies to:

$$Gini(S) = 1 - p_1^2 - p_0^2 = 2p_1(1 - p_1)$$

Interpreting Gini Values

Pure Node (Gini = 0)

All samples belong to one class. If $$p_1 = 1$$ , then $$Gini = 1 - 1^2 = 0$$ . No further splitting is needed.

Maximum Impurity (Gini = 0.5 for binary)

Equal class proportions. If $$p_1 = 0.5$$ , then $$Gini = 1 - 0.5^2 - 0.5^2 = 0.5$$ . This is the worst case - maximum uncertainty.

Weighted Gini for a Split

When evaluating a split, we compute the weighted average of Gini impurity in the child nodes:

Gini_{split} = \frac{|S_L|}{|S|}Gini(S_L) + \frac{|S_R|}{|S|}Gini(S_R)

Entropy vs. Gini: When to Use Each

Entropy (Information Gain)

Used by ID3 and C4.5 algorithms
Involves logarithmic computation (slightly slower)
Ranges from 0 to $\log_2(c)$
Tends to produce more balanced trees
Better theoretical foundation in information theory

Gini Impurity

Used by CART and scikit-learn (default)
No logarithm needed (computationally faster)
Ranges from 0 to 0.5 (binary) or $$1 - 1/c$$
Tends to isolate the most frequent class first
Slightly favors larger partitions

Gini vs Entropy Comparison

Both metrics behave similarly. This chart overlays Gini impurity and scaled entropy for binary classification. Notice they peak at the same point (p = 0.5) and agree on which splits are best in most cases.

In Practice, It Rarely Matters

Research has shown that the choice between entropy and Gini impurity affects only about 2% of cases. The resulting trees are usually very similar. Gini is the default in most implementations because it avoids the logarithm computation, making it marginally faster.

05

Building a Decision Tree

The recursive algorithm that grows a tree from data, one split at a time.

The Greedy Top-Down Algorithm

Decision trees are built using a recursive partitioning approach known as top-down induction of decision trees (TDIDT). The algorithm starts with all data at the root and greedily splits it into subsets.

1

Evaluate All Possible Splits

For each feature, evaluate every possible split point. For numerical features, consider all midpoints between consecutive sorted values. For categorical features, consider all possible subsets (or binary groupings for CART).

2

Select the Best Split

Choose the feature and split point that maximizes the impurity reduction (information gain, Gini reduction, or variance reduction for regression). This is the greedy choice.

\text{Best Split} = \arg\max_{A, t} \left[ \text{Impurity}(S) - \text{Impurity}_{split}(S, A, t) \right]

3

Create Child Nodes

Split the data into subsets based on the chosen feature and threshold. Data where $x_A \leq t$ goes to the left child, and data where $$x_A > t$$ goes to the right child.

4

Recurse on Each Child

Apply the same process recursively to each child node. Each child becomes a new root of a sub-tree. This continues until a stopping criterion is met.

5

Create Leaf Nodes

When a stopping criterion is satisfied, create a leaf node. Assign the majority class (classification) or the mean target value (regression) of the remaining samples.

Stopping Criteria

Without stopping criteria, a decision tree will keep splitting until every leaf is pure (contains samples of only one class). This leads to severe overfitting. Common stopping criteria include:

Maximum depth: Limit the number of levels in the tree. Deeper trees are more complex and prone to overfitting.
Minimum samples per node: Require a minimum number of samples to justify a split. Prevents creating nodes from tiny subsets of data.
Minimum samples per leaf: Ensure every leaf node has at least this many samples. Small leaves overfit to noise.
Maximum leaf nodes: Limit the total number of leaf nodes in the tree.
Minimum impurity decrease: Only split if the impurity reduction exceeds a threshold.
Pure node: Stop when all samples in a node belong to the same class (entropy = 0 or Gini = 0).

Handling Numerical vs. Categorical Features

Numerical Features

Sort the feature values, then consider all midpoints between consecutive values as potential thresholds. For $$n$$ unique values, there are $$n - 1$$ possible split points. The split takes the form $x_j \leq t$ .

Categorical Features

For CART (binary splits), the algorithm considers all possible ways to partition categories into two groups. For $$k$$ categories, there are $2^{k-1} - 1$ possible partitions. ID3/C4.5 creates one branch per category value.

Interactive Decision Tree Visualization

See how a decision tree splits data step by step. This visualization shows a sample classification tree built on the classic "Play Tennis" dataset. Each node shows the splitting feature and threshold, with color indicating class purity.

Computational Complexity

Building a decision tree has time complexity $O(n \cdot m \cdot \log n)$ where $$n$$ is the number of samples and $$m$$ is the number of features. The $\log n$ comes from sorting. For high-cardinality categorical features, the exponential number of possible partitions can become a bottleneck.

06

Pruning

Combating overfitting by simplifying the tree structure without sacrificing generalization.

The Overfitting Problem

A fully grown decision tree perfectly fits the training data by creating a leaf for every unique combination of feature values. This is a textbook case of overfitting: the tree memorizes the training data, including noise and outliers, instead of learning the underlying patterns.

Consider a decision tree with one leaf per training sample. It achieves 100% training accuracy but will perform poorly on unseen data. The tree has learned the noise, not the signal. Pruning removes branches that provide little predictive power, resulting in a simpler, more generalizable tree.

Signs of Overfitting

Large gap between training accuracy and test accuracy
Very deep tree with many leaves
Leaves with very few samples
Complex decision boundaries that follow noise in the data

Pre-Pruning vs. Post-Pruning

Pre-Pruning (Early Stopping)

Stop growing the tree before it fully fits the training data. Controlled by hyperparameters set before training:

max_depth: Maximum tree depth (e.g., 5-10)
min_samples_split: Minimum samples required to split a node
min_samples_leaf: Minimum samples required in each leaf
max_leaf_nodes: Maximum total number of leaves
min_impurity_decrease: Minimum impurity reduction required to split

Pros: Fast, prevents unnecessary computation. Cons: Might stop too early (horizon effect) - a seemingly bad split might lead to excellent sub-splits that are never explored.

Post-Pruning (Backward Pruning)

Grow the full tree first, then remove branches that do not improve generalization. Two main approaches:

Reduced Error Pruning: Replace each subtree with a leaf and keep the change if validation accuracy does not decrease
Cost-Complexity Pruning (CCP): Add a penalty term for tree complexity (used by CART and scikit-learn)

Pros: Considers the full tree structure, avoids the horizon effect. Cons: Computationally more expensive since the full tree must be built first.

Cost-Complexity Pruning (Minimal Cost-Complexity)

The CART algorithm uses cost-complexity pruning, which balances tree accuracy against tree size. The cost-complexity measure for a subtree $$T$$ is:

R_\alpha(T) = R(T) + \alpha|T|

Where:

$$R(T)$$ is the total misclassification rate (or impurity) of the tree
$$|T|$$ is the number of leaf nodes (tree complexity)
$\alpha \geq 0$ is the complexity parameter that controls the trade-off

When $\alpha = 0$ , the full tree is optimal. As $\alpha$ increases, simpler trees are preferred. The optimal $\alpha$ is found using cross-validation, producing a sequence of nested subtrees from which the best is selected.

Overfitting vs Underfitting

This chart shows how training and test accuracy change as tree depth increases. Shallow trees underfit (high bias), while very deep trees overfit (high variance). The optimal depth balances the two.

scikit-learn Implementation

In scikit-learn, use DecisionTreeClassifier(ccp_alpha=0.01) to apply cost-complexity pruning. The method cost_complexity_pruning_path() returns the effective alpha values and corresponding impurities, allowing you to plot the trade-off and choose the optimal pruning level.

07

Random Forests

Harnessing the wisdom of crowds: combining many trees into a powerful ensemble.

The Problem with Single Trees

Individual decision trees suffer from high variance. Small changes in the training data can produce completely different tree structures. This instability is the primary weakness of decision trees. A single tree that overfits is unreliable, and a heavily pruned tree may underfit.

The key insight behind Random Forests is that while individual trees may be noisy and overfit, their errors are uncorrelated. By averaging many trees, the noise cancels out and the true signal emerges. This is the power of ensemble learning.

Bagging (Bootstrap Aggregating)

Bagging, introduced by Leo Breiman in 1996, is the foundation of Random Forests. The idea is simple but powerful:

1

Create Bootstrap Samples

From a dataset of $$n$$ samples, create $$B$$ bootstrap samples by sampling $$n$$ data points with replacement. Each bootstrap sample contains about 63.2% of the unique original samples (the rest are duplicates).

2

Train Independent Trees

Train one decision tree on each bootstrap sample. Each tree is grown fully (no pruning) to keep bias low. The diversity comes from the different training subsets.

3

Aggregate Predictions

For classification, take the majority vote across all trees. For regression, take the average prediction.

\hat{y} = \text{mode}(h_1(x), h_2(x), \ldots, h_B(x))

Random Feature Selection

Random Forests add a second layer of randomness on top of bagging: at each split, only a random subset of features is considered. This is the key innovation that makes Random Forests more powerful than simple bagging.

m = \lfloor\sqrt{p}\rfloor \text{ (classification)} \quad \text{or} \quad m = \lfloor p/3 \rfloor \text{ (regression)}

Where $$p$$ is the total number of features and $$m$$ is the number of features randomly selected at each split. This decorrelates the trees: without random feature selection, if one feature is very strong, all trees would split on it first, making them highly correlated. By forcing each tree to consider different feature subsets, the trees become diverse, and their average is more robust.

Out-of-Bag (OOB) Error

Because each bootstrap sample leaves out about 36.8% of the original data, we get a free validation set for each tree. For each sample, we can aggregate predictions only from trees that did not include that sample in their training data. This gives us the Out-of-Bag error - an unbiased estimate of the generalization error without needing a separate validation set.

\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\left[\hat{y}_i^{OOB} \neq y_i\right]

Random Forest Ensemble Visualization

Watch how individual tree predictions combine into a powerful ensemble. Each tree votes independently, and the majority vote produces a prediction that is more accurate and stable than any single tree.

No Cross-Validation Needed

The OOB error has been shown to be approximately equivalent to leave-one-out cross-validation. This means Random Forests effectively provide built-in validation, saving the computational cost of k-fold cross-validation.

Feature Importance

Random Forests provide two measures of feature importance:

Impurity-Based Importance (MDI)

For each feature, sum the total impurity decrease (weighted by the number of samples reaching that node) across all trees and all splits using that feature. Features that produce larger impurity reductions are more important.

\text{Imp}(x_j) = \frac{1}{B}\sum_{b=1}^{B}\sum_{t \in T_b} \Delta I(t) \cdot \mathbb{1}[v(t) = j]

Permutation Importance (MDA)

For each feature, randomly shuffle its values in the OOB samples and measure the increase in OOB error. Features that cause a large increase in error when permuted are important. This method is less biased toward high-cardinality features.

\text{Imp}(x_j) = \frac{1}{B}\sum_{b=1}^{B}\left[\text{OOB}_{perm}^{(b)} - \text{OOB}^{(b)}\right]

08

Gradient Boosting & XGBoost

Sequential ensemble methods that learn from mistakes and dominate machine learning competitions.

Boosting Intuition

While bagging creates trees in parallel and averages them, boosting builds trees sequentially. Each new tree focuses on the mistakes made by the previous ones. Think of it like a student who studies the questions they got wrong on a practice exam, progressively improving on their weaknesses.

The key difference from bagging: in boosting, trees are weak learners (shallow trees, typically depth 3-8) rather than full trees. Each tree makes a small correction to the overall prediction, and the cumulative effect of many small corrections produces a powerful model.

Gradient Boosting Framework

Gradient boosting frames the ensemble learning problem as gradient descent in function space. Instead of optimizing parameters, we optimize the function itself by iteratively adding trees that point in the direction of the steepest descent of the loss function.

Step 1: Initialize with a Constant

F_0(x) = \arg\min_{\gamma} \sum_{i=1}^{n} L(y_i, \gamma)

Start with the simplest prediction: for squared error, this is the mean of the target values.

Step 2: Compute Pseudo-Residuals

r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}

The pseudo-residuals are the negative gradients of the loss function. For squared error, these are simply the residuals $y_i - F_{m-1}(x_i)$ .

Step 3: Fit a Tree to the Residuals

Train a new decision tree $$h_m(x)$$ to predict the pseudo-residuals. This tree learns the pattern in the errors of the current model.

Step 4: Update the Model

F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Add the new tree's predictions (scaled by learning rate $\eta$ ) to the current model. The learning rate controls how much each tree contributes, preventing overshooting.

XGBoost: Extreme Gradient Boosting

XGBoost, created by Tianqi Chen, adds several engineering and algorithmic innovations to gradient boosting that make it faster, more accurate, and more robust:

\mathcal{L} = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)

Where the regularization term penalizes tree complexity:

\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2

Here $$T$$ is the number of leaves, $$w_j$$ are the leaf weights, $\gamma$ penalizes the number of leaves, and $\lambda$ penalizes large leaf weights (L2 regularization).

Gradient Boosting Loss Curve

Observe how the loss decreases with each boosting iteration. The training loss steadily decreases, while the validation loss eventually plateaus or increases - signaling the optimal stopping point.

Bagging vs. Boosting Comparison

Bagging (Random Forests)

Trees built in parallel
Each tree is a strong learner (fully grown)
Reduces variance
Less prone to overfitting
Robust to noisy data
Fewer hyperparameters to tune

Boosting (XGBoost, LightGBM)

Trees built sequentially
Each tree is a weak learner (shallow)
Reduces bias and variance
Can overfit with too many iterations
More sensitive to noisy data and outliers
More hyperparameters but often higher accuracy

XGBoost Key Innovations

Second-Order Approximation

Uses both first and second derivatives (Newton's method) for faster convergence

Built-in Regularization

L1 and L2 regularization on leaf weights prevent overfitting

Sparsity Awareness

Efficient handling of missing values and sparse data natively

Column Subsampling

Random feature selection at tree and split level, similar to Random Forests

09

Advantages & Disadvantages

Understanding when to use tree-based models and when to look elsewhere.

Advantages

Highly Interpretable

Decisions can be visualized as flowcharts and explained as simple if-then rules. This is critical in regulated industries like healthcare and finance.

No Feature Scaling Required

Trees make decisions based on thresholds, not distances. Normalization or standardization is unnecessary.

Handles Non-Linear Relationships

Trees can capture complex, non-linear decision boundaries through recursive partitioning without needing polynomial features.

Works with Mixed Feature Types

Naturally handles both numerical and categorical features without one-hot encoding.

Handles Missing Values

Some implementations (like XGBoost) handle missing values natively by learning the optimal direction for missing data at each split.

Fast Prediction

Prediction requires only traversing the tree from root to leaf, which is $O(\log n)$ in a balanced tree.

Feature Importance

Trees naturally provide a ranking of feature importance based on their contribution to impurity reduction.

Disadvantages

Prone to Overfitting

Without pruning or ensemble methods, single trees easily memorize training data, including noise and outliers.

High Variance (Instability)

Small changes in training data can produce completely different tree structures. This makes single trees unreliable.

Biased Toward High-Cardinality Features

Information gain favors features with many unique values. Gain ratio or Gini impurity partially mitigates this.

Greedy Algorithm

The top-down greedy approach finds locally optimal splits, not globally optimal trees. The best first split may not lead to the best overall tree.

Cannot Extrapolate

Trees can only predict values within the range seen during training. They cannot extrapolate beyond the training data, unlike linear models.

Axis-Aligned Boundaries

Standard trees can only create decision boundaries parallel to feature axes. Diagonal or curved boundaries require many splits to approximate.

Ensemble Trade-off

While ensembles (Random Forests, XGBoost) fix many weaknesses, they sacrifice the interpretability that makes single trees attractive.

When to Use Tree-Based Models

Best For

Tabular/structured data with mixed feature types
Problems requiring interpretable models (single trees)
Kaggle-style competitions with structured data (XGBoost)
When feature interactions are important
Quick baseline models with minimal preprocessing
Problems with missing values

Avoid For

Image, audio, or video data (use deep learning)
Sequential data like text or time series (use RNNs/Transformers)
Extrapolation tasks (predictions outside training range)
Very high-dimensional sparse data (use linear models)
Problems requiring smooth decision boundaries
Online learning scenarios with streaming data

10

From Beginner to Advanced

Advanced topics, modern frameworks, and the cutting edge of tree-based learning.

Feature Importance Methods

Understanding which features drive predictions is crucial for model interpretability and feature engineering. Tree-based models offer several approaches:

1

Mean Decrease in Impurity (MDI)

Sum the impurity decrease from all splits using a given feature across all trees. Fast to compute but biased toward high-cardinality and continuous features. Available as feature_importances_ in scikit-learn.

2

Permutation Importance

Shuffle a feature's values and measure the drop in model performance. Model-agnostic and unbiased, but can be slow for large datasets. Use sklearn.inspection.permutation_importance().

3

SHAP Values (SHapley Additive exPlanations)

Based on game theory, SHAP values provide the exact contribution of each feature to each individual prediction. TreeSHAP is an efficient algorithm for computing SHAP values for tree-based models in polynomial time.

\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f(S \cup \{j\}) - f(S)]

Feature Importance Comparison

A typical feature importance ranking from a Random Forest model. Bars show the relative importance of each feature based on mean decrease in impurity (MDI). This helps identify which features drive predictions the most.

Best Practice

Use SHAP values for detailed, per-prediction explanations. Use permutation importance for overall feature ranking. Avoid relying solely on MDI importance as it can be misleading for features with different cardinalities.

Modern Gradient Boosting Frameworks

LightGBM

Developed by Microsoft. Uses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) for dramatic speedups. Grows trees leaf-wise (best-first) rather than level-wise, which often produces better accuracy with fewer leaves. Ideal for large datasets.

CatBoost

Developed by Yandex. Specializes in categorical feature handling using ordered target statistics to prevent target leakage. Uses ordered boosting to reduce prediction shift. Requires minimal hyperparameter tuning and handles categorical features natively without encoding.

Tree-Based Model Selection Guide

1

Need Interpretability?

Use a single decision tree with pruning. Maximum depth of 4-6 keeps the tree human-readable. Consider SHAP values for ensemble model explanations if higher accuracy is needed.

2

Need Robust Predictions with Minimal Tuning?

Use Random Forests. They are hard to overfit, require few hyperparameters, and provide reliable OOB error estimates. Start with 100-500 trees and $m = \sqrt{p}$ features per split.

3

Need Maximum Accuracy on Structured Data?

Use XGBoost, LightGBM, or CatBoost. These gradient boosting frameworks consistently achieve state-of-the-art results on tabular data. They require more careful hyperparameter tuning but reward the effort with superior performance.

4

Have Many Categorical Features?

Use CatBoost. Its native categorical feature handling avoids the pitfalls of one-hot encoding and target encoding. It also requires less hyperparameter tuning than XGBoost.

Key Hyperparameter Tuning Tips

The most impactful hyperparameters for gradient boosting models, ranked by importance:

n_estimators / num_boost_round: Number of trees. Use early stopping with a validation set rather than setting a fixed number. More trees reduce bias but increase training time.
learning_rate (eta): Step size for each tree's contribution. Lower values (0.01-0.1) require more trees but generalize better. Always pair with n_estimators.
max_depth: Maximum tree depth. For boosting, shallow trees (3-8) work best. Deeper trees increase the risk of overfitting.
subsample: Fraction of training data used per tree (0.5-0.9). Reduces overfitting through stochastic gradient boosting.
colsample_bytree / max_features: Fraction of features considered per tree. Similar to Random Forest's feature subsampling. Values of 0.5-0.8 often work well.
min_child_weight / min_samples_leaf: Minimum sum of instance weight in a leaf. Higher values prevent the model from learning noise in small subgroups.
reg_alpha (L1) and reg_lambda (L2): Regularization parameters. Start with small values and increase if overfitting is observed.

Always Use Early Stopping

Instead of tuning n_estimators manually, set it to a large number (e.g., 10000) and use early stopping with a validation set. The model will automatically stop adding trees when the validation score stops improving. This is faster and more reliable than grid search over the number of trees.

Decision Trees &Random Forests

Historical Context

The Origins of Decision Theory

Key Milestones

Why Trees Still Matter

Industry Impact

Intuition - How Trees Think

The 20 Questions Game

How a Decision Tree Classifies

Start at the Root

Ask the Best Question

Split and Recurse

Reach a Leaf Node

Key Properties of Decision Trees

Interpretable

Non-parametric

Handles Mixed Data

No Scaling Needed

Classification vs. Regression Trees

Classification Tree

Regression Tree

Splitting Criteria - Entropy & Information Gain

What is Entropy?

Entropy Derivation - Step by Step

Step 1: Define the Surprise

Step 2: Take the Expected Value

Step 3: Interpret the Values

Worked Example: Binary Entropy

Information Gain

The Greedy Strategy

Visualizing Entropy

Gain Ratio (C4.5 Improvement)

Watch Out for High-Cardinality Features

Gini Index

Gini Impurity Formula

Interpreting Gini Values

Pure Node (Gini = 0)

Maximum Impurity (Gini = 0.5 for binary)

Weighted Gini for a Split

Entropy vs. Gini: When to Use Each

Entropy (Information Gain)

Gini Impurity

Gini vs Entropy Comparison

In Practice, It Rarely Matters

Building a Decision Tree

The Greedy Top-Down Algorithm

Evaluate All Possible Splits

Select the Best Split

Create Child Nodes

Recurse on Each Child

Create Leaf Nodes

Stopping Criteria

Handling Numerical vs. Categorical Features

Numerical Features

Categorical Features

Interactive Decision Tree Visualization

Computational Complexity

Pruning

The Overfitting Problem

Signs of Overfitting

Pre-Pruning vs. Post-Pruning

Pre-Pruning (Early Stopping)

Post-Pruning (Backward Pruning)

Cost-Complexity Pruning (Minimal Cost-Complexity)

Overfitting vs Underfitting

scikit-learn Implementation

Random Forests

The Problem with Single Trees

Bagging (Bootstrap Aggregating)

Create Bootstrap Samples

Train Independent Trees

Aggregate Predictions

Random Feature Selection

Out-of-Bag (OOB) Error

Random Forest Ensemble Visualization

No Cross-Validation Needed

Feature Importance

Impurity-Based Importance (MDI)

Permutation Importance (MDA)

Gradient Boosting & XGBoost

Decision Trees &
Random Forests