bagging
ensemble
OOB error
Quick Reference
Ensemble Learning
Home / Study Lab / Cheat Sheets / Random Forest
QUICK REFERENCE

Random Forest
Cheat Sheet

Your quick reference for Random Forest -- from bootstrap sampling and feature randomness to OOB error and hyperparameter tuning.

Key Formulas

Bootstrap Probability:
$$P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368$$
Ensemble (Classification):
$$\hat{y} = \text{mode}\left(\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_B\right)$$
Ensemble (Regression):
$$\hat{y} = \frac{1}{B}\sum_{b=1}^{B} f_b(\mathbf{x})$$
Ensemble Variance:
$$\text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1 - \rho}{B}\sigma^2$$
Feature Subset Size:
$$m = \lfloor\sqrt{p}\rfloor \text{ (classification)}, \quad m = \left\lfloor \frac{p}{3} \right\rfloor \text{ (regression)}$$

Bootstrap Aggregating (Bagging)

Sampling:
Draw $n$ samples with replacement from original dataset of size $n$
Unique Samples:
$$\approx 63.2\%\text{ unique samples per bootstrap}$$
OOB Samples:
$$\approx 36.8\%\text{ left out per tree (Out-of-Bag)}$$
Aggregation:
Majority vote (classification) or average (regression) across all $B$ trees

Each tree sees a different random subset of the training data. This diversity among trees is what reduces variance and prevents overfitting compared to a single decision tree.

Feature Randomness

Random Subset:
At each split, only $m$ random features out of $p$ total are considered
Decorrelation Effect:
$$\text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2 \xrightarrow{\rho \to 0} \frac{\sigma^2}{B}$$
Standard Choices:
$$m = \lfloor\sqrt{p}\rfloor, \quad m = \lfloor\log_2(p)\rfloor, \quad m = \left\lfloor\frac{p}{3}\right\rfloor$$

Feature randomness decorrelates trees -- lowering $\rho$ in the variance formula. Even if one feature is highly predictive, not every tree will use it at the root, creating diverse tree structures.

Out-of-Bag (OOB) Error

OOB Prediction:
$$\hat{y}_i^{\text{OOB}} = \text{aggregate}\left(\{f_b(\mathbf{x}_i) : i \notin \mathcal{B}_b\}\right)$$
OOB Error:
$$\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n} L\left(y_i, \hat{y}_i^{\text{OOB}}\right)$$
  1. Each sample $\mathbf{x}_i$ is predicted only by trees that did not include it in their bootstrap sample
  2. OOB error approximates leave-one-out cross-validation
  3. No need for a separate validation set -- built-in honest estimate
  4. Enable in scikit-learn with oob_score=True

Feature Importance

Gini Importance:
$$\text{Imp}(X_j) = \sum_{\text{nodes using } X_j} \Delta \, G = G_{\text{parent}} - \sum_k \frac{n_k}{n_{\text{parent}}} G_k$$
Permutation Importance:
$$\text{Imp}(X_j) = \text{Score}_{\text{original}} - \text{Score}_{\text{permuted } X_j}$$
Gini Impurity:
$$G = 1 - \sum_{k=1}^{K} p_k^2$$

Gini importance (default in sklearn) is biased toward high-cardinality features. Permutation importance is more reliable and model-agnostic. SHAP values provide theoretically grounded, per-prediction importance.

Hyperparameters

n_estimators:
Number of trees. Typical range: 100--500. More trees = better but slower. Performance plateaus eventually.
max_depth:
Maximum tree depth. Default: None (fully grown). Limit to prevent overfitting on noisy data.
max_features:
Features per split: $\sqrt{p}$, $\log_2(p)$, or tune via CV. Controls tree correlation $\rho$.
min_samples_split:
Minimum samples to split a node. Default: 2. Increase for regularization.
min_samples_leaf:
Minimum samples in a leaf node. Default: 1. Increase to smooth predictions.
bootstrap:
Default: True. Set to False to use the entire dataset per tree (loses OOB capability).
n_jobs:
Set to -1 for full parallelism. Trees are independent and train in parallel.

Random Forest is remarkably robust to hyperparameters. Start with defaults, then tune n_estimators, max_features, and max_depth using cross-validation.

Pros vs Cons

Pros:

  • No feature scaling required -- tree-based splits are scale-invariant
  • Handles missing values and mixed feature types naturally
  • Built-in feature importance via Gini or permutation methods
  • Fully parallelizable -- each tree trains independently
  • Robust to outliers -- median/mode aggregation dampens extreme predictions
  • Rarely overfits with more trees -- adding trees reduces variance without increasing bias

Cons:

  • Less interpretable than a single decision tree -- hundreds of trees are hard to visualize
  • Memory intensive -- stores all trees in memory at prediction time
  • Slower prediction than a single tree -- must query all $B$ trees
  • Default Gini importance is biased for high-cardinality and continuous features
  • Cannot extrapolate beyond training range (regression) -- predictions bounded by seen values

Interview Quick-Fire

Q: What is a Random Forest?

A: An ensemble of decision trees, each trained on a bootstrap sample with random feature subsets at each split. Final prediction is the majority vote (classification) or average (regression) of all trees. Combines bagging with feature randomness to reduce variance.

Q: Bagging vs. Boosting -- what is the difference?

A: Bagging (used in RF) trains trees independently on bootstrap samples and aggregates via voting/averaging -- it reduces variance. Boosting (e.g., XGBoost) trains trees sequentially, where each new tree corrects errors of previous ones -- it reduces bias. Bagging is parallel; boosting is sequential.

Q: What is OOB error and why is it useful?

A: Out-of-Bag error uses the ~36.8% of samples left out of each bootstrap to evaluate that tree. Each sample is predicted only by trees that did not train on it. OOB error approximates leave-one-out CV without needing a separate validation set, saving data and computation.

Q: How does Random Forest measure feature importance?

A: Two main methods: (1) Gini importance -- total decrease in impurity across all splits using that feature, averaged over trees. (2) Permutation importance -- measures accuracy drop when a feature's values are randomly shuffled. Permutation importance is preferred as Gini importance is biased toward high-cardinality features.

Q: Random Forest vs. single Decision Tree -- when to pick RF?

A: Always prefer RF when accuracy matters -- it reduces the high variance of individual trees by averaging many decorrelated trees. A single tree is preferred only when full interpretability is essential (e.g., clinical decision rules). RF sacrifices interpretability for significantly better generalization.

Q: When should you use Random Forest?

A: RF excels with tabular data, mixed feature types, when you need a strong baseline with minimal tuning, and when feature importance is desired. It works well for both classification and regression. Avoid RF when you need real-time low-latency predictions, extrapolation, or when data is very high-dimensional and sparse (e.g., text).

Q: How does Random Forest handle overfitting?

A: RF resists overfitting through two mechanisms: (1) Bootstrap sampling gives each tree a different training set, and (2) random feature subsets at each split decorrelate the trees. More trees never increase overfitting -- they only reduce variance. The ensemble variance formula shows: as $B \to \infty$, variance approaches $\rho\sigma^2$, where $\rho$ is controlled by max_features.

Q: Can Random Forest be used for regression?

A: Yes. RandomForestRegressor averages predictions from all trees: $\hat{y} = \frac{1}{B}\sum f_b(\mathbf{x})$. Key difference from classification: uses $m = \lfloor p/3 \rfloor$ features per split (vs. $\sqrt{p}$) and mean squared error for splitting. Limitation: RF regression cannot extrapolate beyond the range of training target values.

Continue Your Journey