Random Forest
Cheat Sheet
Your quick reference for Random Forest -- from bootstrap sampling and feature randomness to OOB error and hyperparameter tuning.
Your quick reference for Random Forest -- from bootstrap sampling and feature randomness to OOB error and hyperparameter tuning.
Each tree sees a different random subset of the training data. This diversity among trees is what reduces variance and prevents overfitting compared to a single decision tree.
Feature randomness decorrelates trees -- lowering $\rho$ in the variance formula. Even if one feature is highly predictive, not every tree will use it at the root, creating diverse tree structures.
oob_score=TrueGini importance (default in sklearn) is biased toward high-cardinality features. Permutation importance is more reliable and model-agnostic. SHAP values provide theoretically grounded, per-prediction importance.
n_estimators:
max_depth:
None (fully grown). Limit to prevent overfitting on noisy data.max_features:
min_samples_split:
min_samples_leaf:
bootstrap:
True. Set to False to use the entire dataset per tree (loses OOB capability).n_jobs:
-1 for full parallelism. Trees are independent and train in parallel.Random Forest is remarkably robust to hyperparameters. Start with defaults, then tune n_estimators, max_features, and max_depth using cross-validation.
A: An ensemble of decision trees, each trained on a bootstrap sample with random feature subsets at each split. Final prediction is the majority vote (classification) or average (regression) of all trees. Combines bagging with feature randomness to reduce variance.
A: Bagging (used in RF) trains trees independently on bootstrap samples and aggregates via voting/averaging -- it reduces variance. Boosting (e.g., XGBoost) trains trees sequentially, where each new tree corrects errors of previous ones -- it reduces bias. Bagging is parallel; boosting is sequential.
A: Out-of-Bag error uses the ~36.8% of samples left out of each bootstrap to evaluate that tree. Each sample is predicted only by trees that did not train on it. OOB error approximates leave-one-out CV without needing a separate validation set, saving data and computation.
A: Two main methods: (1) Gini importance -- total decrease in impurity across all splits using that feature, averaged over trees. (2) Permutation importance -- measures accuracy drop when a feature's values are randomly shuffled. Permutation importance is preferred as Gini importance is biased toward high-cardinality features.
A: Always prefer RF when accuracy matters -- it reduces the high variance of individual trees by averaging many decorrelated trees. A single tree is preferred only when full interpretability is essential (e.g., clinical decision rules). RF sacrifices interpretability for significantly better generalization.
A: RF excels with tabular data, mixed feature types, when you need a strong baseline with minimal tuning, and when feature importance is desired. It works well for both classification and regression. Avoid RF when you need real-time low-latency predictions, extrapolation, or when data is very high-dimensional and sparse (e.g., text).
A: RF resists overfitting through two mechanisms: (1) Bootstrap sampling gives each tree a different training set, and (2) random feature subsets at each split decorrelate the trees. More trees never increase overfitting -- they only reduce variance. The ensemble variance formula shows: as $B \to \infty$, variance approaches $\rho\sigma^2$, where $\rho$ is controlled by max_features.
A: Yes. RandomForestRegressor averages predictions from all trees: $\hat{y} = \frac{1}{B}\sum f_b(\mathbf{x})$. Key difference from classification: uses $m = \lfloor p/3 \rfloor$ features per split (vs. $\sqrt{p}$) and mean squared error for splitting. Limitation: RF regression cannot extrapolate beyond the range of training target values.