Random Forest
15 interview questions from basic to advanced
15 interview questions from basic to advanced
Random Forest is an ensemble learning method that builds multiple decision trees using bootstrap samples and random feature subsets, then aggregates their predictions through majority voting (classification) or averaging (regression). Each tree sees a different random subset of the data and features, making the trees decorrelated.
This diversity among trees reduces variance while maintaining low bias, typically outperforming a single decision tree. The combination of bagging (random samples) and random feature selection at each split is what gives Random Forest its power and robustness across a wide range of problems.
Bagging (Bootstrap Aggregating) creates multiple training sets by sampling with replacement from the original data. Each bootstrap sample is the same size as the original but contains approximately 63.2% unique samples. The remaining 36.8% are left out and can be used for out-of-bag error estimation.
Random Forest extends bagging by also randomly selecting a subset of features at each node split. This double randomness -- random samples AND random features -- creates highly diverse trees that, when averaged, produce robust predictions with low variance. Plain bagging only randomizes samples, but RF's additional feature randomness is what makes the trees truly decorrelated.
For classification, each tree votes for a class and the final prediction is the majority vote. For regression, each tree predicts a numerical value and the final prediction is the average. The sklearn API uses RandomForestClassifier and RandomForestRegressor respectively.
Both share the same bagging + feature randomness mechanism but differ in how they aggregate predictions and measure node impurity. Classification typically uses Gini impurity to evaluate splits, while regression uses MSE (mean squared error). The core algorithm remains the same -- only the aggregation and splitting criteria change between the two tasks.
Random Forest is much more robust than a single tree. It achieves lower variance through averaging many trees, is less prone to overfitting, and produces more stable predictions that don't change dramatically with small data changes. It also provides built-in feature importance metrics and built-in OOB error estimation, eliminating the need for a separate validation set.
RF handles high-dimensional data well and works effectively with mixed feature types. The tradeoff is reduced interpretability compared to a single tree -- while a single decision tree can be visualized and explained easily, a forest of hundreds of trees is essentially a black-box model. Training time and memory usage are also higher due to building multiple trees.
No, Random Forest does not require feature scaling (standardization or normalization). This is because decision trees make splits based on feature thresholds -- they only care about the ordering of values, not their magnitude. Whether a feature ranges from 0-1 or 0-1000000, the tree finds the same optimal split point.
This makes RF very convenient as a "plug-and-play" algorithm that works well with mixed feature types and scales. Unlike distance-based algorithms (KNN, SVM) or gradient-based models (logistic regression, neural networks), tree-based methods are invariant to monotonic transformations of individual features, so scaling has no effect on the model's performance.
Each bootstrap sample leaves out approximately 36.8% of the training data (OOB samples). For each training instance, we collect predictions only from trees that did NOT include it in their bootstrap sample. The OOB error is the average prediction error across all training instances using only their OOB predictions.
Advantages: (1) It is essentially free -- computed during training without extra computation, (2) It provides an unbiased estimate similar to leave-one-out cross-validation, (3) No need to set aside a separate validation set, which is valuable when data is limited, (4) Useful for monitoring convergence as trees are added. The OOB estimate is slightly pessimistic because each observation uses fewer trees than the full ensemble, but it is very close to cross-validation error in practice.
Two main methods exist: (1) Gini Importance (MDI -- Mean Decrease in Impurity): For each feature, sum the total decrease in Gini impurity across all splits using that feature across all trees, then normalize. This is fast to compute but biased toward high-cardinality features that offer more potential split points.
(2) Permutation Importance (MDA -- Mean Decrease in Accuracy): For each feature, randomly shuffle its values and measure how much the OOB error increases. The larger the increase, the more important the feature. This method is less biased but slower to compute. Permutation importance is generally preferred for reliable feature selection. For correlated features, consider conditional permutation importance or SHAP values, as standard permutation importance may undervalue correlated features.
max_features controls how many features are randomly selected at each split. Lower values introduce more randomness, creating more decorrelated trees that reduce ensemble variance but potentially increase individual tree bias. Higher values introduce less randomness, producing more correlated trees with higher ensemble variance but lower individual tree bias.
The sweet spot balances these effects: sqrt(p) works well for classification, p/3 for regression, where p is the total number of features. Tuning this parameter often has the biggest impact on RF performance. Setting max_features too low can cause underfitting (trees are too random to learn useful patterns), while setting it too high makes the forest behave like plain bagging with correlated trees.
Bagging (used in RF) builds trees independently in parallel on bootstrap samples and aggregates via averaging or voting -- it primarily reduces variance. Each tree is trained independently and has no knowledge of the other trees. Boosting (used in Gradient Boosting, AdaBoost) builds trees sequentially, where each new tree corrects errors of the previous ensemble -- it primarily reduces bias.
Bagging is simpler, more parallelizable, and less prone to overfitting. Boosting often achieves higher accuracy but is more sensitive to hyperparameters and noise, and can overfit if not tuned properly. In practice, gradient boosting methods (XGBoost, LightGBM) often outperform Random Forest on structured data, but RF remains a strong baseline due to its robustness and minimal tuning requirements.
For missing values: (1) sklearn's RF doesn't handle them natively -- you must impute first, (2) Some implementations (R's randomForest) use proximity-based imputation, where proximity between observations is computed from tree terminal nodes, (3) Missing value indicators can be created as additional binary features. The bootstrap sampling in RF naturally provides robustness to imputation noise.
For categorical features: (1) sklearn requires encoding (one-hot or ordinal), (2) Some implementations handle categories natively with multi-way splits. RF handles high-cardinality categoricals better than many algorithms because each tree only sees a random subset of features, reducing overfitting to rare categories. One-hot encoding can be problematic for very high-cardinality features as it creates many sparse columns, so ordinal or target encoding may be preferred.
For B trees with individual variance σ² and pairwise correlation ρ: Var(f̄) = Var(1/B · Σf_b) = (1/B²)[B·σ² + B(B-1)·ρσ²] = ρσ² + (1-ρ)σ²/B. As B→∞, variance approaches ρσ². If trees are perfectly correlated (ρ=1), bagging provides zero variance reduction. If uncorrelated (ρ=0), variance reduces as σ²/B.
Random Forest's feature randomness reduces ρ by making trees learn different patterns, which is its key innovation over plain bagging. This formula explains why RF outperforms bagged trees -- the feature randomness decorrelates the trees, pushing ρ closer to 0. The practical implication is that adding more trees always helps (reduces the (1-ρ)σ²/B term), but the irreducible floor ρσ² can only be lowered by further decorrelating the trees through more aggressive feature subsampling.
They disagree when: (1) Features have different cardinalities -- Gini importance favors high-cardinality features (more potential splits) while permutation importance is unbiased, (2) Features are correlated -- Gini importance splits credit among correlated features while permutation importance may undervalue them (shuffling one correlated feature is partially compensated by the other), (3) Features interact -- permutation importance captures interaction effects better since shuffling disrupts interactions.
Best practice: use permutation importance for feature selection, Gini importance for quick screening during exploratory analysis. For correlated features, consider conditional permutation importance (which permutes a feature conditionally on correlated features) or SHAP values (which provide theoretically grounded attribution based on Shapley values from game theory). SHAP values handle both correlation and interaction effects correctly but are computationally more expensive.
Diagnosis: low training AND test accuracy, OOB error similar to test error but both high. This indicates the model lacks the capacity to capture the underlying patterns. Underfitting in RF is relatively rare with default parameters, so it usually indicates insufficient features or inappropriate problem framing.
Fixes: (1) Increase max_depth or set to None (remove depth limit), (2) Decrease min_samples_split and min_samples_leaf to allow finer splits, (3) Increase max_features to give each tree more features to consider, (4) Engineer better features or add polynomial/interaction features, (5) Check if n_estimators is sufficient (at least 100-500), (6) Consider that RF may not suit the problem -- highly complex patterns may need gradient boosting or neural networks. If none of these work, the issue likely lies in the feature representation rather than the model itself.
For each observation x_i, define OOB(i) as the set of trees whose bootstrap sample did NOT contain x_i. The OOB prediction for x_i averages only trees in OOB(i). Since P(x_i not in bootstrap b) = (1-1/n)^n → e^{-1} ≈ 0.368, each observation is OOB in approximately 36.8% of trees.
As n→∞ and B→∞, the OOB estimator converges to the leave-one-out cross-validation error. The key insight: OOB error uses a different subset of trees for each observation, just as LOO-CV trains a different model for each fold. Empirically, OOB error is slightly pessimistic (uses fewer trees per prediction than the full ensemble) but very close to CV error. This theoretical connection was established by Breiman and provides strong justification for using OOB error as a reliable model evaluation metric without the computational cost of full cross-validation.
Training complexity: O(B · n · p · log(n)) for B trees, n samples, p features. Prediction complexity: O(B · depth). The parallelizable nature of RF (trees are independent) means training scales linearly with the number of available cores, making it more tractable than sequential methods like boosting.
For 100M rows: (1) Subsample -- use the max_samples parameter to train each tree on a fraction (e.g., 10%) while maintaining OOB estimation, (2) Distribute -- use Spark MLlib's RandomForest which distributes tree building across a cluster, (3) Feature reduction -- PCA or feature selection to reduce p, (4) Incremental learning -- train on chunks and combine forests (forests can be trivially merged), (5) Use a more efficient implementation like LightGBM's random forest mode which uses histogram-based splitting (O(n) instead of O(n·log(n)) per split). For prediction: batch predict with n_jobs=-1 for full parallelism across all CPU cores.