Random Forest - Complete Master Guide

01

Historical Intuition

How the idea of combining many weak learners evolved into one of machine learning's most reliable algorithms.

Leo Breiman & the Birth of Random Forest

The story of Random Forest begins with Leo Breiman, a statistician at UC Berkeley whose work fundamentally reshaped how we think about prediction. In 2001, Breiman published his landmark paper "Random Forests" in the journal Machine Learning, introducing an algorithm that would become one of the most widely used and trusted methods in all of data science.

Breiman's insight did not appear in a vacuum. It built upon two foundational ideas: his own bagging method (Bootstrap Aggregating, published in 1996) and Tin Kam Ho's random subspace method (1998). Bagging showed that averaging the predictions of many models trained on bootstrap samples of the data could dramatically reduce variance without increasing bias. Ho's work demonstrated that selecting random subsets of features at each split could further improve generalization.

Breiman's genius was to combine these two sources of randomness -- random samples and random features -- into a single elegant algorithm. The result was a method that was remarkably accurate, resistant to overfitting, and required almost no hyperparameter tuning. He also provided the theoretical framework explaining why the method works, connecting it to the strength of individual trees and the correlation between them.

The Ensemble Learning Paradigm

Random Forest belongs to the broader family of ensemble methods -- algorithms that combine multiple models to produce a single, superior prediction. The fundamental principle behind ensemble learning is that a group of weak learners, when properly combined, can form a strong learner. This idea has deep roots in both statistics and computer science.

The theoretical justification comes from the Condorcet Jury Theorem (1785), which states that if each member of a jury has a probability greater than 0.5 of making the correct decision, then the probability of the majority being correct increases toward 1 as the jury size grows.

P(\text{majority correct}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k} \xrightarrow{B \to \infty} 1, \quad \text{if } p > 0.5

Where \( B \) is the number of trees and \( p \) is the accuracy of each individual tree. This result holds when the errors of individual trees are independent -- and this is precisely what the random feature selection in Random Forest aims to achieve.

The Evolution of Random Forest

From theoretical foundations to one of the most deployed algorithms in production systems worldwide, Random Forest has had a remarkable trajectory:

1994

Tin Kam Ho introduces the random subspace method, showing that random feature selection improves decision tree ensembles

1996

Leo Breiman publishes "Bagging Predictors," demonstrating that bootstrap aggregating reduces variance of unstable learners

2001

Breiman publishes "Random Forests," combining bagging with random feature selection and providing theoretical analysis

2003

Extremely Randomized Trees (Extra-Trees) introduced by Geurts et al., pushing randomness even further

2010s

Random Forest becomes the default first-choice algorithm for tabular data across industry and Kaggle competitions

Today

Remains a top performer for tabular data, widely used in healthcare, finance, ecology, and remote sensing

02

Core Intuition

Why combining many diverse, imperfect decision trees produces remarkably accurate and stable predictions.

The Wisdom of Crowds

Imagine you need to estimate the number of jelly beans in a large jar. If you ask one person, their guess might be wildly off. But if you ask 500 people and take the average of their guesses, the result is remarkably close to the true answer. This is the wisdom of crowds phenomenon, and it is the core intuition behind Random Forest.

Each decision tree in a Random Forest is like one person making a guess. Individual trees are high-variance estimators -- they are sensitive to the specific training data they see, and small changes in the data can produce very different trees. However, when we average the predictions of hundreds of such trees, the individual errors tend to cancel out, leaving a stable, accurate prediction.

The key requirement for the wisdom of crowds to work is diversity. If all 500 people consulted each other before guessing, their estimates would be correlated and the average would be no better than a single guess. Random Forest ensures diversity through two mechanisms: bootstrap sampling (each tree sees a different random subset of the data) and random feature selection (each split considers only a random subset of features).

Combining Decorrelated Weak Learners

A single decision tree grown to full depth is a weak learner in the sense that it overfits the training data and has high variance. Its test error is dominated by the variance term in the bias-variance decomposition:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

When we average \( B \) independent estimators, each with variance \( \sigma^2 \), the variance of the average is \( \sigma^2 / B \). However, trees trained on the same dataset are not independent -- they are correlated. If the pairwise correlation between trees is \( \rho \), the variance of the ensemble becomes:

\text{Var}_{\text{ensemble}} = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2

The first term \( \rho \sigma^2 \) cannot be reduced by adding more trees -- it depends entirely on the correlation between trees. The second term \( \frac{(1-\rho)\sigma^2}{B} \) vanishes as \( B \to \infty \). This is why decorrelating the trees (reducing \( \rho \)) is so important.

Why Random Forest Works So Well

Variance Reduction

Averaging many trees dramatically reduces the variance without increasing bias, producing smooth and stable decision boundaries

Robust to Overfitting

Unlike single trees, Random Forest rarely overfits even with hundreds of trees. Adding more trees never hurts -- it only helps or has no effect

Minimal Tuning Required

Works well out of the box with default hyperparameters, making it an excellent first model to try on any tabular dataset

03

Bootstrap Aggregating (Bagging)

The statistical foundation that makes ensemble averaging so powerful -- sampling with replacement.

What is Bootstrap Sampling?

A bootstrap sample is created by randomly drawing \( n \) samples with replacement from a dataset of \( n \) observations. Because sampling is done with replacement, some observations will appear multiple times in the bootstrap sample, while others will not appear at all.

The probability that any single observation is not selected in a single draw is \( 1 - \frac{1}{n} \). Since we draw \( n \) times independently, the probability that the observation is never selected is:

P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n

As \( n \to \infty \), this converges to:

\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \approx 0.368

This means approximately 36.8% of the original observations are left out of each bootstrap sample. These left-out observations are called Out-of-Bag (OOB) samples and play a crucial role in model evaluation.

From Bootstrap to Bagging

Bagging (Bootstrap AGGregatING) was introduced by Leo Breiman in 1996. The procedure is: (1) generate \( B \) bootstrap samples from the training data, (2) train a separate model on each bootstrap sample, and (3) combine the predictions by averaging (regression) or majority voting (classification).

For regression, the bagged prediction is the average:

\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(\mathbf{x})

For classification, bagging uses majority voting:

\hat{y}_{\text{bag}}(\mathbf{x}) = \text{mode}\left\{\hat{y}_1(\mathbf{x}), \hat{y}_2(\mathbf{x}), \ldots, \hat{y}_B(\mathbf{x})\right\}

Bagging works particularly well with high-variance, low-bias base learners like deep decision trees. A deep tree has low bias because it can model complex relationships, but high variance because it is very sensitive to the specific training data. Bagging reduces the variance while keeping the bias low.

Why Not Bag Linear Models?

Bagging a stable estimator like linear regression produces little improvement because the variance is already low. The average of many similar linear models is just another linear model with nearly the same coefficients. Bagging shines with unstable estimators where small data perturbations cause large changes in the fitted model.

Interactive: Bootstrap Sampling

See how bootstrap sampling works in practice. Circles represent data points. On the left is the original dataset; on the right is the bootstrap sample with duplicates highlighted and OOB samples shown below in gray.

Bootstrap Sampling Visualization

Adjust N to change dataset size. Click Resample to draw a new bootstrap sample. Orange-glowing circles appear multiple times. Gray circles below are OOB (not selected).

Dataset Size (N): 25

04

Feature Randomness

The secret ingredient that transforms bagged trees into a true Random Forest -- random subsets of features at every split.

Random Feature Subsets at Each Split

In standard bagging, each tree considers all features when deciding how to split a node. This means that if one feature is a strong predictor, every tree will use that feature as its first split, making all the trees look very similar. Similar trees produce correlated predictions, and correlated predictions limit the variance reduction achieved by averaging.

Random Forest solves this: at each node split, instead of evaluating all \( p \) features, the algorithm randomly selects a subset of \( m \) features and chooses the best split only among those \( m \) features. This forces the trees to use different features at their top splits, producing structurally diverse trees.

The recommended values for \( m \) depend on the task:

m_{\text{classification}} = \lfloor \sqrt{p} \rfloor, \quad m_{\text{regression}} = \lfloor p / 3 \rfloor

Where \( p \) is the total number of features. For classification, we use the square root of the total features; for regression, we use one-third. These defaults were determined empirically by Breiman and have been validated across hundreds of datasets.

Why Feature Randomness Decorrelates Trees

Consider a dataset with 20 features where feature \( x_7 \) is by far the strongest predictor. Without feature randomness (pure bagging), every tree would split on \( x_7 \) at the root. Despite bootstrap samples being different, the tree structures would be nearly identical.

With Random Forest using \( m = \lfloor\sqrt{20}\rfloor = 4 \) features per split, any given tree has only a \( \frac{4}{20} = 20\% \) chance of even seeing \( x_7 \) at the root split. Trees that do not see \( x_7 \) must find alternative splitting strategies, exploring other features and interactions. The result is a collection of structurally diverse trees.

The trade-off is: smaller \( m \) produces more decorrelated trees (lower \( \rho \)) but each individual tree is weaker (higher bias). Larger \( m \) produces stronger individual trees but with higher correlation. The sweet spot \( m = \sqrt{p} \) balances these effects optimally for most classification problems.

Extremely Randomized Trees (Extra-Trees)

Extra-Trees push randomness even further: instead of finding the best split among the \( m \) random features, they choose a random split threshold for each feature. This further decorrelates the trees and can sometimes outperform standard Random Forest, especially on noisy datasets.

The Effect of \( m \) on Correlation and Strength

Breiman defined two key properties: the strength \( s \) of the individual trees (their average accuracy) and the mean correlation \( \bar{\rho} \) between pairs of trees. The generalization error is bounded by:

\text{Generalization Error} \leq \frac{\bar{\rho} (1 - s^2)}{s^2}

This bound reveals the fundamental trade-off: to minimize the generalization error, we want high strength \( s \) (accurate individual trees) and low correlation \( \bar{\rho} \) (diverse trees). Decreasing \( m \) decreases both \( \bar{\rho} \) and \( s \). The optimal \( m \) minimizes the ratio \( \bar{\rho} / s^2 \), and the default values \( \sqrt{p} \) and \( p/3 \) typically achieve a near-optimal balance.

05

The RF Algorithm

A step-by-step walkthrough of how Random Forest builds its ensemble of decorrelated decision trees.

Training: Building the Forest

Given a training set \( \mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\} \) with \( p \) features, the algorithm proceeds as follows:

1

Draw Bootstrap Sample

For tree \( b \) (where \( b = 1, \ldots, B \)), draw a bootstrap sample \( \mathcal{D}_b \) of size \( n \) from \( \mathcal{D} \) by sampling with replacement. Approximately 63.2% of unique observations will be included.

2

Grow Decision Tree

Grow an unpruned decision tree \( T_b \) on \( \mathcal{D}_b \). At each internal node, randomly select \( m \) features from the \( p \) available features. Find the best split among only these \( m \) features using Gini impurity or information gain.

3

Split Until Stopping Criterion

Continue splitting each node until a stopping criterion is met: each leaf contains fewer than \( n_{\min} \) samples (default: 1 for classification, 5 for regression), or until the node is pure.

4

Repeat B Times

Repeat steps 1-3 to produce \( B \) trees: \( \{T_1, T_2, \ldots, T_B\} \). Typical values of \( B \) range from 100 to 1000.

Prediction: Aggregating the Ensemble

For classification, each tree casts a vote and the forest predicts the majority class:

\hat{y}(\mathbf{x}) = \underset{c}{\text{argmax}} \sum_{b=1}^{B} \mathbb{1}[T_b(\mathbf{x}) = c]

For regression, the forest prediction is the average:

\hat{f}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x})

One can also obtain class probabilities: \( P(y = c | \mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \mathbb{1}[T_b(\mathbf{x}) = c] \). This soft voting often yields better results when used with probability-based metrics like log-loss.

Interactive: Growing a Random Forest

Watch how adding more trees smooths the decision boundary. Each individual tree creates a jagged boundary, but the ensemble average becomes increasingly smooth.

Random Forest Tree Growth

Adjust trees and max features per split. Thin transparent lines show individual tree boundaries; the thick cyan line shows the ensemble.

Number of Trees: 1

Max Features per Split: 2

06

Out-of-Bag Error

A free, built-in cross-validation estimate that comes naturally from the bootstrap sampling procedure.

The OOB Mechanism

Each bootstrap sample leaves out approximately 36.8% of observations. For any observation \( (\mathbf{x}_i, y_i) \), it will be absent from roughly one-third of all trees. We can use these trees -- the ones that did not train on observation \( i \) -- to make a prediction. This is the Out-of-Bag (OOB) prediction.

Let \( \mathcal{O}_i \) denote the set of trees for which observation \( i \) is out-of-bag. The OOB prediction is:

\hat{y}_i^{\text{OOB}} = \underset{c}{\text{argmax}} \sum_{b \in \mathcal{O}_i} \mathbb{1}[T_b(\mathbf{x}_i) = c]

The OOB error rate is:

\text{OOB Error} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[\hat{y}_i^{\text{OOB}} \neq y_i]

OOB as Free Cross-Validation

The OOB error estimate is essentially equivalent to leave-one-out cross-validation but comes at no additional computational cost. Each observation is predicted by trees that never saw it during training, providing an unbiased estimate of the generalization error.

Breiman showed empirically that the OOB error estimate is as accurate as using a separate test set of the same size as the training set. This means you can use all your data for training while still getting a reliable performance estimate.

The mathematical justification relies on the fact that each OOB prediction uses approximately \( B/e \approx 0.368B \) trees, which is still a substantial ensemble. As \( B \) grows, the OOB error converges to the true generalization error.

OOB for Hyperparameter Tuning

You can use OOB error instead of cross-validation to tune hyperparameters like \( m \) and \( n_{\min} \). This is much faster than k-fold CV because you only train the forest once per configuration. Set oob_score=True in scikit-learn.

OOB Error Convergence

As the number of trees \( B \) increases, the OOB error estimate stabilizes. With very few trees (say \( B = 10 \)), the OOB estimate is noisy because each observation is only predicted by 3-4 trees. As \( B \) grows, more trees contribute to each OOB prediction, and the estimate converges.

Unlike the training error which can be made arbitrarily small by growing complex trees, the OOB error provides an honest estimate of generalization performance. If the OOB error stops decreasing as you add more trees, you have reached the point of diminishing returns.

07

Feature Importance

Two powerful methods for understanding which features drive predictions in a Random Forest.

Gini Importance (Mean Decrease Impurity)

Gini importance measures how much each feature contributes to reducing impurity across all splits in all trees. For each feature \( j \):

\text{Imp}_{\text{Gini}}(j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{\substack{t \in T_b \\ v(t) = j}} p(t) \cdot \Delta G(t)

Where the inner sum is over all nodes \( t \) in tree \( T_b \) that split on feature \( j \), \( p(t) \) is the proportion of samples reaching node \( t \), and \( \Delta G(t) \) is the decrease in Gini impurity:

\Delta G(t) = G(t) - \frac{n_L}{n_t} G(t_L) - \frac{n_R}{n_t} G(t_R)

Where \( G(t) = 1 - \sum_{c=1}^{C} p_c^2 \) is the Gini impurity. Gini importance is fast to compute but has a known bias: it tends to overestimate the importance of continuous features and high-cardinality categorical features.

Permutation Importance (Mean Decrease Accuracy)

Permutation importance is more reliable: if a feature is important, randomly shuffling its values should significantly increase the prediction error. For each feature \( j \):

1

Compute Baseline

Calculate the OOB error of the trained forest: \( \text{Error}_{\text{base}} \).

2

Permute Feature j

Randomly shuffle the values of feature \( j \) in the OOB data, breaking the relationship between feature \( j \) and the target.

3

Recompute Error

Compute the OOB error with the permuted feature: \( \text{Error}_{\text{perm},j} \).

4

Calculate Importance

The importance of feature \( j \) is: \( \text{Imp}_{\text{perm}}(j) = \text{Error}_{\text{perm},j} - \text{Error}_{\text{base}} \).

\text{Imp}_{\text{perm}}(j) = \frac{1}{B} \sum_{b=1}^{B} \frac{\sum_{i \in \text{OOB}_b} \mathbb{1}[\hat{y}_{b,\pi_j}(\mathbf{x}_i) \neq y_i] - \sum_{i \in \text{OOB}_b} \mathbb{1}[\hat{y}_b(\mathbf{x}_i) \neq y_i]}{|\text{OOB}_b|}

Permutation importance is model-agnostic and avoids the bias toward high-cardinality features, making it the preferred method when accurate feature ranking is critical.

Interactive: Feature Importance Comparison

Compare Gini importance and Permutation importance for a dataset with 7 features. Notice how the rankings can differ between the two methods.

Feature Importance Visualization

Toggle between Gini (cyan bars) and Permutation (orange bars) importance. Watch how feature rankings shift between methods.

08

Bias-Variance Analysis

Understanding exactly how Random Forest reduces prediction error through the lens of bias-variance decomposition.

Variance of Correlated Averages

Consider \( B \) trees, each with prediction variance \( \sigma^2 \) and pairwise correlation \( \rho \). The variance of the ensemble average is:

\text{Var}\left[\frac{1}{B}\sum_{b=1}^{B} T_b(\mathbf{x})\right] = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2

This formula has two terms:

The first term \( \rho \sigma^2 \) is the irreducible ensemble variance. It depends on correlation \( \rho \) and cannot be reduced by adding more trees.
The second term \( \frac{(1-\rho)\sigma^2}{B} \) is the reducible variance. It decreases as \( 1/B \) and vanishes as trees grow to infinity.

Bias Remains Unchanged

A critical property of averaging is that it does not change the bias. If each individual tree has bias \( \text{Bias}[T_b(\mathbf{x})] = \mu(\mathbf{x}) - f(\mathbf{x}) \), then the ensemble has the same bias:

\text{Bias}\left[\frac{1}{B}\sum_{b=1}^{B} T_b(\mathbf{x})\right] = \mu(\mathbf{x}) - f(\mathbf{x})

This is why Random Forest uses deep, unpruned trees -- they have low bias, and the high variance is handled by ensemble averaging. This contrasts with boosting methods like Gradient Boosting, which build shallow trees (high bias, low variance) and reduce bias iteratively.

Error Reduction with More Trees

The chart below shows how the test error of a Random Forest decreases as the number of trees increases, compared to a single decision tree. The single tree error remains flat while the forest error drops rapidly and then plateaus.

09

Hyperparameter Tuning

The key parameters that control Random Forest behavior and how to tune them effectively.

n_estimators -- Number of Trees

Controls how many trees are grown. More trees reduce variance but increase training time linearly.

Default: 100 -- Reasonable starting point for most datasets
Sweet spot: 200-500 -- Sufficient for most real-world problems
Diminishing returns: Beyond 500-1000 trees, improvement is negligible
Key insight: Random Forest never overfits by adding more trees. The only reason to limit trees is computation time

max_depth -- Maximum Tree Depth

Controls the maximum depth each tree can grow to. By default, trees grow until each leaf is pure or contains fewer than min_samples_split samples.

Default: None (unlimited) -- Recommended for Random Forest since deep trees have low bias
Limiting depth: Values like 10-30 can speed up training and reduce memory but increase bias
Very shallow trees (3-5): Turns RF into an ensemble of stumps, which may underfit

max_features -- Feature Subset Size

The most important hyperparameter unique to Random Forest. It controls \( m \), the number of features considered at each split:

m = \begin{cases} \lfloor\sqrt{p}\rfloor & \text{classification (default)} \\ \lfloor p/3 \rfloor & \text{regression (default)} \end{cases}

Lower \( m \): More randomness, more decorrelated trees, lower individual tree accuracy
Higher \( m \): Less randomness, stronger individual trees, higher correlation. Approaches bagging when \( m = p \)
Tuning: For classification with \( p = 100 \), try \( m \in \{5, 7, 10, 15, 20\} \)

Other Important Parameters

1

min_samples_split

Minimum samples required to split an internal node (default: 2). Increasing this acts as a regularizer. Values between 2 and 20 are common.

2

min_samples_leaf

Minimum samples required in a leaf node (default: 1). Increasing this smooths the model. Values between 1 and 10 are typical.

3

bootstrap

Whether to use bootstrap samples (default: True). Setting False means each tree trains on the full dataset. Generally keep True.

4

oob_score

Whether to use OOB samples to estimate generalization error (default: False). Set True for free validation. Only available when bootstrap=True.

10

Practical Applications

Real-world domains where Random Forest delivers reliable, production-ready results.

Where Random Forest Excels

Classification

From medical diagnosis to spam detection, Random Forest is a top performer for binary and multi-class classification on structured tabular data.

Regression

Predicting house prices, stock returns, or energy consumption. RF regression handles nonlinear relationships and interactions automatically.

Feature Selection

Permutation importance provides a model-agnostic ranking of features, helping identify which variables matter most for prediction.

Anomaly Detection

Isolation Forest, a variant of Random Forest, detects anomalies by measuring how easily a data point can be isolated through random splits.

Random Forest Playground

Train a Random Forest classifier on two-moon data and visualize the decision regions.

OOB Accuracy: --

Model Comparison on Sample Data

The chart below compares classification accuracy of different ensemble and tree-based methods. Random Forest consistently ranks near the top, offering a strong balance between accuracy and simplicity.

Advantages

No Feature Scaling Needed

Decision trees are invariant to monotone transformations. Unlike SVM or logistic regression, RF does not require standardization.

Handles Mixed Feature Types

Works seamlessly with both numerical and categorical features, and naturally handles missing values in many implementations.

Built-in Feature Importance

Gini and permutation importance provide insights into which features drive predictions, aiding interpretability.

Embarrassingly Parallel

Each tree is trained independently, making Random Forest trivially parallelizable across CPU cores.

Free Validation via OOB

OOB error provides a reliable generalization estimate without a held-out validation set.

Disadvantages

Large Model Size

A forest of 500 deep trees can consume significant memory. Each tree stores all split rules and leaf values.

Slow Prediction for Real-Time

Prediction requires traversing all B trees. With B=1000 and deep trees, this can be too slow for latency-critical applications.

Cannot Extrapolate

Predictions are bounded by the range of training targets. RF cannot predict values outside this range.

Often Outperformed by Boosting

On well-tuned benchmarks, gradient boosting methods (XGBoost, LightGBM) typically achieve 1-3% higher accuracy.

11

Python Implementation

From basic classification to complete pipelines with RandomizedSearchCV.

Basic Random Forest Classification

The simplest way to get started with Random Forest using scikit-learn.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Random Forest -- no scaling needed!
model = RandomForestClassifier(
    n_estimators=200, max_features='sqrt',
    random_state=42, oob_score=True, n_jobs=-1
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f"OOB Score: {model.oob_score_:.4f}")
print(f"Feature importances: {model.feature_importances_}")

Random Forest Regression

Using Random Forest for nonlinear regression. No feature scaling is required.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate nonlinear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(300, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train RF Regressor
rfr = RandomForestRegressor(
    n_estimators=300, min_samples_leaf=3,
    random_state=42, oob_score=True, n_jobs=-1
)
rfr.fit(X_train, y_train)

# Evaluate
y_pred = rfr.predict(X_test)
print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")
print(f"R^2:  {r2_score(y_test, y_pred):.4f}")
print(f"OOB R^2: {rfr.oob_score_:.4f}")

Hyperparameter Tuning with RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Define parameter distributions
param_dist = {
    'n_estimators': randint(100, 1000),
    'max_depth': [None, 10, 20, 30, 50],
    'max_features': ['sqrt', 'log2', 0.3, 0.5],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'bootstrap': [True],
}

# Randomized search with 5-fold cross-validation
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
search = RandomizedSearchCV(
    rf, param_dist, n_iter=50, cv=5,
    scoring='accuracy', random_state=42, n_jobs=-1
)
search.fit(X, y)

print(f"Best parameters: {search.best_params_}")
print(f"Best CV accuracy: {search.best_score_:.4f}")

Feature Importance Analysis

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_breast_cancer

# Load and train
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names

rf = RandomForestClassifier(
    n_estimators=300, random_state=42, n_jobs=-1
)
rf.fit(X, y)

# Gini importance (built-in)
gini_imp = rf.feature_importances_

# Permutation importance
perm_result = permutation_importance(
    rf, X, y, n_repeats=10, random_state=42, n_jobs=-1
)
perm_imp = perm_result.importances_mean

# Display top 10 features by each method
gini_idx = np.argsort(gini_imp)[::-1][:10]
perm_idx = np.argsort(perm_imp)[::-1][:10]

print("Top 10 by Gini Importance:")
for i in gini_idx:
    print(f"  {feature_names[i]:30s} {gini_imp[i]:.4f}")

print("\nTop 10 by Permutation Importance:")
for i in perm_idx:
    print(f"  {feature_names[i]:30s} {perm_imp[i]:.4f}")

Gini vs Permutation: When to Use Which

Use Gini importance for quick exploratory analysis -- it is fast and requires no extra computation. Use permutation importance for rigorous feature selection, especially when features have different cardinalities. Permutation importance on a held-out test set is the gold standard.

Isolation Forest for Anomaly Detection

import numpy as np
from sklearn.ensemble import IsolationForest

# Generate normal data with some anomalies
np.random.seed(42)
X_normal = np.random.randn(300, 2)
X_anomaly = np.random.uniform(-4, 4, size=(20, 2))
X = np.vstack([X_normal, X_anomaly])

# Train Isolation Forest
iso_forest = IsolationForest(
    n_estimators=200, contamination=0.06,
    random_state=42, n_jobs=-1
)
predictions = iso_forest.fit_predict(X)

# -1 = anomaly, 1 = normal
n_anomalies = (predictions == -1).sum()
print(f"Detected {n_anomalies} anomalies out of {len(X)} points")
print(f"Anomaly scores: {iso_forest.score_samples(X)[:5]}")