Decision Trees &
Random Forests Cheat Sheet
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Entropy ranges [0, log₂(c)]. Gini ranges [0, 1-1/c]. For binary: entropy max = 1, Gini max = 0.5.
Complexity: O(n · m · log n) where n = samples, m = features.
R(T) = misclassification rate, |T| = number of leaves, α found via cross-validation. Use ccp_alpha in scikit-learn.
Each bootstrap sample contains ~63.2% unique data. OOB error approximates leave-one-out CV.
Start with defaults, then tune learning_rate + n_estimators first, then max_depth, then regularization.
A: Both measure node impurity. Entropy uses logarithms and ranges [0, log₂(c)]. Gini uses squared probabilities and ranges [0, 1-1/c]. In practice, they produce nearly identical trees. Gini is slightly faster (no log computation) and is the scikit-learn default.
A: The greedy splitting algorithm is sensitive to the training data. Small changes in the data can lead to completely different split decisions at the root, cascading into entirely different tree structures. Ensembles fix this by averaging many trees.
A: Random Forest uses two sources of randomness: (1) bootstrap sampling gives each tree different training data, and (2) random feature selection at each split decorrelates the trees. Averaging many decorrelated trees reduces variance without increasing bias.
A: Bagging trains trees independently in parallel on bootstrap samples and averages them (reduces variance). Boosting trains trees sequentially, where each tree corrects the errors of the previous ones (reduces bias and variance). Boosting typically achieves higher accuracy but is more prone to overfitting.
A: Trees partition the feature space into rectangles and assign a constant prediction to each region. For values outside the training range, the tree assigns the prediction of the nearest known region. It cannot learn trends that extend beyond the data, unlike linear models.
A: XGBoost uses second-order Taylor expansion for better split finding, built-in L1/L2 regularization to prevent overfitting, column subsampling for diversity, sparsity-aware splitting for missing values, and cache-optimized parallel computation for speed.