Gini / Entropy
15 Questions
Ace the Interview
Interview Prep
Decision Trees
Home / Study Lab / Decision Trees Interview
INTERVIEW PREP

Decision Trees
Interview Questions

15 commonly asked decision trees interview questions with detailed answers. Master splitting criteria, pruning strategies, and ensemble methods to ace your next ML interview.

EASY What is a decision tree and how does it work?

A decision tree is a supervised learning algorithm that makes predictions by learning a hierarchy of if-else decision rules from the training data. The model structure resembles an inverted tree where the topmost node is called the root, internal nodes represent tests on features, branches represent the outcomes of those tests, and leaf nodes hold the final predictions.

During training, the algorithm recursively splits the data at each node by selecting the feature and threshold that best separates the target classes (for classification) or reduces variance (for regression). At prediction time, a new sample simply traverses the tree from root to leaf, following the branch that matches its feature values at each decision point. The prediction at the leaf is typically the majority class or mean target value of the training samples that ended up in that leaf.

Key Points
  • Hierarchical structure of if-else decision rules
  • Root node at top, internal decision nodes, and leaf prediction nodes
  • Recursively splits data using the best feature and threshold
  • Predictions made by traversing the tree from root to leaf
EASY What is the difference between classification and regression trees?

Classification trees are used when the target variable is categorical. At each leaf node, they predict a class label, typically the majority class among the training samples in that leaf. The splitting criteria used during training are based on impurity measures such as Gini impurity or entropy, which quantify how mixed the classes are at a given node.

Regression trees are used when the target variable is continuous. Each leaf predicts a numerical value, usually the mean of the target values of the training samples in that leaf. The splitting criteria minimize variance or mean squared error within the resulting child nodes. Despite this difference in output and splitting metric, the underlying recursive partitioning algorithm is the same for both types.

Key Points
  • Classification trees predict categorical labels; regression trees predict continuous values
  • Classification uses Gini impurity or entropy for splits
  • Regression uses variance reduction or MSE for splits
  • Same recursive partitioning algorithm underlies both
EASY What is Gini impurity?

Gini impurity measures how often a randomly chosen element from a node would be incorrectly classified if it were labeled according to the distribution of classes in that node. It is calculated as Gini = 1 - sum(p_i^2), where p_i is the proportion of samples belonging to class i. A Gini value of 0 means the node is perfectly pure (all samples belong to one class), while a maximum value of 0.5 (for binary classification) indicates an even 50/50 split between classes.

During tree construction, the algorithm evaluates every possible split and selects the one that produces the largest reduction in Gini impurity, known as the Gini gain. The weighted Gini impurity of the child nodes is subtracted from the parent node's Gini impurity, and the split with the highest gain is chosen. Gini impurity is the default splitting criterion in scikit-learn's DecisionTreeClassifier because it is computationally efficient -- it avoids the logarithmic calculation required by entropy.

Key Points
  • Formula: Gini = 1 - sum(p_i^2)
  • Ranges from 0 (pure) to 0.5 (maximum impurity for binary)
  • Measures probability of incorrect random classification
  • Default criterion in scikit-learn due to computational efficiency
EASY What are the advantages of decision trees?

Decision trees are one of the most interpretable models in machine learning. The tree structure can be visualized and understood by non-technical stakeholders, making it easy to explain why a particular prediction was made by tracing the path from root to leaf. They require minimal data preprocessing -- no feature scaling or normalization is needed, and they can handle both numerical and categorical features natively.

Decision trees can capture non-linear relationships and feature interactions without explicit feature engineering. They are also robust to outliers because splits are based on rank ordering rather than absolute values. Additionally, they implicitly perform feature selection since only the most informative features appear in the tree. Training and prediction are both fast, and the model can handle multi-output problems naturally.

Key Points
  • Highly interpretable and easy to visualize
  • No feature scaling or normalization required
  • Handle both numerical and categorical data
  • Capture non-linear relationships and feature interactions
  • Implicit feature selection and robust to outliers
EASY What is pruning in decision trees?

Pruning is a regularization technique that reduces the size of a decision tree to prevent overfitting. An unpruned tree can grow until every leaf is perfectly pure, memorizing the training data including noise and outliers. Pruning removes branches that provide little predictive power on unseen data, resulting in a simpler, more generalizable model.

There are two main types of pruning. Pre-pruning (early stopping) halts tree growth during construction by setting constraints such as maximum depth, minimum samples per leaf, or minimum impurity decrease. Post-pruning (cost-complexity pruning) first grows the full tree and then removes subtrees that do not significantly improve performance on a validation set. Scikit-learn implements cost-complexity pruning via the ccp_alpha parameter, where higher alpha values produce more aggressive pruning.

Key Points
  • Reduces tree size to prevent overfitting
  • Pre-pruning stops growth early using constraints
  • Post-pruning removes branches after full tree is built
  • Cost-complexity pruning controlled by ccp_alpha in scikit-learn
MEDIUM Compare Gini impurity, entropy, and information gain.

Gini impurity and entropy are both measures of node impurity used to evaluate the quality of a split. Gini impurity is calculated as 1 - sum(p_i^2) and ranges from 0 to 0.5 for binary classification. Entropy uses the formula -sum(p_i * log2(p_i)) and ranges from 0 to 1 for binary classification. Both reach their minimum at pure nodes and their maximum when classes are equally distributed, but entropy penalizes impurity more heavily due to the logarithmic function.

Information gain is not an impurity measure itself but rather the reduction in entropy (or Gini) achieved by a split. It is computed as the parent node's impurity minus the weighted average impurity of the child nodes. The split with the highest information gain is selected. In practice, Gini and entropy produce nearly identical trees in about 98% of cases. Gini is slightly faster to compute since it avoids logarithms, which is why it is the default in many implementations. Entropy-based information gain is used by the ID3 and C4.5 algorithms.

Key Points
  • Gini: 1 - sum(p_i^2), faster to compute, default in CART
  • Entropy: -sum(p_i * log2(p_i)), penalizes impurity more heavily
  • Information gain = parent impurity - weighted child impurity
  • Both produce nearly identical trees in most cases
MEDIUM How do decision trees handle continuous and categorical features?

For continuous features, the algorithm sorts the feature values and evaluates binary splits at every midpoint between consecutive distinct values. For example, if a feature has values [1, 3, 7, 10], it tests thresholds at 2, 5, and 8.5. The threshold that produces the best impurity reduction is selected, creating a binary split of the form "feature <= threshold" for the left branch and "feature > threshold" for the right branch.

For categorical features, the handling depends on the algorithm. CART considers all possible binary partitions of the categories into two groups and picks the best one. For a feature with k categories, this means evaluating 2^(k-1) - 1 possible partitions, which becomes expensive for high-cardinality features. ID3 and C4.5 create one branch per category value (multi-way split). In scikit-learn, categorical features must be encoded numerically (e.g., one-hot or ordinal encoding) before fitting, as the implementation only supports numerical inputs.

Key Points
  • Continuous: sort values, test all midpoint thresholds
  • Categorical: CART uses binary partitions; ID3/C4.5 use multi-way splits
  • High-cardinality categoricals can be computationally expensive
  • Scikit-learn requires numerical encoding for categorical features
MEDIUM Explain the CART algorithm.

CART (Classification and Regression Trees), introduced by Breiman et al. in 1984, is the most widely used decision tree algorithm and the one implemented in scikit-learn. It always produces binary trees -- each internal node has exactly two children. The algorithm greedily searches for the best feature and threshold at each node by evaluating all possible binary splits and selecting the one that maximizes the impurity reduction (Gini for classification, MSE for regression).

CART uses a top-down, greedy recursive partitioning approach. Starting with the full dataset at the root, it finds the best binary split, partitions the data into two subsets, and recursively repeats the process on each subset. Growth stops when a stopping criterion is met (e.g., maximum depth, minimum samples per node, or no further impurity reduction). For pruning, CART uses cost-complexity pruning (also called minimal cost-complexity pruning or weakest-link pruning), which iteratively removes the subtree that contributes the least to overall tree performance, controlled by a complexity parameter alpha.

Key Points
  • Produces strictly binary trees (two children per node)
  • Uses Gini impurity for classification, MSE for regression
  • Greedy top-down recursive partitioning
  • Cost-complexity pruning with alpha parameter for regularization
  • Foundation of scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor
MEDIUM What causes overfitting in decision trees and how do you prevent it?

Decision trees are highly prone to overfitting because, without constraints, they will keep splitting until every leaf node is pure or contains a single sample. This creates extremely deep and complex trees that memorize training data, including noise, outliers, and random fluctuations. The resulting model has very low bias but extremely high variance -- it performs well on training data but generalizes poorly to unseen data.

Several techniques prevent overfitting. Pre-pruning constraints include setting max_depth to limit tree depth, min_samples_split to require a minimum number of samples before splitting, min_samples_leaf to enforce a minimum number of samples in each leaf, and max_features to consider only a random subset of features at each split. Post-pruning via cost-complexity pruning (ccp_alpha) removes branches after the tree is fully grown. Beyond single-tree methods, ensemble approaches like Random Forests and Gradient Boosting combine many trees to reduce variance while maintaining low bias.

Key Points
  • Unrestricted trees memorize noise and outliers (high variance)
  • Pre-pruning: max_depth, min_samples_split, min_samples_leaf
  • Post-pruning: cost-complexity pruning with ccp_alpha
  • Ensemble methods (Random Forest, Boosting) reduce variance
MEDIUM How does feature importance work in decision trees?

Feature importance in decision trees is computed based on how much each feature contributes to reducing impurity across all nodes where it is used for splitting. For each node, the impurity decrease is calculated as the difference between the parent node's impurity and the weighted sum of the child nodes' impurities. These impurity decreases are accumulated for each feature across all splits in the tree and then normalized so that all importances sum to one.

This is known as Mean Decrease in Impurity (MDI) or Gini importance. Features that appear higher in the tree (closer to the root) and are used more frequently tend to have higher importance scores because they affect more samples and produce larger impurity reductions. However, MDI has a known bias toward features with many unique values (high cardinality), as these features have more possible split points and are statistically more likely to produce a good split. Permutation importance is an alternative that avoids this bias by measuring the drop in model performance when feature values are randomly shuffled.

Key Points
  • Based on total impurity reduction across all splits using each feature
  • Also called Mean Decrease in Impurity (MDI) or Gini importance
  • Features closer to root tend to have higher importance
  • Biased toward high-cardinality features; permutation importance is an alternative
HARD Compare ID3, C4.5, and CART algorithms in detail.

ID3 (Iterative Dichotomiser 3) was developed by Quinlan in 1986. It uses information gain (entropy-based) as its splitting criterion and creates multi-way splits -- one branch per unique value of a categorical feature. ID3 only handles categorical features and does not support continuous attributes. It has no built-in pruning mechanism and is therefore highly prone to overfitting. A major limitation is that information gain is biased toward features with many categories.

C4.5, also by Quinlan (1993), is an extension of ID3 that addresses its key limitations. It uses gain ratio instead of information gain to correct the bias toward high-cardinality features. C4.5 handles both continuous and categorical features, supports missing values by distributing instances probabilistically across branches, and includes an error-based post-pruning mechanism. CART, developed by Breiman et al. (1984), differs by producing strictly binary trees, using Gini impurity for classification and MSE for regression, and applying cost-complexity pruning. CART handles both classification and regression, while ID3 and C4.5 are classification-only.

Key Points
  • ID3: entropy-based, multi-way splits, categorical only, no pruning
  • C4.5: gain ratio, handles continuous features and missing values, error-based pruning
  • CART: binary splits, Gini/MSE, supports regression, cost-complexity pruning
  • C4.5 corrects information gain bias with gain ratio normalization
  • CART is the foundation for scikit-learn and most modern implementations
HARD How do Random Forests improve upon single decision trees?

Random Forests, introduced by Breiman in 2001, combine two sources of randomness to build an ensemble of decorrelated decision trees. First, bagging (bootstrap aggregating) trains each tree on a different random sample drawn with replacement from the training data, so each tree sees a slightly different dataset. Second, at each split, only a random subset of features (typically sqrt(d) for classification or d/3 for regression) is considered, preventing any single dominant feature from appearing in every tree.

The key insight is that by decorrelating the individual trees, the variance of the ensemble average is dramatically reduced. A single decision tree has low bias but high variance. By averaging many decorrelated trees, the variance shrinks proportionally to the number of trees while bias remains approximately the same. The final prediction is determined by majority vote (classification) or averaging (regression). Random Forests are also robust to overfitting -- increasing the number of trees never harms generalization, it only reduces variance. They provide out-of-bag (OOB) error estimates for free and reliable feature importance scores.

Key Points
  • Bagging + random feature subsets decorrelate trees
  • Reduces variance dramatically while maintaining low bias
  • Majority vote for classification, averaging for regression
  • More trees never hurts; provides free OOB error estimates
  • Much more robust to overfitting than single decision trees
HARD Explain Gradient Boosted Trees and their advantages.

Gradient Boosted Trees (GBT) build an ensemble of decision trees sequentially, where each new tree is trained to correct the errors (residuals) of the combined ensemble so far. Unlike Random Forests, which build trees independently in parallel, boosting is an additive process: the model at step m is F_m(x) = F_{m-1}(x) + learning_rate * h_m(x), where h_m is a shallow decision tree fitted to the negative gradient of the loss function. This gradient-based formulation, introduced by Friedman in 2001, generalizes boosting to any differentiable loss function.

GBT typically uses shallow trees (depth 3-8) as weak learners rather than the deep trees used in Random Forests. The learning rate shrinks each tree's contribution, requiring more trees but producing better generalization through regularization. Advantages include consistently achieving state-of-the-art accuracy on structured/tabular data, flexibility through custom loss functions, and built-in handling of missing values (in implementations like XGBoost and LightGBM). The main trade-offs are sequential training (slower than Random Forest), sensitivity to hyperparameters, and higher risk of overfitting if not properly tuned.

Key Points
  • Sequential ensemble -- each tree corrects errors of previous trees
  • Fits trees to negative gradients of the loss function
  • Uses shallow trees with a learning rate for regularization
  • State-of-the-art on tabular data (XGBoost, LightGBM, CatBoost)
  • Trade-offs: slower training, hyperparameter sensitivity
HARD How would you handle imbalanced classes with decision trees?

Class imbalance causes decision trees to favor the majority class because impurity measures are dominated by the larger class. The most direct approach is to use class weights that inversely proportion to class frequency. In scikit-learn, setting class_weight="balanced" automatically adjusts the impurity calculation so that misclassifying a minority sample incurs a larger penalty. This is equivalent to scaling the Gini or entropy contribution of each sample by its class weight, effectively giving the minority class more influence on split decisions.

Data-level techniques include oversampling the minority class (SMOTE generates synthetic minority samples by interpolating between existing ones), undersampling the majority class, or using a combination of both. For ensemble methods, Balanced Random Forest combines undersampling with bagging by training each tree on a balanced bootstrap sample. Threshold tuning adjusts the classification threshold away from the default 0.5 to optimize for the desired metric (e.g., F1-score or recall). Cost-sensitive learning assigns different misclassification costs per class. The choice of evaluation metric is also critical -- use precision-recall AUC, F1-score, or Matthews correlation coefficient rather than accuracy.

Key Points
  • class_weight="balanced" adjusts impurity calculations
  • SMOTE and other resampling techniques at the data level
  • Balanced Random Forest for ensemble-level balancing
  • Threshold tuning and cost-sensitive learning
  • Use F1, precision-recall AUC, or MCC instead of accuracy
HARD What are the computational complexity considerations for tree algorithms?

Training a decision tree involves evaluating possible splits at each node. For a dataset with n samples and d features, at each node the algorithm must sort each feature (O(n log n) per feature) and evaluate all possible thresholds. The total training complexity for a balanced tree is O(d * n * log(n) * log(n)), where one log(n) comes from sorting and the other from the tree depth. In the worst case (completely unbalanced tree), depth becomes O(n), making training O(d * n^2 * log(n)).

Prediction time is O(depth) per sample, which is O(log n) for balanced trees. Memory usage is O(number of nodes), which is at most O(n) for a fully grown tree. For ensemble methods, these costs multiply by the number of trees. Modern implementations use algorithmic optimizations to improve scalability: histogram-based splitting (LightGBM) discretizes features into bins reducing split evaluation to O(bins) instead of O(n), approximate quantile sketches (XGBoost) reduce the number of candidate thresholds, and GPU acceleration parallelizes split evaluation. These optimizations make tree ensembles practical for datasets with millions of rows and thousands of features.

Key Points
  • Training: O(d * n * log(n) * log(n)) for balanced trees
  • Prediction: O(log n) per sample for balanced trees
  • Histogram-based splitting reduces per-node cost dramatically
  • Ensemble costs scale linearly with number of trees
  • XGBoost, LightGBM use approximate methods for scalability

Continue Your Journey