Gradient Boosting (XGBoost) - Complete Master Guide

01

Historical Intuition

How the idea of combining weak learners sequentially evolved into the most powerful algorithm for structured data.

The Birth of Boosting: AdaBoost

The story of gradient boosting begins with a fundamental question in computational learning theory: can a set of weak learners be combined to create a strong learner? In 1990, Robert Schapire proved that the answer is yes, and in 1997, Yoav Freund and Robert Schapire introduced AdaBoost (Adaptive Boosting), the first practical boosting algorithm.

AdaBoost works by training a sequence of weak classifiers (typically shallow decision trees called "stumps"), where each subsequent classifier focuses more on the examples that previous classifiers got wrong. It assigns higher weights to misclassified samples, forcing the next learner to pay more attention to the hard cases. The final prediction is a weighted majority vote of all the weak learners.

Friedman's Gradient Boosting Framework (2001)

While AdaBoost was revolutionary, its theoretical understanding was limited to classification with exponential loss. In 2001, Jerome Friedman published a landmark paper that reframed boosting as gradient descent in function space. This was the birth of Gradient Boosting Machines (GBM).

Friedman's key insight was that boosting could be understood as iteratively adding functions (trees) that move the ensemble prediction in the direction of the negative gradient of the loss function. This generalization meant that gradient boosting could optimize any differentiable loss function -- making it applicable to regression, classification, ranking, and more.

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})

Where \( F_m \) is the ensemble after \( m \) iterations, \( h_m \) is the new weak learner fitted to the negative gradient (pseudo-residuals), and \( \eta \) is the learning rate (shrinkage parameter).

The Evolution of Gradient Boosting

From Friedman's foundational work to today's highly optimized implementations, gradient boosting has become the dominant algorithm for structured/tabular data:

1990

Schapire proves that weak learners can be boosted to strong learners

1997

Freund and Schapire introduce AdaBoost, winning the Godel Prize

2001

Friedman publishes Gradient Boosting Machines -- gradient descent in function space

2014

Tianqi Chen releases XGBoost -- regularized, parallelized, and cache-optimized gradient boosting

2017

Microsoft releases LightGBM with histogram-based splitting and leaf-wise growth

Today

Gradient boosting dominates Kaggle competitions and powers production ML systems worldwide

02

Core Intuition

Why fitting to the errors of previous models leads to powerful ensemble predictions.

Sequential Error Correction

The core idea behind gradient boosting is remarkably simple: build models sequentially, where each new model corrects the errors of the previous ensemble. Instead of trying to solve the entire problem at once, you solve it incrementally.

Think of it as a team of analysts reviewing a forecast. The first analyst makes a rough prediction. The second analyst looks at the errors (residuals) from the first prediction and builds a model to predict those errors. The third analyst then models the remaining errors, and so on. Each subsequent model focuses on what the current ensemble still gets wrong.

Residual Fitting

In the simplest case (squared error loss for regression), each new tree is fitted to the residuals -- the difference between the true values and the current predictions:

r_i^{(m)} = y_i - F_{m-1}(\mathbf{x}_i)

The new tree \( h_m(\mathbf{x}) \) is trained to predict these residuals. When we add it to the ensemble, the combined prediction gets closer to the true values. After many iterations, the residuals become very small, and the ensemble achieves high accuracy.

Why Not Just Build a Bigger Tree?

A single large decision tree can overfit badly. Gradient boosting instead uses many small trees (weak learners), each making a small correction. This additive approach acts as a form of regularization -- each tree contributes only a small improvement, preventing the ensemble from memorizing noise in the training data.

Additive Modeling

Gradient boosting builds the final model as a sum of weak learners:

F_M(\mathbf{x}) = F_0(\mathbf{x}) + \sum_{m=1}^{M} \eta \cdot h_m(\mathbf{x})

Where \( F_0 \) is the initial prediction (often the mean of the target), \( M \) is the total number of boosting rounds, and \( \eta \) is the learning rate that controls how much each tree contributes.

Greedy Optimization

Each tree is added greedily to minimize the loss function. The ensemble cannot go back and change previous trees -- it can only add new corrections.

Flexible Loss Functions

Any differentiable loss function can be used -- MSE, MAE, log-loss, Huber loss, and custom losses for specific business objectives.

Weak Learners

The base learners are typically shallow decision trees (depth 3-8). They are intentionally kept weak -- a single tree is not accurate, but the ensemble of hundreds or thousands of them is very powerful.

03

Mathematical Foundation

Loss minimization in function space -- gradient descent without parameters.

Loss Minimization in Function Space

In traditional gradient descent, we minimize a loss function by updating parameters in the direction of the negative gradient. In gradient boosting, we minimize the loss by updating a function -- the ensemble prediction itself. The objective is:

\min_{F} \sum_{i=1}^{n} L(y_i, F(\mathbf{x}_i))

Where \( L \) is a differentiable loss function and \( F \) is the model we are building. Instead of optimizing over a fixed set of parameters, we optimize over the space of functions by iteratively adding new functions (trees).

The Gradient Descent Analogy

In parametric gradient descent, we update parameters: \( \theta_{t+1} = \theta_t - \eta \nabla_\theta L \). In gradient boosting, we update the function itself:

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) - \eta \cdot \nabla_{F} L \Big|_{F=F_{m-1}}

The negative gradient of the loss with respect to the current prediction gives us the direction in which to improve. For each training point, this negative gradient is:

g_i = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \Bigg|_{F=F_{m-1}}

These values \( g_i \) are called pseudo-residuals. We fit a decision tree \( h_m \) to these pseudo-residuals, which approximates the negative gradient direction.

Second-Order Approximation (XGBoost)

XGBoost improves on the basic gradient boosting by using a second-order Taylor expansion of the loss function. This is analogous to Newton's method vs. gradient descent -- using curvature information for faster convergence:

L^{(m)} \approx \sum_{i=1}^{n} \left[ g_i \cdot h_m(\mathbf{x}_i) + \frac{1}{2} H_i \cdot h_m^2(\mathbf{x}_i) \right] + \Omega(h_m)

Where \( g_i = \partial L / \partial F(\mathbf{x}_i) \) is the first derivative (gradient), \( H_i = \partial^2 L / \partial F(\mathbf{x}_i)^2 \) is the second derivative (Hessian), and \( \Omega(h_m) \) is a regularization term that penalizes model complexity.

Why Second-Order Information Matters

Using the Hessian (second derivative) allows XGBoost to take more informed steps during optimization. Instead of just knowing the direction to move (gradient), it also knows the curvature, enabling it to determine the optimal step size. This is why XGBoost often converges faster than standard gradient boosting implementations.

Boosting Rounds vs. Training Error

The chart below illustrates how training and validation error decrease as more boosting rounds (trees) are added. Notice how training error continues to decrease while validation error eventually plateaus or increases -- this is the point where early stopping should be applied.

04

The Boosting Algorithm

Step-by-step: how gradient boosting builds its ensemble one tree at a time.

The Gradient Boosting Algorithm

The complete gradient boosting algorithm proceeds as follows:

1

Initialize with a Constant

Start with an initial prediction \( F_0(\mathbf{x}) \) that minimizes the loss. For squared error, this is the mean of the target values: \( F_0 = \bar{y} \). For log-loss, this is the log-odds of the positive class.

2

Compute Pseudo-Residuals

For each training point, compute the negative gradient of the loss with respect to the current prediction:

r_{im} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} \Bigg|_{F=F_{m-1}}

3

Fit a Weak Learner

Train a decision tree \( h_m(\mathbf{x}) \) on the training data with the pseudo-residuals \( r_{im} \) as the target. The tree learns to approximate the negative gradient direction.

4

Update the Ensemble

Add the new tree to the ensemble with a learning rate \( \eta \):

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})

5

Repeat

Go back to step 2 and repeat for \( M \) iterations (boosting rounds). The final model is the sum of all trees.

Pseudo-Residuals for Common Loss Functions

The beauty of the gradient boosting framework is that the algorithm stays the same -- only the pseudo-residuals change depending on the loss function:

Squared Error (MSE): \( r_{im} = y_i - F_{m-1}(\mathbf{x}_i) \) -- simply the raw residuals
Absolute Error (MAE): \( r_{im} = \text{sign}(y_i - F_{m-1}(\mathbf{x}_i)) \) -- the sign of the residuals
Log-Loss (Classification): \( r_{im} = y_i - p_i \) where \( p_i = \sigma(F_{m-1}(\mathbf{x}_i)) \) -- difference between true label and predicted probability

Interactive: Boosting Steps Visualization

See how gradient boosting builds its prediction step by step. Each boosting round adds a new tree that corrects the remaining errors.

Gradient Boosting Steps

Use the slider to add more boosting rounds. Watch how the ensemble prediction (orange line) gets closer to the true function (blue dashed) as more trees are added.

Boosting Round: 1

05

Learning Rate & Shrinkage

How controlling each tree's contribution prevents overfitting and improves generalization.

The Effect of the Learning Rate

The learning rate \( \eta \) (also called the shrinkage parameter) controls how much each tree contributes to the final ensemble. It scales the contribution of each new tree:

F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}), \quad 0 < \eta \leq 1

Small \( \eta \) (e.g., 0.01-0.1): Each tree makes a tiny correction. Requires more trees (higher \( M \)) but generally produces better generalization. The model learns slowly and carefully, reducing the risk of overfitting.
Large \( \eta \) (e.g., 0.3-1.0): Each tree makes a large correction. Needs fewer trees but risks overfitting. The model learns aggressively and may memorize noise.

The Learning Rate and n_estimators Trade-off

There is a fundamental trade-off between the learning rate and the number of trees:

\text{Lower } \eta \implies \text{Need more trees (higher } M\text{)} \implies \text{Better generalization but slower training}

In practice, the standard approach is to set a small learning rate (0.01-0.1) and then use early stopping to find the optimal number of trees. This means you keep adding trees until the validation error stops improving, then stop. This automatically finds the right balance between underfitting and overfitting.

Never Fix n_estimators Without Early Stopping

Setting a fixed number of trees (e.g., n_estimators=1000) without early stopping is risky. With a high learning rate, 1000 trees may severely overfit. With a low learning rate, 1000 trees may underfit. Always use early stopping with a validation set to find the optimal number of boosting rounds.

Interactive: Learning Rate Effect

See how different learning rates affect the convergence of gradient boosting. A smaller learning rate leads to smoother convergence but requires more boosting rounds.

Learning Rate Visualization

Adjust the learning rate slider and observe how the model converges. Lower values require more iterations but produce smoother, more generalizable models.

Learning Rate (η): 0.1

06

Tree-Specific Parameters

Controlling the complexity and behavior of each individual tree in the ensemble.

max_depth -- Tree Depth

The max_depth parameter controls the maximum depth of each decision tree in the ensemble. It directly determines the complexity of the interactions the model can capture:

Depth 1 (stumps): Each tree splits on a single feature. Can only model main effects, no interactions.
Depth 2-3: Captures pairwise and three-way feature interactions. Often sufficient for many problems.
Depth 4-8: Captures higher-order interactions. Standard range for gradient boosting.
Depth > 10: Very complex trees. High risk of overfitting. Rarely needed for boosting.

The default in XGBoost is max_depth=6, which provides a good balance for most problems.

min_child_weight -- Minimum Leaf Weight

In XGBoost, min_child_weight specifies the minimum sum of instance weights (Hessians) needed in a child node. For standard squared-error loss, this is effectively the minimum number of samples in a leaf. Higher values create more conservative (simpler) trees:

\text{min\_child\_weight}: \sum_{i \in \text{leaf}} H_i \geq \text{threshold}

Setting this value higher prevents the model from learning highly specific patterns (noise) in the training data. Values of 1-10 are typical.

Subsampling Parameters

Gradient boosting can use stochastic sampling at both the row and column level to reduce overfitting and speed up training:

subsample (Row Sampling)

Fraction of training samples used for each tree (e.g., 0.8 means 80%). Introduces randomness like bagging. Typical range: 0.5-1.0.

colsample_bytree (Column Sampling)

Fraction of features considered for each tree. Similar to Random Forest's feature bagging. Typical range: 0.5-1.0.

colsample_bylevel

Fraction of features considered at each depth level within a tree. Provides even finer-grained randomization.

Stochastic Gradient Boosting

Friedman (2002) showed that adding randomness through subsampling (called "stochastic gradient boosting") significantly improves generalization. By not using all data for every tree, the model is less likely to overfit. This is one of the most effective regularization techniques for gradient boosting.

07

Regularization in XGBoost

How XGBoost controls model complexity beyond just tree depth and learning rate.

The XGBoost Regularized Objective

One of XGBoost's key innovations is its regularized objective function. Unlike standard GBM, XGBoost explicitly adds a regularization term to the loss:

\text{Obj}^{(m)} = \sum_{i=1}^{n} L(y_i, \hat{y}_i^{(m)}) + \sum_{k=1}^{m} \Omega(f_k)

Where the regularization term for each tree is:

\Omega(f) = \gamma \cdot T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 + \alpha \sum_{j=1}^{T} |w_j|

Here \( T \) is the number of leaves, \( w_j \) are the leaf weights (predictions), \( \gamma \) penalizes the number of leaves (complexity cost), \( \lambda \) is the L2 regularization on leaf weights, and \( \alpha \) is the L1 regularization on leaf weights.

Regularization Parameters

γ

gamma (min_split_loss)

Minimum loss reduction required to make a further split on a leaf node. A higher gamma makes the algorithm more conservative -- it will only split if the split improves the objective by at least gamma. Effectively controls the minimum gain for a split. Values of 0-5 are typical.

λ

lambda (reg_lambda) -- L2 Regularization

L2 regularization on the leaf weights. Shrinks the leaf weights toward zero, preventing any single leaf from having an extreme value. Default is 1.0. Higher values make the model more conservative.

α

alpha (reg_alpha) -- L1 Regularization

L1 regularization on the leaf weights. Encourages sparsity by pushing some leaf weights to exactly zero. Useful when you have many features and want implicit feature selection. Default is 0.

Early Stopping

Perhaps the most important regularization technique for gradient boosting is early stopping. Instead of pre-specifying the number of trees, you monitor the validation error and stop training when it stops improving:

Split your data into training and validation sets
After each boosting round, evaluate the loss on the validation set
If the validation loss has not improved for \( k \) consecutive rounds (the "patience" parameter), stop training
Use the model from the round with the best validation performance

Early Stopping is Essential

Without early stopping, gradient boosting will continue adding trees until it perfectly memorizes the training data. Always use early stopping with a held-out validation set. The early_stopping_rounds parameter in XGBoost (typically 10-50) controls how many rounds of no improvement to tolerate before stopping.

08

Feature Importance

Understanding which features drive predictions -- from simple counts to SHAP values.

Built-in Feature Importance Metrics

XGBoost provides three built-in methods to measure feature importance:

G

Gain

The average improvement in the loss function when a feature is used for splitting. This is the most informative metric -- it tells you how much each feature contributes to reducing the training error. Features with higher gain are more important for making accurate predictions.

C

Cover

The average number of samples affected by splits on a feature. A feature with high cover influences the predictions of many training samples, even if its gain per split is moderate.

F

Frequency (Weight)

The number of times a feature is used for splitting across all trees. Simple but can be misleading -- a feature used often with small improvements may appear more important than a feature used rarely with large improvements.

Feature Importance Chart

The chart below compares the importance of features in a sample gradient boosting model using the "gain" metric.

SHAP Values -- The Gold Standard

While built-in importance metrics are useful, SHAP (SHapley Additive exPlanations) values provide the most theoretically grounded and interpretable feature attributions:

f(\mathbf{x}) = \phi_0 + \sum_{j=1}^{d} \phi_j(\mathbf{x})

Where \( \phi_0 \) is the base value (average prediction) and \( \phi_j(\mathbf{x}) \) is the SHAP value for feature \( j \) -- the contribution of feature \( j \) to the prediction for this specific instance. SHAP values are based on cooperative game theory (Shapley values) and satisfy desirable properties like local accuracy, consistency, and missingness.

SHAP vs. Built-in Importance

Built-in importance metrics are global and can be inconsistent (a feature's importance can decrease when you make it more useful). SHAP values are consistent, provide both global and local explanations, and show the direction of effect. The shap library has optimized TreeSHAP for gradient boosting models.

09

Gradient Boosting Variants

XGBoost, LightGBM, and CatBoost -- three implementations, different trade-offs.

XGBoost (eXtreme Gradient Boosting)

Released by Tianqi Chen in 2014, XGBoost became the go-to gradient boosting library for several reasons:

Regularized objective: L1 and L2 regularization on leaf weights, plus complexity cost for number of leaves
Second-order optimization: Uses both gradient and Hessian for more accurate tree construction
Sparsity-aware: Handles missing values natively by learning the best default direction for each split
Cache-optimized: Block structure for efficient parallel computation of split gains
Tree growth: Level-wise (best-first) growth -- builds all nodes at one depth before moving deeper

LightGBM

Released by Microsoft in 2017, LightGBM introduced several innovations for speed and scalability:

Histogram-based splitting: Bins continuous features into discrete histograms, dramatically reducing the number of split candidates
Leaf-wise growth: Grows the tree leaf-by-leaf, choosing the leaf with the largest loss reduction. This can produce deeper, more accurate trees but risks overfitting
GOSS (Gradient-based One-Side Sampling): Keeps all instances with large gradients (poorly predicted) and randomly samples from small-gradient instances
EFB (Exclusive Feature Bundling): Bundles mutually exclusive features to reduce dimensionality

CatBoost

Released by Yandex in 2017, CatBoost specializes in handling categorical features:

Native categorical support: Encodes categorical features using ordered target statistics, avoiding the need for one-hot encoding or label encoding
Ordered boosting: Combats prediction shift (a form of data leakage) by using different permutations of the data for computing residuals and training trees
Symmetric trees: Builds balanced (symmetric) trees where all nodes at the same depth use the same splitting condition, enabling fast inference
Less tuning required: Often performs well with default hyperparameters

Variant Comparison

The chart below compares the three major gradient boosting implementations across key dimensions.

10

Practical Applications

Where gradient boosting dominates -- from Kaggle competitions to production ML systems.

Where Gradient Boosting Excels

Kaggle Competitions

XGBoost and LightGBM are the dominant algorithms for tabular data competitions. The majority of winning solutions on structured data use gradient boosting, often blended with other models.

Click-Through Rate (CTR) Prediction

Digital advertising platforms use gradient boosting to predict whether a user will click an ad. Features include user demographics, browsing history, ad content, and contextual signals.

Fraud Detection

Banks and fintech companies use gradient boosting to detect fraudulent transactions in real time. The ability to handle imbalanced classes and mixed feature types makes it ideal.

Credit Scoring

Gradient boosting models assess credit risk by analyzing financial history, demographic data, and behavioral patterns. Feature importance provides interpretability required by regulators.

Healthcare

Predicting patient outcomes, disease diagnosis from lab results, and drug response modeling. Gradient boosting handles missing values and mixed data types common in medical data.

Recommendation Systems

Learning to rank products, content, or search results. XGBoost's ranking objective (pairwise and listwise) is widely used in search and recommendation engines.

Gradient Boosting Playground

Experiment with gradient boosting parameters and see how they affect the decision boundary. Adjust the number of estimators and tree depth.

n_estimators: 50

max_depth: 3

Advantages

State-of-the-Art on Tabular Data

Consistently outperforms other algorithms on structured/tabular datasets. Dominates machine learning competitions and production systems for non-image, non-text data.

Handles Mixed Feature Types

Works with numerical, categorical, and ordinal features without extensive preprocessing. Handles missing values natively (in XGBoost and LightGBM).

Feature Importance and Interpretability

Provides built-in feature importance metrics. SHAP values offer detailed, per-prediction explanations that satisfy regulatory requirements.

Robust to Outliers and Noise

Tree-based models are naturally robust to outliers. Combined with regularization and early stopping, gradient boosting resists overfitting.

Disadvantages

Sequential Training

Trees are built sequentially, limiting parallelism at the algorithm level. Training can be slower than Random Forest for very large datasets.

Many Hyperparameters

Requires careful tuning of learning rate, number of trees, tree depth, regularization, and subsampling parameters. Can be complex for beginners.

Struggles with Extrapolation

Tree-based models cannot extrapolate beyond the range of training data. For predictions outside the training distribution, performance degrades.

Not Ideal for Unstructured Data

For images, text, and audio, deep learning models (CNNs, Transformers) typically outperform gradient boosting. GB is best suited for tabular/structured data.

11

Python Implementation

From sklearn's GradientBoostingClassifier to XGBoost's native API.

Basic Gradient Boosting with scikit-learn

The simplest way to get started with gradient boosting using scikit-learn's built-in implementation.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f"Feature importances (top 5):")
import numpy as np
indices = np.argsort(model.feature_importances_)[::-1][:5]
for i in indices:
    print(f"  Feature {i}: {model.feature_importances_[i]:.4f}")

XGBoost with Early Stopping

Using XGBoost's native API with early stopping -- the recommended approach for production models.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load and split data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

# Create DMatrix objects (XGBoost's optimized data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test)

# Set parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 4,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_lambda': 1.0,
    'reg_alpha': 0.1,
    'seed': 42
}

# Train with early stopping
model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=30,
    verbose_eval=50
)

# Predict and evaluate
y_pred = (model.predict(dtest) > 0.5).astype(int)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Best iteration: {model.best_iteration}")

Hyperparameter Tuning with Optuna

Modern hyperparameter optimization using Optuna with XGBoost -- more efficient than grid search.

import optuna
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

X, y = load_breast_cancer(return_X_y=True)

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 2, 8),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 10.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 5.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'random_state': 42
    }
    model = XGBClassifier(**params, use_label_encoder=False,
                          eval_metric='logloss')
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

SHAP Feature Explanations

Using SHAP values for interpretable feature importance with XGBoost.

import shap
import xgboost as xgb
from sklearn.datasets import load_breast_cancer

# Train model
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names
model = xgb.XGBClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1,
    random_state=42, use_label_encoder=False,
    eval_metric='logloss'
)
model.fit(X, y)

# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Summary plot (feature importance with direction)
shap.summary_plot(shap_values, X,
                  feature_names=feature_names)

# Single prediction explanation
shap.force_plot(explainer.expected_value,
                shap_values[0], X[0],
                feature_names=feature_names)

XGBoost sklearn API vs. Native API

XGBoost offers two APIs: the native xgb.train() API (more control, DMatrix objects) and the sklearn-compatible XGBClassifier/XGBRegressor (easier integration with sklearn pipelines, GridSearchCV). Use the native API for production and the sklearn API for prototyping and when you need sklearn compatibility.