Stacking (Stacked Generalization)

Stacking (Stacked Generalization) is an ensemble technique where a "meta-learner" model is trained to optimally combine the predictions of multiple diverse "base learners" — instead of simple averaging or voting, stacking learns WHEN to trust each model (Model A is best for young customers, Model B is best for high-income customers) by using the base models' predictions as input features to a second-level model, typically achieving the highest performance of any ensemble method and serving as the winning strategy in many Kaggle competitions.

What Is Stacking?

- Definition: A two-level ensemble method where Level 1 consists of diverse base models that generate predictions, and Level 2 is a meta-learner (usually a simple linear model) that takes those predictions as features and learns the optimal way to combine them.
- Why It's Better Than Simple Averaging: Averaging weights all models equally. Stacking learns that "trust the Random Forest more for these types of inputs and the Neural Network more for those types" — capturing conditional expertise that uniform weighting cannot.
- The Key Insight: Different models have different strengths. A linear model might be best for extrapolation, a tree model for capturing interactions, and a neural network for non-linear patterns. Stacking automatically allocates trust based on each model's demonstrated accuracy on different regions of the data.

How Stacking Works

| Step | Process | Detail |
|------|---------|--------|
| 1. Train base models | SVM, Random Forest, Neural Net | Each trained on training data |
| 2. Generate meta-features | Each base model predicts on validation set | 3 models → 3 new features per example |
| 3. Train meta-learner | Logistic Regression on meta-features | Learns optimal combination weights |
| 4. Predict | Base models predict on new data → meta-learner combines | Final ensemble prediction |

Preventing Data Leakage in Stacking

The critical mistake: training base models on the same data used to generate meta-features → the meta-learner overfits to training set predictions.

Solution: K-Fold Out-of-Fold Predictions

| Fold | Base Model Trains On | Generates Predictions For |
|------|---------------------|--------------------------|
| Fold 1 held out | Folds 2-5 | Fold 1 (out-of-fold predictions) |
| Fold 2 held out | Folds 1, 3-5 | Fold 2 (out-of-fold predictions) |
| ... | ... | ... |
| All folds combined | | Complete set of honest meta-features |

Each training example gets a prediction from a model that never saw it — preventing leakage.

Common Stacking Architectures

| Base Models (Level 1) | Meta-Learner (Level 2) | Use Case |
|-----------------------|----------------------|----------|
| LR, RF, XGBoost, SVM | Logistic Regression | Standard stacking |
| LightGBM, CatBoost, Neural Net | Ridge Regression | Kaggle competitions |
| Multiple fine-tuned BERTs | Linear combination | NLP tasks |
| ResNet, EfficientNet, ViT | Simple MLP | Computer vision |

Python Implementation

``python from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC

stacker = StackingClassifier( estimators=[ ('rf', RandomForestClassifier(n_estimators=100)), ('svm', SVC(probability=True)), ], final_estimator=LogisticRegression(), cv=5 # Out-of-fold predictions (prevents leakage) ) stacker.fit(X_train, y_train)``

Stacking is the most powerful ensemble technique for combining diverse models — learning the optimal conditional weighting of base model predictions through a meta-learner that captures when each model is most trustworthy, consistently achieving top performance in competitions and production systems where maximizing accuracy justifies the additional complexity of a multi-model pipeline.

Want to learn more?