Ensemble learning – RunModule

An ensemble learning method is one that combines different models and takes the average of each result.

Diversity implies that learning models are independent and not correlated.

This post is an introduction to ensemble learning.

Diversity

Ways to achieve diversity:

Use different training sets
Use different learning algorithms
Use different explanatory variables
Use different output

Methods

Ensemble learning methods:

Voting
Bagging
Pasting
Random forest
Boosting
Stacking

Voting

Voting uses different training algorithms and then gets the result that has been obtained more times.

Class label voting types for classification:

Plurality
Majority
Unanimity
Weighting

Plurality looks for the mode.

Voting types for regression are mean and median.

Pasting

Pasting applies the same training algorithm on random subsets of the same set without replacement.

Replacement, in the context of statistical learning, means that the training set may include the same instance of data more than once.

Because it can be parallelized, it scales well.

Bagging

Boostrap aggregation or bagging trains models on random subsets with replacement.

It usually combines models of the same type.

Bagging introduces a bit more diversity in the training subsets, so there subsets are comparatively more different between them compared to the ones produced in pasting.

As the subset data may be distorted (because of duplication or missing data), bagging ends up with a slightly higher bias than pasting.

On the other hand, the extra diversity also means that the predictors end up being less correlated, i.e. there are more differences and disagreement between them. This makes that the average may improve the results, so the ensemble’s variance is reduced.

Bagging can be explained intuitively with the sentence “slightly worse trees, but much better forest“.

Because it can be parallelized, it scales well.

Examples of algorithms using bagging:

Bagged Decision Trees
Random Forest
Extra Trees

Out-of-bag evaluation

Out-of-bag (OOB) evaluation is a method to evaluate a bagging model’s performance.

On average, each bootstrap sample contains about 63% unique training points, so about 37% are left out. This is calculated because as m grows, the ratio approaches 1 – exp(-1), what is 63%.

If there are enough estimators, then each instance in the training set will likely be an OOB instance of several estimators.

Each unused instance can be used to test only the predictors where the instance wasn’t used on training and compare the average result against the true result. Then, calculate the overall error based on each instance error.

Random patches mean sampling both training instances and features.

Random subspaces mean sampling only features.

Random forest

You can read this post about random forest.

Boosting

Hypothesis boosting (Boosting) (in Spanish, intensificación) refers to any ensemble method that combines several learners sequentially.

It usually combines models of the same type.

It doesn’t modify the training set, unlike bagging.

Boosting methods:

AdaBoost
Gradient boosting machines
XGBoost

Adaptative boosting (AdaBoost) is a boosting method where the most underfitted instances get more weight when training a new predecessor.

Once all predecessors are trained, a weight is assigned to each predictors depending on their overall accuracy and the prediction is done.

The training is lineal so it doesn’t scale as well as bagging or pasting.

Reference:

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boost [online].

Gradient boosting is a boosting method where the residual errors is taken into account to fit a new predictor.

Histogram-based gradient boosting (HGB) uses bins as in histograms.

Bibliography:

BREIMAN, Leo. Arcing the Edge [online]. 1997
FRIEDMAN, Jerome H. Greedy Function Approximation: A Gradient Boosting Machine [online]. 1999

Bibliography:

Chapter 7 “Ensemble Learning and Random forests”, section “Boosting”. In: HOML.
SCHAPIRE, Robert E., FREUND, Yoav. Boosting: Foundations and Algorithms. The MIT Press, 2012.

Stacking

Stacked generalization, called stacking (in Spanish, apilamiento), uses a blending predictor to aggregate the predictions done by the base predictors.

It combines two groups of models:

A group of primary models, level 0 models, each of different type.
A supervised level 1 model supervised model, that learns to combine the primary models.

The blending predictor accepts as the input dimensions the different predictions of the base predictors (produces by applying k-fold), and also takes the expected value from the original data set. Then it is fitted to learn on which base predictor to trust on which situation.

References:

Stacked generalization [online]

Algorithms:

Stacked Generalization
Blending Ensemble
Super Learner Ensemble

Bibliography