本文主要会系统介绍集成学习的原理和应用

介绍

什么是集成学习呢?

通俗的讲,就是多算法融合。它的思想相当简单直接,以至于用一句话就可以完美概况: 三个臭皮匠,顶个诸葛亮。实际操作中,集成学习把大大小小的多种算法融合在一起, 共同协作来解决一个问题。他本身不是一个单独的机器学习算法,而是通过构建并结合多个 机器学习器来完成学习任务。

集成学习可以用于分类问题集成,回归问题集成,特征选取集成,异常点检测集成等等, 也可以说所有的机器学习领域都可以看到集成学习的身影。

用好集成学习有两个关键点:1)怎么训练每个算法?2)怎么融合每个算法? 围绕着两个关键点,有很多方法提出来,几句代表性的就是大家熟知的bagging和boosting方法, 其中bagging和boosting也是当今两大杀器RF(Random Forests)和GBDT(Gradient Boosting Decision Tree)之所以成功的主要秘密。

Gradient Boost其实是一个框架,里面可以套入很多不同的算法,可以参考一下机器学习与数学3 中的讲解。Boost是"提升"的意思,一般Boosting算法都是一个迭代的过程,每一次新的训练都是 为了改进上一次的结果。

xg+lr

sklearn-apply

        """Apply trees in the ensemble to X, return leaf indices.

        .. versionadded:: 0.17

        Parameters
        ----------
        X : array-like or sparse matrix, shape = [n_samples, n_features]
            The input samples. Internally, its dtype will be converted to
            ``dtype=np.float32``. If a sparse matrix is provided, it will
            be converted to a sparse ``csr_matrix``.

        Returns
        -------
        X_leaves : array_like, shape = [n_samples, n_estimators, n_classes]
            For each datapoint x in X and for each tree in the ensemble,
            return the index of the leaf x ends up in each estimator.
            In the case of binary classification n_classes is 1.
        """

在使用相关的gbdt或者xgboost的时候,可以通过apply方法, 获取到输入变量x在模型中对应的基学习器上(子树)上的叶子节点的索引; 0-1分类的时候,一棵树就能分类出正负。因此不需要额外维度的类,因此n_clsses=1

另外需要注意的是,在集成的过程中要对样本数据进行合适的分割,从而避免出现过拟合现象。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples

X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

实例

#  python3.5.3 + scikit-learn0.18.1

from scipy.sparse.construct import hstack
from sklearn.model_selection import train_test_split
from sklearn.datasets.svmlight_format import load_svmlight_file
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics.ranking import roc_auc_score
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np

def gbdt_lr_train(libsvmFileName):

    # load样本数据
    X_all, y_all = load_svmlight_file(libsvmFileName)

    # 训练/测试数据分割
    X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.3, random_state = 42)

    # 定义GBDT模型
    gbdt = GradientBoostingClassifier(n_estimators=40, max_depth=3, verbose=0,max_features=0.5)

    # 训练学习
    gbdt.fit(X_train, y_train)

    # 预测及AUC评测
    y_pred_gbdt = gbdt.predict_proba(X_test.toarray())[:, 1]
    gbdt_auc = roc_auc_score(y_test, y_pred_gbdt)
    print('gbdt auc: %.5f' % gbdt_auc)

    # lr对原始特征样本模型训练
    lr = LogisticRegression()
    lr.fit(X_train, y_train)    # 预测及AUC评测
    y_pred_test = lr.predict_proba(X_test)[:, 1]
    lr_test_auc = roc_auc_score(y_test, y_pred_test)
    print('基于原有特征的LR AUC: %.5f' % lr_test_auc)

    # GBDT编码原有特征
    X_train_leaves = gbdt.apply(X_train)[:,:,0]
    X_test_leaves = gbdt.apply(X_test)[:,:,0]

    # 对所有特征进行ont-hot编码
    (train_rows, cols) = X_train_leaves.shape

    gbdtenc = OneHotEncoder()
    X_trans = gbdtenc.fit_transform(np.concatenate((X_train_leaves, X_test_leaves), axis=0))

    # 定义LR模型
    lr = LogisticRegression()
    # lr对gbdt特征编码后的样本模型训练
    lr.fit(X_trans[:train_rows, :], y_train)
    # 预测及AUC评测
    y_pred_gbdtlr1 = lr.predict_proba(X_trans[train_rows:, :])[:, 1]
    gbdt_lr_auc1 = roc_auc_score(y_test, y_pred_gbdtlr1)
    print('基于GBDT特征编码后的LR AUC: %.5f' % gbdt_lr_auc1)

    # 定义LR模型
    lr = LogisticRegression(n_jobs=-1)
    # 组合特征
    X_train_ext = hstack([X_trans[:train_rows, :], X_train])
    X_test_ext = hstack([X_trans[train_rows:, :], X_test])

    print(X_train_ext.shape)
    # lr对组合特征的样本模型训练
    lr.fit(X_train_ext, y_train)

    # 预测及AUC评测
    y_pred_gbdtlr2 = lr.predict_proba(X_test_ext)[:, 1]
    gbdt_lr_auc2 = roc_auc_score(y_test, y_pred_gbdtlr2)
    print('基于组合特征的LR AUC: %.5f' % gbdt_lr_auc2)


if __name__ == '__main__':
    gbdt_lr_train('data/sample_libsvm_data.txt')   


参考资料

https://blog.csdn.net/dengxing1234/article/details/73739481

https://blog.csdn.net/dengxing1234/article/details/73739481

https://cloud.tencent.com/developer/article/1061660

XGBoost与spark在广告排序中的应用

https://mp.weixin.qq.com/s/4i5O0QlKpWz_5_NA9gdiPA?utm_medium=hao.caibaojian.com&utm_source=hao.caibaojian.com



看到这里,其实本文的主题已经结束了,这里是bonus。
XGBoost+LR结合的思想源于facebook的研究论文[5],使用树模型来做特征选择,最后用LR来输出CTR分数,充分利用了两种模型的优点,实践证明,XGBoost+LR离线评估和线上AB都优于单独XGBoost。除了论文中提供的方案带来的收益外,我们还将这种stacking的想法做了发挥,工程上单独抽取出LR层,这样做有如下优点:
1、对于一些类似于ID类的非连续特征,可以单独使用LR层来承载
2、事实上很多feature extraction 强大的模型稍作处理都可以作为LR层的输入,处理得当的话,LR还是很强大的
3、通过在LR层组合不同的特征来源,方便的做到“刻画”和“泛化”的结合,类似于deep and wide这样的思想
4、LR本身的优势,适合大规模并行,online learning算法成熟等等。。。

XGBoost + LR 的格局没有超越特征工程

https://cloud.tencent.com/developer/article/1006009

XGBoost + LR 在工业和竞赛实践中,都取得了不错的效果。但 XGBoost 的叶子节点不能完全替代人工特征, XGBoost + LR 并没有像深度学习那样试图带来自动特征工程的故事和逻辑。最终,XGBoost + LR 的格局没有超越特征工程。

xgboost是如何处理缺失值的?

判断归到左子树还是右子树

在xgboost里,在每个结点上都会将对应变量是缺失值的数据往左右分支各导流一次, 然后计算两种导流方案对Objective的影响,最后认为对Objective降低更明显的方向(左或者右)就是缺失数据应该流向的方向,在预测时在这个结点上将同样变量有缺失值的数据都导向训练出来的方向。

1) You can directly feed data in as sparse matrix, 
and only contains non-missing value. i.e. features that are not presented 
in the sparse feature matrix are treated as ‘missing’.

2) XGBoost will handle it internally and you do not need to do anything on it.

3) It will depends on how you present the data. If you put data in as LIBSVM format, 
and list zero features there, it will not be treated as missing.

4) Internally, XGBoost will automatically learn what is the best direction 
to go when a value is missing. Equivalently, 
this can be viewed as automatically “learn” 
what is the best imputation value for missing values 
based on reduction on training loss.


参考资料

https://www.zhihu.com/question/58230411

https://blog.csdn.net/vitodi/article/details/59541300

https://www.zhihu.com/question/34867991?sort=created

xgboost假如one hot编码的列太多的话,对模型效果有影响?

正负样本不均衡

经过特征构建所得的数据集正负样本比例约为 1:1200,数据严重失衡,易导致模型训练失效。 在这里,我们可通过下采样和基于 f1_score 的评价标准来应对此问题。

若考虑对训练集中的负样本进行下采样。为避免随机采样的特征空间覆盖性不足, 先对负样本进行k-means聚类(参考Sergey Feldman所提方法(2)), 然后在每个聚类上采用subsample来获得全面的负样本采样, 最后与正样本组成较为平衡的训练集。

https://blog.csdn.net/snoopy_yuan/article/details/75808006

参考资料

https://blog.csdn.net/lilyth_lilyth/article/details/48032119

http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html

https://blog.csdn.net/shine19930820/article/details/71713680#python%E5%AE%9E%E7%8E%B0

https://www.zhihu.com/question/39254529

http://www.csdn.net/article/2015-03-02/2824069