类别不平衡之欠采样（undersampling）

时间：2018-05-22 22:14:03 阅读：516 评论：0 收藏：0 [点我收藏+]

类别不平衡就是指分类任务中不同类别的训练样例数目差别很大的情况

常用的做法有三种，分别是1.欠采样， 2.过采样， 3.阈值移动

由于这几天做的project的target为正值的概率不到4%，且数据量足够大，所以我采用了欠采样：

欠采样，即去除一些反例使得正、反例数目接近，然后再进行学习，基本的算法如下：

def undersampling(train, desired_apriori):

    # Get the indices per target value
    idx_0 = train[train.target == 0].index
    idx_1 = train[train.target == 1].index
    # Get original number of records per target value
    nb_0 = len(train.loc[idx_0])
    nb_1 = len(train.loc[idx_1])
    # Calculate the undersampling rate and resulting number of records with target=0
    undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
    undersampled_nb_0 = int(undersampling_rate*nb_0)
    print(‘Rate to undersample records with target=0: {}‘.format(undersampling_rate))
    print(‘Number of records with target=0 after undersampling: {}‘.format(undersampled_nb_0))
    # Randomly select records with target=0 to get at the desired a priori
    undersampled_idx = shuffle(idx_0, n_samples=undersampled_nb_0)
    # Construct list with remaining indices
    idx_list = list(undersampled_idx) + list(idx_1)
    # Return undersample data frame
    train = train.loc[idx_list].reset_index(drop=True)

    return train

因为对应具体的project，所以里面欠采样的为反例，如果要使用的话需要做一些改动。

欠采样法若随机丢弃反例，可能会丢失一些重要信息。为此，周志华实验室提出了欠采样的算法EasyEnsemble：利用集成学习机制，将反例划分为若干个集合供不容学习器适用，这样对每个学习器来看都进行了欠采样，但在全局来看却不会丢失重要信息。其实这个方法可以再基本欠采样方法上进行些许改动即可：

def easyensemble(df, desired_apriori, n_subsets=10):
    train_resample = []
    for _ in range(n_subsets):
        sel_train = undersampling(df, desired_apriori)
        train_resample.append(sel_train)
    return train_resample

仔细来看，下图是原始论文Exploratory Undersampling for Class-Imbalance Learning里的算法介绍：
技术分享图片

Reference:

类别不平衡之欠采样（undersampling）

标签：dex 数据量需要 mba sem strong 利用 set def

原文地址：https://www.cnblogs.com/bjwu/p/9073937.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行