码迷,mamicode.com
首页 > 其他好文 > 详细

scikit-learn:4.3. Preprocessing data(standardi/normali/binari..zation、encoding、missing value)

时间:2015-07-23 09:29:56      阅读:188      评论:0      收藏:0      [点我收藏+]

标签:

参考:http://scikit-learn.org/stable/modules/preprocessing.html


主要讲述The sklearn.preprocessing package的utility function and transformer classes,包括standardization、normalization、binarization、encoding categorical features、process missing value。


1、Standardization, or mean removal and variance scaling(标准化:去均值、除方差)

所谓standardization(标准化),就是指features处于standard normally distribut(高斯分布:均值是0、方差是1),强调的是,所有的features都是标准化的,防止某个features权重过大影响estimators的结果。

(详细内容之前,看一下Further discussion on the importance of centering and scaling data:http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html


实际操作中,我们不关心分布的具体形状,只需要每个features减去本features的均值,然后除以本features的标准差。

主要介绍scale function、StandardScaler class和MinMaxScaler

The function scale provides a quick and easy way to perform this operation on a single array-like dataset:

>>>
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

Scaled data has zero mean and unit variance:

>>>
>>> X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

>>> X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

The preprocessing module further provides a utility class StandardScaler ,可以计算训练集的均值和方差,然后将相同的transformation应用于测试集. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:

>>>
>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_                                      
array([ 1. ...,  0. ...,  0.33...])

>>> scaler.std_                                       
array([ 0.81...,  0.81...,  1.24...])

>>> scaler.transform(X)                               
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

>>>
>>> scaler.transform([[-1.,  1., 0.]])                
array([[-2.44...,  1.22..., -0.26...]])

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor ofStandardScaler.


MinMaxScaler看名字就知道,将特征值缩放到min和max之间,常见的是缩放到0和1之间。好处是,不仅能够保持sparse data中得0仍然是0,还可以增加处理小方差特征的鲁棒性。

实现细节如下:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std / (max - min) + min

(其实看出来了,实现时操作的是1d array,而不是2d的X,所以对于regression任务,可以考虑缩放target variables)

以缩放到[0, 1]之间为例:

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

如果不直接对训练集使用fit_transform,而是使用fit ,那么相同的transformation过程同样可以应用到测试集上:

>>>
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])

(除了center、scale操作,有些model还假设特征之间的linear independence,如PCA处理图像时,关于移除特征之间的linear correlation,参考: sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True



未完待续。。。。








版权声明:本文为博主原创文章,未经博主允许不得转载。

scikit-learn:4.3. Preprocessing data(standardi/normali/binari..zation、encoding、missing value)

标签:

原文地址:http://blog.csdn.net/mmc2015/article/details/47016313

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!