scikit-learn：4.3. Preprocessing data（standardi/normali/binari..zation、encoding、missing value）

时间：2015-07-23 09:29:56 阅读：188 评论：0 收藏：0 [点我收藏+]

标签：

参考：http://scikit-learn.org/stable/modules/preprocessing.html

主要讲述The sklearn.preprocessing package的utility function and transformer classes，包括standardization、normalization、binarization、encoding categorical features、process missing value。

1、Standardization, or mean removal and variance scaling（标准化：去均值、除方差）

所谓standardization（标准化），就是指features处于standard normally distribut（高斯分布：均值是0、方差是1），强调的是，所有的features都是标准化的，防止某个features权重过大影响estimators的结果。

（详细内容之前，看一下Further discussion on the importance of centering and scaling data：http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html）

实际操作中，我们不关心分布的具体形状，只需要每个features减去本features的均值，然后除以本features的标准差。

主要介绍scale function、StandardScaler class和MinMaxScaler：

The function scale provides a quick and easy way to perform this operation on a single array-like dataset:

>>>
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

Scaled data has zero mean and unit variance:

>>>
>>> X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

>>> X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

The preprocessing module further provides a utility class StandardScaler ，可以计算训练集的均值和方差，然后将相同的transformation应用于测试集. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:

>>>
>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_                                      
array([ 1. ...,  0. ...,  0.33...])

>>> scaler.std_                                       
array([ 0.81...,  0.81...,  1.24...])

>>> scaler.transform(X)                               
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

>>>
>>> scaler.transform([[-1.,  1., 0.]])                
array([[-2.44...,  1.22..., -0.26...]])

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor ofStandardScaler.

MinMaxScaler看名字就知道，将特征值缩放到min和max之间，常见的是缩放到0和1之间。好处是，不仅能够保持sparse data中得0仍然是0，还可以增加处理小方差特征的鲁棒性。

实现细节如下：

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std / (max - min) + min

（其实看出来了，实现时操作的是1d array，而不是2d的X，所以对于regression任务，可以考虑缩放target variables）

以缩放到[0, 1]之间为例：

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

如果不直接对训练集使用fit_transform，而是使用fit ，那么相同的transformation过程同样可以应用到测试集上：

>>>
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])

（除了center、scale操作，有些model还假设特征之间的linear independence，如PCA处理图像时，关于移除特征之间的linear correlation，参考： sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True）

未完待续。。。。

scikit-learn：4.3. Preprocessing data（standardi/normali/binari..zation、encoding、missing value）

标签：

原文地址：http://blog.csdn.net/mmc2015/article/details/47016313

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行