6.3. Preprocessing data
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
这个 sklearn.preprocessing
预处理包 提供了几个常用的实用函数和变换器类,将原始特征向量转换为更适合下游估计器的表现形式。
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.
一般来说,学习算法受益于数据集的标准化。如果集合中存在一些异常值,则更适合使用健壮的定标器或变换器。在比较不同定标器对含有边缘离群值的数据集的影响时,着重讨论了不同定标器、变换器和规范化器在含有边缘离群值的数据集上的行为。
6.3.1. Standardization, or mean removal and variance scaling
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
数据集的标准化是许多在scikit learn中实现的机器学习估计器的一个共同要求;如果单个特征或多或少不像标准正态分布数据:均值和单位方差为零的高斯分布,则它们可能表现不好。
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The preprocessing module provides the StandardScaler utility class, which is a quick and easy way to perform the following operation on an array-like dataset:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)