标签:
Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.
主要以下面八个部分展开介绍:
mainly in eight
 aspects: 1) data
 augmentation; 2) pre-processing
 on images; 3) initializations
 of Networks; 4) some
 tips during training; 5) selections
 of activation functions; 6) diverse
 regularizations; 7)some
 insights found from figures and finally 8) methods
 of ensemble multiple deep networks.
1,数据扩增
2.预处理数据
3.初始化网络
4,在训练中的一些tips
5,合理的选择激活函数
6.多种正则化
7,从实验图和结果发现insights
8,如何集合多个网络
依次介绍八种方法:
一、data augmentation
1.  th additiarhorizontally
 flipping(水平翻转), random crops(随机切割) and color jittering(颜色抖动). Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition,
 you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values
 by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.
2、Krizhevsky et
 al. [1] proposed fancy
 PCA。you can firstly perform PCA on the set of RGB pixel values throughout your training images. add
 the following quantity to each RGB image pixel (i.e.,  ):
):  where,
 where,  and
 and  are
 the
 are
 the  -th
 eigenvector and eigenvalue of the
-th
 eigenvector and eigenvalue of the  covariance
 matrix of RGB pixel values, respectively, and
 covariance
 matrix of RGB pixel values, respectively, and  is
 a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. 。。
 is
 a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. 。。
二、Pre-processing
1、The first and simple pre-processing approach is zero-center the data, and then normalize them。
code:
>>> X -= np.mean(X, axis = 0) # zero-center >>> X /= np.std(X, axis = 0) # normalize
2、re-processing approach similar to the first one is PCA Whitening.
>>> X -= np.mean(X, axis = 0) # zero-center >>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix
>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix >>> Xrot = np.dot(X, U) # decorrelate the data
>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)
上面两种方法:these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.
三、初始化-Initialization
 来设各个参数,但最后的效果没有实质性提高。
来设各个参数,但最后的效果没有实质性提高。 ,
 where
,
 where  is
 a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.
 is
 a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out thatyou can normalize the variance of each neuron‘s output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:
>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)
As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives
 an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be  as:
 as:
>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.
 )
 and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e.,
)
 and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e.,  filters
 with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common usedpooling size is of
 filters
 with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common usedpooling size is of  .
.
本图取之:http://cs231n.stanford.edu/index.html
几种激活函数:
Sigmoid:
The sigmoid non-linearity has the mathematical form  .
 It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the
 firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).在最大阈值1时,就达到饱和--Saturated.
.
 It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the
 firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).在最大阈值1时,就达到饱和--Saturated.
sigmoid已经失宠,因为他的两个缺点:
(1).Sigmoids saturate and kill gradients. 由于饱和而失去了梯度
因为在when the neuron‘s activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero。看图就知道,整个曲线的倾斜角度,在两端倾斜角都是平的。
关键的问题在于this (local) gradient will be multiplied to the gradient of this gate‘s output for the whole objective。这样就会因为local gradient 太小,而it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights
 and
 recursively to its data. 影响到梯度,导致没有信号能通过神经元传递给权值。而且还需要小心关注初始权值,one must pay extra caution when initializing
 the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.因为初始的权值太大,就会让神经元直接饱和,整个网络难以学习。
(2) .Sigmoid outputs are not zero-centered. 不是以0为中心
This
 is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that isnot zero-centered. This has implications on the dynamics duringgradient
 descent, because if the data coming into a neuron is always positive(e.g.,  element
 wise in
 element
 wise in  ),
 then the gradient on the weights
),
 then the gradient on the weights  will
 during back-propagation become either all be positive, or all negative(depending on the gradient of the whole expression
 will
 during back-propagation become either all be positive, or all negative(depending on the gradient of the whole expression  ).
). 
这样在后几层网络中接受的值也不是0中心,这样在动态梯度下降法中,如果进入nueron中的数据都是正的,那么整个权值梯度w要不全为正,或者全为负(取决于f的表达形式)。
This
 could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue.
 Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
这回导致锯齿状的动态梯度,但如果在一个batch数据中将梯度求和来更新权值,有可能会相互抵消,从而缓解上诉的影响。这笔饱和激活带来的影响要轻太多了!
Tanh(x)
The
 tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid
 nonlinearity.
tanh的作用是将真个实数数据放到了[-1,1]之间,他的激活依旧是饱和状态,但他的输出是0中心。
 ,
 which is simply thresholded at zero.
,
 which is simply thresholded at zero.
There are several pros and cons to using the ReLUs:
(Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.
运算简单,非指数形式,切不会饱和
(Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
已被证明可以加速随机梯度收敛,被认为是由于其线性和非饱和形式(有待考证)
(Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.
Deep Neural Networks的Tricks~~翻译版~~精华
标签:
原文地址:http://blog.csdn.net/pandav5/article/details/51178032