1. 正则化与偏差-方差分解
Regularization:减小方差的策略
误差可分解为:偏差,方差与噪声之和。即误差 = 偏差 + 方差 + 噪声之和
偏差度量了学习算法的期望预测与真实结果的偏离程度,即刻画了学习算法本身的拟合能力
方差度量了同样大小的训练集的变动所导致的学习性能的变化,即刻画了数据扰动所造成的影响
噪声则表达了在当前任务上任何学习算法所能达到的期望泛化误差的下界
高偏差:模型不够复杂,拟合能力不足(欠拟合)
高方差:泛化能力不足(过拟合)
正则化就是增强模型泛化能力的方法
目标函数(Objective Function):
𝑶𝒃𝒋 = 𝑪𝒐𝒔𝒕 + Regularization Term
L1 Regularization Term: ∑ i N ∣ w i ∣ \sum_{i}^{N}\left|\boldsymbol{w}_{i}\right| ∑ i N ∣ w i ∣
L2 Regularization Term: ∑ i N w i 2 \sum_{i}^{N} w_{i}^{2} ∑ i N w i 2
正则化项就是约束参数尽量小,这样对于输入数据的扰动,模型有着较大的容限,即提高了泛化能力。
除此之外,还有许多其他的正则化方法,不一定是在目标函数上做文章。
2. weight decay正则化
L2 Regularization = weight decay(权值衰减)
其实L2正则项的别名就是weight decay
不带L2正则项的目标函数:O b j = L o s s \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s} O b j = L o s s
带L2正则项的目标函数:O b j = L o s s + λ 2 ∗ ∑ i N w i 2 \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s}+\frac{\lambda}{2} * \sum_{i}^{N} \boldsymbol{w}_{i}^{2} O b j = L o s s + 2 λ ∗ ∑ i N w i 2
不带L2正则项的参数更新:
w i + 1 = w i − ∂ o b j ∂ w i = w i − ∂ L o s s ∂ w i w_{i+1}=w_{i}-\frac{\partial o b j}{\partial w_{i}}=w_{i}-\frac{\partial L o s s}{\partial w_{i}} w i + 1 = w i − ∂ w i ∂ o b j = w i − ∂ w i ∂ L o s s
带L2正则项的参数更新:
w i + 1 = w i − ∂ o b j ∂ w i = w i − ( ∂ L o s s ∂ w i + λ ∗ w i ) = w i ( 1 − λ ) − ∂ L o s s ∂ w i \begin{aligned} w_{i+1}=w_{i}-\frac{\partial o b j}{\partial w_{i}} &=w_{i}-\left(\frac{\partial L o s s}{\partial w_{i}}+\lambda * w_{i}\right) \\ &=w_{i}(1-\lambda)-\frac{\partial L o s s}{\partial w_{i}} \end{aligned} w i + 1 = w i − ∂ w i ∂ o b j = w i − ( ∂ w i ∂ L o s s + λ ∗ w i ) = w i ( 1 − λ ) − ∂ w i ∂ L o s s
可以看到,L2正则项的作用就是在每次参数更新的时候对参数先做一个衰减再减去偏导数,这样参数就会随着更新迭代不断变小。
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 import torchimport torch.nn as nnimport matplotlib.pyplot as pltfrom tools.common_tools import set_seedfrom torch.utils.tensorboard import SummaryWriterset_seed(1 ) n_hidden = 200 max_iter = 2000 disp_interval = 200 lr_init = 0.01 def gen_data (num_data=10 , x_range=(-1 , 1 ) ): w = 1.5 train_x = torch.linspace(*x_range, num_data).unsqueeze_(1 ) train_y = w*train_x + torch.normal(0 , 0.5 , size=train_x.size()) test_x = torch.linspace(*x_range, num_data).unsqueeze_(1 ) test_y = w*test_x + torch.normal(0 , 0.3 , size=test_x.size()) return train_x, train_y, test_x, test_y train_x, train_y, test_x, test_y = gen_data(x_range=(-1 , 1 )) class MLP (nn.Module): def __init__ (self, neural_num ): super (MLP, self).__init__() self.linears = nn.Sequential( nn.Linear(1 , neural_num), nn.ReLU(inplace=True ), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True ), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True ), nn.Linear(neural_num, 1 ), ) def forward (self, x ): return self.linears(x) net_normal = MLP(neural_num=n_hidden) net_weight_decay = MLP(neural_num=n_hidden) optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9 ) optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9 , weight_decay=1e-2 ) loss_func = torch.nn.MSELoss() writer = SummaryWriter(comment='_test_tensorboard' , filename_suffix="12345678" ) for epoch in range (max_iter): pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x) loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y) optim_normal.zero_grad() optim_wdecay.zero_grad() loss_normal.backward() loss_wdecay.backward() optim_normal.step() optim_wdecay.step() if (epoch+1 ) % disp_interval == 0 : for name, layer in net_normal.named_parameters(): writer.add_histogram(name + '_grad_normal' , layer.grad, epoch) writer.add_histogram(name + '_data_normal' , layer, epoch) for name, layer in net_weight_decay.named_parameters(): writer.add_histogram(name + '_grad_weight_decay' , layer.grad, epoch) writer.add_histogram(name + '_data_weight_decay' , layer, epoch) test_pred_normal, test_pred_wdecay = net_normal(test_x), net_weight_decay(test_x) plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue' , s=50 , alpha=0.3 , label='train' ) plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red' , s=50 , alpha=0.3 , label='test' ) plt.plot(test_x.data.numpy(), test_pred_normal.data.numpy(), 'r-' , lw=3 , label='no weight decay' ) plt.plot(test_x.data.numpy(), test_pred_wdecay.data.numpy(), 'b--' , lw=3 , label='weight decay' ) plt.text(-0.25 , -1.5 , 'no weight decay loss={:.6f}' .format (loss_normal.item()), fontdict={'size' : 15 , 'color' : 'red' }) plt.text(-0.25 , -2 , 'weight decay loss={:.6f}' .format (loss_wdecay.item()), fontdict={'size' : 15 , 'color' : 'red' }) plt.ylim((-2.5 , 2.5 )) plt.legend(loc='upper left' ) plt.title("Epoch: {}" .format (epoch+1 )) plt.show() plt.close()
matplotlib输出结果:
可以看到,L2正则化有效防止了过拟合的发生,提高了模型的泛化能力。
下面是tensorboard的可视化结果:
上面是不带weight decay的参数可视化,可见随着迭代增加,参数还是分布在[-1, 1]的范围中。
上面是带weight decay的参数可视化,可见随着迭代增加,参数的尺度逐渐缩小,最终分布在很小的范围内,符合weight decay的含义。
2. Dropout正则化
1 torch.nn.Dropout(p=0.5 , inplace=False )
功能:Dropout层
参数:
p – probability of an element to be zeroed. Default: 0.5
inplace – If set to True
, will do this operation in-place. Default: False
实现细节:训练时权重均乘以1 1 − p \frac{1}{1-p} 1 − p 1 ,即除以1-p
Dropout方法通过对神经网络的神经元按一定概率失活(每一次训练迭代,即前向传播和反向传播都按一定概率抛弃一些神经元),这样就可以减轻模型对某一个神经元的依赖性,也可以防止神经网络训练不充分的问题(即接收梯度更新的总是那几个,而其他神经元几乎没有变化)。
随机:dropout probability
失活:weight = 0
数据尺度变化:训练时,所有权重需要除以1-drop_prob。
比如:drop_prob = 0.3 , 1-drop_prob = 0.7;训练时有30%的神经元不参与,那么训练的时候前向传播的结果尺度只有测试情况下的70%(测试时全部神经元参与前向传播,drop_prob = 0),所以训练时所有参数要除以1-drop_prob,这样才能抵消dropout使部分神经元不参加训练造成的尺度变小的问题。
参考文献:《Dropout: A simple way to prevent neural networks from overfitting》
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 import torchimport torch.nn as nnimport matplotlib.pyplot as pltfrom tools.common_tools import set_seedfrom torch.utils.tensorboard import SummaryWriterset_seed(1 ) n_hidden = 200 max_iter = 2000 disp_interval = 400 lr_init = 0.01 def gen_data (num_data=10 , x_range=(-1 , 1 ) ): w = 1.5 train_x = torch.linspace(*x_range, num_data).unsqueeze_(1 ) train_y = w*train_x + torch.normal(0 , 0.5 , size=train_x.size()) test_x = torch.linspace(*x_range, num_data).unsqueeze_(1 ) test_y = w*test_x + torch.normal(0 , 0.3 , size=test_x.size()) return train_x, train_y, test_x, test_y train_x, train_y, test_x, test_y = gen_data(x_range=(-1 , 1 )) class MLP (nn.Module): def __init__ (self, neural_num, d_prob=0.5 ): super (MLP, self).__init__() self.linears = nn.Sequential( nn.Linear(1 , neural_num), nn.ReLU(inplace=True ), nn.Dropout(d_prob), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True ), nn.Dropout(d_prob), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True ), nn.Dropout(d_prob), nn.Linear(neural_num, 1 ), ) def forward (self, x ): return self.linears(x) net_prob_0 = MLP(neural_num=n_hidden, d_prob=0. ) net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5 ) optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9 ) optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9 ) loss_func = torch.nn.MSELoss() writer = SummaryWriter(comment='_test_tensorboard' , filename_suffix="12345678" ) for epoch in range (max_iter): pred_normal, pred_wdecay = net_prob_0(train_x), net_prob_05(train_x) loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y) optim_normal.zero_grad() optim_reglar.zero_grad() loss_normal.backward() loss_wdecay.backward() optim_normal.step() optim_reglar.step() if (epoch+1 ) % disp_interval == 0 : net_prob_0.eval () net_prob_05.eval () for name, layer in net_prob_0.named_parameters(): writer.add_histogram(name + '_grad_normal' , layer.grad, epoch) writer.add_histogram(name + '_data_normal' , layer, epoch) for name, layer in net_prob_05.named_parameters(): writer.add_histogram(name + '_grad_regularization' , layer.grad, epoch) writer.add_histogram(name + '_data_regularization' , layer, epoch) test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x) plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue' , s=50 , alpha=0.3 , label='train' ) plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red' , s=50 , alpha=0.3 , label='test' ) plt.plot(test_x.data.numpy(), test_pred_prob_0.data.numpy(), 'r-' , lw=3 , label='d_prob_0' ) plt.plot(test_x.data.numpy(), test_pred_prob_05.data.numpy(), 'b--' , lw=3 , label='d_prob_05' ) plt.text(-0.25 , -1.5 , 'd_prob_0 loss={:.8f}' .format (loss_normal.item()), fontdict={'size' : 15 , 'color' : 'red' }) plt.text(-0.25 , -2 , 'd_prob_05 loss={:.6f}' .format (loss_wdecay.item()), fontdict={'size' : 15 , 'color' : 'red' }) plt.ylim((-2.5 , 2.5 )) plt.legend(loc='upper left' ) plt.title("Epoch: {}" .format (epoch+1 )) plt.show() plt.close() net_prob_0.train() net_prob_05.train()
matplotlib输出:
以下是tensorboard可视化结果:
可以看到,用了dropout之后参数的尺度会比较小,分布比较集中,这也是泛化能力强的特征。
3. Batch Normalization正则化
Batch Normalization:批标准化
批:一批数据,通常为mini-batch
标准化:0均值,1方差
优点:
可以用更大学习率,加速模型收敛
可以不用精心设计权值初始化
可以不用dropout或较小的dropout
可以不用L2或者较小的weight decay
可以不用LRN(local response normalization)
参考文献:《 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》
计算方式:
这个算法非常简单,对于迭代输入的mini-batch数据B \mathcal{B} B (也就是dataloader一次迭代产生的数据)计算它们的样本均值μ B \mu_{\mathcal{B}} μ B 和样本方差σ B 2 \sigma_{\mathcal{B}}^{2} σ B 2 ,然后对输入数据B \mathcal{B} B 做标准化得到标准化的x ^ i \widehat{x}_{i} x i (分母的ϵ \epsilon ϵ 是为了防止分母为0的一个无穷小量),最后对x ^ i \widehat{x}_{i} x i 做一个affine transform得到y i y_{i} y i (参数γ , β \gamma, \beta γ , β 也是需要反向传播学习的)。其实这个affine transform可以给标准化后的x ^ i \widehat{x}_{i} x i 重新引入合适的期望和方差(由于参数γ , β \gamma, \beta γ , β 是通过反向传播学习的,甚至可以有γ = σ B , β = μ B \gamma =\sigma _{\mathcal{B}}\,\,, \beta =\mu _{\mathcal{B}} γ = σ B , β = μ B ,那么这一层的BN就相当于无效,即这一层使用BN对训练收敛没有帮助)。多了这两个参数,给BN层带来了比较大的自由度。
3.1 _BatchNorm
_BatchNorm是BatchNorm1d/BatchNorm2d/BatchNorm3d的基类
1 2 3 4 5 __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True)
主要属性:
running_mean:均值
running_var:方差
weight:affine transform中的gamma
bias: affine transform中的beta
trainning:是否位于训练状态
一般来说pytorch中的模型都是继承nn.Module
类的,都有一个属性trainning
指定是否是训练状态,训练状态与否将会影响到某些层的参数是否是固定的,比如BN层或者Dropout层。通常用model.train()
指定当前模型model
为训练状态,model.eval()
指定当前模型为测试状态。
参数:
num_features:一个样本特征数量
eps:分母修正项
momentum:使用指数加权平均估计当前mean/var
affine:是否需要affine transform,如果affine=False
,则γ =1, β =0,并且不能学习被更新。一般都会设置成affine=True
track_running_stats:是训练状态,还是测试状态。
同时,BN的API中有几个参数需要比较关心的,一个是affine
指定是否需要仿射,还有个是track_running_stats
指定是否跟踪当前batch的统计特性。
track_running_stats=True
表示跟踪整个训练过程中的batch的统计特性,得到方差和均值,而不只是仅仅依赖与当前输入的batch的统计特性。相反的,如果track_running_stats=False
那么就只是计算当前输入的batch的统计特性中的均值和方差了。
当在推理阶段的时候,如果track_running_stats=False
,此时如果batch_size
比较小,那么其统计特性就会和全局统计特性有着较大偏差,可能导致糟糕的效果。
并且track_running_stats=True
且trainning=True
时,计算出来的running_mean和running_var只是纸面上的数据(即trainning=True
时的输出并不会用到running_mean和running_var,而是当前batch的mean和var),只有当trainning=False
时,训练得到的running_mean和running_var才会派上用场。
一般来说,trainning
和track_running_stats
有四种组合
trainning=True
, track_running_stats=True
。这个是期望中的训练阶段的设置,此时BN将会跟踪整个训练过程中batch的统计特性。但是输出并不会用上running_mean和running_var,而只是根据当前输入的batch的mean和var进行标准化。
trainning=True
, track_running_stats=False
。此时BN只会计算当前输入的训练batch的统计特性,可能没法很好地描述全局的数据统计特性。
trainning=False
, track_running_stats=True
。这个是期望中的测试阶段的设置,此时BN会用之前训练好的模型中的running_mean和running_var并且不会对其进行更新。一般来说,只需要设置model.eval()
其中model
中含有BN层,即可实现这个功能。
trainning=False
, track_running_stats=False
效果同(2),只不过是位于测试状态,这个一般不采用,这个只是用测试输入的batch的统计特性,容易造成统计特性的偏移,导致糟糕效果。
我们要注意到 ,BN层中的running_mean
和running_var
的更新是在forward()
操作中进行的,而不是optimizer.step()
中进行的,因此如果处于训练状态,就算你不进行手动step()
,BN的统计特性也会变化的。
当track_running_stats=True
时,running_mean和running_var的更新的方法如下:
running_mean = (1 - momentum) * pre_running_mean + momentum * mean_t
running_var = (1 - momentum) * pre_running_var + momentum * var_t
3.2 BatchNorm1d
1 torch.nn.BatchNorm1d(num_features, eps=1e-05 , momentum=0.1 , affine=True , track_running_stats=True )
参数:
num_features – C from an expected input of size (N, C, L) or L from input of size (N, L)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None
for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True
, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True
, this module tracks the running mean and variance, and when set to False
, this module does not track such statistics and uses batch statistics instead in both training and eval modes if the running mean and variance are None
. Default: True
输入数据尺寸:
nn.BatchNorm1d input= B∗ * ∗ 特征数$*$1d特征(比如一个神经元的输出就是1d特征)
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import torchimport numpy as npimport torch.nn as nnfrom tools.common_tools import set_seedset_seed(1 ) batch_size = 3 num_features = 5 momentum = 0.3 features_shape = (1 ) feature_map = torch.ones(features_shape) feature_maps = torch.stack([feature_map*(i+1 ) for i in range (num_features)], dim=0 ) feature_maps_bs = torch.stack([feature_maps for i in range (batch_size)], dim=0 ) print ("input data:\n{} shape is {}" .format (feature_maps_bs, feature_maps_bs.shape))bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum) running_mean, running_var = 0 , 1 for i in range (2 ): bn.train() outputs = bn(feature_maps_bs) print ("iteration:{}, running mean: {} " .format (i, bn.running_mean)) print ("iteration:{}, running var:{} " .format (i, bn.running_var)) bn.eval () outputs = bn(feature_maps_bs) print ("iteration:{}, outputs:\n{} " .format (i, outputs)) mean_t, var_t = feature_maps_bs.mean(dim=0 ), feature_maps_bs.var(dim=0 ) running_mean = (1 - momentum) * running_mean + momentum * mean_t running_var = (1 - momentum) * running_var + momentum * var_t print ("iteration:{}, 我们计算的running mean: {} " .format (i, running_mean.mean(dim=-1 ))) print ("iteration:{}, 我们计算的的running var:{}" .format (i, running_var.mean(dim=-1 ))) print ('norm_out=\n{}' .format ((feature_maps_bs - running_mean.unsqueeze(dim=0 )) / np.sqrt(running_var.unsqueeze(dim=0 ))))
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 input data:tensor([[[1. ], [2. ], [3. ], [4. ], [5. ]], [[1. ], [2. ], [3. ], [4. ], [5. ]], [[1. ], [2. ], [3. ], [4. ], [5. ]]]) shape is torch.Size([3 , 5 , 1 ]) iteration:0 , running mean: tensor([0.3000 , 0.6000 , 0.9000 , 1.2000 , 1.5000 ]) iteration:0 , running var:tensor([0.7000 , 0.7000 , 0.7000 , 0.7000 , 0.7000 ]) iteration:0 , outputs: tensor([[[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]], [[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]], [[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]]]) iteration:0 , 我们计算的running mean: tensor([0.3000 , 0.6000 , 0.9000 , 1.2000 , 1.5000 ]) iteration:0 , 我们计算的的running var:tensor([0.7000 , 0.7000 , 0.7000 , 0.7000 , 0.7000 ]) norm_out= tensor([[[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]], [[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]], [[0.8367 ], [1.6733 ], [2.5100 ], [3.3466 ], [4.1833 ]]]) iteration:1 , running mean: tensor([0.5100 , 1.0200 , 1.5300 , 2.0400 , 2.5500 ]) iteration:1 , running var:tensor([0.4900 , 0.4900 , 0.4900 , 0.4900 , 0.4900 ]) iteration:1 , outputs: tensor([[[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]], [[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]], [[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]]]) iteration:1 , 我们计算的running mean: tensor([0.5100 , 1.0200 , 1.5300 , 2.0400 , 2.5500 ]) iteration:1 , 我们计算的的running var:tensor([0.4900 , 0.4900 , 0.4900 , 0.4900 , 0.4900 ]) norm_out= tensor([[[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]], [[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]], [[0.7000 ], [1.4000 ], [2.1000 ], [2.8000 ], [3.5000 ]]])
3.3 BatchNorm2d
1 torch.nn.BatchNorm2d(num_features, eps=1e-05 , momentum=0.1 , affine=True , track_running_stats=True )
参数:
num_features – C from an expected input of size (N, C, H, W)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None
for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True
, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True
, this module tracks the running mean and variance, and when set to False
, this module does not track such statistics and uses batch statistics instead in both training and eval modes if the running mean and variance are None
. Default: True
输入数据尺寸:
nn.BatchNorm2d input= B∗ * ∗ 特征数$*$2d特征(比如卷积输出的特征图就是2d特征)
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import torchimport numpy as npimport torch.nn as nnfrom tools.common_tools import set_seedset_seed(1 ) batch_size = 3 num_features = 5 momentum = 0.3 features_shape = (1 ) feature_maps_bs = torch.randn(batch_size, num_features, features_shape) print ("input data:\n{} shape is {}" .format (feature_maps_bs, feature_maps_bs.shape))bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum) running_mean, running_var = 0 , 1 for i in range (2 ): bn.train() outputs = bn(feature_maps_bs) print ("iteration:{}, running mean: {} " .format (i, bn.running_mean)) print ("iteration:{}, running var:{} " .format (i, bn.running_var)) bn.eval () outputs = bn(feature_maps_bs) print ("iteration:{}, outputs:\n{} " .format (i, outputs)) feature_maps_bs_1 = torch.transpose(feature_maps_bs, dim0=0 , dim1=1 ) feature_maps_bs_1 = feature_maps_bs_1.reshape((num_features, -1 )) mean_t, var_t = feature_maps_bs_1.mean(dim=1 ), feature_maps_bs_1.var(dim=1 ) running_mean = (1 - momentum) * running_mean + momentum * mean_t running_var = (1 - momentum) * running_var + momentum * var_t print ("iteration:{}, 我们计算的running mean: {} " .format (i, running_mean)) print ("iteration:{}, 我们计算的running var:{}" .format (i, running_var)) feature_maps_mean = running_mean.unsqueeze(0 ).unsqueeze(-1 ) feature_maps_var = running_var.unsqueeze(0 ).unsqueeze(-1 ) print ("iteration:{}, 我们计算的out:\n{}" .format (i, ((feature_maps_bs - feature_maps_mean) / np.sqrt(feature_maps_var + 1e-05 ))))
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 input data:tensor([[[ 0.6614 ], [ 0.2669 ], [ 0.0617 ], [ 0.6213 ], [-0.4519 ]], [[-0.1661 ], [-1.5228 ], [ 0.3817 ], [-1.0276 ], [-0.5631 ]], [[-0.8923 ], [-0.0583 ], [-0.1955 ], [-0.9656 ], [ 0.4224 ]]]) shape is torch.Size([3 , 5 , 1 ]) iteration:0 , running mean: tensor([-0.0397 , -0.1314 , 0.0248 , -0.1372 , -0.0593 ]) iteration:0 , running var:tensor([0.8813 , 0.9727 , 0.7251 , 0.9621 , 0.7874 ]) iteration:0 , outputs: tensor([[[ 0.7468 ], [ 0.4039 ], [ 0.0433 ], [ 0.7733 ], [-0.4425 ]], [[-0.1347 ], [-1.4108 ], [ 0.4191 ], [-0.9078 ], [-0.5678 ]], [[-0.9082 ], [ 0.0742 ], [-0.2587 ], [-0.8446 ], [ 0.5428 ]]], grad_fn=<NativeBatchNormBackward>) iteration:0 , 我们计算的running mean: tensor([-0.0397 , -0.1314 , 0.0248 , -0.1372 , -0.0593 ]) iteration:0 , 我们计算的running var:tensor([0.8813 , 0.9727 , 0.7251 , 0.9621 , 0.7874 ]) iteration:0 , 我们计算的out: tensor([[[ 0.7468 ], [ 0.4039 ], [ 0.0433 ], [ 0.7733 ], [-0.4425 ]], [[-0.1347 ], [-1.4108 ], [ 0.4191 ], [-0.9078 ], [-0.5678 ]], [[-0.9082 ], [ 0.0742 ], [-0.2587 ], [-0.8446 ], [ 0.5428 ]]]) iteration:1 , running mean: tensor([-0.0675 , -0.2234 , 0.0421 , -0.2332 , -0.1007 ]) iteration:1 , running var:tensor([0.7982 , 0.9536 , 0.5326 , 0.9355 , 0.6386 ]) iteration:1 , outputs: tensor([[[ 0.8158 ], [ 0.5021 ], [ 0.0268 ], [ 0.8835 ], [-0.4395 ]], [[-0.1104 ], [-1.3306 ], [ 0.4652 ], [-0.8213 ], [-0.5785 ]], [[-0.9232 ], [ 0.1691 ], [-0.3256 ], [-0.7572 ], [ 0.6547 ]]], grad_fn=<NativeBatchNormBackward>) iteration:1 , 我们计算的running mean: tensor([-0.0675 , -0.2234 , 0.0421 , -0.2332 , -0.1007 ]) iteration:1 , 我们计算的running var:tensor([0.7982 , 0.9536 , 0.5326 , 0.9355 , 0.6386 ]) iteration:1 , 我们计算的out: tensor([[[ 0.8158 ], [ 0.5021 ], [ 0.0268 ], [ 0.8835 ], [-0.4395 ]], [[-0.1104 ], [-1.3306 ], [ 0.4652 ], [-0.8213 ], [-0.5785 ]], [[-0.9232 ], [ 0.1691 ], [-0.3256 ], [-0.7572 ], [ 0.6547 ]]])
3.4 BN算法实现总结
总的来说,不管是BatchNorm1d还是BatchNorm2d,都是对一个batch数据中除了通道(或者说特征数num_features)以外的dim求取均值和方差,然后对一个batch的数据做标准化。
此外关于γ \gamma γ 和β \beta β 是每一个通道配置一对,通过反向传播在optimizer.step()中学习
3.5 BN应用的一例
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 import osimport numpy as npimport torchimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimimport torchvision.transforms as transformsfrom torch.utils.tensorboard import SummaryWriterfrom torch.utils.data import DataLoaderfrom matplotlib import pyplot as pltfrom model.lenet import LeNetfrom tools.my_dataset import RMBDatasetfrom tools.common_tools import set_seedclass LeNet_bn (nn.Module): def __init__ (self, classes ): super (LeNet_bn, self).__init__() self.conv1 = nn.Conv2d(3 , 6 , 5 ) self.bn1 = nn.BatchNorm2d(num_features=6 ) self.conv2 = nn.Conv2d(6 , 16 , 5 ) self.bn2 = nn.BatchNorm2d(num_features=16 ) self.fc1 = nn.Linear(16 * 5 * 5 , 120 ) self.bn3 = nn.BatchNorm1d(num_features=120 ) self.fc2 = nn.Linear(120 , 84 ) self.fc3 = nn.Linear(84 , classes) def forward (self, x ): out = self.conv1(x) out = self.bn1(out) out = F.relu(out) out = F.max_pool2d(out, 2 ) out = self.conv2(out) out = self.bn2(out) out = F.relu(out) out = F.max_pool2d(out, 2 ) out = out.view(out.size(0 ), -1 ) out = self.fc1(out) out = self.bn3(out) out = F.relu(out) out = F.relu(self.fc2(out)) out = self.fc3(out) return out def initialize_weights (self ): for m in self.modules(): if isinstance (m, nn.Conv2d): nn.init.normal_(m.weight.data, mean=0 , std=0.5 ) if m.bias is not None : m.bias.data.zero_() elif isinstance (m, nn.BatchNorm2d): m.weight.data.fill_(1 ) m.bias.data.zero_() elif isinstance (m, nn.Linear): nn.init.normal_(m.weight.data, mean=0 , std=0.5 ) m.bias.data.zero_() set_seed(1 ) rmb_label = {"1" : 0 , "100" : 1 } MAX_EPOCH = 10 BATCH_SIZE = 16 LR = 0.01 log_interval = 10 val_interval = 1 split_dir = os.path.join(".." , "data" , "rmb_split" ) train_dir = os.path.join(split_dir, "train" ) valid_dir = os.path.join(split_dir, "valid" ) norm_mean = [0.485 , 0.456 , 0.406 ] norm_std = [0.229 , 0.224 , 0.225 ] train_transform = transforms.Compose([ transforms.Resize((32 , 32 )), transforms.RandomCrop(32 , padding=4 ), transforms.RandomGrayscale(p=0.8 ), transforms.ToTensor(), transforms.Normalize(norm_mean, norm_std), ]) valid_transform = transforms.Compose([ transforms.Resize((32 , 32 )), transforms.ToTensor(), transforms.Normalize(norm_mean, norm_std), ]) train_data = RMBDataset(data_dir=train_dir, transform=train_transform) valid_data = RMBDataset(data_dir=valid_dir, transform=valid_transform) train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True ) valid_loader = DataLoader(dataset=valid_data, batch_size=BATCH_SIZE) net = LeNet(classes=2 ) net.initialize_weights() criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9 ) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10 , gamma=0.1 ) train_curve = list () valid_curve = list () iter_count = 0 writer = SummaryWriter(comment='test_your_comment' , filename_suffix="_test_your_filename_suffix" ) for epoch in range (MAX_EPOCH): loss_mean = 0. correct = 0. total = 0. net.train() for i, data in enumerate (train_loader): iter_count += 1 inputs, labels = data outputs = net(inputs) optimizer.zero_grad() loss = criterion(outputs, labels) loss.backward() optimizer.step() _, predicted = torch.max (outputs.data, 1 ) total += labels.size(0 ) correct += (predicted == labels).squeeze().sum ().numpy() loss_mean += loss.item() train_curve.append(loss.item()) if (i+1 ) % log_interval == 0 : loss_mean = loss_mean / log_interval print ("Training:Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] Loss: {:.4f} Acc:{:.2%}" .format ( epoch, MAX_EPOCH, i+1 , len (train_loader), loss_mean, correct / total)) loss_mean = 0. writer.add_scalars("Loss" , {"Train" : loss.item()}, iter_count) writer.add_scalars("Accuracy" , {"Train" : correct / total}, iter_count) scheduler.step() if (epoch+1 ) % val_interval == 0 : correct_val = 0. total_val = 0. loss_val = 0. net.eval () with torch.no_grad(): for j, data in enumerate (valid_loader): inputs, labels = data outputs = net(inputs) loss = criterion(outputs, labels) _, predicted = torch.max (outputs.data, 1 ) total_val += labels.size(0 ) correct_val += (predicted == labels).squeeze().sum ().numpy() loss_val += loss.item() valid_curve.append(loss.item()) print ("Valid:\t Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] Loss: {:.4f} Acc:{:.2%}" .format ( epoch, MAX_EPOCH, j+1 , len (valid_loader), loss_val, correct / total)) writer.add_scalars("Loss" , {"Valid" : loss.item()}, iter_count) writer.add_scalars("Accuracy" , {"Valid" : correct / total}, iter_count)
这里我们对权值初始化用的是mean=0,std=0.5的正态分布,这不是一个很好的参数初始化,如果我们不用BN层,会得到下面的结果:
1 2 Training:Epoch[009/010] Iteration[010/010] Loss: 0.6934 Acc:50.00% Valid: Epoch[009/010] Iteration[002/002] Loss: 1.3892 Acc:50.00%
到第9个epoch,loss还是很大,acc=50%,对于一个二分类问题基本就是瞎猜的概率,这个模型根本没有训练到。
下面我们使用BN层
1 2 Training:Epoch[009/010] Iteration[010/010] Loss: 0.0019 Acc:100.00% Valid: Epoch[009/010] Iteration[002/002] Loss: 0.0000 Acc:100.00%
可以看到,这改变是非常大的,使用BN层使得模型对参数初始化不那么敏感
4. LN/IN/GN正则化
常见的Normalization:
Batch Normalization(BN)
Layer Normalization(LN)
Instance Normalization(IN)
Group Normalization(GN)
加入输入数据一个batch的尺寸是BCHW
BN是对BHW计算均值方差,每一个C有一个均值和方差
LN是对CHW计算均值和方差,每一个B有一个均值和方差
IN是对HW计算均值和方差,每一个B和C都有一个均值和方差
GN首先将C分为若干组Ci,然后对CiHW计算均值和方差,每一个B和Ci都有一个均值和方差
4.1 Layer Normalization正则化
起因:BN并不适用于RNN等动态网络;batchsize较小的时候效果不好
思路:对每一个输出计算均值和方差(BCHW中对CHW计算均值和方差)
注意事项:
不再有running_mean和running_var
gamma和beta为逐元素的
参考文献:《Layer Normalization》
1 2 3 nn.LayerNorm(normalized_shape, eps=1e-05 , elementwise_affine=True )
主要参数:
normalized_shape:该层特征形状
eps:分母修正项
elementwise_affine:是否需要affine transform
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import torchimport numpy as npimport torch.nn as nnfrom tools.common_tools import set_seedset_seed(1 ) batch_size = 2 num_features = 3 features_shape = (2 , 2 ) feature_maps_bs = torch.randn(batch_size, num_features, features_shape[0 ], features_shape[1 ]) ln = nn.LayerNorm(feature_maps_bs.size()[1 :]) output = ln(feature_maps_bs) print ("input data:\n{} shape is {}" .format (feature_maps_bs, feature_maps_bs.shape))print ("weight shape:{}" .format (ln.weight.shape))print ("outputs:\n{}" .format (output))feature_maps_bs_1 = feature_maps_bs.reshape(batch_size, -1 ) mean_t, var_t = feature_maps_bs_1.mean(dim=1 ), feature_maps_bs_1.var(dim=1 , unbiased=False ) print ('mean_t={}, var_t={}' .format (mean_t, var_t))feature_maps_mean = mean_t.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) feature_maps_var = var_t.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) print ("我们计算的out:\n{}" .format (((feature_maps_bs - feature_maps_mean) / np.sqrt(feature_maps_var + 1e-05 ))))
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 input data:tensor([[[[-1.5256 , -0.7502 ], [-0.6540 , -1.6095 ]], [[-0.1002 , -0.6092 ], [-0.9798 , -1.6091 ]], [[ 0.4391 , 1.1712 ], [ 1.7674 , -0.0954 ]]], [[[ 0.1394 , -1.5785 ], [-0.3206 , -0.2993 ]], [[-0.7984 , 0.3357 ], [ 0.2753 , 1.7163 ]], [[-0.0561 , 0.9107 ], [-1.3924 , 2.6891 ]]]]) shape is torch.Size([2 , 3 , 2 , 2 ]) weight shape:torch.Size([3 , 2 , 2 ]) outputs: tensor([[[[-1.1093 , -0.3588 ], [-0.2656 , -1.1905 ]], [[ 0.2705 , -0.2222 ], [-0.5810 , -1.1901 ]], [[ 0.7925 , 1.5011 ], [ 2.0783 , 0.2751 ]]], [[[ 0.0037 , -1.4722 ], [-0.3915 , -0.3732 ]], [[-0.8020 , 0.1724 ], [ 0.1205 , 1.3584 ]], [[-0.1643 , 0.6663 ], [-1.3123 , 2.1942 ]]]], grad_fn=<NativeLayerNormBackward>) mean_t=tensor([-0.3796 , 0.1351 ]), var_t=tensor([1.0673 , 1.3549 ]) 我们计算的out: tensor([[[[-1.1093 , -0.3588 ], [-0.2656 , -1.1905 ]], [[ 0.2705 , -0.2222 ], [-0.5810 , -1.1901 ]], [[ 0.7925 , 1.5011 ], [ 2.0783 , 0.2751 ]]], [[[ 0.0037 , -1.4722 ], [-0.3915 , -0.3732 ]], [[-0.8020 , 0.1724 ], [ 0.1205 , 1.3584 ]], [[-0.1643 , 0.6663 ], [-1.3123 , 2.1942 ]]]])
LN的权值尺寸为(3, 2, 2),说明affine transform的gamma和beta是逐元素的(BN则是每一个通道有一个gamma和beta)
此外,LN其实就算对输入数据的后面连续几个维度做标准化;比如输入数据尺寸为[2, 3, 2, 2],那么ln = nn.LayerNorm((2))
, ln = nn.LayerNorm((2, 2))
或者ln = nn.LayerNorm((3, 2, 2))
都是可以的。
4.2 Instance Normalization正则化
起因:BN在图像生成(Image Generation)中不适用
思路:对每一个特征图计算均值和方差(BCHW中对HW计算均值和方差)
参考文献:
《Instance Normalization: The Missing Ingredient for Fast Stylization》
《Image Style Transfer Using Convolutional Neural Networks》
1 2 3 4 5 nn.InstanceNorm2d(num_features, eps=1e-05 , momentum=0.1 , affine=False , track_running_stats=False )
主要参数:
num_features:一个样本特征数量(最重要)
eps:分母修正项
momentum:指数加权平均估计当前mean/var
affine:是否需要affine transform
track_running_stats:是训练状态,还是测试状态
参数和属性与BN是一模一样的
但是这里track_running_stats和affine默认是False的,说明不怎么使用这两个功能。
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import torchimport numpy as npimport torch.nn as nnfrom tools.common_tools import set_seedset_seed(1 ) batch_size = 3 num_features = 3 features_shape = (2 , 2 ) feature_maps_bs = torch.randn(batch_size, num_features, features_shape[0 ], features_shape[1 ]) print ("input data:\n{} shape is {}" .format (feature_maps_bs, feature_maps_bs.shape))instance_n = nn.InstanceNorm2d(num_features=num_features) running_mean, running_var = 0 , 1 outputs = instance_n(feature_maps_bs) print ("outputs:\n{} " .format (outputs))feature_maps_bs_1 = feature_maps_bs.reshape((batch_size, num_features, -1 )) mean_t, var_t = feature_maps_bs_1.mean(dim=-1 ), feature_maps_bs_1.var(dim=-1 , unbiased=False ) feature_maps_mean = mean_t.unsqueeze(-1 ).unsqueeze(-1 ) feature_maps_var = var_t.unsqueeze(-1 ).unsqueeze(-1 ) print ("我们计算的out:\n{}" .format (((feature_maps_bs - feature_maps_mean) / np.sqrt(feature_maps_var + 1e-05 ))))
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 input data:tensor([[[[-1.5256 , -0.7502 ], [-0.6540 , -1.6095 ]], [[-0.1002 , -0.6092 ], [-0.9798 , -1.6091 ]], [[-0.7121 , 0.3037 ], [-0.7773 , -0.2515 ]]], [[[-0.2223 , 1.6871 ], [ 0.2284 , 0.4676 ]], [[-0.6970 , -1.1608 ], [ 0.6995 , 0.1991 ]], [[ 0.1991 , 0.0457 ], [ 0.1530 , -0.4757 ]]], [[[-1.8821 , -0.7765 ], [ 2.0242 , -0.0865 ]], [[ 2.3571 , -1.0373 ], [ 1.5748 , -0.6298 ]], [[ 2.4070 , 0.2786 ], [ 0.2468 , 1.1843 ]]]]) shape is torch.Size([3 , 3 , 2 , 2 ]) outputs: tensor([[[[-0.8982 , 0.8840 ], [ 1.1052 , -1.0910 ]], [[ 1.3167 , 0.3915 ], [-0.2821 , -1.4260 ]], [[-0.8146 , 1.5307 ], [-0.9650 , 0.2490 ]]], [[[-1.0785 , 1.6222 ], [-0.4410 , -0.1027 ]], [[-0.6262 , -1.2614 ], [ 1.2866 , 0.6011 ]], [[ 0.8118 , 0.2422 ], [ 0.6406 , -1.6945 ]]], [[[-1.1945 , -0.4185 ], [ 1.5472 , 0.0658 ]], [[ 1.2488 , -1.1181 ], [ 0.7033 , -0.8340 ]], [[ 1.5656 , -0.8529 ], [-0.8890 , 0.1763 ]]]]) 我们计算的out: tensor([[[[-0.8982 , 0.8840 ], [ 1.1052 , -1.0910 ]], [[ 1.3167 , 0.3915 ], [-0.2821 , -1.4260 ]], [[-0.8146 , 1.5307 ], [-0.9650 , 0.2490 ]]], [[[-1.0785 , 1.6222 ], [-0.4410 , -0.1027 ]], [[-0.6262 , -1.2614 ], [ 1.2866 , 0.6011 ]], [[ 0.8118 , 0.2422 ], [ 0.6406 , -1.6945 ]]], [[[-1.1945 , -0.4185 ], [ 1.5472 , 0.0658 ]], [[ 1.2488 , -1.1181 ], [ 0.7033 , -0.8340 ]], [[ 1.5656 , -0.8529 ], [-0.8890 , 0.1763 ]]]])
4.3 Group Normalization正则化
起因:小batch样本中,BN估计的值不准
思路:数据不够,通道来凑
注意事项:
不再有running_mean和running_var
gamma和beta为逐通道(channel)的
应用场景:大模型(小batch size)任务
参考文献:《Group Normalization》
1 2 3 4 nn.GroupNorm(num_groups, num_channels, eps=1e-05 , affine=True )
主要参数:
num_groups:分组数
num_channels:通道数(特征数)
eps:分母修正项
affine:是否需要affine transform
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import torchimport numpy as npimport torch.nn as nnfrom tools.common_tools import set_seedset_seed(1 ) batch_size = 2 num_features = 4 num_groups = 2 features_shape = (2 , 2 ) feature_maps_bs = torch.randn(batch_size, num_features, features_shape[0 ], features_shape[1 ]) gn = nn.GroupNorm(num_groups, num_features) outputs = gn(feature_maps_bs) print ("input data:\n{} shape is {}" .format (feature_maps_bs, feature_maps_bs.shape))print ("outputs:\n{} " .format (outputs))print (gn.weight.shape)feature_maps_bs_c1 = feature_maps_bs[:, :int (num_features/num_groups), :, :] feature_maps_bs_c2 = feature_maps_bs[:, int (num_features/num_groups):, :, :] feature_maps_bs_1 = feature_maps_bs_c1.reshape(batch_size, -1 ) feature_maps_bs_2 = feature_maps_bs_c2.reshape(batch_size, -1 ) mean_1, var_1 = feature_maps_bs_1.mean(dim=-1 ), feature_maps_bs_1.var(dim=-1 , unbiased=False ) mean_2, var_2 = feature_maps_bs_2.mean(dim=-1 ), feature_maps_bs_2.var(dim=-1 , unbiased=False ) feature_maps_mean_1 = mean_1.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) feature_maps_var_1 = var_1.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) feature_maps_mean_2 = mean_2.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) feature_maps_var_2 = var_2.unsqueeze(-1 ).unsqueeze(-1 ).unsqueeze(-1 ) output1 = (feature_maps_bs_c1 - feature_maps_mean_1) / np.sqrt(feature_maps_var_1 + 1e-05 ) output2 = (feature_maps_bs_c2 - feature_maps_mean_2) / np.sqrt(feature_maps_var_2 + 1e-05 ) output = torch.cat([output1, output2], dim=1 ) print ("我们计算的out:\n{}" .format (output))
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 input data:tensor([[[[-1.5256 , -0.7502 ], [-0.6540 , -1.6095 ]], [[-0.1002 , -0.6092 ], [-0.9798 , -1.6091 ]], [[-0.7121 , 0.3037 ], [-0.7773 , -0.2515 ]], [[-0.2223 , 1.6871 ], [ 0.2284 , 0.4676 ]]], [[[-0.6970 , -1.1608 ], [ 0.6995 , 0.1991 ]], [[ 0.8657 , 0.2444 ], [-0.6629 , 0.8073 ]], [[ 1.1017 , -0.1759 ], [-2.2456 , -1.4465 ]], [[ 0.0612 , -0.6177 ], [-0.7981 , -0.1316 ]]]]) shape is torch.Size([2 , 4 , 2 , 2 ]) outputs: tensor([[[[-1.0505e+00 , 4.4155e-01 ], [ 6.2676e-01 , -1.2119e+00 ]], [[ 1.6925e+00 , 7.1295e-01 ], [-1.5879e-04 , -1.2112e+00 ]], [[-1.0862e+00 , 2.8861e-01 ], [-1.1744e+00 , -4.6273e-01 ]], [[-4.2323e-01 , 2.1608e+00 ], [ 1.8671e-01 , 5.1043e-01 ]]], [[[-1.0067e+00 , -1.6429e+00 ], [ 9.0893e-01 , 2.2244e-01 ]], [[ 1.1368e+00 , 2.8461e-01 ], [-9.5998e-01 , 1.0568e+00 ]], [[ 1.7266e+00 , 3.7595e-01 ], [-1.8119e+00 , -9.6716e-01 ]], [[ 6.2659e-01 , -9.1099e-02 ], [-2.8173e-01 , 4.2280e-01 ]]]], grad_fn=<NativeGroupNormBackward>) torch.Size([4 ]) 我们计算的out: tensor([[[[-1.0505e+00 , 4.4155e-01 ], [ 6.2676e-01 , -1.2119e+00 ]], [[ 1.6925e+00 , 7.1295e-01 ], [-1.5886e-04 , -1.2112e+00 ]], [[-1.0862e+00 , 2.8861e-01 ], [-1.1744e+00 , -4.6273e-01 ]], [[-4.2323e-01 , 2.1608e+00 ], [ 1.8671e-01 , 5.1043e-01 ]]], [[[-1.0067e+00 , -1.6429e+00 ], [ 9.0893e-01 , 2.2244e-01 ]], [[ 1.1368e+00 , 2.8461e-01 ], [-9.5998e-01 , 1.0568e+00 ]], [[ 1.7266e+00 , 3.7595e-01 ], [-1.8119e+00 , -9.6716e-01 ]], [[ 6.2659e-01 , -9.1099e-02 ], [-2.8173e-01 , 4.2280e-01 ]]]])
GN的权值尺寸为[4],说明affine transform的gamma和beta是每一个BCi有一对