1. optimizer的属性
1 2 3 4 5 class Optimizer (object ): def __init__ (self, params, defaults ): self.defaults = defaults self.state = defaultdict(dict ) self.param_groups = []
defaults:优化器超参数(字典:包含’lr’, 'momentum’等等关于优化器的超参数)
state:参数的缓存,如momentum的缓存
params_groups:管理的参数组(params_groups是一个list,list内有字典;字典内有一组参数,以及这一组参数对应的超参数设置)
_step_count:记录更新次数,学习率调整中使用
2. optimizer的方法
2.1 step()
功能:执行一步更新
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import osBASE_DIR = os.path.dirname(os.path.abspath(__file__)) import torchimport torch.optim as optimfrom tools.common_tools import set_seedset_seed(1 ) weight = torch.randn((2 , 2 ), requires_grad=True ) weight.grad = torch.ones((2 , 2 )) optimizer = optim.SGD([weight], lr=0.1 ) print ("weight before step:{}" .format (weight.data))optimizer.step() print ("weight after step:{}" .format (weight.data))
1 2 3 4 weight before step:tensor([[0.6614 , 0.2669 ], [0.0617 , 0.6213 ]]) weight after step:tensor([[ 0.5614 , 0.1669 ], [-0.0383 , 0.5213 ]])
手动设置了梯度为1,并且lr=0.1,所以参数更新后各减去0.1
2.2 zero_grad()
功能:清空所管理参数的梯度
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import osBASE_DIR = os.path.dirname(os.path.abspath(__file__)) import torchimport torch.optim as optimfrom tools.common_tools import set_seedset_seed(1 ) weight = torch.randn((2 , 2 ), requires_grad=True ) weight.grad = torch.ones((2 , 2 )) optimizer = optim.SGD([weight], lr=0.1 ) print ("weight before step:{}" .format (weight.data))optimizer.step() print ("weight after step:{}" .format (weight.data))print ("weight in optimizer:{}\nweight in weight:{}\n" .format (id (optimizer.param_groups[0 ]['params' ][0 ]), id (weight)))print ("weight.grad is {}\n" .format (weight.grad))optimizer.zero_grad() print ("after optimizer.zero_grad(), weight.grad is\n{}" .format (weight.grad))
1 2 3 4 5 6 7 8 9 10 11 12 13 weight before step:tensor([[0.6614 , 0.2669 ], [0.0617 , 0.6213 ]]) weight after step:tensor([[ 0.5614 , 0.1669 ], [-0.0383 , 0.5213 ]]) weight in optimizer:2188611511704 weight in weight:2188611511704 weight.grad is tensor([[1. , 1. ], [1. , 1. ]]) after optimizer.zero_grad(), weight.grad is tensor([[0. , 0. ], [0. , 0. ]])
从weight in optimizer:2188611511704,weight in weight:2188611511704可以看出,优化器内的参数和weight的参数是同一个地址的,也就是说任何一个改变都将引起另外一个的改变,这是非常合理的。
在运行了optimizer.zero_grad()之后,梯度被清零了。实际上我们在每一轮迭代中都要设置optimizer.zero_grad()清空梯度(不然会累加),以保证每一轮计算的梯度都是正确的。
2.3 add_param_group()
功能:添加参数组
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import osBASE_DIR = os.path.dirname(os.path.abspath(__file__)) import torchimport torch.optim as optimfrom tools.common_tools import set_seedset_seed(1 ) weight = torch.randn((2 , 2 ), requires_grad=True ) weight.grad = torch.ones((2 , 2 )) optimizer = optim.SGD([weight], lr=0.1 ) print ("optimizer.param_groups is\n{}" .format (optimizer.param_groups))w2 = torch.randn((3 , 3 ), requires_grad=True ) optimizer.add_param_group({"params" : w2, 'lr' : 0.0001 }) print ("optimizer.param_groups is\n{}" .format (optimizer.param_groups))
1 2 3 4 5 6 7 8 optimizer.param_groups is [{'params' : [tensor([[0.6614 , 0.2669 ], [0.0617 , 0.6213 ]], requires_grad=True )], 'lr' : 0.1 , 'momentum' : 0 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False }] optimizer.param_groups is [{'params' : [tensor([[0.6614 , 0.2669 ], [0.0617 , 0.6213 ]], requires_grad=True )], 'lr' : 0.1 , 'momentum' : 0 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False }, {'params' : [tensor([[-0.4519 , -0.1661 , -1.5228 ], [ 0.3817 , -1.0276 , -0.5631 ], [-0.8923 , -0.0583 , -0.1955 ]], requires_grad=True )], 'lr' : 0.0001 , 'momentum' : 0 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False }]
如果想使两组weight使用不同的学习率或者其他的超参数,那么可以使用add_param_group函数。使用之后,optimizer.param_groups列表中多了一个字典,用于管理w2的超参数设置。
2.4 state_dict()
功能:获取优化器当前状态信息字典
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import osBASE_DIR = os.path.dirname(os.path.abspath(__file__)) import torchimport torch.optim as optimfrom tools.common_tools import set_seedset_seed(1 ) weight = torch.randn((2 , 2 ), requires_grad=True ) weight.grad = torch.ones((2 , 2 )) optimizer = optim.SGD([weight], lr=0.1 ) optimizer = optim.SGD([weight], lr=0.1 , momentum=0.9 ) opt_state_dict = optimizer.state_dict() print ("state_dict before step:\n" , opt_state_dict)for i in range (10 ): optimizer.step() print ("state_dict after step:\n" , optimizer.state_dict())torch.save(optimizer.state_dict(), os.path.join(BASE_DIR, "optimizer_state_dict.pkl" ))
1 2 3 4 5 state_dict before step: {'state' : {}, 'param_groups' : [{'lr' : 0.1 , 'momentum' : 0.9 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False , 'params' : [2529699450456 ]}]} state_dict after step: {'state' : {2529699450456 : {'momentum_buffer' : tensor([[6.5132 , 6.5132 ], [6.5132 , 6.5132 ]])}, 'default_factory' : {}}, 'param_groups' : [{'lr' : 0.1 , 'momentum' : 0.9 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False , 'params' : [2529699450456 ]}]}
optimizer.state_dict()函数返回一个字典,有’state’和’param_groups’两个键,用于保存当前的状态信息,以便于下一次训练时直接从这个状态开始。
2.5 load_state_dict()
功能:加载状态信息字典
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import osBASE_DIR = os.path.dirname(os.path.abspath(__file__)) import torchimport torch.optim as optimfrom tools.common_tools import set_seedset_seed(1 ) weight = torch.randn((2 , 2 ), requires_grad=True ) weight.grad = torch.ones((2 , 2 )) optimizer = optim.SGD([weight], lr=0.1 , momentum=0.9 ) state_dict = torch.load(os.path.join(BASE_DIR, "optimizer_state_dict.pkl" )) print ("state_dict before load state:\n" , optimizer.state_dict())optimizer.load_state_dict(state_dict) print ("state_dict after load state:\n" , optimizer.state_dict())
1 2 3 4 5 state_dict before load state: {'state' : {}, 'param_groups' : [{'lr' : 0.1 , 'momentum' : 0.9 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False , 'params' : [1517314336152 ]}]} state_dict after load state: {'state' : {1517314336152 : {'momentum_buffer' : tensor([[6.5132 , 6.5132 ], [6.5132 , 6.5132 ]])}, 'default_factory' : {}}, 'param_groups' : [{'lr' : 0.1 , 'momentum' : 0.9 , 'dampening' : 0 , 'weight_decay' : 0 , 'nesterov' : False , 'params' : [1517314336152 ]}]}
3. 学习率和动量
3.1 学习率(learning rate)
梯度下降:w i + 1 = w i − L R ∗ g ( w i ) w_{i+1}=w_{i}-L R * g\left(w_{i}\right) w i + 1 = w i − L R ∗ g ( w i )
学习率(learning rate)控制更新的步伐
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import torchimport numpy as npimport matplotlib.pyplot as plttorch.manual_seed(1 ) def func (x_t ): """ y = (2x)^2 = 4*x^2 dy/dx = 8x """ return torch.pow (2 *x_t, 2 ) x = torch.tensor([2. ], requires_grad=True ) x_t = torch.linspace(-3 , 3 , 100 ) y = func(x_t) plt.plot(x_t.numpy(), y.numpy(), label="y = 4*x^2" ) plt.grid() plt.xlabel("x" ) plt.ylabel("y" ) plt.legend() plt.show()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 import torchimport numpy as npimport matplotlib.pyplot as plttorch.manual_seed(1 ) def func (x_t ): """ y = (2x)^2 = 4*x^2 dy/dx = 8x """ return torch.pow (2 *x_t, 2 ) x = torch.tensor([2. ], requires_grad=True ) iter_rec, loss_rec, x_rec = list (), list (), list () lr = 0.01 max_iteration = 20 for i in range (max_iteration): y = func(x) y.backward() print ("Iter:{}, X:{:8}, X.grad:{:8}, loss:{:10}" .format ( i, x.detach().numpy()[0 ], x.grad.detach().numpy()[0 ], y.item())) x_rec.append(x.item()) x.data.sub_(lr * x.grad) x.grad.zero_() iter_rec.append(i) loss_rec.append(y) plt.subplot(121 ).plot(iter_rec, loss_rec, '-ro' ) plt.xlabel("Iteration" ) plt.ylabel("Loss value" ) x_t = torch.linspace(-3 , 3 , 100 ) y = func(x_t) plt.subplot(122 ).plot(x_t.numpy(), y.numpy(), label="y = 4*x^2" ) plt.grid() y_rec = [func(torch.tensor(i)).item() for i in x_rec] plt.subplot(122 ).plot(x_rec, y_rec, '-ro' ) plt.legend() plt.show()
如果更改学习率,得到的图如下所示:
从上面图中,我们得到结论:
大的学习率容易造成结果不收敛,loss不降反增。
小的学习率总是可以使结果收敛,越小的loss单次步进距离越小,所需要的训练时间越长
学习率也不是越小越好。如上图中,lr=0.1没有lr=0.125效果好,lr=0.125直接一步到位,因为2-16*0.125=0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import torchimport numpy as npimport matplotlib.pyplot as plttorch.manual_seed(1 ) def func (x_t ): """ y = (2x)^2 = 4*x^2 dy/dx = 8x """ return torch.pow (2 *x_t, 2 ) iteration = 100 num_lr = 10 lr_min, lr_max = 0.01 , 0.2 lr_list = np.linspace(lr_min, lr_max, num=num_lr).tolist() loss_rec = [[] for l in range (len (lr_list))] iter_rec = list () for i, lr in enumerate (lr_list): x = torch.tensor([2. ], requires_grad=True ) for iter in range (iteration): y = func(x) y.backward() x.data.sub_(lr * x.grad) x.grad.zero_() loss_rec[i].append(y.item()) for i, loss_r in enumerate (loss_rec): plt.plot(range (len (loss_r)), loss_r, label="LR: {}" .format (lr_list[i])) plt.legend() plt.xlabel('Iterations' ) plt.ylabel('Loss value' ) plt.show()
做一个横向对比,也能得到我们刚才的结论。
3.2 动量(momentum)
v i = m ∗ v i − 1 + g ( w i ) v_{i}=m * v_{i-1}+g\left(w_{i}\right) v i = m ∗ v i − 1 + g ( w i )
w i + 1 = w i − l r ∗ v i w_{i+1}=w_{i}-l r * v_{i} w i + 1 = w i − l r ∗ v i
𝒘 i + 1 𝒘_{i+1} w i + 1 :第i+1次更新的参数
l r lr l r :学习率
𝒗 i 𝒗_i v i :更新量
m m m :momentum系数
𝒈 ( 𝒘 i ) 𝒈(𝒘_i) g ( w i ) : 𝒘𝒊的梯度
以第100次的更新量v 100 v_{100} v 1 0 0 为例:
v 100 = m ∗ v 99 + g ( w 100 ) = g ( w 100 ) + m ∗ ( m ∗ v 98 + g ( w 99 ) ) = g ( w 100 ) + m ∗ g ( w 99 ) + m 2 ∗ v 98 = g ( w 100 ) + m ∗ g ( w 99 ) + m 2 ∗ g ( w 98 ) + m 3 ∗ v 97 \begin{aligned} \boldsymbol{v}_{100} &=\boldsymbol{m} * \boldsymbol{v}_{99}+\boldsymbol{g}\left(\boldsymbol{w}_{100}\right) \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} *\left(\boldsymbol{m} * \boldsymbol{v}_{98}+\boldsymbol{g}\left(\boldsymbol{w}_{99}\right)\right) \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} * \boldsymbol{g}\left(\boldsymbol{w}_{99}\right)+\boldsymbol{m}^{2} * \boldsymbol{v}_{98} \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} * \boldsymbol{g}\left(\boldsymbol{w}_{99}\right)+\boldsymbol{m}^{2} * \boldsymbol{g}\left(\boldsymbol{w}_{98}\right)+\boldsymbol{m}^{3} * \boldsymbol{v}_{97} \end{aligned} v 1 0 0 = m ∗ v 9 9 + g ( w 1 0 0 ) = g ( w 1 0 0 ) + m ∗ ( m ∗ v 9 8 + g ( w 9 9 ) ) = g ( w 1 0 0 ) + m ∗ g ( w 9 9 ) + m 2 ∗ v 9 8 = g ( w 1 0 0 ) + m ∗ g ( w 9 9 ) + m 2 ∗ g ( w 9 8 ) + m 3 ∗ v 9 7
可以看到,momentum是考虑了过去参数的梯度对现在更新的影响。m越小,则对过去的考虑越少;m越接近1,则对过去的考虑越多。一般m可以取0.9
对权重归一化之后的图如上所述
实验:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 import torchimport numpy as npimport torch.optim as optimimport matplotlib.pyplot as plttorch.manual_seed(1 ) def func (x ): return torch.pow (2 *x, 2 ) iteration = 100 m = 0.0 lr_list = [0.01 , 0.03 ] momentum_list = list () loss_rec = [[] for l in range (len (lr_list))] iter_rec = list () for i, lr in enumerate (lr_list): x = torch.tensor([2. ], requires_grad=True ) momentum = 0. if lr == 0.03 else m momentum_list.append(momentum) optimizer = optim.SGD([x], lr=lr, momentum=momentum) for iter in range (iteration): y = func(x) y.backward() optimizer.step() optimizer.zero_grad() loss_rec[i].append(y.item()) for i, loss_r in enumerate (loss_rec): plt.plot(range (len (loss_r)), loss_r, label="LR: {} M:{}" .format (lr_list[i], momentum_list[i])) plt.legend() plt.xlabel('Iterations' ) plt.ylabel('Loss value' ) plt.show()
当m=0时,lr=0.03比lr=0.01更快收敛
当m=0.9时,lr=0.01出现了超调震荡,也就是m过大了
当m=0.63时,lr=0.01比lr=0.03更快收敛
4. 优化器
4.1 optim.SGD:随机梯度下降法
1 2 3 4 5 6 torch.optim.SGD(params, lr=<required parameter>, momentum=0 , dampening=0 , weight_decay=0 , nesterov=False )
主要参数:
params:管理的参数组
lr:初始学习率
momentum:动量系数,贝塔
weight_decay:L2正则化系数
nesterov:是否采用NAG
4.2 optim.Adagrad:自适应学习率梯度下降法
4.3 optim.RMSprop: Adagrad的改进
4.4 optim.Adadelta: Adagrad的改进
4.5 optim.Adam:RMSprop结合Momentum
4.6 optim.Adamax:Adam增加学习率上限
4.7 optim.SparseAdam:稀疏版的Adam
4.8 optim.ASGD:随机平均梯度下降
4.9 optim.Rprop:弹性反向传播
4.10 optim.LBFGS:BFGS的改进