优化器

1. optimizer的属性

1
2
3
4
5
class Optimizer(object):
def __init__(self, params, defaults):
self.defaults = defaults
self.state = defaultdict(dict)
self.param_groups = []
  • defaults:优化器超参数(字典:包含’lr’, 'momentum’等等关于优化器的超参数)
  • state:参数的缓存,如momentum的缓存
  • params_groups:管理的参数组(params_groups是一个list,list内有字典;字典内有一组参数,以及这一组参数对应的超参数设置)
  • _step_count:记录更新次数,学习率调整中使用

2. optimizer的方法

2.1 step()

功能:执行一步更新

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
import torch
import torch.optim as optim
from tools.common_tools import set_seed

set_seed(1) # 设置随机种子

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1)

print("weight before step:{}".format(weight.data))
optimizer.step() # 修改lr=1 0.1观察结果
print("weight after step:{}".format(weight.data))
1
2
3
4
weight before step:tensor([[0.6614, 0.2669],
[0.0617, 0.6213]])
weight after step:tensor([[ 0.5614, 0.1669],
[-0.0383, 0.5213]])

手动设置了梯度为1,并且lr=0.1,所以参数更新后各减去0.1

2.2 zero_grad()

功能:清空所管理参数的梯度

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
import torch
import torch.optim as optim
from tools.common_tools import set_seed

set_seed(1) # 设置随机种子

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1)

print("weight before step:{}".format(weight.data))
optimizer.step() # 修改lr=1 0.1观察结果
print("weight after step:{}".format(weight.data))

print("weight in optimizer:{}\nweight in weight:{}\n".format(id(optimizer.param_groups[0]['params'][0]), id(weight)))

print("weight.grad is {}\n".format(weight.grad))
optimizer.zero_grad()
print("after optimizer.zero_grad(), weight.grad is\n{}".format(weight.grad))
1
2
3
4
5
6
7
8
9
10
11
12
13
weight before step:tensor([[0.6614, 0.2669],
[0.0617, 0.6213]])
weight after step:tensor([[ 0.5614, 0.1669],
[-0.0383, 0.5213]])
weight in optimizer:2188611511704
weight in weight:2188611511704

weight.grad is tensor([[1., 1.],
[1., 1.]])

after optimizer.zero_grad(), weight.grad is
tensor([[0., 0.],
[0., 0.]])

从weight in optimizer:2188611511704,weight in weight:2188611511704可以看出,优化器内的参数和weight的参数是同一个地址的,也就是说任何一个改变都将引起另外一个的改变,这是非常合理的。

在运行了optimizer.zero_grad()之后,梯度被清零了。实际上我们在每一轮迭代中都要设置optimizer.zero_grad()清空梯度(不然会累加),以保证每一轮计算的梯度都是正确的。

2.3 add_param_group()

功能:添加参数组

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
import torch
import torch.optim as optim
from tools.common_tools import set_seed

set_seed(1) # 设置随机种子

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1)

print("optimizer.param_groups is\n{}".format(optimizer.param_groups))

w2 = torch.randn((3, 3), requires_grad=True)

optimizer.add_param_group({"params": w2, 'lr': 0.0001})

print("optimizer.param_groups is\n{}".format(optimizer.param_groups))
1
2
3
4
5
6
7
8
optimizer.param_groups is
[{'params': [tensor([[0.6614, 0.2669],
[0.0617, 0.6213]], requires_grad=True)], 'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False}]
optimizer.param_groups is
[{'params': [tensor([[0.6614, 0.2669],
[0.0617, 0.6213]], requires_grad=True)], 'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False}, {'params': [tensor([[-0.4519, -0.1661, -1.5228],
[ 0.3817, -1.0276, -0.5631],
[-0.8923, -0.0583, -0.1955]], requires_grad=True)], 'lr': 0.0001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False}]

如果想使两组weight使用不同的学习率或者其他的超参数,那么可以使用add_param_group函数。使用之后,optimizer.param_groups列表中多了一个字典,用于管理w2的超参数设置。

2.4 state_dict()

功能:获取优化器当前状态信息字典

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
import torch
import torch.optim as optim
from tools.common_tools import set_seed

set_seed(1) # 设置随机种子

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1)

optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
opt_state_dict = optimizer.state_dict()

print("state_dict before step:\n", opt_state_dict)

for i in range(10):
optimizer.step()

print("state_dict after step:\n", optimizer.state_dict())

torch.save(optimizer.state_dict(), os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))
1
2
3
4
5
state_dict before step:
{'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [2529699450456]}]}
state_dict after step:
{'state': {2529699450456: {'momentum_buffer': tensor([[6.5132, 6.5132],
[6.5132, 6.5132]])}, 'default_factory': {}}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [2529699450456]}]}

optimizer.state_dict()函数返回一个字典,有’state’和’param_groups’两个键,用于保存当前的状态信息,以便于下一次训练时直接从这个状态开始。

2.5 load_state_dict()

功能:加载状态信息字典

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
import torch
import torch.optim as optim
from tools.common_tools import set_seed

set_seed(1) # 设置随机种子

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
state_dict = torch.load(os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))

print("state_dict before load state:\n", optimizer.state_dict())
optimizer.load_state_dict(state_dict)
print("state_dict after load state:\n", optimizer.state_dict())
1
2
3
4
5
state_dict before load state:
{'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [1517314336152]}]}
state_dict after load state:
{'state': {1517314336152: {'momentum_buffer': tensor([[6.5132, 6.5132],
[6.5132, 6.5132]])}, 'default_factory': {}}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [1517314336152]}]}

3. 学习率和动量

3.1 学习率(learning rate)

梯度下降:wi+1=wiLRg(wi)w_{i+1}=w_{i}-L R * g\left(w_{i}\right)

学习率(learning rate)控制更新的步伐

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(1)


def func(x_t):
"""
y = (2x)^2 = 4*x^2 dy/dx = 8x
"""
return torch.pow(2*x_t, 2)


# init
x = torch.tensor([2.], requires_grad=True)

# ------------------------------ plot data ------------------------------
x_t = torch.linspace(-3, 3, 100)
y = func(x_t)
plt.plot(x_t.numpy(), y.numpy(), label="y = 4*x^2")
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

image-20200804221925914

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(1)


def func(x_t):
"""
y = (2x)^2 = 4*x^2 dy/dx = 8x
"""
return torch.pow(2*x_t, 2)


# init
x = torch.tensor([2.], requires_grad=True)

# ------------------------------ gradient descent ------------------------------
iter_rec, loss_rec, x_rec = list(), list(), list()

lr = 0.01 # /1. /.5 /.2 /.1 /.125
max_iteration = 20 # /1. 4 /.5 4 /.2 20 200

for i in range(max_iteration):

y = func(x)
y.backward()

print("Iter:{}, X:{:8}, X.grad:{:8}, loss:{:10}".format(
i, x.detach().numpy()[0], x.grad.detach().numpy()[0], y.item()))

x_rec.append(x.item())

x.data.sub_(lr * x.grad) # x -= x.grad 数学表达式意义: x = x - x.grad # 0.5 0.2 0.1 0.125
x.grad.zero_()

iter_rec.append(i)
loss_rec.append(y)

plt.subplot(121).plot(iter_rec, loss_rec, '-ro')
plt.xlabel("Iteration")
plt.ylabel("Loss value")

x_t = torch.linspace(-3, 3, 100)
y = func(x_t)
plt.subplot(122).plot(x_t.numpy(), y.numpy(), label="y = 4*x^2")
plt.grid()
y_rec = [func(torch.tensor(i)).item() for i in x_rec]
plt.subplot(122).plot(x_rec, y_rec, '-ro')
plt.legend()
plt.show()

image-20200804222853062

如果更改学习率,得到的图如下所示:

image-20200804223156014

image-20200804223224473

image-20200804223250241

image-20200804223314188

image-20200804223344313

从上面图中,我们得到结论:

  1. 大的学习率容易造成结果不收敛,loss不降反增。

  2. 小的学习率总是可以使结果收敛,越小的loss单次步进距离越小,所需要的训练时间越长

  3. 学习率也不是越小越好。如上图中,lr=0.1没有lr=0.125效果好,lr=0.125直接一步到位,因为2-16*0.125=0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(1)


def func(x_t):
"""
y = (2x)^2 = 4*x^2 dy/dx = 8x
"""
return torch.pow(2*x_t, 2)


iteration = 100
num_lr = 10
lr_min, lr_max = 0.01, 0.2 # .5 .3 .2

lr_list = np.linspace(lr_min, lr_max, num=num_lr).tolist()
loss_rec = [[] for l in range(len(lr_list))]
iter_rec = list()

for i, lr in enumerate(lr_list):
x = torch.tensor([2.], requires_grad=True)
for iter in range(iteration):

y = func(x)
y.backward()
x.data.sub_(lr * x.grad) # x.data -= x.grad
x.grad.zero_()

loss_rec[i].append(y.item())

for i, loss_r in enumerate(loss_rec):
plt.plot(range(len(loss_r)), loss_r, label="LR: {}".format(lr_list[i]))
plt.legend()
plt.xlabel('Iterations')
plt.ylabel('Loss value')
plt.show()

image-20200804224130169

做一个横向对比,也能得到我们刚才的结论。

3.2 动量(momentum)

vi=mvi1+g(wi)v_{i}=m * v_{i-1}+g\left(w_{i}\right)
wi+1=wilrviw_{i+1}=w_{i}-l r * v_{i}

𝒘i+1𝒘_{i+1}:第i+1次更新的参数
lrlr:学习率
𝒗i𝒗_i :更新量
mm:momentum系数
𝒈(𝒘i)𝒈(𝒘_i): 𝒘𝒊的梯度

以第100次的更新量v100v_{100}为例:

v100=mv99+g(w100)=g(w100)+m(mv98+g(w99))=g(w100)+mg(w99)+m2v98=g(w100)+mg(w99)+m2g(w98)+m3v97\begin{aligned} \boldsymbol{v}_{100} &=\boldsymbol{m} * \boldsymbol{v}_{99}+\boldsymbol{g}\left(\boldsymbol{w}_{100}\right) \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} *\left(\boldsymbol{m} * \boldsymbol{v}_{98}+\boldsymbol{g}\left(\boldsymbol{w}_{99}\right)\right) \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} * \boldsymbol{g}\left(\boldsymbol{w}_{99}\right)+\boldsymbol{m}^{2} * \boldsymbol{v}_{98} \\ &=\boldsymbol{g}\left(\boldsymbol{w}_{100}\right)+\boldsymbol{m} * \boldsymbol{g}\left(\boldsymbol{w}_{99}\right)+\boldsymbol{m}^{2} * \boldsymbol{g}\left(\boldsymbol{w}_{98}\right)+\boldsymbol{m}^{3} * \boldsymbol{v}_{97} \end{aligned}

可以看到,momentum是考虑了过去参数的梯度对现在更新的影响。m越小,则对过去的考虑越少;m越接近1,则对过去的考虑越多。一般m可以取0.9

image-20200805151856145

image-20200805152705059

对权重归一化之后的图如上所述

实验:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
import numpy as np
import torch.optim as optim
import matplotlib.pyplot as plt
torch.manual_seed(1)

def func(x):
return torch.pow(2*x, 2) # y = (2x)^2 = 4*x^2 dy/dx = 8x

iteration = 100
m = 0.0 # .9 .63

lr_list = [0.01, 0.03]

momentum_list = list()
loss_rec = [[] for l in range(len(lr_list))]
iter_rec = list()

for i, lr in enumerate(lr_list):
x = torch.tensor([2.], requires_grad=True)

momentum = 0. if lr == 0.03 else m
momentum_list.append(momentum)

optimizer = optim.SGD([x], lr=lr, momentum=momentum)

for iter in range(iteration):

y = func(x)
y.backward()

optimizer.step()
optimizer.zero_grad()

loss_rec[i].append(y.item())

for i, loss_r in enumerate(loss_rec):
plt.plot(range(len(loss_r)), loss_r, label="LR: {} M:{}".format(lr_list[i], momentum_list[i]))
plt.legend()
plt.xlabel('Iterations')
plt.ylabel('Loss value')
plt.show()

image-20200805161602501

当m=0时,lr=0.03比lr=0.01更快收敛

image-20200805161852025

当m=0.9时,lr=0.01出现了超调震荡,也就是m过大了

image-20200810020432511

当m=0.63时,lr=0.01比lr=0.03更快收敛

4. 优化器

4.1 optim.SGD:随机梯度下降法

1
2
3
4
5
6
torch.optim.SGD(params, 
lr=<required parameter>,
momentum=0,
dampening=0,
weight_decay=0,
nesterov=False)

主要参数:

  • params:管理的参数组
  • lr:初始学习率
  • momentum:动量系数,贝塔
  • weight_decay:L2正则化系数
  • nesterov:是否采用NAG

4.2 optim.Adagrad:自适应学习率梯度下降法

4.3 optim.RMSprop: Adagrad的改进

4.4 optim.Adadelta: Adagrad的改进

4.5 optim.Adam:RMSprop结合Momentum

4.6 optim.Adamax:Adam增加学习率上限

4.7 optim.SparseAdam:稀疏版的Adam

4.8 optim.ASGD:随机平均梯度下降

4.9 optim.Rprop:弹性反向传播

4.10 optim.LBFGS:BFGS的改进