引子
Numpy 在向量化的数值计算中表现优异
但是在处理更灵活、复杂的数据任务:
如为数据添加标签、处理缺失值、分组和透视表等方面
Numpy显得力不从心
而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具
12.1 对象创建
12.1.1 Pandas Series对象
Series 是带标签数据的一维数组
Series对象的创建
通用结构: pd.Series(data, index=index, dtype=dtype)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
dtype: 数据类型,为可选参数
1、用列表创建
1 2 3 4 import pandas as pddata = pd.Series([1.5 , 3 , 4.5 , 6 ]) data
0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64
1 2 data = pd.Series([1.5 , 3 , 4.5 , 6 ], index=["a" , "b" , "c" , "d" ]) data
a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64
1 2 data = pd.Series([1 , 2 , 3 , 4 ], index=["a" , "b" , "c" , "d" ]) data
a 1
b 2
c 3
d 4
dtype: int64
1 2 data = pd.Series([1 , 2 , 3 , 4 ], index=["a" , "b" , "c" , "d" ], dtype="float" ) data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
注意:数据支持多种类型
1 2 data = pd.Series([1 , 2 , "3" , 4 ], index=["a" , "b" , "c" , "d" ]) data
a 1
b 2
c 3
d 4
dtype: object
1
'3'
数据类型可被强制改变
1 2 data = pd.Series([1 , 2 , "3" , 4 ], index=["a" , "b" , "c" , "d" ], dtype=float ) data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
3.0
1 2 data = pd.Series([1 , 2 , "a" , 4 ], index=["a" , "b" , "c" , "d" ], dtype=float ) data
ValueError: could not convert string to float: 'a'
2、用一维numpy数组创建
1 2 3 4 import numpy as npx = np.arange(5 ) pd.Series(x)
0 0
1 1
2 2
3 3
4 4
dtype: int32
3、用字典创建
1 2 3 4 5 6 population_dict = {"BeiJing" : 2154 , "ShangHai" : 2424 , "ShenZhen" : 1303 , "HangZhou" : 981 } population = pd.Series(population_dict) population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64
字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
1 2 population = pd.Series(population_dict, index=["BeiJing" , "HangZhou" , "c" , "d" ]) population
BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64
4、data为标量的情况
1 pd.Series(5 , index=[100 , 200 , 300 ])
100 5
200 5
300 5
dtype: int64
12.1.2 Pandas DataFrame对象
DataFrame 是带标签数据的多维数组
DataFrame对象的创建
通用结构: pd.DataFrame(data, index=index, columns=columns)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
columns: 列标签,为可选参数
1、通过Series对象创建
1 2 3 4 5 6 7 population_dict = {"BeiJing" : 2154 , "ShangHai" : 2424 , "ShenZhen" : 1303 , "HangZhou" : 981 } population = pd.Series(population_dict) pd.DataFrame(population)
0
BeiJing
2154
ShangHai
2424
ShenZhen
1303
HangZhou
981
1 pd.DataFrame(population, columns=["population" ])
population
BeiJing
2154
ShangHai
2424
ShenZhen
1303
HangZhou
981
2、通过Series对象字典创建
1 2 3 4 5 6 7 GDP_dict = {"BeiJing" : 30320 , "ShangHai" : 32680 , "ShenZhen" : 24222 , "HangZhou" : 13468 } GDP = pd.Series(GDP_dict) GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
1 2 pd.DataFrame({"population" : population, "GDP" : GDP})
population
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
注意:数量不够的会自动补齐
1 2 3 pd.DataFrame({"population" : population, "GDP" : GDP, "country" : "China" })
population
GDP
country
BeiJing
2154
30320
China
ShangHai
2424
32680
China
ShenZhen
1303
24222
China
HangZhou
981
13468
China
3、通过字典列表对象创建
1 2 3 4 5 import numpy as npimport pandas as pddata = [{"a" : i, "b" : 2 *i} for i in range (3 )] data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
1 2 data = pd.DataFrame(data) data
1 2 data1 = data["a" ].copy() data1
0 0
1 1
2 2
Name: a, dtype: int64
0 10
1 1
2 2
Name: a, dtype: int64
1 2 data = [{"a" : 1 , "b" :1 },{"b" : 3 , "c" :4 }] data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
a
b
c
0
1.0
1
NaN
1
NaN
3
4.0
4、通过Numpy二维数组创建
1 2 data = np.random.randint(10 , size=(3 , 2 )) data
array([[3, 0],
[9, 6],
[7, 7]])
1 pd.DataFrame(data, columns=["foo" , "bar" ], index=["a" , "b" , "c" ])
foo
bar
a
3
0
b
9
6
c
7
7
12.2 DataFrame性质
1、属性
1 2 data = pd.DataFrame({"pop" : population, "GDP" : GDP}) data
pop
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
(1)df.values 返回numpy数组表示的数据
array([[ 2154, 30320],
[ 2424, 32680],
[ 1303, 24222],
[ 981, 13468]], dtype=int64)
(2)df.index 返回行索引
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
(3)df.columns 返回列索引
Index(['pop', 'GDP'], dtype='object')
(4)df.shape 形状
(4, 2)
(5) pd.size 大小
8
(6)pd.dtypes 返回每列数据类型
pop int64
GDP int64
dtype: object
2、索引
方法太多脑子容易懵,我总结一下loc和iloc的方法(推荐使用这两种方法来索引):
选择行
data.loc[‘名称’],返回Series对象
data.loc[[‘名称’]],返回dataframe对象
data.loc[[‘名称1’, ‘名称2’]],选择多行,返回dataframe对象
data.loc[‘名称1’ : ‘名称3’],选择’名称1’到’名称3’多行,返回dataframe对象
data.iloc[1],选择行1,返回Series对象
data.iloc[[1]],选择行1,返回dataframe对象
data.iloc[[1, 3]],选择行1和行3,返回dataframe对象
data.iloc[0:3],选择行0到行2,返回dataframe对象
注:其实上面可以全部写成(i)loc[行索引信息 , : ],不写第二个冒号也可以默认选择的是行
选择列
data.loc[ : , ‘名称’],返回Series对象
data.loc[ : , [‘名称’]],返回dataframe对象
data.loc[ : , [‘名称1’, ‘名称2’]],选择多列,返回dataframe对象
data.loc[ : , ‘名称1’ : ‘名称3’],选择’名称1’到’名称3’多列,返回dataframe对象
data.iloc[ : , 1],选择列1,返回Series对象
data.iloc[ : , [1]],选择列1,返回dataframe对象
data.iloc[ : , [1, 3]],选择列1和列3,返回dataframe对象
data.iloc[ : , 0:3],选择列0到列2,返回dataframe对象
选择元素
data.loc[ ‘行名称’ , ‘列名称’],返回元素
data.loc[[‘行名称’], [‘列名称’]],返回dataframe对象
data.loc[[‘行名称1’ , ‘行名称2’], [‘列名称1’ , ‘列名称2’]],返回dataframe对象
data.loc[‘行名称1’ : ‘行名称3’, ‘列名称1’ : ‘列名称2’]],选择’行名称1’到’行名称3’多行和’列名称1’到’列名称2’多列的相交元素,返回dataframe对象
data.iloc[1 , 2],选择行1列2的元素,返回元素
data.iloc[[1] , [2]],选择行1列2的元素,返回dataframe对象
data.iloc[[1, 0] , [1, 3]],选择行1行0与列1列3交叉位置的四个元素,返回dataframe对象
data.iloc[1 : 3, 1 : 2],选择行1到行3多行和列1到列2多列的相交元素,返回dataframe对象
pop
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
(1)获取列
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: pop, dtype: int64
pop
BeiJing
2154
ShangHai
2424
ShenZhen
1303
HangZhou
981
GDP
pop
BeiJing
30320
2154
ShangHai
32680
2424
ShenZhen
24222
1303
HangZhou
13468
981
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64
(2)获取行
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
pop
GDP
BeiJing
2154
30320
1 data.loc[["BeiJing" , "HangZhou" ]]
pop
GDP
BeiJing
2154
30320
HangZhou
981
13468
pop
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
pop
GDP
BeiJing
2154
30320
pop
GDP
ShangHai
2424
32680
HangZhou
981
13468
(3)获取标量
pop
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
1 data.loc["BeiJing" , "GDP" ]
30320
30320
30320
(4)Series对象的索引
pandas.core.series.Series
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
30320
3、切片
1 2 df = pd.DataFrame(np.random.randn(6 ,4 ), index=['2019-01-01' , '2019-01-02' , '2019-01-03' , '2019-01-04' , '2019-01-05' , '2019-01-06' ], columns=["A" , "B" , "C" , "D" ]) df
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
2019-01-04
-0.894394
0.698045
1.695152
-0.122940
2019-01-05
0.731645
0.663405
0.189481
-0.872651
2019-01-06
0.050233
-1.236851
-0.246509
-0.328486
(1)行切片
1 df["2019-01-01" : "2019-01-03" ]
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
1 df.loc["2019-01-01" : "2019-01-03" ]
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
(2)列切片
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
2019-01-04
-0.894394
0.698045
1.695152
-0.122940
2019-01-05
0.731645
0.663405
0.189481
-0.872651
2019-01-06
0.050233
-1.236851
-0.246509
-0.328486
A
B
C
2019-01-01
-0.242680
0.310893
-2.323013
2019-01-02
-1.683455
0.442882
0.061800
2019-01-03
0.320874
-1.309343
0.418429
2019-01-04
-0.894394
0.698045
1.695152
2019-01-05
0.731645
0.663405
0.189481
2019-01-06
0.050233
-1.236851
-0.246509
A
B
C
2019-01-01
-0.242680
0.310893
-2.323013
2019-01-02
-1.683455
0.442882
0.061800
2019-01-03
0.320874
-1.309343
0.418429
2019-01-04
-0.894394
0.698045
1.695152
2019-01-05
0.731645
0.663405
0.189481
2019-01-06
0.050233
-1.236851
-0.246509
(3)多种多样的取值
A
B
C
D
2019-01-01
-0.242680
0.310893
-2.323013
0.725643
2019-01-02
-1.683455
0.442882
0.061800
0.522243
2019-01-03
0.320874
-1.309343
0.418429
0.084683
2019-01-04
-0.894394
0.698045
1.695152
-0.122940
2019-01-05
0.731645
0.663405
0.189481
-0.872651
2019-01-06
0.050233
-1.236851
-0.246509
-0.328486
1 df.loc["2019-01-02" : "2019-01-03" , "C" :"D" ]
C
D
2019-01-02
0.061800
0.522243
2019-01-03
0.418429
0.084683
C
D
2019-01-02
0.061800
0.522243
2019-01-03
0.418429
0.084683
1 df.loc["2019-01-04" : "2019-01-06" , ["A" , "C" ]]
A
C
2019-01-04
-0.894394
1.695152
2019-01-05
0.731645
0.189481
2019-01-06
0.050233
-0.246509
A
C
2019-01-04
-0.894394
1.695152
2019-01-05
0.731645
0.189481
2019-01-06
0.050233
-0.246509
1 df.loc[["2019-01-02" , "2019-01-06" ], "C" : "D" ]
C
D
2019-01-02
0.061800
0.522243
2019-01-06
-0.246509
-0.328486
A
B
C
2019-01-02
-1.683455
0.442882
0.061800
2019-01-06
0.050233
-1.236851
-0.246509
1 df.loc[["2019-01-04" , "2019-01-06" ], ["A" , "D" ]]
A
D
2019-01-04
-0.894394
-0.122940
2019-01-06
0.050233
-0.328486
A
D
2019-01-02
-1.683455
0.522243
2019-01-06
0.050233
-0.328486
4、布尔索引
A
B
C
D
2019-01-01
-0.968004
0.024658
1.365858
-0.280383
2019-01-02
-0.409198
1.228846
0.673056
0.396339
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
A
B
C
D
2019-01-01
False
True
True
False
2019-01-02
False
True
True
True
2019-01-03
False
False
True
False
2019-01-04
True
False
False
True
2019-01-05
True
True
False
False
2019-01-06
False
False
True
True
A
B
C
D
2019-01-01
NaN
0.024658
1.365858
NaN
2019-01-02
NaN
1.228846
0.673056
0.396339
2019-01-03
NaN
NaN
0.063541
NaN
2019-01-04
0.847711
NaN
NaN
0.003130
2019-01-05
1.181092
0.017556
NaN
NaN
2019-01-06
NaN
NaN
1.378498
0.919418
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2019-01-05 True
2019-01-06 False
Freq: D, Name: A, dtype: bool
A
B
C
D
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
1 2 3 df2 = df.copy() df2['E' ] = ['one' , 'one' , 'two' , 'three' , 'four' , 'three' ] df2
A
B
C
D
E
2019-01-01
-0.968004
0.024658
1.365858
-0.280383
one
2019-01-02
-0.409198
1.228846
0.673056
0.396339
one
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
two
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
three
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
four
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
three
1 2 ind = df2["E" ].isin(["two" , "four" ]) ind
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
A
B
C
D
E
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
two
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
four
(5)赋值
A
B
C
D
2019-01-01
-0.968004
0.024658
1.365858
-0.280383
2019-01-02
-0.409198
1.228846
0.673056
0.396339
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
1 2 s1 = pd.Series([1 , 2 , 3 , 4 , 5 , 6 ], index=pd.date_range('20190101' , periods=6 )) s1
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
A
B
C
D
E
2019-01-01
-0.968004
0.024658
1.365858
-0.280383
1
2019-01-02
-0.409198
1.228846
0.673056
0.396339
2
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
3
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
4
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
5
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
6
1 2 df.loc["2019-01-01" , "A" ] = 0 df
A
B
C
D
E
2019-01-01
0.000000
0.024658
1.365858
-0.280383
1
2019-01-02
-0.409198
1.228846
0.673056
0.396339
2
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
3
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
4
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
5
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
6
A
B
C
D
E
2019-01-01
0.000000
0.000000
1.365858
-0.280383
1
2019-01-02
-0.409198
1.228846
0.673056
0.396339
2
2019-01-03
-2.356054
-0.458781
0.063541
-0.016453
3
2019-01-04
0.847711
-1.054088
-0.114295
0.003130
4
2019-01-05
1.181092
0.017556
-0.598561
-0.303268
5
2019-01-06
-0.084136
-0.542269
1.378498
0.919418
6
1 2 df["D" ] = np.array([5 ]*len (df)) df
A
B
C
D
E
2019-01-01
0.000000
0.000000
1.365858
5
1
2019-01-02
-0.409198
1.228846
0.673056
5
2
2019-01-03
-2.356054
-0.458781
0.063541
5
3
2019-01-04
0.847711
-1.054088
-0.114295
5
4
2019-01-05
1.181092
0.017556
-0.598561
5
5
2019-01-06
-0.084136
-0.542269
1.378498
5
6
1 2 df.index = [i for i in range (len (df))] df
A
B
C
D
E
0
0.000000
0.000000
1.365858
5
1
1
-0.409198
1.228846
0.673056
5
2
2
-2.356054
-0.458781
0.063541
5
3
3
0.847711
-1.054088
-0.114295
5
4
4
1.181092
0.017556
-0.598561
5
5
5
-0.084136
-0.542269
1.378498
5
6
1 2 df.columns = [i for i in range (df.shape[1 ])] df
0
1
2
3
4
0
0.000000
0.000000
1.365858
5
1
1
-0.409198
1.228846
0.673056
5
2
2
-2.356054
-0.458781
0.063541
5
3
3
0.847711
-1.054088
-0.114295
5
4
4
1.181092
0.017556
-0.598561
5
5
5
-0.084136
-0.542269
1.378498
5
6
12.3 数值运算及统计分析
1、数据的查看
1 2 3 4 5 6 import pandas as pdimport numpy as npdates = pd.date_range(start='2019-01-01' , periods=6 ) df = pd.DataFrame(np.random.randn(6 ,4 ), index=dates, columns=["A" , "B" , "C" , "D" ]) df
A
B
C
D
2019-01-01
1.051134
-2.357190
-0.441416
0.756528
2019-01-02
-1.779281
1.581360
2.070031
0.672192
2019-01-03
1.072005
-0.447704
0.783539
-0.962688
2019-01-04
1.306330
-1.088984
-0.122274
-0.692286
2019-01-05
-0.528289
1.637207
1.798403
0.464263
2019-01-06
1.597604
-0.099617
0.735458
1.495397
(1)查看前面的行
A
B
C
D
2019-01-01
1.051134
-2.357190
-0.441416
0.756528
2019-01-02
-1.779281
1.581360
2.070031
0.672192
2019-01-03
1.072005
-0.447704
0.783539
-0.962688
2019-01-04
1.306330
-1.088984
-0.122274
-0.692286
2019-01-05
-0.528289
1.637207
1.798403
0.464263
A
B
C
D
2019-01-01
1.051134
-2.35719
-0.441416
0.756528
2019-01-02
-1.779281
1.58136
2.070031
0.672192
(2)查看后面的行
A
B
C
D
2019-01-02
-1.779281
1.581360
2.070031
0.672192
2019-01-03
1.072005
-0.447704
0.783539
-0.962688
2019-01-04
1.306330
-1.088984
-0.122274
-0.692286
2019-01-05
-0.528289
1.637207
1.798403
0.464263
2019-01-06
1.597604
-0.099617
0.735458
1.495397
A
B
C
D
2019-01-04
1.306330
-1.088984
-0.122274
-0.692286
2019-01-05
-0.528289
1.637207
1.798403
0.464263
2019-01-06
1.597604
-0.099617
0.735458
1.495397
(3)查看总体信息
1 2 df.iloc[0 , 3 ] = np.nan df
A
B
C
D
2019-01-01
1.051134
-2.357190
-0.441416
NaN
2019-01-02
-1.779281
1.581360
2.070031
0.672192
2019-01-03
1.072005
-0.447704
0.783539
-0.962688
2019-01-04
1.306330
-1.088984
-0.122274
-0.692286
2019-01-05
-0.528289
1.637207
1.798403
0.464263
2019-01-06
1.597604
-0.099617
0.735458
1.495397
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
2、Numpy通用函数同样适用于Pandas
(1)向量化运算
1 2 x = pd.DataFrame(np.arange(4 ).reshape(1 , 4 )) x
0
1
2
3
0
1.0
2.718282
7.389056
20.085537
1 2 y = pd.DataFrame(np.arange(4 ,8 ).reshape(1 , 4 )) y
(2)矩阵化运算
1 2 3 np.random.seed(42 ) x = pd.DataFrame(np.random.randint(10 , size=(30 , 30 ))) x
0
1
2
3
4
5
6
7
8
9
...
20
21
22
23
24
25
26
27
28
29
0
6
3
7
4
6
9
2
6
7
4
...
4
0
9
5
8
0
9
2
6
3
1
8
2
4
2
6
4
8
6
1
3
...
2
0
3
1
7
3
1
5
5
9
2
3
5
1
9
1
9
3
7
6
8
...
6
8
7
0
7
7
2
0
7
2
3
2
0
4
9
6
9
8
6
8
7
...
0
2
4
2
0
4
9
6
6
8
4
9
9
2
6
0
3
3
4
6
6
...
9
6
8
6
0
0
8
8
3
8
5
2
6
5
7
8
4
0
2
9
7
...
2
0
4
0
7
0
0
1
1
5
6
6
4
0
0
2
1
4
9
5
6
...
5
0
8
5
2
3
3
2
9
2
7
2
3
6
3
8
0
7
6
1
7
...
3
0
1
0
4
4
6
8
8
2
8
2
2
3
7
5
7
0
7
3
0
...
1
1
5
2
8
3
0
3
0
4
9
3
7
7
6
2
0
0
2
5
6
...
4
2
3
2
0
0
4
5
2
8
10
4
7
0
4
2
0
3
4
6
0
...
5
6
1
9
1
9
0
7
0
8
11
5
6
9
6
9
2
1
8
7
9
...
6
5
2
8
9
5
9
9
5
0
12
3
9
5
5
4
0
7
4
4
6
...
0
7
2
9
6
9
4
9
4
6
13
8
4
0
9
9
0
1
5
8
7
...
5
8
4
0
3
4
9
9
4
6
14
3
0
4
6
9
9
5
4
3
1
...
6
1
0
3
7
1
2
0
0
2
15
4
2
0
0
7
9
1
2
1
2
...
6
3
9
4
1
7
3
8
4
8
16
3
9
4
8
7
2
0
2
3
1
...
8
0
0
3
8
5
2
0
3
8
17
2
8
6
3
2
9
4
4
2
8
...
6
9
4
2
6
1
8
9
9
0
18
5
6
7
9
8
1
9
1
4
4
...
3
5
2
5
6
9
9
2
6
2
19
1
9
3
7
8
6
0
2
8
0
...
4
3
2
2
3
8
1
8
0
0
20
4
5
5
2
6
8
9
7
5
7
...
3
5
0
8
0
4
3
2
5
1
21
2
4
8
1
9
7
1
4
6
7
...
0
1
8
2
0
4
6
5
0
4
22
4
5
2
4
6
4
4
4
9
9
...
1
7
6
9
9
1
5
5
2
1
23
0
5
4
8
0
6
4
4
1
2
...
8
5
0
7
6
9
2
0
4
3
24
9
7
0
9
0
3
7
4
1
5
...
3
7
8
2
2
1
9
2
2
4
25
4
1
9
5
4
5
0
4
8
9
...
9
3
0
7
0
2
3
7
5
9
26
6
7
1
9
7
2
6
2
6
1
...
0
6
5
9
8
0
3
8
3
9
27
2
8
1
3
5
1
7
7
0
2
...
8
0
4
5
4
5
5
6
3
7
28
6
8
6
2
2
7
4
3
7
5
...
1
7
9
2
4
5
9
5
3
2
29
3
0
3
0
0
9
5
4
3
2
...
1
3
0
4
8
0
8
7
5
6
30 rows × 30 columns
0
1
2
3
4
5
6
7
8
9
...
20
21
22
23
24
25
26
27
28
29
0
6
8
3
2
9
2
6
2
2
3
...
4
2
4
0
9
4
6
2
6
3
1
3
2
5
0
9
6
4
3
2
7
...
5
4
5
5
7
1
7
8
8
0
2
7
4
1
4
2
5
0
6
3
7
...
5
8
2
4
0
9
1
1
6
3
3
4
2
9
9
6
7
0
3
7
6
...
2
1
4
8
9
5
9
3
2
0
4
6
6
1
6
0
8
2
8
5
2
...
6
9
6
0
0
4
7
5
2
0
5
9
4
9
9
3
4
1
0
7
0
...
8
7
4
6
3
5
2
1
7
9
6
2
8
3
8
3
0
4
7
0
0
...
9
1
4
4
7
0
6
7
4
5
7
6
6
7
6
4
2
9
6
7
2
...
7
4
4
4
4
4
2
7
3
4
8
7
1
6
8
6
9
5
1
3
5
...
5
6
9
1
1
8
6
0
7
3
9
4
3
8
7
6
7
6
7
0
6
...
7
7
9
2
5
9
1
2
5
2
10
3
8
7
1
3
5
3
0
7
5
...
4
0
2
6
4
1
9
9
1
0
11
7
1
4
0
6
7
6
8
3
5
...
7
5
0
5
1
0
5
8
3
5
12
7
9
1
6
2
8
7
8
5
5
...
9
0
4
1
2
9
2
4
3
1
13
2
8
4
6
5
3
0
1
7
2
...
3
1
8
5
8
8
2
5
5
7
14
5
9
7
7
1
0
5
6
3
5
...
9
0
0
1
6
9
8
3
5
9
15
4
4
9
4
9
0
7
9
2
7
...
7
4
2
1
6
8
6
9
0
4
16
1
1
8
2
8
9
4
2
8
1
...
9
9
3
1
5
8
4
1
7
6
17
7
3
8
7
4
3
3
6
2
4
...
1
8
0
2
7
5
9
7
5
9
18
5
6
0
5
5
6
1
9
8
0
...
4
5
0
1
3
7
6
5
2
1
19
1
7
8
2
3
1
5
8
1
0
...
8
0
7
3
7
0
8
4
8
7
20
4
2
6
0
9
2
5
3
1
4
...
3
0
1
8
3
9
0
8
1
1
21
0
0
8
2
6
0
0
0
1
2
...
5
1
7
5
7
3
6
0
7
3
22
9
3
7
4
8
4
8
1
5
3
...
0
8
6
0
8
0
5
4
9
0
23
5
1
0
2
6
0
5
0
2
2
...
8
2
9
7
2
7
9
5
2
4
24
8
7
7
0
0
7
2
4
8
0
...
0
0
9
6
2
0
8
4
4
8
25
0
3
7
4
0
0
3
4
3
0
...
4
4
1
9
1
2
0
5
5
0
26
9
1
2
9
8
0
3
6
0
4
...
3
6
5
2
9
3
3
5
9
8
27
2
5
0
6
8
1
2
8
3
5
...
2
5
5
0
2
7
8
6
5
7
28
6
5
7
6
3
1
9
8
0
2
...
5
0
2
4
2
5
3
3
3
5
29
3
9
2
8
8
5
2
2
4
8
...
1
4
1
3
4
9
9
7
2
6
30 rows × 30 columns
1 2 3 np.random.seed(1 ) y = pd.DataFrame(np.random.randint(10 , size=(30 , 30 ))) y
0
1
2
3
4
5
6
7
8
9
...
20
21
22
23
24
25
26
27
28
29
0
5
8
9
5
0
0
1
7
6
9
...
1
7
0
6
9
9
7
6
9
1
1
0
1
8
8
3
9
8
7
3
6
...
9
2
0
4
9
2
7
7
9
8
2
6
9
3
7
7
4
5
9
3
6
...
7
7
1
1
3
0
8
6
4
5
3
6
2
5
7
8
4
4
7
7
4
...
0
1
9
8
2
3
1
2
7
2
4
6
0
9
2
6
6
2
7
7
0
...
1
5
4
0
7
8
9
5
7
0
5
9
3
9
1
4
4
6
8
8
9
...
1
8
7
0
3
4
2
0
3
5
6
1
2
4
3
0
6
0
7
2
8
...
4
3
3
6
7
3
5
3
2
4
7
4
0
3
3
8
3
5
6
7
5
...
1
7
3
1
6
6
9
6
9
6
8
0
0
2
9
6
0
6
7
0
3
...
6
7
9
5
4
9
5
2
5
6
9
6
8
7
7
7
2
6
0
5
2
...
7
0
6
2
4
3
6
7
6
3
10
0
6
4
7
6
2
9
5
9
9
...
4
9
3
9
1
2
5
4
0
8
11
2
3
9
9
4
4
8
2
1
6
...
0
5
9
8
6
6
0
4
7
3
12
0
1
6
0
6
1
6
4
2
5
...
8
8
0
7
2
0
7
1
1
9
13
5
1
5
9
6
4
9
8
7
5
...
2
4
3
2
0
0
4
2
5
0
14
0
3
8
5
3
1
4
7
3
2
...
8
5
5
7
5
9
1
3
9
3
15
3
3
6
1
3
0
5
0
5
2
...
7
1
7
7
3
8
3
0
6
3
16
0
6
5
9
6
4
6
6
2
2
...
3
6
8
6
5
1
3
2
6
3
17
6
7
2
8
0
1
8
6
0
0
...
5
6
2
5
4
3
0
6
2
1
18
9
4
4
0
9
8
7
7
6
1
...
7
9
9
7
1
1
4
6
5
6
19
4
1
1
5
1
2
6
2
3
3
...
0
0
0
9
8
5
9
3
4
0
20
9
8
6
3
9
9
0
8
1
6
...
2
9
0
1
3
9
4
8
8
8
21
2
8
6
4
9
0
5
5
6
1
...
6
7
5
6
8
7
4
2
4
0
22
0
3
5
9
0
3
6
5
1
1
...
6
2
5
3
9
3
9
5
1
9
23
7
7
0
8
6
1
2
0
4
4
...
1
9
6
0
2
8
3
7
2
5
24
6
0
4
2
3
1
0
5
7
0
...
1
1
2
7
5
2
9
4
7
3
25
5
0
2
1
4
9
4
6
9
3
...
5
5
3
5
9
2
7
4
1
6
26
9
8
1
8
1
6
2
6
1
8
...
2
5
1
2
5
3
3
6
1
8
27
1
8
6
4
6
9
5
4
7
2
...
9
3
1
5
1
1
7
1
2
6
28
0
7
7
4
3
2
7
8
5
2
...
0
2
8
3
7
3
9
2
3
8
29
8
0
2
6
8
3
6
4
9
7
...
6
7
8
5
7
2
5
3
4
5
30 rows × 30 columns
0
1
2
3
4
5
6
7
8
9
...
20
21
22
23
24
25
26
27
28
29
0
616
560
723
739
612
457
681
799
575
590
...
523
739
613
580
668
602
733
585
657
700
1
520
438
691
600
612
455
666
764
707
592
...
555
681
503
679
641
506
779
494
633
590
2
557
570
786
807
690
469
804
828
704
573
...
563
675
712
758
793
672
754
550
756
638
3
605
507
664
701
660
496
698
806
651
575
...
582
685
668
586
629
534
678
484
591
626
4
599
681
753
873
721
563
754
770
620
654
...
633
747
661
677
726
649
716
610
735
706
5
422
354
602
627
613
396
617
627
489
423
...
456
572
559
537
499
384
589
436
574
507
6
359
446
599
599
481
357
577
572
451
464
...
449
550
495
532
633
554
663
476
565
602
7
531
520
698
590
607
537
665
696
571
472
...
576
588
551
665
652
527
742
528
650
599
8
449
322
547
533
593
399
584
638
587
424
...
402
596
523
523
447
362
561
386
529
484
9
373
433
525
601
522
345
551
521
434
447
...
508
498
438
478
459
418
488
407
503
496
10
500
427
574
607
667
477
652
656
615
477
...
622
702
531
610
558
532
598
471
582
561
11
664
694
772
841
779
574
730
810
711
608
...
591
760
616
638
721
676
846
678
754
708
12
545
547
687
701
721
576
689
724
710
532
...
674
684
648
694
710
564
757
571
671
656
13
574
586
723
750
691
494
696
787
667
523
...
618
681
568
682
715
644
756
557
690
604
14
502
382
645
557
570
403
538
677
500
501
...
369
650
507
576
546
531
554
437
616
463
15
510
505
736
651
649
510
719
733
694
557
...
605
717
574
642
678
576
755
455
598
654
16
567
376
614
612
643
514
598
724
547
464
...
456
639
520
560
569
442
596
517
659
532
17
626
716
828
765
740
603
809
852
692
591
...
664
716
655
721
742
612
819
593
744
712
18
600
559
667
664
641
556
624
815
638
564
...
581
701
559
677
710
554
748
597
614
657
19
445
431
661
681
641
552
690
719
602
474
...
515
637
576
620
572
512
599
455
622
538
20
523
569
784
725
713
501
740
772
638
640
...
589
775
664
686
726
672
747
548
723
645
21
487
465
553
639
517
449
592
609
454
398
...
492
567
534
404
554
417
561
466
498
492
22
479
449
574
686
583
377
566
614
563
455
...
453
539
491
501
596
520
722
478
565
501
23
483
386
476
526
550
426
492
585
536
482
...
322
541
438
456
487
408
502
426
474
481
24
523
551
658
767
537
444
663
731
576
577
...
522
590
525
664
691
548
635
526
641
538
25
652
656
738
753
853
508
752
815
669
576
...
694
833
693
606
575
616
704
559
728
672
26
578
577
744
856
699
497
779
800
733
587
...
630
754
704
834
760
680
765
592
731
629
27
554
494
665
689
630
574
695
703
636
599
...
554
685
532
658
649
554
693
577
634
668
28
498
552
659
784
552
492
690
775
544
551
...
567
636
518
599
742
521
733
533
605
604
29
513
491
563
642
477
367
589
647
516
484
...
428
574
504
548
553
483
540
407
547
455
30 rows × 30 columns
264 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
103 µs ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
array([[6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4, 1, 7, 5, 1, 4, 0,
9, 5, 8, 0, 9, 2, 6, 3],
[8, 2, 4, 2, 6, 4, 8, 6, 1, 3, 8, 1, 9, 8, 9, 4, 1, 3, 6, 7, 2, 0,
3, 1, 7, 3, 1, 5, 5, 9],
[3, 5, 1, 9, 1, 9, 3, 7, 6, 8, 7, 4, 1, 4, 7, 9, 8, 8, 0, 8, 6, 8,
7, 0, 7, 7, 2, 0, 7, 2],
[2, 0, 4, 9, 6, 9, 8, 6, 8, 7, 1, 0, 6, 6, 7, 4, 2, 7, 5, 2, 0, 2,
4, 2, 0, 4, 9, 6, 6, 8],
[9, 9, 2, 6, 0, 3, 3, 4, 6, 6, 3, 6, 2, 5, 1, 9, 8, 4, 5, 3, 9, 6,
8, 6, 0, 0, 8, 8, 3, 8],
[2, 6, 5, 7, 8, 4, 0, 2, 9, 7, 5, 7, 8, 3, 0, 0, 9, 3, 6, 1, 2, 0,
4, 0, 7, 0, 0, 1, 1, 5],
[6, 4, 0, 0, 2, 1, 4, 9, 5, 6, 3, 6, 7, 0, 5, 7, 4, 3, 1, 5, 5, 0,
8, 5, 2, 3, 3, 2, 9, 2],
[2, 3, 6, 3, 8, 0, 7, 6, 1, 7, 0, 8, 8, 1, 6, 9, 2, 6, 9, 8, 3, 0,
1, 0, 4, 4, 6, 8, 8, 2],
[2, 2, 3, 7, 5, 7, 0, 7, 3, 0, 7, 3, 5, 7, 3, 2, 8, 2, 8, 1, 1, 1,
5, 2, 8, 3, 0, 3, 0, 4],
[3, 7, 7, 6, 2, 0, 0, 2, 5, 6, 5, 5, 5, 2, 5, 7, 1, 4, 0, 0, 4, 2,
3, 2, 0, 0, 4, 5, 2, 8],
[4, 7, 0, 4, 2, 0, 3, 4, 6, 0, 2, 1, 8, 9, 5, 9, 2, 7, 7, 1, 5, 6,
1, 9, 1, 9, 0, 7, 0, 8],
[5, 6, 9, 6, 9, 2, 1, 8, 7, 9, 6, 8, 3, 3, 0, 7, 2, 6, 1, 1, 6, 5,
2, 8, 9, 5, 9, 9, 5, 0],
[3, 9, 5, 5, 4, 0, 7, 4, 4, 6, 3, 5, 3, 2, 6, 7, 3, 1, 9, 2, 0, 7,
2, 9, 6, 9, 4, 9, 4, 6],
[8, 4, 0, 9, 9, 0, 1, 5, 8, 7, 4, 0, 6, 4, 5, 6, 2, 9, 2, 4, 5, 8,
4, 0, 3, 4, 9, 9, 4, 6],
[3, 0, 4, 6, 9, 9, 5, 4, 3, 1, 3, 9, 9, 2, 9, 0, 7, 4, 3, 7, 6, 1,
0, 3, 7, 1, 2, 0, 0, 2],
[4, 2, 0, 0, 7, 9, 1, 2, 1, 2, 6, 0, 9, 7, 9, 9, 9, 1, 2, 8, 6, 3,
9, 4, 1, 7, 3, 8, 4, 8],
[3, 9, 4, 8, 7, 2, 0, 2, 3, 1, 0, 6, 7, 6, 4, 0, 6, 6, 8, 2, 8, 0,
0, 3, 8, 5, 2, 0, 3, 8],
[2, 8, 6, 3, 2, 9, 4, 4, 2, 8, 3, 4, 3, 4, 6, 8, 6, 4, 9, 9, 6, 9,
4, 2, 6, 1, 8, 9, 9, 0],
[5, 6, 7, 9, 8, 1, 9, 1, 4, 4, 5, 2, 7, 0, 5, 3, 0, 6, 8, 3, 3, 5,
2, 5, 6, 9, 9, 2, 6, 2],
[1, 9, 3, 7, 8, 6, 0, 2, 8, 0, 8, 7, 0, 5, 4, 5, 9, 4, 5, 4, 4, 3,
2, 2, 3, 8, 1, 8, 0, 0],
[4, 5, 5, 2, 6, 8, 9, 7, 5, 7, 4, 7, 9, 3, 9, 7, 9, 1, 4, 8, 3, 5,
0, 8, 0, 4, 3, 2, 5, 1],
[2, 4, 8, 1, 9, 7, 1, 4, 6, 7, 0, 5, 0, 1, 0, 4, 9, 8, 5, 0, 0, 1,
8, 2, 0, 4, 6, 5, 0, 4],
[4, 5, 2, 4, 6, 4, 4, 4, 9, 9, 2, 0, 4, 8, 0, 2, 3, 0, 0, 7, 1, 7,
6, 9, 9, 1, 5, 5, 2, 1],
[0, 5, 4, 8, 0, 6, 4, 4, 1, 2, 6, 5, 1, 5, 1, 1, 1, 2, 1, 3, 8, 5,
0, 7, 6, 9, 2, 0, 4, 3],
[9, 7, 0, 9, 0, 3, 7, 4, 1, 5, 4, 1, 2, 8, 6, 6, 5, 7, 3, 7, 3, 7,
8, 2, 2, 1, 9, 2, 2, 4],
[4, 1, 9, 5, 4, 5, 0, 4, 8, 9, 1, 0, 9, 8, 9, 8, 8, 5, 7, 0, 9, 3,
0, 7, 0, 2, 3, 7, 5, 9],
[6, 7, 1, 9, 7, 2, 6, 2, 6, 1, 9, 5, 2, 2, 8, 6, 4, 9, 6, 8, 0, 6,
5, 9, 8, 0, 3, 8, 3, 9],
[2, 8, 1, 3, 5, 1, 7, 7, 0, 2, 9, 8, 4, 5, 3, 9, 1, 7, 5, 4, 8, 0,
4, 5, 4, 5, 5, 6, 3, 7],
[6, 8, 6, 2, 2, 7, 4, 3, 7, 5, 1, 3, 3, 5, 5, 0, 7, 5, 2, 8, 1, 7,
9, 2, 4, 5, 9, 5, 3, 2],
[3, 0, 3, 0, 0, 9, 5, 4, 3, 2, 0, 5, 1, 7, 9, 4, 6, 9, 1, 7, 1, 3,
0, 4, 8, 0, 8, 7, 5, 6]])
array([[5, 8, 9, 5, 0, 0, 1, 7, 6, 9, 2, 4, 5, 2, 4, 2, 4, 7, 7, 9, 1, 7,
0, 6, 9, 9, 7, 6, 9, 1],
[0, 1, 8, 8, 3, 9, 8, 7, 3, 6, 5, 1, 9, 3, 4, 8, 1, 4, 0, 3, 9, 2,
0, 4, 9, 2, 7, 7, 9, 8],
[6, 9, 3, 7, 7, 4, 5, 9, 3, 6, 8, 0, 2, 7, 7, 9, 7, 3, 0, 8, 7, 7,
1, 1, 3, 0, 8, 6, 4, 5],
[6, 2, 5, 7, 8, 4, 4, 7, 7, 4, 9, 0, 2, 0, 7, 1, 7, 9, 8, 4, 0, 1,
9, 8, 2, 3, 1, 2, 7, 2],
[6, 0, 9, 2, 6, 6, 2, 7, 7, 0, 6, 5, 1, 4, 6, 0, 6, 5, 1, 2, 1, 5,
4, 0, 7, 8, 9, 5, 7, 0],
[9, 3, 9, 1, 4, 4, 6, 8, 8, 9, 2, 7, 5, 5, 4, 5, 8, 5, 8, 1, 1, 8,
7, 0, 3, 4, 2, 0, 3, 5],
[1, 2, 4, 3, 0, 6, 0, 7, 2, 8, 3, 0, 8, 4, 2, 9, 0, 3, 8, 1, 4, 3,
3, 6, 7, 3, 5, 3, 2, 4],
[4, 0, 3, 3, 8, 3, 5, 6, 7, 5, 1, 7, 0, 2, 8, 2, 1, 4, 0, 4, 1, 7,
3, 1, 6, 6, 9, 6, 9, 6],
[0, 0, 2, 9, 6, 0, 6, 7, 0, 3, 9, 0, 3, 4, 7, 5, 3, 8, 8, 0, 6, 7,
9, 5, 4, 9, 5, 2, 5, 6],
[6, 8, 7, 7, 7, 2, 6, 0, 5, 2, 1, 8, 5, 9, 4, 9, 1, 2, 0, 4, 7, 0,
6, 2, 4, 3, 6, 7, 6, 3],
[0, 6, 4, 7, 6, 2, 9, 5, 9, 9, 9, 8, 6, 4, 2, 9, 4, 0, 0, 3, 4, 9,
3, 9, 1, 2, 5, 4, 0, 8],
[2, 3, 9, 9, 4, 4, 8, 2, 1, 6, 3, 8, 9, 7, 0, 5, 2, 2, 8, 5, 0, 5,
9, 8, 6, 6, 0, 4, 7, 3],
[0, 1, 6, 0, 6, 1, 6, 4, 2, 5, 4, 6, 2, 9, 2, 7, 5, 0, 7, 8, 8, 8,
0, 7, 2, 0, 7, 1, 1, 9],
[5, 1, 5, 9, 6, 4, 9, 8, 7, 5, 1, 8, 0, 5, 3, 9, 0, 4, 8, 6, 2, 4,
3, 2, 0, 0, 4, 2, 5, 0],
[0, 3, 8, 5, 3, 1, 4, 7, 3, 2, 2, 2, 6, 6, 0, 1, 5, 6, 5, 8, 8, 5,
5, 7, 5, 9, 1, 3, 9, 3],
[3, 3, 6, 1, 3, 0, 5, 0, 5, 2, 7, 6, 4, 0, 2, 4, 8, 7, 6, 7, 7, 1,
7, 7, 3, 8, 3, 0, 6, 3],
[0, 6, 5, 9, 6, 4, 6, 6, 2, 2, 4, 1, 2, 3, 9, 3, 6, 7, 0, 3, 3, 6,
8, 6, 5, 1, 3, 2, 6, 3],
[6, 7, 2, 8, 0, 1, 8, 6, 0, 0, 1, 2, 7, 7, 4, 4, 0, 1, 0, 8, 5, 6,
2, 5, 4, 3, 0, 6, 2, 1],
[9, 4, 4, 0, 9, 8, 7, 7, 6, 1, 7, 2, 4, 5, 6, 7, 2, 0, 5, 2, 7, 9,
9, 7, 1, 1, 4, 6, 5, 6],
[4, 1, 1, 5, 1, 2, 6, 2, 3, 3, 2, 3, 0, 0, 1, 5, 0, 5, 8, 8, 0, 0,
0, 9, 8, 5, 9, 3, 4, 0],
[9, 8, 6, 3, 9, 9, 0, 8, 1, 6, 6, 1, 3, 7, 3, 2, 3, 0, 2, 8, 2, 9,
0, 1, 3, 9, 4, 8, 8, 8],
[2, 8, 6, 4, 9, 0, 5, 5, 6, 1, 7, 6, 5, 7, 1, 9, 7, 5, 6, 7, 6, 7,
5, 6, 8, 7, 4, 2, 4, 0],
[0, 3, 5, 9, 0, 3, 6, 5, 1, 1, 8, 4, 7, 0, 1, 7, 1, 5, 4, 5, 6, 2,
5, 3, 9, 3, 9, 5, 1, 9],
[7, 7, 0, 8, 6, 1, 2, 0, 4, 4, 6, 1, 0, 9, 4, 9, 3, 0, 5, 1, 1, 9,
6, 0, 2, 8, 3, 7, 2, 5],
[6, 0, 4, 2, 3, 1, 0, 5, 7, 0, 2, 2, 0, 9, 5, 1, 1, 6, 3, 2, 1, 1,
2, 7, 5, 2, 9, 4, 7, 3],
[5, 0, 2, 1, 4, 9, 4, 6, 9, 3, 8, 8, 7, 1, 8, 7, 8, 9, 2, 2, 5, 5,
3, 5, 9, 2, 7, 4, 1, 6],
[9, 8, 1, 8, 1, 6, 2, 6, 1, 8, 3, 7, 0, 7, 0, 0, 7, 3, 9, 5, 2, 5,
1, 2, 5, 3, 3, 6, 1, 8],
[1, 8, 6, 4, 6, 9, 5, 4, 7, 2, 0, 2, 0, 5, 9, 4, 1, 4, 5, 2, 9, 3,
1, 5, 1, 1, 7, 1, 2, 6],
[0, 7, 7, 4, 3, 2, 7, 8, 5, 2, 4, 9, 2, 2, 3, 5, 9, 6, 4, 9, 0, 2,
8, 3, 7, 3, 9, 2, 3, 8],
[8, 0, 2, 6, 8, 3, 6, 4, 9, 7, 6, 3, 2, 9, 1, 5, 5, 6, 9, 4, 6, 7,
8, 5, 7, 2, 5, 3, 4, 5]])
24.7 µs ± 3.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
22 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1 %timeit np.dot(x.values, y.values)
43.4 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 x2 = list (x1) y2 = list (y1) x3 = [] y3 = [] for i in x2: res = [] for j in i: res.append(int (j)) x3.append(res) for i in y2: res = [] for j in i: res.append(int (j)) y3.append(res)
1 2 3 4 5 6 7 8 9 10 11 def f (x, y ): res = [] for i in range (len (x)): row = [] for j in range (len (y[0 ])): sum_row = 0 for k in range (len (x[0 ])): sum_row += x[i][k]*y[k][j] row.append(sum_row) res.append(row) return res
5.86 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
1 2 3 np.random.seed(42 ) x = pd.DataFrame(np.random.randint(10 , size=(3 , 3 )), columns=list ("ABC" )) x
A
B
C
0
6
3
7
1
4
6
9
2
2
6
7
A 6
B 3
C 7
Name: 0, dtype: int32
A
B
C
0
1.000000
1.0
1.000000
1
0.666667
2.0
1.285714
2
0.333333
2.0
1.000000
0 6
1 4
2 2
Name: A, dtype: int32
A
B
C
0
1.0
0.5
1.166667
1
1.0
1.5
2.250000
2
1.0
3.0
3.500000
1 x.div(x.iloc[0 ], axis=1 )
A
B
C
0
1.000000
1.0
1.000000
1
0.666667
2.0
1.285714
2
0.333333
2.0
1.000000
3、新的用法
(1)索引对齐
1 2 A = pd.DataFrame(np.random.randint(0 , 20 , size=(2 , 2 )), columns=list ("AB" )) A
1 2 B = pd.DataFrame(np.random.randint(0 , 10 , size=(3 , 3 )), columns=list ("ABC" )) B
A
B
C
0
7
5
1
1
4
0
9
2
5
8
0
pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A
B
C
0
10.0
12.0
NaN
1
6.0
1.0
NaN
2
NaN
NaN
NaN
A
B
C
0
10.0
12.0
1.0
1
6.0
1.0
9.0
2
5.0
8.0
0.0
A
B
C
0
21.0
35.0
NaN
1
8.0
0.0
NaN
2
NaN
NaN
NaN
(2)统计相关
1 2 y = np.random.randint(3 , size=20 ) y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
array([0, 1, 2])
1 2 from collections import CounterCounter(y)
Counter({2: 11, 1: 5, 0: 4})
1 2 y1 = pd.DataFrame(y, columns=["A" ]) y1
A
0
2
1
2
2
2
3
1
4
2
5
1
6
1
7
2
8
1
9
2
10
2
11
0
12
2
13
0
14
2
15
2
16
0
17
0
18
2
19
1
array([0, 1, 2])
2 11
1 5
0 4
Name: A, dtype: int64
1 2 3 4 5 6 7 8 9 10 11 12 13 14 population_dict = {"BeiJing" : 2154 , "ShangHai" : 2424 , "ShenZhen" : 1303 , "HangZhou" : 981 } population = pd.Series(population_dict) GDP_dict = {"BeiJing" : 30320 , "ShangHai" : 32680 , "ShenZhen" : 24222 , "HangZhou" : 13468 } GDP = pd.Series(GDP_dict) city_info = pd.DataFrame({"population" : population,"GDP" : GDP}) city_info
population
GDP
BeiJing
2154
30320
ShangHai
2424
32680
ShenZhen
1303
24222
HangZhou
981
13468
1 2 city_info["per_GDP" ] = city_info["GDP" ]/city_info["population" ] city_info
population
GDP
per_GDP
BeiJing
2154
30320
14.076137
ShangHai
2424
32680
13.481848
ShenZhen
1303
24222
18.589409
HangZhou
981
13468
13.728848
递增排序
1 city_info.sort_values(by="per_GDP" )
population
GDP
per_GDP
ShangHai
2424
32680
13.481848
HangZhou
981
13468
13.728848
BeiJing
2154
30320
14.076137
ShenZhen
1303
24222
18.589409
递减排序
1 city_info.sort_values(by="per_GDP" , ascending=False )
population
GDP
per_GDP
ShenZhen
1303
24222
18.589409
BeiJing
2154
30320
14.076137
HangZhou
981
13468
13.728848
ShangHai
2424
32680
13.481848
按轴进行排序
1 2 data = pd.DataFrame(np.random.randint(20 , size=(3 , 4 )), index=[2 , 1 , 0 ], columns=list ("CBAD" )) data
C
B
A
D
2
3
13
17
8
1
1
19
14
6
0
11
7
14
2
行排序
C
B
A
D
0
11
7
14
2
1
1
19
14
6
2
3
13
17
8
列排序
A
B
C
D
2
17
13
3
8
1
14
19
1
6
0
14
7
11
2
1 data.sort_index(axis=1 , ascending=False )
D
C
B
A
2
8
3
13
17
1
6
1
19
14
0
2
11
7
14
1 2 df = pd.DataFrame(np.random.normal(2 , 4 , size=(6 , 4 )),columns=list ("ABCD" )) df
A
B
C
D
0
1.082198
3.557396
-3.060476
6.367969
1
13.113252
6.774559
2.874553
5.527044
2
-2.036341
-4.333177
5.094802
-0.152567
3
-3.386712
-1.522365
-2.522209
2.537716
4
4.328491
5.550994
5.577329
5.019991
5
1.171336
-0.493910
-4.032613
6.398588
非空个数
A 6
B 6
C 6
D 6
dtype: int64
求和
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
0 7.947086
1 28.289408
2 -1.427283
3 -4.893571
4 20.476806
5 3.043402
dtype: float64
最大值 最小值
A -3.386712
B -4.333177
C -4.032613
D -0.152567
dtype: float64
0 6.367969
1 13.113252
2 5.094802
3 2.537716
4 5.577329
5 6.398588
dtype: float64
A
B
C
D
0
1.082198
3.557396
-3.060476
6.367969
1
13.113252
6.774559
2.874553
5.527044
2
-2.036341
-4.333177
5.094802
-0.152567
3
-3.386712
-1.522365
-2.522209
2.537716
4
4.328491
5.550994
5.577329
5.019991
5
1.171336
-0.493910
-4.032613
6.398588
A 1
B 1
C 4
D 5
dtype: int64
均值
A 2.378704
B 1.588916
C 0.655231
D 4.283124
dtype: float64
方差
A 34.980702
B 19.110656
C 18.948144
D 6.726776
dtype: float64
标准差
A 5.914449
B 4.371574
C 4.352947
D 2.593603
dtype: float64
中位数
A 1.126767
B 1.531743
C 0.176172
D 5.273518
dtype: float64
众数
1 2 data = pd.DataFrame(np.random.randint(5 , size=(10 , 2 )), columns=list ("AB" )) data
A
B
0
4
2
1
3
2
2
2
0
3
2
4
4
2
0
5
4
1
6
2
0
7
1
1
8
3
4
9
2
0
75%分位数
A 3.539202
B 5.052594
C 4.539740
D 6.157738
Name: 0.75, dtype: float64
一网打尽
A
B
C
D
count
6.000000
6.000000
6.000000
6.000000
mean
2.378704
1.588916
0.655231
4.283124
std
5.914449
4.371574
4.352947
2.593603
min
-3.386712
-4.333177
-4.032613
-0.152567
25%
-1.256706
-1.265251
-2.925910
3.158284
50%
1.126767
1.531743
0.176172
5.273518
75%
3.539202
5.052594
4.539740
6.157738
max
13.113252
6.774559
5.577329
6.398588
1 2 3 4 data_2 = pd.DataFrame([["a" , "a" , "c" , "d" ], ["c" , "a" , "c" , "b" ], ["a" , "a" , "d" , "c" ]], columns=list ("ABCD" )) data_2
A
B
C
D
0
a
a
c
d
1
c
a
c
b
2
a
a
d
c
A
B
C
D
count
3
3
3
3
unique
2
1
2
3
top
a
a
c
c
freq
2
3
2
1
相关性系数和协方差
A
B
C
D
A
1.000000
0.831063
0.331060
0.510821
B
0.831063
1.000000
0.179244
0.719112
C
0.331060
0.179244
1.000000
-0.450365
D
0.510821
0.719112
-0.450365
1.000000
A 1.000000
B 0.831063
C 0.331060
D 0.510821
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
A
B
C
D
0
1.082198
3.557396
-3.060476
6.367969
1
13.113252
6.774559
2.874553
5.527044
2
-2.036341
-4.333177
5.094802
-0.152567
3
-3.386712
-1.522365
-2.522209
2.537716
4
4.328491
5.550994
5.577329
5.019991
5
1.171336
-0.493910
-4.032613
6.398588
A
B
C
D
0
1.082198
3.557396
-3.060476
6.367969
1
14.195450
10.331955
-0.185923
11.895013
2
12.159109
5.998778
4.908878
11.742447
3
8.772397
4.476413
2.386669
14.280162
4
13.100888
10.027406
7.963999
19.300153
5
14.272224
9.533497
3.931385
25.698741
1 df.apply(np.cumsum, axis=1 )
A
B
C
D
0
1.082198
4.639594
1.579117
7.947086
1
13.113252
19.887811
22.762364
28.289408
2
-2.036341
-6.369518
-1.274717
-1.427283
3
-3.386712
-4.909077
-7.431287
-4.893571
4
4.328491
9.879485
15.456814
20.476806
5
1.171336
0.677427
-3.355186
3.043402
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
1 df.apply(lambda x: x.max ()-x.min ())
A 16.499965
B 11.107736
C 9.609942
D 6.551155
dtype: float64
1 2 3 4 def my_describe (x ): return pd.Series([x.count(), x.mean(), x.max (), x.idxmin(), x.std()], \ index=["Count" , "mean" , "max" , "idxmin" , "std" ]) df.apply(my_describe)
A
B
C
D
Count
6.000000
6.000000
6.000000
6.000000
mean
2.378704
1.588916
0.655231
4.283124
max
13.113252
6.774559
5.577329
6.398588
idxmin
3.000000
2.000000
5.000000
2.000000
std
5.914449
4.371574
4.352947
2.593603
12.4 缺失值处理
1、发现缺失值
1 2 3 4 5 6 7 import pandas as pdimport numpy as npdata = pd.DataFrame(np.array([[1 , np.nan, 2 ], [np.nan, 3 , 4 ], [5 , 6 , None ]]), columns=["A" , "B" , "C" ]) data
A
B
C
0
1
NaN
2
1
NaN
3
4
2
5
6
None
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
A object
B object
C object
dtype: object
A
B
C
0
False
True
False
1
True
False
False
2
False
False
True
A
B
C
0
True
False
True
1
False
True
True
2
True
True
False
2、删除缺失值
1 2 3 4 5 data = pd.DataFrame(np.array([[1 , np.nan, 2 , 3 ], [np.nan, 4 , 5 , 6 ], [7 , 8 , np.nan, 9 ], [10 , 11 , 12 , 13 ]]), columns=["A" , "B" , "C" , "D" ]) data
A
B
C
D
0
1.0
NaN
2.0
3.0
1
NaN
4.0
5.0
6.0
2
7.0
8.0
NaN
9.0
3
10.0
11.0
12.0
13.0
注意:np.nan是一种特殊的浮点数
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
A
B
C
D
3
10.0
11.0
12.0
13.0
(2)删除整列
1 data.dropna(axis="columns" )
D
0
3.0
1
6.0
2
9.0
3
13.0
A
B
C
D
0
1.0
NaN
2.0
NaN
1
NaN
4.0
5.0
NaN
2
7.0
8.0
NaN
NaN
3
10.0
11.0
12.0
NaN
1 data.dropna(axis="columns" , how="all" )
A
B
C
0
1.0
NaN
2.0
1
NaN
4.0
5.0
2
7.0
8.0
NaN
3
10.0
11.0
12.0
1 data.dropna(axis="columns" , how="any" )
1 2 data.loc[3 ] = np.nan data
A
B
C
D
0
1.0
NaN
2.0
NaN
1
NaN
4.0
5.0
NaN
2
7.0
8.0
NaN
NaN
3
NaN
NaN
NaN
NaN
A
B
C
D
0
1.0
NaN
2.0
NaN
1
NaN
4.0
5.0
NaN
2
7.0
8.0
NaN
NaN
3、填充缺失值
1 2 3 4 5 data = pd.DataFrame(np.array([[1 , np.nan, 2 , 3 ], [np.nan, 4 , 5 , 6 ], [7 , 8 , np.nan, 9 ], [10 , 11 , 12 , 13 ]]), columns=["A" , "B" , "C" , "D" ]) data
A
B
C
D
0
1.0
NaN
2.0
3.0
1
NaN
4.0
5.0
6.0
2
7.0
8.0
NaN
9.0
3
10.0
11.0
12.0
13.0
A
B
C
D
0
1.0
5.0
2.0
3.0
1
5.0
4.0
5.0
6.0
2
7.0
8.0
5.0
9.0
3
10.0
11.0
12.0
13.0
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
A
B
C
D
0
1.0
7.666667
2.000000
3.0
1
6.0
4.000000
5.000000
6.0
2
7.0
8.000000
6.333333
9.0
3
10.0
11.000000
12.000000
13.0
1 2 fill = data.stack().mean() fill
7.0
A
B
C
D
0
1.0
7.0
2.0
3.0
1
7.0
4.0
5.0
6.0
2
7.0
8.0
7.0
9.0
3
10.0
11.0
12.0
13.0
12.5 合并数据
1 2 3 4 5 6 7 8 9 import pandas as pdimport numpy as npdef make_df (cols, ind ): "一个简单的DataFrame" data = {c: [str (c)+str (i) for i in ind] for c in cols} return pd.DataFrame(data, ind) make_df("ABC" , range (3 ))
A
B
C
0
A0
B0
C0
1
A1
B1
C1
2
A2
B2
C2
1 2 3 4 df_1 = make_df("AB" , [1 , 2 ]) df_2 = make_df("AB" , [3 , 4 ]) print (df_1)print (df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
A
B
1
A1
B1
2
A2
B2
3
A3
B3
4
A4
B4
1 2 3 4 df_3 = make_df("AB" , [0 , 1 ]) df_4 = make_df("CD" , [0 , 1 ]) print (df_3)print (df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
1 pd.concat([df_3, df_4], axis=1 )
A
B
C
D
0
A0
B0
C0
D0
1
A1
B1
C1
D1
行重叠
1 2 3 4 df_5 = make_df("AB" , [1 , 2 ]) df_6 = make_df("AB" , [1 , 2 ]) print (df_5)print (df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
A
B
1
A1
B1
2
A2
B2
1
A1
B1
2
A2
B2
1 pd.concat([df_5, df_6],ignore_index=True )
A
B
0
A1
B1
1
A2
B2
2
A1
B1
3
A2
B2
列重叠
1 2 3 4 df_7 = make_df("ABC" , [1 , 2 ]) df_8 = make_df("BCD" , [1 , 2 ]) print (df_7)print (df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
1 pd.concat([df_7, df_8], axis=1 )
A
B
C
B
C
D
1
A1
B1
C1
B1
C1
D1
2
A2
B2
C2
B2
C2
D2
1 pd.concat([df_7, df_8],axis=1 , ignore_index=True )
0
1
2
3
4
5
1
A1
B1
C1
B1
C1
D1
2
A2
B2
C2
B2
C2
D2
1 2 3 4 df_9 = make_df("AB" , [1 , 2 ]) df_10 = make_df("BC" , [1 , 2 ]) print (df_9)print (df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
A
B
C
0
A1
B1
C1
1
A2
B2
C2
1 2 3 4 df_9 = make_df("AB" , [1 , 2 ]) df_10 = make_df("CB" , [2 , 1 ]) print (df_9)print (df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
A
B
C
0
A1
B1
C1
1
A2
B2
C2
【例】 合并城市信息
1 2 3 4 population_dict = {"city" : ("BeiJing" , "HangZhou" , "ShenZhen" ), "pop" : (2154 , 981 , 1303 )} population = pd.DataFrame(population_dict) population
city
pop
0
BeiJing
2154
1
HangZhou
981
2
ShenZhen
1303
1 2 3 4 GDP_dict = {"city" : ("BeiJing" , "ShangHai" , "HangZhou" ), "GDP" : (30320 , 32680 , 13468 )} GDP = pd.DataFrame(GDP_dict) GDP
city
GDP
0
BeiJing
30320
1
ShangHai
32680
2
HangZhou
13468
1 2 city_info = pd.merge(population, GDP) city_info
city
pop
GDP
0
BeiJing
2154
30320
1
HangZhou
981
13468
1 2 city_info = pd.merge(population, GDP, how="outer" ) city_info
city
pop
GDP
0
BeiJing
2154.0
30320.0
1
HangZhou
981.0
13468.0
2
ShenZhen
1303.0
NaN
3
ShangHai
NaN
32680.0
12.6 分组和数据透视表
1 2 3 4 df = pd.DataFrame({"key" :["A" , "B" , "C" , "C" , "B" , "A" ], "data1" : range (6 ), "data2" : np.random.randint(0 , 10 , size=6 )}) df
key
data1
data2
0
A
0
3
1
B
1
3
2
C
2
4
3
C
3
6
4
B
4
6
5
A
5
3
(1)分组
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018DD1697308>
data1
data2
key
A
5
6
B
5
9
C
5
10
1 df.groupby("key" ).mean()
data1
data2
key
A
2.5
3.0
B
2.5
4.5
C
2.5
5.0
1 2 for i in df.groupby("key" ): print (str (i))
('A', key data1 data2
0 A 0 3
5 A 5 3)
('B', key data1 data2
1 B 1 3
4 B 4 6)
('C', key data1 data2
2 C 2 4
3 C 3 6)
1 df.groupby("key" )["data2" ].sum ()
key
A 6
B 9
C 10
Name: data2, dtype: int32
1 2 for data, group in df.groupby("key" ): print ("{0:5} shape={1}" .format (data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
1 df.groupby("key" )["data1" ].describe()
count
mean
std
min
25%
50%
75%
max
key
A
2.0
2.5
3.535534
0.0
1.25
2.5
3.75
5.0
B
2.0
2.5
2.121320
1.0
1.75
2.5
3.25
4.0
C
2.0
2.5
0.707107
2.0
2.25
2.5
2.75
3.0
1 df.groupby("key" ).aggregate(["min" , "median" , "max" ])
data1
data2
min
median
max
min
median
max
key
A
0
2.5
5
3
3.0
3
B
1
2.5
4
3
4.5
6
C
2
2.5
3
4
5.0
6
1 2 3 def filter_func (x ): return x["data2" ].std() > 3 df.groupby("key" )["data2" ].std()
key
A 0.000000
B 2.121320
C 1.414214
Name: data2, dtype: float64
1 df.groupby("key" ).filter (filter_func)
key
data1
data2
0
A
0
3
1
B
1
3
2
C
2
4
3
C
3
6
4
B
4
6
5
A
5
3
1 df.groupby("key" ).transform(lambda x: x-x.mean())
data1
data2
0
-2.5
0.0
1
-1.5
-1.5
2
-0.5
-1.0
3
0.5
1.0
4
1.5
1.5
5
2.5
0.0
key
data1
data2
0
A
0
3
1
B
1
3
2
C
2
4
3
C
3
6
4
B
4
6
5
A
5
3
1 df.groupby("key" ).apply(lambda x: x-x.mean())
data1
data2
0
-2.5
0.0
1
-1.5
-1.5
2
-0.5
-1.0
3
0.5
1.0
4
1.5
1.5
5
2.5
0.0
key
data1
data2
0
A
0
3
1
B
1
3
2
C
2
4
3
C
3
6
4
B
4
6
5
A
5
3
1 2 3 def norm_by_data2 (x ): x["data1" ] /= x["data2" ].sum () return x
1 df.groupby("key" ).apply(norm_by_data2)
key
data1
data2
0
A
0.000000
3
1
B
0.111111
3
2
C
0.200000
4
3
C
0.300000
6
4
B
0.444444
6
5
A
0.833333
3
1 2 L = [0 , 1 , 0 , 1 , 2 , 0 ] df
key
data1
data2
0
A
0
3
1
B
1
3
2
C
2
4
3
C
3
6
4
B
4
6
5
A
5
3
data1
data2
0
7
10
1
4
9
2
4
6
1 2 df2 = df.set_index("key" ) df2
data1
data2
key
A
0
3
B
1
3
C
2
4
C
3
6
B
4
6
A
5
3
1 2 mapping = {"A" : "first" , "B" : "constant" , "C" : "constant" } df2.groupby(mapping).sum ()
data1
data2
constant
10
19
first
5
6
1 df2.groupby(str .lower).mean()
data1
data2
a
2.5
3.0
b
2.5
4.5
c
2.5
5.0
1 df2.groupby([str .lower, mapping]).mean()
data1
data2
a
first
2.5
3.0
b
constant
2.5
4.5
c
constant
2.5
5.0
【例1】 行星观测数据处理
1 2 3 import seaborn as snsplanets = sns.load_dataset("planets" )
(1035, 6)
method
number
orbital_period
mass
distance
year
0
Radial Velocity
1
269.300
7.10
77.40
2006
1
Radial Velocity
1
874.774
2.21
56.95
2008
2
Radial Velocity
1
763.000
2.60
19.84
2011
3
Radial Velocity
1
326.030
19.40
110.62
2007
4
Radial Velocity
1
516.220
10.50
119.47
2009
number
orbital_period
mass
distance
year
count
1035.000000
992.000000
513.000000
808.000000
1035.000000
mean
1.785507
2002.917596
2.638161
264.069282
2009.070531
std
1.240976
26014.728304
3.818617
733.116493
3.972567
min
1.000000
0.090706
0.003600
1.350000
1989.000000
25%
1.000000
5.442540
0.229000
32.560000
2007.000000
50%
1.000000
39.979500
1.260000
55.250000
2010.000000
75%
2.000000
526.005000
3.040000
178.500000
2012.000000
max
7.000000
730000.000000
25.000000
8500.000000
2014.000000
method
number
orbital_period
mass
distance
year
0
Radial Velocity
1
269.300
7.10
77.40
2006
1
Radial Velocity
1
874.774
2.21
56.95
2008
2
Radial Velocity
1
763.000
2.60
19.84
2011
3
Radial Velocity
1
326.030
19.40
110.62
2007
4
Radial Velocity
1
516.220
10.50
119.47
2009
1 2 decade = 10 * (planets["year" ] // 10 ) decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
1 2 3 decade = decade.astype(str ) + "s" decade.name = "decade" decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
method
number
orbital_period
mass
distance
year
0
Radial Velocity
1
269.300
7.10
77.40
2006
1
Radial Velocity
1
874.774
2.21
56.95
2008
2
Radial Velocity
1
763.000
2.60
19.84
2011
3
Radial Velocity
1
326.030
19.40
110.62
2007
4
Radial Velocity
1
516.220
10.50
119.47
2009
1 planets.groupby(["method" , decade]).sum ()
number
orbital_period
mass
distance
year
method
decade
Astrometry
2010s
2
1.262360e+03
0.00000
35.75
4023
Eclipse Timing Variations
2000s
5
1.930800e+04
6.05000
261.44
6025
2010s
10
2.345680e+04
4.20000
1000.00
12065
Imaging
2000s
29
1.350935e+06
0.00000
956.83
40139
2010s
21
6.803750e+04
0.00000
1210.08
36208
Microlensing
2000s
12
1.732500e+04
0.00000
0.00
20070
2010s
15
4.750000e+03
0.00000
41440.00
26155
Orbital Brightness Modulation
2010s
5
2.127920e+00
0.00000
2360.00
6035
Pulsar Timing
1990s
9
1.900153e+02
0.00000
0.00
5978
2000s
1
3.652500e+04
0.00000
0.00
2003
2010s
1
9.070629e-02
0.00000
1200.00
2011
Pulsation Timing Variations
2000s
1
1.170000e+03
0.00000
0.00
2007
Radial Velocity
1980s
1
8.388800e+01
11.68000
40.57
1989
1990s
52
1.091561e+04
68.17820
723.71
55943
2000s
475
2.633526e+05
945.31928
15201.16
619775
2010s
424
1.809630e+05
316.47890
11382.67
432451
Transit
2000s
64
2.897102e+02
0.00000
31823.31
124462
2010s
712
8.087813e+03
1.47000
102419.46
673999
Transit Timing Variations
2010s
9
2.393505e+02
0.00000
3313.00
8050
1 planets.groupby(["method" , decade])[["number" ]].sum ().unstack().fillna(0 )
number
decade
1980s
1990s
2000s
2010s
method
Astrometry
0.0
0.0
0.0
2.0
Eclipse Timing Variations
0.0
0.0
5.0
10.0
Imaging
0.0
0.0
29.0
21.0
Microlensing
0.0
0.0
12.0
15.0
Orbital Brightness Modulation
0.0
0.0
0.0
5.0
Pulsar Timing
0.0
9.0
1.0
1.0
Pulsation Timing Variations
0.0
0.0
1.0
0.0
Radial Velocity
1.0
52.0
475.0
424.0
Transit
0.0
0.0
64.0
712.0
Transit Timing Variations
0.0
0.0
0.0
9.0
(2)数据透视表
【例2】泰坦尼克号乘客数据分析
1 2 3 import seaborn as snstitanic = sns.load_dataset("titanic" )
survived
pclass
sex
age
sibsp
parch
fare
embarked
class
who
adult_male
deck
embark_town
alive
alone
0
0
3
male
22.0
1
0
7.2500
S
Third
man
True
NaN
Southampton
no
False
1
1
1
female
38.0
1
0
71.2833
C
First
woman
False
C
Cherbourg
yes
False
2
1
3
female
26.0
0
0
7.9250
S
Third
woman
False
NaN
Southampton
yes
True
3
1
1
female
35.0
1
0
53.1000
S
First
woman
False
C
Southampton
yes
False
4
0
3
male
35.0
0
0
8.0500
S
Third
man
True
NaN
Southampton
no
True
1 T = titanic[titanic.age.notnull()].copy()
1 2 T.age.apply(lambda x: 60 if x>=60 else x) T.age.value_counts()
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
885 39.0
886 27.0
887 19.0
889 26.0
890 32.0
Name: age, Length: 714, dtype: float64
24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
..
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: age, Length: 88, dtype: int64
1 2 3 4 Age = 10 *(T["age" ]//10 ) Age = Age.astype(int ) Age.head() Age.value_counts()
0 20
1 30
2 20
3 30
4 30
Name: age, dtype: int32
20 220
30 167
10 102
40 89
0 62
50 48
60 19
70 6
80 1
Name: age, dtype: int64
0 20s
1 30s
2 20s
3 30s
4 30s
...
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
1 T.groupby(["sex" , Age])["survived" ].mean().unstack()
age
0
10
20
30
40
50
60
70
80
sex
female
0.633333
0.755556
0.722222
0.833333
0.687500
0.888889
1.000000
NaN
NaN
male
0.593750
0.122807
0.168919
0.214953
0.210526
0.133333
0.133333
0.0
1.0
1 2 T.age = Age T.pivot_table("survived" , index="sex" , columns="age" )
age
0
10
20
30
40
50
60
70
80
sex
female
0.633333
0.755556
0.722222
0.833333
0.687500
0.888889
1.000000
NaN
NaN
male
0.593750
0.122807
0.168919
0.214953
0.210526
0.133333
0.133333
0.0
1.0
survived
pclass
age
sibsp
parch
fare
count
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
1 titanic.groupby("sex" )[["survived" ]].mean()
survived
sex
female
0.742038
male
0.188908
1 titanic.groupby("sex" )["survived" ].mean()
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
1 titanic.groupby(["sex" , "class" ])["survived" ].aggregate("mean" ).unstack()
class
First
Second
Third
sex
female
0.968085
0.921053
0.500000
male
0.368852
0.157407
0.135447
1 titanic.pivot_table("survived" , index="sex" , columns="class" )
class
First
Second
Third
sex
female
0.968085
0.921053
0.500000
male
0.368852
0.157407
0.135447
1 titanic.pivot_table("survived" , index="sex" , columns="class" , aggfunc="mean" , margins=True )
class
First
Second
Third
All
sex
female
0.968085
0.921053
0.500000
0.742038
male
0.368852
0.157407
0.135447
0.188908
All
0.629630
0.472826
0.242363
0.383838
1 titanic.pivot_table(index="sex" , columns="class" , aggfunc={"survived" : "sum" , "fare" : "mean" })
fare
survived
class
First
Second
Third
First
Second
Third
sex
female
106.125798
21.970121
16.118810
91
70
72
male
67.226127
19.741782
12.661633
45
17
47
12.7 其他
(1)向量化字符串操作
(2) 处理时间序列
(3) 多级索引:用于多维数据
1 2 3 4 5 6 7 8 9 10 11 base_data = np.array([[1771 , 11115 ], [2154 , 30320 ], [2141 , 14070 ], [2424 , 32680 ], [1077 , 7806 ], [1303 , 24222 ], [798 , 4789 ], [981 , 13468 ]]) data = pd.DataFrame(base_data, index=[["BeiJing" ,"BeiJing" ,"ShangHai" ,"ShangHai" ,"ShenZhen" ,"ShenZhen" ,"HangZhou" ,"HangZhou" ]\ , [2008 , 2018 ]*4 ], columns=["population" , "GDP" ]) data
population
GDP
BeiJing
2008
1771
11115
2018
2154
30320
ShangHai
2008
2141
14070
2018
2424
32680
ShenZhen
2008
1077
7806
2018
1303
24222
HangZhou
2008
798
4789
2018
981
13468
1 2 data.index.names = ["city" , "year" ] data
population
GDP
city
year
BeiJing
2008
1771
11115
2018
2154
30320
ShangHai
2008
2141
14070
2018
2424
32680
ShenZhen
2008
1077
7806
2018
1303
24222
HangZhou
2008
798
4789
2018
981
13468
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
1 data.loc["ShangHai" , "GDP" ]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
1 data.loc["ShangHai" , 2018 ]["GDP" ]
32680
(4) 高性能的Pandas:eval()
1 df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000 ,100 ))) for i in range (4 ))
1 %timeit (df1+df2)/(df3+df4)
19.1 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 %timeit pd.eval ("(df1+df2)/(df3+df4)" )
14 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 np.allclose((df1+df2)/(df3+df4), pd.eval ("(df1+df2)/(df3+df4)" ))
True
1 2 df = pd.DataFrame(np.random.random((1000 , 3 )), columns=list ("ABC" )) df.head()
A
B
C
0
0.444109
0.403663
0.865813
1
0.294881
0.458937
0.235739
2
0.312199
0.256650
0.393556
3
0.080160
0.176736
0.385579
4
0.830258
0.352358
0.186917
1 res_1 = pd.eval ("(df.A+df.B)/(df.C-1)" )
1 res_2 = df.eval ("(A+B)/(C-1)" )
1 np.allclose(res_1, res_2)
True
1 2 df["D" ] = pd.eval ("(df.A+df.B)/(df.C-1)" ) df.head()
A
B
C
D
0
0.444109
0.403663
0.865813
-6.317854
1
0.294881
0.458937
0.235739
-0.986336
2
0.312199
0.256650
0.393556
-0.938006
3
0.080160
0.176736
0.385579
-0.418112
4
0.830258
0.352358
0.186917
-1.454484
1 2 df.eval ("D=(A+B)/(C-1)" , inplace=True ) df.head()
A
B
C
D
0
0.444109
0.403663
0.865813
-6.317854
1
0.294881
0.458937
0.235739
-0.986336
2
0.312199
0.256650
0.393556
-0.938006
3
0.080160
0.176736
0.385579
-0.418112
4
0.830258
0.352358
0.186917
-1.454484
1 2 3 column_mean = df.mean(axis=1 ) res = df.eval ("A+@column_mean" ) res.head()
0 -0.706958
1 0.295687
2 0.318299
3 0.136251
4 0.809020
dtype: float64
(4) 高性能的Pandas:query()
A
B
C
D
0
0.444109
0.403663
0.865813
-6.317854
1
0.294881
0.458937
0.235739
-0.986336
2
0.312199
0.256650
0.393556
-0.938006
3
0.080160
0.176736
0.385579
-0.418112
4
0.830258
0.352358
0.186917
-1.454484
1 %timeit df[(df.A < 0.5 ) & (df.B > 0.5 )]
1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 %timeit df.query("(A < 0.5)&(B > 0.5)" )
3.35 ms ± 440 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 df.query("(A < 0.5)&(B > 0.5)" ).head()
A
B
C
D
5
0.125739
0.557392
0.556752
-1.541193
6
0.264407
0.738427
0.669857
-3.037572
11
0.427045
0.719621
0.732786
-4.291193
12
0.050538
0.864942
0.883998
-7.891950
19
0.310231
0.558427
0.627987
-2.335022
1 np.allclose(df[(df.A < 0.5 ) & (df.B > 0.5 )], df.query("(A < 0.5)&(B > 0.5)" ))
True
(5)eval()和query()的使用时机
小数组时,普通方法反而更快
32000
8000000