Python Pandas库

image-20200714235042682

引子

Numpy 在向量化的数值计算中表现优异

但是在处理更灵活、复杂的数据任务:

如为数据添加标签、处理缺失值、分组和透视表等方面

Numpy显得力不从心

而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具

12.1 对象创建

12.1.1 Pandas Series对象

Series 是带标签数据的一维数组

Series对象的创建

通用结构: pd.Series(data, index=index, dtype=dtype)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

dtype: 数据类型,为可选参数

1、用列表创建

  • index缺省,默认为整数序列
1
2
3
4
import pandas as pd

data = pd.Series([1.5, 3, 4.5, 6])
data
0    1.5
1    3.0
2    4.5
3    6.0
dtype: float64
  • 增加index
1
2
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a    1.5
b    3.0
c    4.5
d    6.0
dtype: float64
  • 增加数据类型
1
缺省则从传入的数据自动判断
1
2
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
data
a    1
b    2
c    3
d    4
dtype: int64
1
2
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

注意:数据支持多种类型

1
2
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a    1
b    2
c    3
d    4
dtype: object
1
data["a"]
1
1
data["c"]
'3'

数据类型可被强制改变

1
2
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
1
data["c"]
3.0
1
2
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
ValueError: could not convert string to float: 'a'

2、用一维numpy数组创建

1
2
3
4
import numpy as np

x = np.arange(5)
pd.Series(x)
0    0
1    1
2    2
3    3
4    4
dtype: int32

3、用字典创建

  • 默认以键为index 值为data
1
2
3
4
5
6
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
population
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
dtype: int64
  • 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
1
2
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
population
BeiJing     2154.0
HangZhou     981.0
c              NaN
d              NaN
dtype: float64

4、data为标量的情况

1
pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

12.1.2 Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

DataFrame对象的创建

通用结构: pd.DataFrame(data, index=index, columns=columns)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

columns: 列标签,为可选参数

1、通过Series对象创建

1
2
3
4
5
6
7
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }

population = pd.Series(population_dict)
pd.DataFrame(population)
0
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
1
pd.DataFrame(population, columns=["population"])
population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981

2、通过Series对象字典创建

1
2
3
4
5
6
7
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }

GDP = pd.Series(GDP_dict)
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
1
2
pd.DataFrame({"population": population,
"GDP": GDP})
population GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

注意:数量不够的会自动补齐

1
2
3
pd.DataFrame({"population": population,
"GDP": GDP,
"country": "China"})
population GDP country
BeiJing 2154 30320 China
ShangHai 2424 32680 China
ShenZhen 1303 24222 China
HangZhou 981 13468 China

3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
1
2
3
4
5
import numpy as np
import pandas as pd

data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
1
2
data = pd.DataFrame(data)
data
a b
0 0 0
1 1 2
2 2 4
1
2
data1 = data["a"].copy()
data1
0    0
1    1
2    2
Name: a, dtype: int64
1
2
data1[0] = 10
data1
0    10
1     1
2     2
Name: a, dtype: int64
1
data
a b
0 0 0
1 1 2
2 2 4
  • 不存在的键,会默认值为NaN
1
2
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
1
pd.DataFrame(data)
a b c
0 1.0 1 NaN
1 NaN 3 4.0

4、通过Numpy二维数组创建

1
2
data = np.random.randint(10, size=(3, 2))
data
array([[3, 0],
       [9, 6],
       [7, 7]])
1
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
foo bar
a 3 0
b 9 6
c 7 7

12.2 DataFrame性质

1、属性

1
2
data = pd.DataFrame({"pop": population, "GDP": GDP})
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

(1)df.values 返回numpy数组表示的数据

1
data.values
array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

(2)df.index 返回行索引

1
data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

(3)df.columns 返回列索引

1
data.columns
Index(['pop', 'GDP'], dtype='object')

(4)df.shape 形状

1
data.shape
(4, 2)

(5) pd.size 大小

1
data.size
8

(6)pd.dtypes 返回每列数据类型

1
data.dtypes
pop    int64
GDP    int64
dtype: object

2、索引

方法太多脑子容易懵,我总结一下loc和iloc的方法(推荐使用这两种方法来索引):

  1. 选择行
    • data.loc[‘名称’],返回Series对象
    • data.loc[[‘名称’]],返回dataframe对象
    • data.loc[[‘名称1’, ‘名称2’]],选择多行,返回dataframe对象
    • data.loc[‘名称1’ : ‘名称3’],选择’名称1’到’名称3’多行,返回dataframe对象
    • data.iloc[1],选择行1,返回Series对象
    • data.iloc[[1]],选择行1,返回dataframe对象
    • data.iloc[[1, 3]],选择行1和行3,返回dataframe对象
    • data.iloc[0:3],选择行0到行2,返回dataframe对象
    • 注:其实上面可以全部写成(i)loc[行索引信息 , : ],不写第二个冒号也可以默认选择的是行
  2. 选择列
    • data.loc[ : , ‘名称’],返回Series对象
    • data.loc[ : , [‘名称’]],返回dataframe对象
    • data.loc[ : , [‘名称1’, ‘名称2’]],选择多列,返回dataframe对象
    • data.loc[ : , ‘名称1’ : ‘名称3’],选择’名称1’到’名称3’多列,返回dataframe对象
    • data.iloc[ : , 1],选择列1,返回Series对象
    • data.iloc[ : , [1]],选择列1,返回dataframe对象
    • data.iloc[ : , [1, 3]],选择列1和列3,返回dataframe对象
    • data.iloc[ : , 0:3],选择列0到列2,返回dataframe对象
  3. 选择元素
    • data.loc[ ‘行名称’ , ‘列名称’],返回元素
    • data.loc[[‘行名称’], [‘列名称’]],返回dataframe对象
    • data.loc[[‘行名称1’ , ‘行名称2’], [‘列名称1’ , ‘列名称2’]],返回dataframe对象
    • data.loc[‘行名称1’ : ‘行名称3’, ‘列名称1’ : ‘列名称2’]],选择’行名称1’到’行名称3’多行和’列名称1’到’列名称2’多列的相交元素,返回dataframe对象
    • data.iloc[1 , 2],选择行1列2的元素,返回元素
    • data.iloc[[1] , [2]],选择行1列2的元素,返回dataframe对象
    • data.iloc[[1, 0] , [1, 3]],选择行1行0与列1列3交叉位置的四个元素,返回dataframe对象
    • data.iloc[1 : 3, 1 : 2],选择行1到行3多行和列1到列2多列的相交元素,返回dataframe对象
1
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

(1)获取列

  • 字典式
1
data["pop"]
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64
1
data[["pop"]]
pop
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
1
data[["GDP", "pop"]]
GDP pop
BeiJing 30320 2154
ShangHai 32680 2424
ShenZhen 24222 1303
HangZhou 13468 981
  • 对象属性式
1
data.GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

(2)获取行

  • 绝对索引 df.loc
1
data.loc["BeiJing"]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
1
data.loc[["BeiJing"]]
pop GDP
BeiJing 2154 30320
1
data.loc[["BeiJing", "HangZhou"]]
pop GDP
BeiJing 2154 30320
HangZhou 981 13468
  • 相对索引 df.iloc
1
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
1
data.iloc[0]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
1
data.iloc[[0]]
pop GDP
BeiJing 2154 30320
1
data.iloc[[1, 3]]
pop GDP
ShangHai 2424 32680
HangZhou 981 13468

(3)获取标量

1
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
1
data.loc["BeiJing", "GDP"]
30320
1
data.iloc[0, 1]
30320
1
data.values[0][1]
30320

(4)Series对象的索引

1
type(data.GDP)
pandas.core.series.Series
1
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
1
GDP["BeiJing"]
30320

3、切片

1
2
df = pd.DataFrame(np.random.randn(6,4), index=['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06'], columns=["A", "B", "C", "D"])
df
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683
2019-01-04 -0.894394 0.698045 1.695152 -0.122940
2019-01-05 0.731645 0.663405 0.189481 -0.872651
2019-01-06 0.050233 -1.236851 -0.246509 -0.328486

(1)行切片

1
df["2019-01-01": "2019-01-03"]
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683
1
df.loc["2019-01-01": "2019-01-03"]
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683
1
df.iloc[0: 3]
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683

(2)列切片

1
df
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683
2019-01-04 -0.894394 0.698045 1.695152 -0.122940
2019-01-05 0.731645 0.663405 0.189481 -0.872651
2019-01-06 0.050233 -1.236851 -0.246509 -0.328486
1
df.loc[:, "A": "C"]
A B C
2019-01-01 -0.242680 0.310893 -2.323013
2019-01-02 -1.683455 0.442882 0.061800
2019-01-03 0.320874 -1.309343 0.418429
2019-01-04 -0.894394 0.698045 1.695152
2019-01-05 0.731645 0.663405 0.189481
2019-01-06 0.050233 -1.236851 -0.246509
1
df.iloc[:, 0: 3]
A B C
2019-01-01 -0.242680 0.310893 -2.323013
2019-01-02 -1.683455 0.442882 0.061800
2019-01-03 0.320874 -1.309343 0.418429
2019-01-04 -0.894394 0.698045 1.695152
2019-01-05 0.731645 0.663405 0.189481
2019-01-06 0.050233 -1.236851 -0.246509

(3)多种多样的取值

1
df
A B C D
2019-01-01 -0.242680 0.310893 -2.323013 0.725643
2019-01-02 -1.683455 0.442882 0.061800 0.522243
2019-01-03 0.320874 -1.309343 0.418429 0.084683
2019-01-04 -0.894394 0.698045 1.695152 -0.122940
2019-01-05 0.731645 0.663405 0.189481 -0.872651
2019-01-06 0.050233 -1.236851 -0.246509 -0.328486
  • 行、列同时切片
1
df.loc["2019-01-02": "2019-01-03", "C":"D"]
C D
2019-01-02 0.061800 0.522243
2019-01-03 0.418429 0.084683
1
df.iloc[1: 3, 2:]
C D
2019-01-02 0.061800 0.522243
2019-01-03 0.418429 0.084683
  • 行切片,列分散取值
1
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
A C
2019-01-04 -0.894394 1.695152
2019-01-05 0.731645 0.189481
2019-01-06 0.050233 -0.246509
1
df.iloc[3:, [0, 2]]
A C
2019-01-04 -0.894394 1.695152
2019-01-05 0.731645 0.189481
2019-01-06 0.050233 -0.246509
  • 行分散取值,列切片
1
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
C D
2019-01-02 0.061800 0.522243
2019-01-06 -0.246509 -0.328486
1
df.iloc[[1, 5], 0: 3]
A B C
2019-01-02 -1.683455 0.442882 0.061800
2019-01-06 0.050233 -1.236851 -0.246509
  • 行、列均分散取值
1
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
A D
2019-01-04 -0.894394 -0.122940
2019-01-06 0.050233 -0.328486
1
df.iloc[[1, 5], [0, 3]]
A D
2019-01-02 -1.683455 0.522243
2019-01-06 0.050233 -0.328486

4、布尔索引

1
df
A B C D
2019-01-01 -0.968004 0.024658 1.365858 -0.280383
2019-01-02 -0.409198 1.228846 0.673056 0.396339
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453
2019-01-04 0.847711 -1.054088 -0.114295 0.003130
2019-01-05 1.181092 0.017556 -0.598561 -0.303268
2019-01-06 -0.084136 -0.542269 1.378498 0.919418
1
df > 0
A B C D
2019-01-01 False True True False
2019-01-02 False True True True
2019-01-03 False False True False
2019-01-04 True False False True
2019-01-05 True True False False
2019-01-06 False False True True
1
df[df > 0]
A B C D
2019-01-01 NaN 0.024658 1.365858 NaN
2019-01-02 NaN 1.228846 0.673056 0.396339
2019-01-03 NaN NaN 0.063541 NaN
2019-01-04 0.847711 NaN NaN 0.003130
2019-01-05 1.181092 0.017556 NaN NaN
2019-01-06 NaN NaN 1.378498 0.919418
1
df.A > 0
2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06    False
Freq: D, Name: A, dtype: bool
1
df[df.A > 0]
A B C D
2019-01-04 0.847711 -1.054088 -0.114295 0.003130
2019-01-05 1.181092 0.017556 -0.598561 -0.303268
  • isin()方法
1
2
3
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
A B C D E
2019-01-01 -0.968004 0.024658 1.365858 -0.280383 one
2019-01-02 -0.409198 1.228846 0.673056 0.396339 one
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453 two
2019-01-04 0.847711 -1.054088 -0.114295 0.003130 three
2019-01-05 1.181092 0.017556 -0.598561 -0.303268 four
2019-01-06 -0.084136 -0.542269 1.378498 0.919418 three
1
2
ind = df2["E"].isin(["two", "four"])
ind
2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool
1
df2[ind]
A B C D E
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453 two
2019-01-05 1.181092 0.017556 -0.598561 -0.303268 four

(5)赋值

1
df
A B C D
2019-01-01 -0.968004 0.024658 1.365858 -0.280383
2019-01-02 -0.409198 1.228846 0.673056 0.396339
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453
2019-01-04 0.847711 -1.054088 -0.114295 0.003130
2019-01-05 1.181092 0.017556 -0.598561 -0.303268
2019-01-06 -0.084136 -0.542269 1.378498 0.919418
  • DataFrame 增加新列
1
2
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64
1
2
df["E"] = s1
df
A B C D E
2019-01-01 -0.968004 0.024658 1.365858 -0.280383 1
2019-01-02 -0.409198 1.228846 0.673056 0.396339 2
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453 3
2019-01-04 0.847711 -1.054088 -0.114295 0.003130 4
2019-01-05 1.181092 0.017556 -0.598561 -0.303268 5
2019-01-06 -0.084136 -0.542269 1.378498 0.919418 6
  • 修改赋值
1
2
df.loc["2019-01-01", "A"] = 0
df
A B C D E
2019-01-01 0.000000 0.024658 1.365858 -0.280383 1
2019-01-02 -0.409198 1.228846 0.673056 0.396339 2
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453 3
2019-01-04 0.847711 -1.054088 -0.114295 0.003130 4
2019-01-05 1.181092 0.017556 -0.598561 -0.303268 5
2019-01-06 -0.084136 -0.542269 1.378498 0.919418 6
1
2
df.iloc[0, 1] = 0
df
A B C D E
2019-01-01 0.000000 0.000000 1.365858 -0.280383 1
2019-01-02 -0.409198 1.228846 0.673056 0.396339 2
2019-01-03 -2.356054 -0.458781 0.063541 -0.016453 3
2019-01-04 0.847711 -1.054088 -0.114295 0.003130 4
2019-01-05 1.181092 0.017556 -0.598561 -0.303268 5
2019-01-06 -0.084136 -0.542269 1.378498 0.919418 6
1
2
df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
df
A B C D E
2019-01-01 0.000000 0.000000 1.365858 5 1
2019-01-02 -0.409198 1.228846 0.673056 5 2
2019-01-03 -2.356054 -0.458781 0.063541 5 3
2019-01-04 0.847711 -1.054088 -0.114295 5 4
2019-01-05 1.181092 0.017556 -0.598561 5 5
2019-01-06 -0.084136 -0.542269 1.378498 5 6
  • 修改index和columns
1
2
df.index = [i for i in range(len(df))]
df
A B C D E
0 0.000000 0.000000 1.365858 5 1
1 -0.409198 1.228846 0.673056 5 2
2 -2.356054 -0.458781 0.063541 5 3
3 0.847711 -1.054088 -0.114295 5 4
4 1.181092 0.017556 -0.598561 5 5
5 -0.084136 -0.542269 1.378498 5 6
1
2
df.columns = [i for i in range(df.shape[1])]
df
0 1 2 3 4
0 0.000000 0.000000 1.365858 5 1
1 -0.409198 1.228846 0.673056 5 2
2 -2.356054 -0.458781 0.063541 5 3
3 0.847711 -1.054088 -0.114295 5 4
4 1.181092 0.017556 -0.598561 5 5
5 -0.084136 -0.542269 1.378498 5 6

12.3 数值运算及统计分析

image-20200714235116402

1、数据的查看

1
2
3
4
5
6
import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
A B C D
2019-01-01 1.051134 -2.357190 -0.441416 0.756528
2019-01-02 -1.779281 1.581360 2.070031 0.672192
2019-01-03 1.072005 -0.447704 0.783539 -0.962688
2019-01-04 1.306330 -1.088984 -0.122274 -0.692286
2019-01-05 -0.528289 1.637207 1.798403 0.464263
2019-01-06 1.597604 -0.099617 0.735458 1.495397

(1)查看前面的行

1
df.head()    # 默认5行
A B C D
2019-01-01 1.051134 -2.357190 -0.441416 0.756528
2019-01-02 -1.779281 1.581360 2.070031 0.672192
2019-01-03 1.072005 -0.447704 0.783539 -0.962688
2019-01-04 1.306330 -1.088984 -0.122274 -0.692286
2019-01-05 -0.528289 1.637207 1.798403 0.464263
1
df.head(2)
A B C D
2019-01-01 1.051134 -2.35719 -0.441416 0.756528
2019-01-02 -1.779281 1.58136 2.070031 0.672192

(2)查看后面的行

1
df.tail()    # 默认5行
A B C D
2019-01-02 -1.779281 1.581360 2.070031 0.672192
2019-01-03 1.072005 -0.447704 0.783539 -0.962688
2019-01-04 1.306330 -1.088984 -0.122274 -0.692286
2019-01-05 -0.528289 1.637207 1.798403 0.464263
2019-01-06 1.597604 -0.099617 0.735458 1.495397
1
df.tail(3) 
A B C D
2019-01-04 1.306330 -1.088984 -0.122274 -0.692286
2019-01-05 -0.528289 1.637207 1.798403 0.464263
2019-01-06 1.597604 -0.099617 0.735458 1.495397

(3)查看总体信息

1
2
df.iloc[0, 3] = np.nan
df
A B C D
2019-01-01 1.051134 -2.357190 -0.441416 NaN
2019-01-02 -1.779281 1.581360 2.070031 0.672192
2019-01-03 1.072005 -0.447704 0.783539 -0.962688
2019-01-04 1.306330 -1.088984 -0.122274 -0.692286
2019-01-05 -0.528289 1.637207 1.798403 0.464263
2019-01-06 1.597604 -0.099617 0.735458 1.495397
1
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

1
2
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
0 1 2 3
0 0 1 2 3
1
x+5
0 1 2 3
0 5 6 7 8
1
np.exp(x)
0 1 2 3
0 1.0 2.718282 7.389056 20.085537
1
2
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
0 1 2 3
0 4 5 6 7
1
x*y
0 1 2 3
0 0 5 12 21

(2)矩阵化运算

1
2
3
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 6 3 7 4 6 9 2 6 7 4 ... 4 0 9 5 8 0 9 2 6 3
1 8 2 4 2 6 4 8 6 1 3 ... 2 0 3 1 7 3 1 5 5 9
2 3 5 1 9 1 9 3 7 6 8 ... 6 8 7 0 7 7 2 0 7 2
3 2 0 4 9 6 9 8 6 8 7 ... 0 2 4 2 0 4 9 6 6 8
4 9 9 2 6 0 3 3 4 6 6 ... 9 6 8 6 0 0 8 8 3 8
5 2 6 5 7 8 4 0 2 9 7 ... 2 0 4 0 7 0 0 1 1 5
6 6 4 0 0 2 1 4 9 5 6 ... 5 0 8 5 2 3 3 2 9 2
7 2 3 6 3 8 0 7 6 1 7 ... 3 0 1 0 4 4 6 8 8 2
8 2 2 3 7 5 7 0 7 3 0 ... 1 1 5 2 8 3 0 3 0 4
9 3 7 7 6 2 0 0 2 5 6 ... 4 2 3 2 0 0 4 5 2 8
10 4 7 0 4 2 0 3 4 6 0 ... 5 6 1 9 1 9 0 7 0 8
11 5 6 9 6 9 2 1 8 7 9 ... 6 5 2 8 9 5 9 9 5 0
12 3 9 5 5 4 0 7 4 4 6 ... 0 7 2 9 6 9 4 9 4 6
13 8 4 0 9 9 0 1 5 8 7 ... 5 8 4 0 3 4 9 9 4 6
14 3 0 4 6 9 9 5 4 3 1 ... 6 1 0 3 7 1 2 0 0 2
15 4 2 0 0 7 9 1 2 1 2 ... 6 3 9 4 1 7 3 8 4 8
16 3 9 4 8 7 2 0 2 3 1 ... 8 0 0 3 8 5 2 0 3 8
17 2 8 6 3 2 9 4 4 2 8 ... 6 9 4 2 6 1 8 9 9 0
18 5 6 7 9 8 1 9 1 4 4 ... 3 5 2 5 6 9 9 2 6 2
19 1 9 3 7 8 6 0 2 8 0 ... 4 3 2 2 3 8 1 8 0 0
20 4 5 5 2 6 8 9 7 5 7 ... 3 5 0 8 0 4 3 2 5 1
21 2 4 8 1 9 7 1 4 6 7 ... 0 1 8 2 0 4 6 5 0 4
22 4 5 2 4 6 4 4 4 9 9 ... 1 7 6 9 9 1 5 5 2 1
23 0 5 4 8 0 6 4 4 1 2 ... 8 5 0 7 6 9 2 0 4 3
24 9 7 0 9 0 3 7 4 1 5 ... 3 7 8 2 2 1 9 2 2 4
25 4 1 9 5 4 5 0 4 8 9 ... 9 3 0 7 0 2 3 7 5 9
26 6 7 1 9 7 2 6 2 6 1 ... 0 6 5 9 8 0 3 8 3 9
27 2 8 1 3 5 1 7 7 0 2 ... 8 0 4 5 4 5 5 6 3 7
28 6 8 6 2 2 7 4 3 7 5 ... 1 7 9 2 4 5 9 5 3 2
29 3 0 3 0 0 9 5 4 3 2 ... 1 3 0 4 8 0 8 7 5 6

30 rows × 30 columns

  • 转置
1
2
z = x.T
z
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 6 8 3 2 9 2 6 2 2 3 ... 4 2 4 0 9 4 6 2 6 3
1 3 2 5 0 9 6 4 3 2 7 ... 5 4 5 5 7 1 7 8 8 0
2 7 4 1 4 2 5 0 6 3 7 ... 5 8 2 4 0 9 1 1 6 3
3 4 2 9 9 6 7 0 3 7 6 ... 2 1 4 8 9 5 9 3 2 0
4 6 6 1 6 0 8 2 8 5 2 ... 6 9 6 0 0 4 7 5 2 0
5 9 4 9 9 3 4 1 0 7 0 ... 8 7 4 6 3 5 2 1 7 9
6 2 8 3 8 3 0 4 7 0 0 ... 9 1 4 4 7 0 6 7 4 5
7 6 6 7 6 4 2 9 6 7 2 ... 7 4 4 4 4 4 2 7 3 4
8 7 1 6 8 6 9 5 1 3 5 ... 5 6 9 1 1 8 6 0 7 3
9 4 3 8 7 6 7 6 7 0 6 ... 7 7 9 2 5 9 1 2 5 2
10 3 8 7 1 3 5 3 0 7 5 ... 4 0 2 6 4 1 9 9 1 0
11 7 1 4 0 6 7 6 8 3 5 ... 7 5 0 5 1 0 5 8 3 5
12 7 9 1 6 2 8 7 8 5 5 ... 9 0 4 1 2 9 2 4 3 1
13 2 8 4 6 5 3 0 1 7 2 ... 3 1 8 5 8 8 2 5 5 7
14 5 9 7 7 1 0 5 6 3 5 ... 9 0 0 1 6 9 8 3 5 9
15 4 4 9 4 9 0 7 9 2 7 ... 7 4 2 1 6 8 6 9 0 4
16 1 1 8 2 8 9 4 2 8 1 ... 9 9 3 1 5 8 4 1 7 6
17 7 3 8 7 4 3 3 6 2 4 ... 1 8 0 2 7 5 9 7 5 9
18 5 6 0 5 5 6 1 9 8 0 ... 4 5 0 1 3 7 6 5 2 1
19 1 7 8 2 3 1 5 8 1 0 ... 8 0 7 3 7 0 8 4 8 7
20 4 2 6 0 9 2 5 3 1 4 ... 3 0 1 8 3 9 0 8 1 1
21 0 0 8 2 6 0 0 0 1 2 ... 5 1 7 5 7 3 6 0 7 3
22 9 3 7 4 8 4 8 1 5 3 ... 0 8 6 0 8 0 5 4 9 0
23 5 1 0 2 6 0 5 0 2 2 ... 8 2 9 7 2 7 9 5 2 4
24 8 7 7 0 0 7 2 4 8 0 ... 0 0 9 6 2 0 8 4 4 8
25 0 3 7 4 0 0 3 4 3 0 ... 4 4 1 9 1 2 0 5 5 0
26 9 1 2 9 8 0 3 6 0 4 ... 3 6 5 2 9 3 3 5 9 8
27 2 5 0 6 8 1 2 8 3 5 ... 2 5 5 0 2 7 8 6 5 7
28 6 5 7 6 3 1 9 8 0 2 ... 5 0 2 4 2 5 3 3 3 5
29 3 9 2 8 8 5 2 2 4 8 ... 1 4 1 3 4 9 9 7 2 6

30 rows × 30 columns

1
2
3
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 5 8 9 5 0 0 1 7 6 9 ... 1 7 0 6 9 9 7 6 9 1
1 0 1 8 8 3 9 8 7 3 6 ... 9 2 0 4 9 2 7 7 9 8
2 6 9 3 7 7 4 5 9 3 6 ... 7 7 1 1 3 0 8 6 4 5
3 6 2 5 7 8 4 4 7 7 4 ... 0 1 9 8 2 3 1 2 7 2
4 6 0 9 2 6 6 2 7 7 0 ... 1 5 4 0 7 8 9 5 7 0
5 9 3 9 1 4 4 6 8 8 9 ... 1 8 7 0 3 4 2 0 3 5
6 1 2 4 3 0 6 0 7 2 8 ... 4 3 3 6 7 3 5 3 2 4
7 4 0 3 3 8 3 5 6 7 5 ... 1 7 3 1 6 6 9 6 9 6
8 0 0 2 9 6 0 6 7 0 3 ... 6 7 9 5 4 9 5 2 5 6
9 6 8 7 7 7 2 6 0 5 2 ... 7 0 6 2 4 3 6 7 6 3
10 0 6 4 7 6 2 9 5 9 9 ... 4 9 3 9 1 2 5 4 0 8
11 2 3 9 9 4 4 8 2 1 6 ... 0 5 9 8 6 6 0 4 7 3
12 0 1 6 0 6 1 6 4 2 5 ... 8 8 0 7 2 0 7 1 1 9
13 5 1 5 9 6 4 9 8 7 5 ... 2 4 3 2 0 0 4 2 5 0
14 0 3 8 5 3 1 4 7 3 2 ... 8 5 5 7 5 9 1 3 9 3
15 3 3 6 1 3 0 5 0 5 2 ... 7 1 7 7 3 8 3 0 6 3
16 0 6 5 9 6 4 6 6 2 2 ... 3 6 8 6 5 1 3 2 6 3
17 6 7 2 8 0 1 8 6 0 0 ... 5 6 2 5 4 3 0 6 2 1
18 9 4 4 0 9 8 7 7 6 1 ... 7 9 9 7 1 1 4 6 5 6
19 4 1 1 5 1 2 6 2 3 3 ... 0 0 0 9 8 5 9 3 4 0
20 9 8 6 3 9 9 0 8 1 6 ... 2 9 0 1 3 9 4 8 8 8
21 2 8 6 4 9 0 5 5 6 1 ... 6 7 5 6 8 7 4 2 4 0
22 0 3 5 9 0 3 6 5 1 1 ... 6 2 5 3 9 3 9 5 1 9
23 7 7 0 8 6 1 2 0 4 4 ... 1 9 6 0 2 8 3 7 2 5
24 6 0 4 2 3 1 0 5 7 0 ... 1 1 2 7 5 2 9 4 7 3
25 5 0 2 1 4 9 4 6 9 3 ... 5 5 3 5 9 2 7 4 1 6
26 9 8 1 8 1 6 2 6 1 8 ... 2 5 1 2 5 3 3 6 1 8
27 1 8 6 4 6 9 5 4 7 2 ... 9 3 1 5 1 1 7 1 2 6
28 0 7 7 4 3 2 7 8 5 2 ... 0 2 8 3 7 3 9 2 3 8
29 8 0 2 6 8 3 6 4 9 7 ... 6 7 8 5 7 2 5 3 4 5

30 rows × 30 columns

1
x.dot(y)
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 616 560 723 739 612 457 681 799 575 590 ... 523 739 613 580 668 602 733 585 657 700
1 520 438 691 600 612 455 666 764 707 592 ... 555 681 503 679 641 506 779 494 633 590
2 557 570 786 807 690 469 804 828 704 573 ... 563 675 712 758 793 672 754 550 756 638
3 605 507 664 701 660 496 698 806 651 575 ... 582 685 668 586 629 534 678 484 591 626
4 599 681 753 873 721 563 754 770 620 654 ... 633 747 661 677 726 649 716 610 735 706
5 422 354 602 627 613 396 617 627 489 423 ... 456 572 559 537 499 384 589 436 574 507
6 359 446 599 599 481 357 577 572 451 464 ... 449 550 495 532 633 554 663 476 565 602
7 531 520 698 590 607 537 665 696 571 472 ... 576 588 551 665 652 527 742 528 650 599
8 449 322 547 533 593 399 584 638 587 424 ... 402 596 523 523 447 362 561 386 529 484
9 373 433 525 601 522 345 551 521 434 447 ... 508 498 438 478 459 418 488 407 503 496
10 500 427 574 607 667 477 652 656 615 477 ... 622 702 531 610 558 532 598 471 582 561
11 664 694 772 841 779 574 730 810 711 608 ... 591 760 616 638 721 676 846 678 754 708
12 545 547 687 701 721 576 689 724 710 532 ... 674 684 648 694 710 564 757 571 671 656
13 574 586 723 750 691 494 696 787 667 523 ... 618 681 568 682 715 644 756 557 690 604
14 502 382 645 557 570 403 538 677 500 501 ... 369 650 507 576 546 531 554 437 616 463
15 510 505 736 651 649 510 719 733 694 557 ... 605 717 574 642 678 576 755 455 598 654
16 567 376 614 612 643 514 598 724 547 464 ... 456 639 520 560 569 442 596 517 659 532
17 626 716 828 765 740 603 809 852 692 591 ... 664 716 655 721 742 612 819 593 744 712
18 600 559 667 664 641 556 624 815 638 564 ... 581 701 559 677 710 554 748 597 614 657
19 445 431 661 681 641 552 690 719 602 474 ... 515 637 576 620 572 512 599 455 622 538
20 523 569 784 725 713 501 740 772 638 640 ... 589 775 664 686 726 672 747 548 723 645
21 487 465 553 639 517 449 592 609 454 398 ... 492 567 534 404 554 417 561 466 498 492
22 479 449 574 686 583 377 566 614 563 455 ... 453 539 491 501 596 520 722 478 565 501
23 483 386 476 526 550 426 492 585 536 482 ... 322 541 438 456 487 408 502 426 474 481
24 523 551 658 767 537 444 663 731 576 577 ... 522 590 525 664 691 548 635 526 641 538
25 652 656 738 753 853 508 752 815 669 576 ... 694 833 693 606 575 616 704 559 728 672
26 578 577 744 856 699 497 779 800 733 587 ... 630 754 704 834 760 680 765 592 731 629
27 554 494 665 689 630 574 695 703 636 599 ... 554 685 532 658 649 554 693 577 634 668
28 498 552 659 784 552 492 690 775 544 551 ... 567 636 518 599 742 521 733 533 605 604
29 513 491 563 642 477 367 589 647 516 484 ... 428 574 504 548 553 483 540 407 547 455

30 rows × 30 columns

1
%timeit x.dot(y)
264 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1
%timeit np.dot(x, y)
103 µs ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • 执行相同运算,Numpy与Pandas的对比
1
2
x1 = np.array(x)
x1
array([[6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4, 1, 7, 5, 1, 4, 0,
        9, 5, 8, 0, 9, 2, 6, 3],
       [8, 2, 4, 2, 6, 4, 8, 6, 1, 3, 8, 1, 9, 8, 9, 4, 1, 3, 6, 7, 2, 0,
        3, 1, 7, 3, 1, 5, 5, 9],
       [3, 5, 1, 9, 1, 9, 3, 7, 6, 8, 7, 4, 1, 4, 7, 9, 8, 8, 0, 8, 6, 8,
        7, 0, 7, 7, 2, 0, 7, 2],
       [2, 0, 4, 9, 6, 9, 8, 6, 8, 7, 1, 0, 6, 6, 7, 4, 2, 7, 5, 2, 0, 2,
        4, 2, 0, 4, 9, 6, 6, 8],
       [9, 9, 2, 6, 0, 3, 3, 4, 6, 6, 3, 6, 2, 5, 1, 9, 8, 4, 5, 3, 9, 6,
        8, 6, 0, 0, 8, 8, 3, 8],
       [2, 6, 5, 7, 8, 4, 0, 2, 9, 7, 5, 7, 8, 3, 0, 0, 9, 3, 6, 1, 2, 0,
        4, 0, 7, 0, 0, 1, 1, 5],
       [6, 4, 0, 0, 2, 1, 4, 9, 5, 6, 3, 6, 7, 0, 5, 7, 4, 3, 1, 5, 5, 0,
        8, 5, 2, 3, 3, 2, 9, 2],
       [2, 3, 6, 3, 8, 0, 7, 6, 1, 7, 0, 8, 8, 1, 6, 9, 2, 6, 9, 8, 3, 0,
        1, 0, 4, 4, 6, 8, 8, 2],
       [2, 2, 3, 7, 5, 7, 0, 7, 3, 0, 7, 3, 5, 7, 3, 2, 8, 2, 8, 1, 1, 1,
        5, 2, 8, 3, 0, 3, 0, 4],
       [3, 7, 7, 6, 2, 0, 0, 2, 5, 6, 5, 5, 5, 2, 5, 7, 1, 4, 0, 0, 4, 2,
        3, 2, 0, 0, 4, 5, 2, 8],
       [4, 7, 0, 4, 2, 0, 3, 4, 6, 0, 2, 1, 8, 9, 5, 9, 2, 7, 7, 1, 5, 6,
        1, 9, 1, 9, 0, 7, 0, 8],
       [5, 6, 9, 6, 9, 2, 1, 8, 7, 9, 6, 8, 3, 3, 0, 7, 2, 6, 1, 1, 6, 5,
        2, 8, 9, 5, 9, 9, 5, 0],
       [3, 9, 5, 5, 4, 0, 7, 4, 4, 6, 3, 5, 3, 2, 6, 7, 3, 1, 9, 2, 0, 7,
        2, 9, 6, 9, 4, 9, 4, 6],
       [8, 4, 0, 9, 9, 0, 1, 5, 8, 7, 4, 0, 6, 4, 5, 6, 2, 9, 2, 4, 5, 8,
        4, 0, 3, 4, 9, 9, 4, 6],
       [3, 0, 4, 6, 9, 9, 5, 4, 3, 1, 3, 9, 9, 2, 9, 0, 7, 4, 3, 7, 6, 1,
        0, 3, 7, 1, 2, 0, 0, 2],
       [4, 2, 0, 0, 7, 9, 1, 2, 1, 2, 6, 0, 9, 7, 9, 9, 9, 1, 2, 8, 6, 3,
        9, 4, 1, 7, 3, 8, 4, 8],
       [3, 9, 4, 8, 7, 2, 0, 2, 3, 1, 0, 6, 7, 6, 4, 0, 6, 6, 8, 2, 8, 0,
        0, 3, 8, 5, 2, 0, 3, 8],
       [2, 8, 6, 3, 2, 9, 4, 4, 2, 8, 3, 4, 3, 4, 6, 8, 6, 4, 9, 9, 6, 9,
        4, 2, 6, 1, 8, 9, 9, 0],
       [5, 6, 7, 9, 8, 1, 9, 1, 4, 4, 5, 2, 7, 0, 5, 3, 0, 6, 8, 3, 3, 5,
        2, 5, 6, 9, 9, 2, 6, 2],
       [1, 9, 3, 7, 8, 6, 0, 2, 8, 0, 8, 7, 0, 5, 4, 5, 9, 4, 5, 4, 4, 3,
        2, 2, 3, 8, 1, 8, 0, 0],
       [4, 5, 5, 2, 6, 8, 9, 7, 5, 7, 4, 7, 9, 3, 9, 7, 9, 1, 4, 8, 3, 5,
        0, 8, 0, 4, 3, 2, 5, 1],
       [2, 4, 8, 1, 9, 7, 1, 4, 6, 7, 0, 5, 0, 1, 0, 4, 9, 8, 5, 0, 0, 1,
        8, 2, 0, 4, 6, 5, 0, 4],
       [4, 5, 2, 4, 6, 4, 4, 4, 9, 9, 2, 0, 4, 8, 0, 2, 3, 0, 0, 7, 1, 7,
        6, 9, 9, 1, 5, 5, 2, 1],
       [0, 5, 4, 8, 0, 6, 4, 4, 1, 2, 6, 5, 1, 5, 1, 1, 1, 2, 1, 3, 8, 5,
        0, 7, 6, 9, 2, 0, 4, 3],
       [9, 7, 0, 9, 0, 3, 7, 4, 1, 5, 4, 1, 2, 8, 6, 6, 5, 7, 3, 7, 3, 7,
        8, 2, 2, 1, 9, 2, 2, 4],
       [4, 1, 9, 5, 4, 5, 0, 4, 8, 9, 1, 0, 9, 8, 9, 8, 8, 5, 7, 0, 9, 3,
        0, 7, 0, 2, 3, 7, 5, 9],
       [6, 7, 1, 9, 7, 2, 6, 2, 6, 1, 9, 5, 2, 2, 8, 6, 4, 9, 6, 8, 0, 6,
        5, 9, 8, 0, 3, 8, 3, 9],
       [2, 8, 1, 3, 5, 1, 7, 7, 0, 2, 9, 8, 4, 5, 3, 9, 1, 7, 5, 4, 8, 0,
        4, 5, 4, 5, 5, 6, 3, 7],
       [6, 8, 6, 2, 2, 7, 4, 3, 7, 5, 1, 3, 3, 5, 5, 0, 7, 5, 2, 8, 1, 7,
        9, 2, 4, 5, 9, 5, 3, 2],
       [3, 0, 3, 0, 0, 9, 5, 4, 3, 2, 0, 5, 1, 7, 9, 4, 6, 9, 1, 7, 1, 3,
        0, 4, 8, 0, 8, 7, 5, 6]])
1
2
y1 = np.array(y)
y1
array([[5, 8, 9, 5, 0, 0, 1, 7, 6, 9, 2, 4, 5, 2, 4, 2, 4, 7, 7, 9, 1, 7,
        0, 6, 9, 9, 7, 6, 9, 1],
       [0, 1, 8, 8, 3, 9, 8, 7, 3, 6, 5, 1, 9, 3, 4, 8, 1, 4, 0, 3, 9, 2,
        0, 4, 9, 2, 7, 7, 9, 8],
       [6, 9, 3, 7, 7, 4, 5, 9, 3, 6, 8, 0, 2, 7, 7, 9, 7, 3, 0, 8, 7, 7,
        1, 1, 3, 0, 8, 6, 4, 5],
       [6, 2, 5, 7, 8, 4, 4, 7, 7, 4, 9, 0, 2, 0, 7, 1, 7, 9, 8, 4, 0, 1,
        9, 8, 2, 3, 1, 2, 7, 2],
       [6, 0, 9, 2, 6, 6, 2, 7, 7, 0, 6, 5, 1, 4, 6, 0, 6, 5, 1, 2, 1, 5,
        4, 0, 7, 8, 9, 5, 7, 0],
       [9, 3, 9, 1, 4, 4, 6, 8, 8, 9, 2, 7, 5, 5, 4, 5, 8, 5, 8, 1, 1, 8,
        7, 0, 3, 4, 2, 0, 3, 5],
       [1, 2, 4, 3, 0, 6, 0, 7, 2, 8, 3, 0, 8, 4, 2, 9, 0, 3, 8, 1, 4, 3,
        3, 6, 7, 3, 5, 3, 2, 4],
       [4, 0, 3, 3, 8, 3, 5, 6, 7, 5, 1, 7, 0, 2, 8, 2, 1, 4, 0, 4, 1, 7,
        3, 1, 6, 6, 9, 6, 9, 6],
       [0, 0, 2, 9, 6, 0, 6, 7, 0, 3, 9, 0, 3, 4, 7, 5, 3, 8, 8, 0, 6, 7,
        9, 5, 4, 9, 5, 2, 5, 6],
       [6, 8, 7, 7, 7, 2, 6, 0, 5, 2, 1, 8, 5, 9, 4, 9, 1, 2, 0, 4, 7, 0,
        6, 2, 4, 3, 6, 7, 6, 3],
       [0, 6, 4, 7, 6, 2, 9, 5, 9, 9, 9, 8, 6, 4, 2, 9, 4, 0, 0, 3, 4, 9,
        3, 9, 1, 2, 5, 4, 0, 8],
       [2, 3, 9, 9, 4, 4, 8, 2, 1, 6, 3, 8, 9, 7, 0, 5, 2, 2, 8, 5, 0, 5,
        9, 8, 6, 6, 0, 4, 7, 3],
       [0, 1, 6, 0, 6, 1, 6, 4, 2, 5, 4, 6, 2, 9, 2, 7, 5, 0, 7, 8, 8, 8,
        0, 7, 2, 0, 7, 1, 1, 9],
       [5, 1, 5, 9, 6, 4, 9, 8, 7, 5, 1, 8, 0, 5, 3, 9, 0, 4, 8, 6, 2, 4,
        3, 2, 0, 0, 4, 2, 5, 0],
       [0, 3, 8, 5, 3, 1, 4, 7, 3, 2, 2, 2, 6, 6, 0, 1, 5, 6, 5, 8, 8, 5,
        5, 7, 5, 9, 1, 3, 9, 3],
       [3, 3, 6, 1, 3, 0, 5, 0, 5, 2, 7, 6, 4, 0, 2, 4, 8, 7, 6, 7, 7, 1,
        7, 7, 3, 8, 3, 0, 6, 3],
       [0, 6, 5, 9, 6, 4, 6, 6, 2, 2, 4, 1, 2, 3, 9, 3, 6, 7, 0, 3, 3, 6,
        8, 6, 5, 1, 3, 2, 6, 3],
       [6, 7, 2, 8, 0, 1, 8, 6, 0, 0, 1, 2, 7, 7, 4, 4, 0, 1, 0, 8, 5, 6,
        2, 5, 4, 3, 0, 6, 2, 1],
       [9, 4, 4, 0, 9, 8, 7, 7, 6, 1, 7, 2, 4, 5, 6, 7, 2, 0, 5, 2, 7, 9,
        9, 7, 1, 1, 4, 6, 5, 6],
       [4, 1, 1, 5, 1, 2, 6, 2, 3, 3, 2, 3, 0, 0, 1, 5, 0, 5, 8, 8, 0, 0,
        0, 9, 8, 5, 9, 3, 4, 0],
       [9, 8, 6, 3, 9, 9, 0, 8, 1, 6, 6, 1, 3, 7, 3, 2, 3, 0, 2, 8, 2, 9,
        0, 1, 3, 9, 4, 8, 8, 8],
       [2, 8, 6, 4, 9, 0, 5, 5, 6, 1, 7, 6, 5, 7, 1, 9, 7, 5, 6, 7, 6, 7,
        5, 6, 8, 7, 4, 2, 4, 0],
       [0, 3, 5, 9, 0, 3, 6, 5, 1, 1, 8, 4, 7, 0, 1, 7, 1, 5, 4, 5, 6, 2,
        5, 3, 9, 3, 9, 5, 1, 9],
       [7, 7, 0, 8, 6, 1, 2, 0, 4, 4, 6, 1, 0, 9, 4, 9, 3, 0, 5, 1, 1, 9,
        6, 0, 2, 8, 3, 7, 2, 5],
       [6, 0, 4, 2, 3, 1, 0, 5, 7, 0, 2, 2, 0, 9, 5, 1, 1, 6, 3, 2, 1, 1,
        2, 7, 5, 2, 9, 4, 7, 3],
       [5, 0, 2, 1, 4, 9, 4, 6, 9, 3, 8, 8, 7, 1, 8, 7, 8, 9, 2, 2, 5, 5,
        3, 5, 9, 2, 7, 4, 1, 6],
       [9, 8, 1, 8, 1, 6, 2, 6, 1, 8, 3, 7, 0, 7, 0, 0, 7, 3, 9, 5, 2, 5,
        1, 2, 5, 3, 3, 6, 1, 8],
       [1, 8, 6, 4, 6, 9, 5, 4, 7, 2, 0, 2, 0, 5, 9, 4, 1, 4, 5, 2, 9, 3,
        1, 5, 1, 1, 7, 1, 2, 6],
       [0, 7, 7, 4, 3, 2, 7, 8, 5, 2, 4, 9, 2, 2, 3, 5, 9, 6, 4, 9, 0, 2,
        8, 3, 7, 3, 9, 2, 3, 8],
       [8, 0, 2, 6, 8, 3, 6, 4, 9, 7, 6, 3, 2, 9, 1, 5, 5, 6, 9, 4, 6, 7,
        8, 5, 7, 2, 5, 3, 4, 5]])
1
%timeit x1.dot(y1)
24.7 µs ± 3.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
%timeit np.dot(x1, y1)
22 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
%timeit np.dot(x.values, y.values)
43.4 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)
for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
1
2
3
4
5
6
7
8
9
10
11
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]*y[k][j]
row.append(sum_row)
res.append(row)
return res
1
%timeit f(x3, y3)
5.86 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

1
2
3
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
A B C
0 6 3 7
1 4 6 9
2 2 6 7
  • 按行广播
1
x.iloc[0]
A    6
B    3
C    7
Name: 0, dtype: int32
1
x/x.iloc[0]
A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000
  • 按列广播
1
x.A
0    6
1    4
2    2
Name: A, dtype: int32
1
x.div(x.A, axis=0)             # add sub div mul
A B C
0 1.0 0.5 1.166667
1 1.0 1.5 2.250000
2 1.0 3.0 3.500000
1
x.div(x.iloc[0], axis=1)
A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000

3、新的用法

(1)索引对齐

1
2
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
A B
0 3 7
1 2 1
1
2
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
A B C
0 7 5 1
1 4 0 9
2 5 8 0
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
1
A+B
A B C
0 10.0 12.0 NaN
1 6.0 1.0 NaN
2 NaN NaN NaN
  • 缺省值也可用fill_value来填充
1
A.add(B, fill_value=0)
A B C
0 10.0 12.0 1.0
1 6.0 1.0 9.0
2 5.0 8.0 0.0
1
A*B
A B C
0 21.0 35.0 NaN
1 8.0 0.0 NaN
2 NaN NaN NaN

(2)统计相关

  • 数据种类统计
1
2
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
1
np.unique(y)
array([0, 1, 2])
1
2
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
1
2
y1 = pd.DataFrame(y, columns=["A"])
y1
A
0 2
1 2
2 2
3 1
4 2
5 1
6 1
7 2
8 1
9 2
10 2
11 0
12 2
13 0
14 2
15 2
16 0
17 0
18 2
19 1
1
np.unique(y1)
array([0, 1, 2])
1
y1["A"].value_counts()
2    11
1     5
0     4
Name: A, dtype: int64
  • 产生新的结果,并进行排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)

GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
population GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
1
2
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
population GDP per_GDP
BeiJing 2154 30320 14.076137
ShangHai 2424 32680 13.481848
ShenZhen 1303 24222 18.589409
HangZhou 981 13468 13.728848

递增排序

1
city_info.sort_values(by="per_GDP")
population GDP per_GDP
ShangHai 2424 32680 13.481848
HangZhou 981 13468 13.728848
BeiJing 2154 30320 14.076137
ShenZhen 1303 24222 18.589409

递减排序

1
city_info.sort_values(by="per_GDP", ascending=False)
population GDP per_GDP
ShenZhen 1303 24222 18.589409
BeiJing 2154 30320 14.076137
HangZhou 981 13468 13.728848
ShangHai 2424 32680 13.481848

按轴进行排序

1
2
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
C B A D
2 3 13 17 8
1 1 19 14 6
0 11 7 14 2

行排序

1
data.sort_index()
C B A D
0 11 7 14 2
1 1 19 14 6
2 3 13 17 8

列排序

1
data.sort_index(axis=1)
A B C D
2 17 13 3 8
1 14 19 1 6
0 14 7 11 2
1
data.sort_index(axis=1, ascending=False)
D C B A
2 8 3 13 17
1 6 1 19 14
0 2 11 7 14
  • 统计方法
1
2
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588

非空个数

1
df.count()
A    6
B    6
C    6
D    6
dtype: int64

求和

1
df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
1
df.sum(axis=1)
0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

1
df.min()
A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64
1
df.max(axis=1)
0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64
1
df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588
1
df.idxmax()
A    1
B    1
C    4
D    5
dtype: int64

均值

1
df.mean()
A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

1
df.var()
A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

1
df.std()
A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

1
df.median()
A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

1
2
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
A B
0 4 2
1 3 2
2 2 0
3 2 4
4 2 0
5 4 1
6 2 0
7 1 1
8 3 4
9 2 0
1
data.mode()
A B
0 2 0

75%分位数

1
df.quantile(0.75)
A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

一网打尽

1
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 2.378704 1.588916 0.655231 4.283124
std 5.914449 4.371574 4.352947 2.593603
min -3.386712 -4.333177 -4.032613 -0.152567
25% -1.256706 -1.265251 -2.925910 3.158284
50% 1.126767 1.531743 0.176172 5.273518
75% 3.539202 5.052594 4.539740 6.157738
max 13.113252 6.774559 5.577329 6.398588
1
2
3
4
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
A B C D
0 a a c d
1 c a c b
2 a a d c
1
data_2.describe()
A B C D
count 3 3 3 3
unique 2 1 2 3
top a a c c
freq 2 3 2 1

相关性系数和协方差

1
df.corr()
A B C D
A 1.000000 0.831063 0.331060 0.510821
B 0.831063 1.000000 0.179244 0.719112
C 0.331060 0.179244 1.000000 -0.450365
D 0.510821 0.719112 -0.450365 1.000000
1
df.corrwith(df["A"])
A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

1
df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588
1
df.apply(np.cumsum)
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 14.195450 10.331955 -0.185923 11.895013
2 12.159109 5.998778 4.908878 11.742447
3 8.772397 4.476413 2.386669 14.280162
4 13.100888 10.027406 7.963999 19.300153
5 14.272224 9.533497 3.931385 25.698741
1
df.apply(np.cumsum, axis=1)
A B C D
0 1.082198 4.639594 1.579117 7.947086
1 13.113252 19.887811 22.762364 28.289408
2 -2.036341 -6.369518 -1.274717 -1.427283
3 -3.386712 -4.909077 -7.431287 -4.893571
4 4.328491 9.879485 15.456814 20.476806
5 1.171336 0.677427 -3.355186 3.043402
1
df.apply(sum)
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
1
df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
1
df.apply(lambda x: x.max()-x.min())
A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64
1
2
3
4
def my_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
A B C D
Count 6.000000 6.000000 6.000000 6.000000
mean 2.378704 1.588916 0.655231 4.283124
max 13.113252 6.774559 5.577329 6.398588
idxmin 3.000000 2.000000 5.000000 2.000000
std 5.914449 4.371574 4.352947 2.593603

12.4 缺失值处理

1、发现缺失值

1
2
3
4
5
6
7
import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
A B C
0 1 NaN 2
1 NaN 3 4
2 5 6 None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

1
data.dtypes
A    object
B    object
C    object
dtype: object
1
data.isnull()
A B C
0 False True False
1 True False False
2 False False True
1
data.notnull()
A B C
0 True False True
1 False True True
2 True True False

2、删除缺失值

1
2
3
4
5
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0

注意:np.nan是一种特殊的浮点数

1
data.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

1
data.dropna()
A B C D
3 10.0 11.0 12.0 13.0

(2)删除整列

1
data.dropna(axis="columns")
D
0 3.0
1 6.0
2 9.0
3 13.0
1
2
data["D"] = np.nan
data
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 10.0 11.0 12.0 NaN
1
data.dropna(axis="columns", how="all")
A B C
0 1.0 NaN 2.0
1 NaN 4.0 5.0
2 7.0 8.0 NaN
3 10.0 11.0 12.0
1
data.dropna(axis="columns", how="any")
0
1
2
3
1
2
data.loc[3] = np.nan
data
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 NaN NaN NaN NaN
1
data.dropna(how="all")
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN

3、填充缺失值

1
2
3
4
5
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0
1
data.fillna(value=5)
A B C D
0 1.0 5.0 2.0 3.0
1 5.0 4.0 5.0 6.0
2 7.0 8.0 5.0 9.0
3 10.0 11.0 12.0 13.0
  • 用均值进行替换
1
2
fill = data.mean()
fill
A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64
1
data.fillna(value=fill)
A B C D
0 1.0 7.666667 2.000000 3.0
1 6.0 4.000000 5.000000 6.0
2 7.0 8.000000 6.333333 9.0
3 10.0 11.000000 12.000000 13.0
1
2
fill = data.stack().mean()
fill
7.0
1
data.fillna(value=fill)
A B C D
0 1.0 7.0 2.0 3.0
1 7.0 4.0 5.0 6.0
2 7.0 8.0 7.0 9.0
3 10.0 11.0 12.0 13.0

12.5 合并数据

  • 构造一个生产DataFrame的函数
1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np

def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)

make_df("ABC", range(3))
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
  • 垂直合并
1
2
3
4
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
1
pd.concat([df_1, df_2])
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
  • 水平合并
1
2
3
4
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
1
pd.concat([df_3, df_4], axis=1)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
  • 索引重叠

行重叠

1
2
3
4
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2
1
pd.concat([df_5, df_6])
A B
1 A1 B1
2 A2 B2
1 A1 B1
2 A2 B2
1
pd.concat([df_5, df_6],ignore_index=True)
A B
0 A1 B1
1 A2 B2
2 A1 B1
3 A2 B2

列重叠

1
2
3
4
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2
1
pd.concat([df_7, df_8], axis=1)
A B C B C D
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
1
pd.concat([df_7, df_8],axis=1, ignore_index=True)
0 1 2 3 4 5
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
  • 对齐合并merge()
1
2
3
4
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2
1
pd.merge(df_9, df_10)
A B C
0 A1 B1 C1
1 A2 B2 C2
1
2
3
4
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1
1
pd.merge(df_9, df_10)
A B C
0 A1 B1 C1
1 A2 B2 C2

【例】 合并城市信息

1
2
3
4
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
city pop
0 BeiJing 2154
1 HangZhou 981
2 ShenZhen 1303
1
2
3
4
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
city GDP
0 BeiJing 30320
1 ShangHai 32680
2 HangZhou 13468
1
2
city_info = pd.merge(population, GDP)
city_info
city pop GDP
0 BeiJing 2154 30320
1 HangZhou 981 13468
1
2
city_info = pd.merge(population, GDP, how="outer")
city_info
city pop GDP
0 BeiJing 2154.0 30320.0
1 HangZhou 981.0 13468.0
2 ShenZhen 1303.0 NaN
3 ShangHai NaN 32680.0

12.6 分组和数据透视表

image-20200714235136524

1
2
3
4
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
key data1 data2
0 A 0 3
1 B 1 3
2 C 2 4
3 C 3 6
4 B 4 6
5 A 5 3

(1)分组

  • 延迟计算
1
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018DD1697308>
1
df.groupby("key").sum()
data1 data2
key
A 5 6
B 5 9
C 5 10
1
df.groupby("key").mean()
data1 data2
key
A 2.5 3.0
B 2.5 4.5
C 2.5 5.0
1
2
for i in df.groupby("key"):
print(str(i))
('A',   key  data1  data2
0   A      0      3
5   A      5      3)
('B',   key  data1  data2
1   B      1      3
4   B      4      6)
('C',   key  data1  data2
2   C      2      4
3   C      3      6)
  • 按列取值
1
df.groupby("key")["data2"].sum()
key
A     6
B     9
C    10
Name: data2, dtype: int32
  • 按组迭代
1
2
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
  • 调用方法
1
df.groupby("key")["data1"].describe()
count mean std min 25% 50% 75% max
key
A 2.0 2.5 3.535534 0.0 1.25 2.5 3.75 5.0
B 2.0 2.5 2.121320 1.0 1.75 2.5 3.25 4.0
C 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
  • 支持更复杂的操作
1
df.groupby("key").aggregate(["min", "median", "max"])
data1 data2
min median max min median max
key
A 0 2.5 5 3 3.0 3
B 1 2.5 4 3 4.5 6
C 2 2.5 3 4 5.0 6
  • 过滤
1
2
3
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A    0.000000
B    2.121320
C    1.414214
Name: data2, dtype: float64
1
df.groupby("key").filter(filter_func)
key data1 data2
  • 转换
1
df
key data1 data2
0 A 0 3
1 B 1 3
2 C 2 4
3 C 3 6
4 B 4 6
5 A 5 3
1
df.groupby("key").transform(lambda x: x-x.mean())
data1 data2
0 -2.5 0.0
1 -1.5 -1.5
2 -0.5 -1.0
3 0.5 1.0
4 1.5 1.5
5 2.5 0.0
1
df
key data1 data2
0 A 0 3
1 B 1 3
2 C 2 4
3 C 3 6
4 B 4 6
5 A 5 3
1
df.groupby("key").apply(lambda x: x-x.mean())
data1 data2
0 -2.5 0.0
1 -1.5 -1.5
2 -0.5 -1.0
3 0.5 1.0
4 1.5 1.5
5 2.5 0.0
  • apply()方法
1
df
key data1 data2
0 A 0 3
1 B 1 3
2 C 2 4
3 C 3 6
4 B 4 6
5 A 5 3
1
2
3
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
1
df.groupby("key").apply(norm_by_data2)
key data1 data2
0 A 0.000000 3
1 B 0.111111 3
2 C 0.200000 4
3 C 0.300000 6
4 B 0.444444 6
5 A 0.833333 3
  • 将列表、数组设为分组键
1
2
L = [0, 1, 0, 1, 2, 0]
df
key data1 data2
0 A 0 3
1 B 1 3
2 C 2 4
3 C 3 6
4 B 4 6
5 A 5 3
1
df.groupby(L).sum()
data1 data2
0 7 10
1 4 9
2 4 6
  • 用字典将索引映射到分组
1
2
df2 = df.set_index("key")
df2
data1 data2
key
A 0 3
B 1 3
C 2 4
C 3 6
B 4 6
A 5 3
1
2
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
data1 data2
constant 10 19
first 5 6
  • 任意Python函数
1
df2.groupby(str.lower).mean()
data1 data2
a 2.5 3.0
b 2.5 4.5
c 2.5 5.0
  • 多个有效值组成的列表
1
df2.groupby([str.lower, mapping]).mean()
data1 data2
a first 2.5 3.0
b constant 2.5 4.5
c constant 2.5 5.0

【例1】 行星观测数据处理

1
2
3
import seaborn as sns

planets = sns.load_dataset("planets")
1
planets.shape
(1035, 6)
1
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
1
planets.describe()
number orbital_period mass distance year
count 1035.000000 992.000000 513.000000 808.000000 1035.000000
mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
std 1.240976 26014.728304 3.818617 733.116493 3.972567
min 1.000000 0.090706 0.003600 1.350000 1989.000000
25% 1.000000 5.442540 0.229000 32.560000 2007.000000
50% 1.000000 39.979500 1.260000 55.250000 2010.000000
75% 2.000000 526.005000 3.040000 178.500000 2012.000000
max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
1
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
1
2
decade = 10 * (planets["year"] // 10)
decade.head()
0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64
1
2
3
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object
1
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
1
planets.groupby(["method", decade]).sum()
number orbital_period mass distance year
method decade
Astrometry 2010s 2 1.262360e+03 0.00000 35.75 4023
Eclipse Timing Variations 2000s 5 1.930800e+04 6.05000 261.44 6025
2010s 10 2.345680e+04 4.20000 1000.00 12065
Imaging 2000s 29 1.350935e+06 0.00000 956.83 40139
2010s 21 6.803750e+04 0.00000 1210.08 36208
Microlensing 2000s 12 1.732500e+04 0.00000 0.00 20070
2010s 15 4.750000e+03 0.00000 41440.00 26155
Orbital Brightness Modulation 2010s 5 2.127920e+00 0.00000 2360.00 6035
Pulsar Timing 1990s 9 1.900153e+02 0.00000 0.00 5978
2000s 1 3.652500e+04 0.00000 0.00 2003
2010s 1 9.070629e-02 0.00000 1200.00 2011
Pulsation Timing Variations 2000s 1 1.170000e+03 0.00000 0.00 2007
Radial Velocity 1980s 1 8.388800e+01 11.68000 40.57 1989
1990s 52 1.091561e+04 68.17820 723.71 55943
2000s 475 2.633526e+05 945.31928 15201.16 619775
2010s 424 1.809630e+05 316.47890 11382.67 432451
Transit 2000s 64 2.897102e+02 0.00000 31823.31 124462
2010s 712 8.087813e+03 1.47000 102419.46 673999
Transit Timing Variations 2010s 9 2.393505e+02 0.00000 3313.00 8050
1
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
number
decade 1980s 1990s 2000s 2010s
method
Astrometry 0.0 0.0 0.0 2.0
Eclipse Timing Variations 0.0 0.0 5.0 10.0
Imaging 0.0 0.0 29.0 21.0
Microlensing 0.0 0.0 12.0 15.0
Orbital Brightness Modulation 0.0 0.0 0.0 5.0
Pulsar Timing 0.0 9.0 1.0 1.0
Pulsation Timing Variations 0.0 0.0 1.0 0.0
Radial Velocity 1.0 52.0 475.0 424.0
Transit 0.0 0.0 64.0 712.0
Transit Timing Variations 0.0 0.0 0.0 9.0

(2)数据透视表

【例2】泰坦尼克号乘客数据分析

1
2
3
import seaborn as sns

titanic = sns.load_dataset("titanic")
1
titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
1
T = titanic[titanic.age.notnull()].copy()
1
2
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: age, Length: 714, dtype: float64






24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: age, Length: 88, dtype: int64
1
2
3
4
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
0    20
1    30
2    20
3    30
4    30
Name: age, dtype: int32






20    220
30    167
10    102
40     89
0      62
50     48
60     19
70      6
80      1
Name: age, dtype: int64
1
Age.astype(str)+"s"
0      20s
1      30s
2      20s
3      30s
4      30s
      ... 
885    30s
886    20s
887    10s
889    20s
890    30s
Name: age, Length: 714, dtype: object
1
T.groupby(["sex", Age])["survived"].mean().unstack()
age 0 10 20 30 40 50 60 70 80
sex
female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000 NaN NaN
male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.133333 0.0 1.0
1
2
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age 0 10 20 30 40 50 60 70 80
sex
female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000 NaN NaN
male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.133333 0.0 1.0
1
titanic.describe()
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
1
titanic.groupby("sex")[["survived"]].mean()
survived
sex
female 0.742038
male 0.188908
1
titanic.groupby("sex")["survived"].mean()
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64
1
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
  • 数据透视表
1
titanic.pivot_table("survived", index="sex", columns="class")
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
1
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True)
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
1
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})
fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47

12.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

1
2
3
4
5
6
7
8
9
10
11
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
population GDP
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
1
2
data.index.names = ["city", "year"]
data
population GDP
city year
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
1
data["GDP"]
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
1
data.loc["ShangHai", "GDP"]
year
2008    14070
2018    32680
Name: GDP, dtype: int32
1
data.loc["ShangHai", 2018]["GDP"]
32680

(4) 高性能的Pandas:eval()

1
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
1
%timeit (df1+df2)/(df3+df4)
19.1 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 减少了复合代数式计算中间过程的内存分配
1
%timeit pd.eval("(df1+df2)/(df3+df4)")
14 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
  • 实现列间运算
1
2
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
A B C
0 0.444109 0.403663 0.865813
1 0.294881 0.458937 0.235739
2 0.312199 0.256650 0.393556
3 0.080160 0.176736 0.385579
4 0.830258 0.352358 0.186917
1
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
1
res_2 = df.eval("(A+B)/(C-1)")
1
np.allclose(res_1, res_2)
True
1
2
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
A B C D
0 0.444109 0.403663 0.865813 -6.317854
1 0.294881 0.458937 0.235739 -0.986336
2 0.312199 0.256650 0.393556 -0.938006
3 0.080160 0.176736 0.385579 -0.418112
4 0.830258 0.352358 0.186917 -1.454484
1
2
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
A B C D
0 0.444109 0.403663 0.865813 -6.317854
1 0.294881 0.458937 0.235739 -0.986336
2 0.312199 0.256650 0.393556 -0.938006
3 0.080160 0.176736 0.385579 -0.418112
4 0.830258 0.352358 0.186917 -1.454484
  • 使用局部变量
1
2
3
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0   -0.706958
1    0.295687
2    0.318299
3    0.136251
4    0.809020
dtype: float64

(4) 高性能的Pandas:query()

1
df.head()
A B C D
0 0.444109 0.403663 0.865813 -6.317854
1 0.294881 0.458937 0.235739 -0.986336
2 0.312199 0.256650 0.393556 -0.938006
3 0.080160 0.176736 0.385579 -0.418112
4 0.830258 0.352358 0.186917 -1.454484
1
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1
%timeit df.query("(A < 0.5)&(B > 0.5)")
3.35 ms ± 440 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1
df.query("(A < 0.5)&(B > 0.5)").head()
A B C D
5 0.125739 0.557392 0.556752 -1.541193
6 0.264407 0.738427 0.669857 -3.037572
11 0.427045 0.719621 0.732786 -4.291193
12 0.050538 0.864942 0.883998 -7.891950
19 0.310231 0.558427 0.627987 -2.335022
1
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True

(5)eval()和query()的使用时机

小数组时,普通方法反而更快

1
df.values.nbytes
32000
1
df1.values.nbytes
8000000