0%

Pandas的基本操作

数据分析库Pandas

1. 数据预处理

1
import pandas as pd

1.1 数据读取

泰坦尼克号数据
image.png

1
2
3
df = pd.read_csv('./data/titanic.csv')
#展示数据,默认前5条
df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
1
2
#默认展示最后5条数据
df.tail()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

1.2 DataFrame结构

1
2
3
#DataFrame是Pandas工具包的基础结构
#info()查看数据基本信息
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
1
2
#返回索引
df.index
RangeIndex(start=0, stop=891, step=1)
1
2
#列出每一列特征的名字
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
1
2
#返回每一列特征的类型,object表示字符串
df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
1
2
#直接取得数值矩阵
df.values
array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

1.3 数据索引

1
2
age = df['Age']
age[:5]
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
1
age.values[:5]
array([22., 38., 26., 35., 35.])
1
2
3
#Pandas读取数据默认添加数字索引,可以指定索引项
data = df.set_index('Name')
data.head()

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
1
2
3
#通过索引取部分数据
df = pd.read_csv('./data/titanic.csv')
df[['Age','Fare']][:5]

Age Fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
1
2
#索引用位置找一行数据.iloc[]
df.iloc[0]
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object
1
2
#切片部分数据
df.iloc[:5]

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
1
2
#指定部分数据的部分特征
df.iloc[:5,1:3]

Survived Pclass
0 0 3
1 1 1
2 1 3
3 1 1
4 0 3
1
2
3
#用标签找数据.loc[]
df = df.set_index('Name')
df.loc['Heikkinen, Miss. Laina']
PassengerId                   3
Survived                      1
Pclass                        3
Sex                      female
Age                          26
SibSp                         0
Parch                         0
Ticket         STON/O2. 3101282
Fare                      7.925
Cabin                       NaN
Embarked                      S
Name: Heikkinen, Miss. Laina, dtype: object
1
2
#当前数据某一列数据
df.loc['Heikkinen, Miss. Laina','Fare']
7.925
1
2
#选择多个样本
df.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.925 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.100 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.050 NaN S
1
2
3
#数据赋值
df.loc['Heikkinen, Miss. Laina','Fare'] = 1000
df[:5]

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
1
2
#bool类型作索引
df['Fare'] > 40
Name
Braund, Mr. Owen Harris                                      False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)           True
Heikkinen, Miss. Laina                                        True
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  True
Allen, Mr. William Henry                                     False
Moran, Mr. James                                             False
McCarthy, Mr. Timothy J                                       True
Palsson, Master. Gosta Leonard                               False
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)            False
Nasser, Mrs. Nicholas (Adele Achem)                          False
Sandstrom, Miss. Marguerite Rut                              False
Bonnell, Miss. Elizabeth                                     False
Saundercock, Mr. William Henry                               False
Andersson, Mr. Anders Johan                                  False
Vestrom, Miss. Hulda Amanda Adolfina                         False
Hewlett, Mrs. (Mary D Kingcome)                              False
Rice, Master. Eugene                                         False
Williams, Mr. Charles Eugene                                 False
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)      False
Masselmani, Mrs. Fatima                                      False
Fynney, Mr. Joseph J                                         False
Beesley, Mr. Lawrence                                        False
McGowan, Miss. Anna "Annie"                                  False
Sloper, Mr. William Thompson                                 False
Palsson, Miss. Torborg Danira                                False
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)    False
Emir, Mr. Farred Chehab                                      False
Fortune, Mr. Charles Alexander                                True
O'Dwyer, Miss. Ellen "Nellie"                                False
Todoroff, Mr. Lalio                                          False
                                                             ...  
Giles, Mr. Frederick Edward                                  False
Swift, Mrs. Frederick Joel (Margaret Welles Barron)          False
Sage, Miss. Dorothy Edith "Dolly"                             True
Gill, Mr. John William                                       False
Bystrom, Mrs. (Karolina)                                     False
Duran y More, Miss. Asuncion                                 False
Roebling, Mr. Washington Augustus II                          True
van Melkebeke, Mr. Philemon                                  False
Johnson, Master. Harold Theodor                              False
Balkic, Mr. Cerin                                            False
Beckwith, Mrs. Richard Leonard (Sallie Monypeny)              True
Carlsson, Mr. Frans Olof                                     False
Vander Cruyssen, Mr. Victor                                  False
Abelson, Mrs. Samuel (Hannah Wizosky)                        False
Najib, Miss. Adele Kiamie "Jane"                             False
Gustafsson, Mr. Alfred Ossian                                False
Petroff, Mr. Nedelio                                         False
Laleff, Mr. Kristo                                           False
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)                 True
Shelley, Mrs. William (Imanita Parrish Hall)                 False
Markun, Mr. Johann                                           False
Dahlberg, Miss. Gerda Ulrika                                 False
Banfield, Mr. Frederick James                                False
Sutehall, Mr. Henry Jr                                       False
Rice, Mrs. William (Margaret Norton)                         False
Montvila, Rev. Juozas                                        False
Graham, Miss. Margaret Edith                                 False
Johnston, Miss. Catherine Helen "Carrie"                     False
Behr, Mr. Karl Howell                                        False
Dooley, Mr. Patrick                                          False
Name: Fare, Length: 891, dtype: bool
1
2
#通过bool类型筛选价格大于40的乘客
df[df['Fare'] > 40][:5]

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S
1
df[df['Sex'] == 'male'][:5]

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
1
2
#计算男性乘客的平均年龄
df.loc[df['Sex'] == 'male', 'Age'].mean()
30.72664459161148
1
2
#计算年龄大于70的人的总数
(df['Age'] > 70).sum()
5

1.4 创建DataFrame

1
2
3
4
data = {'country':['China','America','India'],
'population':[3, 14, 6]}
data_df = pd.DataFrame(data)
data_df

country population
0 China 3
1 America 14
2 India 6

1.5 Series操作

1
2
3
4
5
6
#DataFrame和Series都可以看作是二维矩阵,单独的一列就是Series,DataFrame由Series组合
#创建Series
data = [11,45,21]
index = ['a','b','c']
s = pd.Series(data = data, index = index)
s
a    11
b    45
c    21
dtype: int64
1
2
#标签查找
s.loc['b']
45
1
2
#索引查找
s.iloc[1]
45
1
2
3
4
#修改操作
s1 = s.copy()
s1.loc['a'] = 100
s1
a    100
b     45
c     21
dtype: int64
1
2
s1.replace(100,101,inplace = True)
s1
a    101
b     45
c     21
dtype: int64
1
2
#修改索引
s1.index
Index(['a', 'b', 'c'], dtype='object')
1
2
s1.index = ['a','b','z']
s1.index
Index(['a', 'b', 'z'], dtype='object')
1
2
3
#索引重命名
s1.rename(index = {'a':'A'},inplace = True)
s1.index
Index(['A', 'b', 'z'], dtype='object')
1
2
3
4
5
data = [100,101]
index = ['e','f']
s2 = pd.Series(data,index,)
s3 = s1.append(s2)
s3
A    101
b     45
z     21
e    100
f    101
dtype: int64
1
2
s3['f'] = 500
s3
A    101
b     45
z     21
e    100
f    500
dtype: int64

2. 数据分析

2.1 统计分析

1
2
3
#创建DataFrame
df = pd.DataFrame([[1,2,3],[4,5,6]],index = ['a', 'b'], columns = ['A','B', 'C'])
df

A B C
a 1 2 3
b 4 5 6
1
df.sum()
A    5
B    7
C    9
dtype: int64
1
df.sum(axis = 1)
a     6
b    15
dtype: int64
1
2
#快捷观察样本的情况
df.describe()

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
1
2
#协方差矩阵
df.cov()

PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 -0.342697 161.883369
Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 0.032017 6.221787
Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 0.012429 -22.830196
Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 -2.344191 73.849030
SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 0.368739 8.748734
Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 0.649728 8.661052
Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 8.661052 2469.436846
1
2
#相关系数
df.corr()

PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
1
2
#统计该列所有属性的个数(离散值)
df['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
1
2
#对于连续值可以划分区间统计,bins的值代表分为几组
df['Age'].value_counts(ascending = True, bins = 5)
(64.084, 80.0]       11
(48.168, 64.084]     69
(0.339, 16.336]     100
(32.252, 48.168]    188
(16.336, 32.252]    346
Name: Age, dtype: int64
1
2
3
4
5
#分箱操作
ages = [15,18,20,21,22,34,41,52,63,79]
bins = [10,40,80]
bins_res = pd.cut(ages,bins)
bins_res
[(10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (40, 80], (40, 80], (40, 80], (40, 80]]
Categories (2, interval[int64]): [(10, 40] < (40, 80]]

2.3 groupby操作

1
2
3
import pandas as pd
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],'data':[0,5,10,5,10,15,10,15,20]})
df

key data
0 A 0
1 B 5
2 C 10
3 A 5
4 B 10
5 C 15
6 A 10
7 B 15
8 C 20
1
2
#计算各key对应的值的和
df.groupby('key').sum()

data
key
A 15
B 30
C 45

3. 常用函数操作

3.1 Merge操作

1
2
3
4
5
6
7
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
pd.merge(left,right,on = 'key')

key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

3.1 排序操作

1
2
3
data  = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],
'data':[4,3,2,1,12,3,4,5,7]})
data

group data
0 a 4
1 a 3
2 a 2
3 b 1
4 b 12
5 b 3
6 c 4
7 c 5
8 c 7
1
2
data.sort_values(by = ['group','data'],ascending = [True,False],inplace=True)
data

group data
0 a 4
1 a 3
2 a 2
4 b 12
5 b 3
3 b 1
8 c 7
7 c 5
6 c 4

3.3 缺失值处理

1
2
3
data = pd.DataFrame({'k1':['one'] *3 + ['two']* 4,
'k2':[3,2,1,3,3,4,4]})
data

k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4
1
2
3
#去掉冗余相同数据drop_duplicates()
res = data.drop_duplicates()
res

k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
5 two 4
1
2
3
#只考虑某些列的冗余数据处理
res1 = data.drop_duplicates(subset='k1')
res1

k1 k2
0 one 3
3 two 3
1
2
3
4
5
#添加新列assign()
import numpy as np
df = pd.DataFrame({'data1':np.random.randn(5),'data2':np.random.randn(5)})
df2 = df.assign(ration = df['data1']/df['data2'])
df2

data1 data2 ration
0 0.795552 1.063400 0.748121
1 1.516393 1.453561 1.043226
2 -1.043458 -0.210488 4.957335
3 1.112729 1.536009 0.724429
4 0.302984 -1.075604 -0.281687
1
2
df = pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])
df

0 1 2
0 0 1.0 2.0
1 0 NaN 0.0
2 0 0.0 NaN
3 0 1.0 2.0
1
2
#通过isnull判断是否存在缺失值
df.isnull()

0 1 2
0 False False False
1 False True False
2 False False True
3 False False False
1
2
#直接判断某列是否存在缺失值
df.isnull().any()
0    False
1     True
2     True
dtype: bool
1
2
#指定维度检查缺失值
df.isnull().any(axis=1)
0    False
1     True
2     True
3    False
dtype: bool
1
2
#缺失值填充
df.fillna(5)

0 1 2
0 0 1.0 2.0
1 0 5.0 0.0
2 0 0.0 5.0
3 0 1.0 2.0

3.4 apply自定义函数

1
2
3
data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],
'data':[1,2,3,4,5,6,7]})
data

food data
0 A1 1
1 A2 2
2 B1 3
3 B2 4
4 B3 5
5 C1 6
6 C2 7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def food_map(series):
if series['food'] == 'A1':
return 'A'
elif series['food'] == 'A2':
return 'A'
elif series['food'] == 'B1':
return 'B'
elif series['food'] == 'B2':
return 'B'
elif series['food'] == 'B3':
return 'B'
elif series['food'] == 'C1':
return 'C'
elif series['food'] == 'C2':
return 'C'
data['food_map'] = data.apply(food_map,axis = 'columns')
data

food data food_map
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
4 B3 5 B
5 C1 6 C
6 C2 7 C

3.5 时间操作

1
2
3
#读取数据时以时间为索引
data = pd.read_csv('./data/flowdata.csv',index_col = 0,parse_dates = True)
data.head()

L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
2009-01-01 06:00:00 0.113500 0.091250 0.016750
2009-01-01 09:00:00 0.135750 0.091500 0.016250
2009-01-01 12:00:00 0.140917 0.096167 0.017000
1
2
#取某一年的数据
data['2013']

L06_347 LS06_347 LS06_348
Time
2013-01-01 00:00:00 1.688333 1.688333 0.207333
2013-01-01 03:00:00 2.693333 2.693333 0.201500
2013-01-01 06:00:00 2.220833 2.220833 0.166917
2013-01-01 09:00:00 2.055000 2.055000 0.175667
2013-01-01 12:00:00 1.710000 1.710000 0.129583
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083
2013-01-01 21:00:00 0.898250 0.898250 0.077167
2013-01-02 00:00:00 0.860000 0.860000 0.075000

3.6 绘图操作

1
2
3
4
5
#pandas简单绘图操作
%matplotlib inline
df = pd.DataFrame(np.random.randn(10,4).cumsum(0),index = np.arange(0,100,10),
columns = ['A','B','C','D'])
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf30fb860>
1
2
3
4
5
6
import matplotlib.pyplot as plt
#指定子图两行一列
fig,axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(16),index = list('abcdefghijklmnop'))
data.plot(ax = axes[0],kind = 'bar')
data.plot(ax = axes[1],kind = 'barh')
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf37e0080>

png

1
2
3
df = pd.DataFrame(np.random.rand(6,4),index = ['one','two','three','four','five','six'],
columns = pd.Index(['A','B','C','D'], name = 'Genus'))
df

Genus A B C D
one 0.825736 0.816818 0.836805 0.288769
two 0.568115 0.108279 0.188345 0.343175
three 0.669199 0.137701 0.567066 0.813652
four 0.961713 0.971082 0.319790 0.780224
five 0.196340 0.901948 0.684793 0.644339
six 0.249157 0.321956 0.110594 0.574358
1
df.plot(kind = 'bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf35d4cf8>

png

1
2
data = pd.read_csv('./data/macrodata.csv')
data.head()

year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
1
data.plot.scatter('quarter','realgdp')
<matplotlib.axes._subplots.AxesSubplot at 0x1dbf39df198>

png

4. 大数据处理技巧

4.1 数值类型转换

1
2
gl = pd.read_csv('./data/game_logs.csv')
gl.head()
D:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3049: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score ... h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 18710504 0 Thu CL1 na 1 FW1 na 1 0 ... Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 ... Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 ... Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 ... Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 ... Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns

1
gl.shape
(171907, 161)
1
2
#可以将数值类型向下转换减少内存占用
gl.info(memory_usage = 'deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB
1
2