数据分析库Pandas

1. 数据预处理

1	import pandas as pd

1.1 数据读取

泰坦尼克号数据

1
2
3

df = pd.read_csv('./data/titanic.csv')
#展示数据，默认前5条
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

1 2	#默认展示最后5条数据 df.tail()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

1.2 DataFrame结构

1
2
3

#DataFrame是Pandas工具包的基础结构
#info()查看数据基本信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

1 2	#返回索引 df.index

RangeIndex(start=0, stop=891, step=1)

1 2	#列出每一列特征的名字 df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

1 2	#返回每一列特征的类型，object表示字符串 df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

1 2	#直接取得数值矩阵 df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

1.3 数据索引

1 2	age = df['Age'] age[:5]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

1	age.values[:5]

array([22., 38., 26., 35., 35.])

1
2
3

#Pandas读取数据默认添加数字索引，可以指定索引项
data = df.set_index('Name')
data.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S

1
2
3

#通过索引取部分数据
df = pd.read_csv('./data/titanic.csv')
df[['Age','Fare']][:5]

	Age	Fare
0	22.0	7.2500
1	38.0	71.2833
2	26.0	7.9250
3	35.0	53.1000
4	35.0	8.0500

1 2	#索引用位置找一行数据.iloc[] df.iloc[0]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

1 2	#切片部分数据 df.iloc[:5]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

1 2	#指定部分数据的部分特征 df.iloc[:5,1:3]

	Survived	Pclass
0	0	3
1	1	1
2	1	3
3	1	1
4	0	3

1
2
3

#用标签找数据.loc[]
df = df.set_index('Name')
df.loc['Heikkinen, Miss. Laina']

PassengerId                   3
Survived                      1
Pclass                        3
Sex                      female
Age                          26
SibSp                         0
Parch                         0
Ticket         STON/O2. 3101282
Fare                      7.925
Cabin                       NaN
Embarked                      S
Name: Heikkinen, Miss. Laina, dtype: object

1 2	#当前数据某一列数据 df.loc['Heikkinen, Miss. Laina','Fare']

7.925

1 2	#选择多个样本 df.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.925	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.100	C123	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.050	NaN	S

1
2
3

#数据赋值
df.loc['Heikkinen, Miss. Laina','Fare'] = 1000
df[:5]

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	1000.0000	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S

1 2	#bool类型作索引 df['Fare'] > 40

Name
Braund, Mr. Owen Harris                                      False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)           True
Heikkinen, Miss. Laina                                        True
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  True
Allen, Mr. William Henry                                     False
Moran, Mr. James                                             False
McCarthy, Mr. Timothy J                                       True
Palsson, Master. Gosta Leonard                               False
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)            False
Nasser, Mrs. Nicholas (Adele Achem)                          False
Sandstrom, Miss. Marguerite Rut                              False
Bonnell, Miss. Elizabeth                                     False
Saundercock, Mr. William Henry                               False
Andersson, Mr. Anders Johan                                  False
Vestrom, Miss. Hulda Amanda Adolfina                         False
Hewlett, Mrs. (Mary D Kingcome)                              False
Rice, Master. Eugene                                         False
Williams, Mr. Charles Eugene                                 False
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)      False
Masselmani, Mrs. Fatima                                      False
Fynney, Mr. Joseph J                                         False
Beesley, Mr. Lawrence                                        False
McGowan, Miss. Anna "Annie"                                  False
Sloper, Mr. William Thompson                                 False
Palsson, Miss. Torborg Danira                                False
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)    False
Emir, Mr. Farred Chehab                                      False
Fortune, Mr. Charles Alexander                                True
O'Dwyer, Miss. Ellen "Nellie"                                False
Todoroff, Mr. Lalio                                          False
                                                             ...  
Giles, Mr. Frederick Edward                                  False
Swift, Mrs. Frederick Joel (Margaret Welles Barron)          False
Sage, Miss. Dorothy Edith "Dolly"                             True
Gill, Mr. John William                                       False
Bystrom, Mrs. (Karolina)                                     False
Duran y More, Miss. Asuncion                                 False
Roebling, Mr. Washington Augustus II                          True
van Melkebeke, Mr. Philemon                                  False
Johnson, Master. Harold Theodor                              False
Balkic, Mr. Cerin                                            False
Beckwith, Mrs. Richard Leonard (Sallie Monypeny)              True
Carlsson, Mr. Frans Olof                                     False
Vander Cruyssen, Mr. Victor                                  False
Abelson, Mrs. Samuel (Hannah Wizosky)                        False
Najib, Miss. Adele Kiamie "Jane"                             False
Gustafsson, Mr. Alfred Ossian                                False
Petroff, Mr. Nedelio                                         False
Laleff, Mr. Kristo                                           False
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)                 True
Shelley, Mrs. William (Imanita Parrish Hall)                 False
Markun, Mr. Johann                                           False
Dahlberg, Miss. Gerda Ulrika                                 False
Banfield, Mr. Frederick James                                False
Sutehall, Mr. Henry Jr                                       False
Rice, Mrs. William (Margaret Norton)                         False
Montvila, Rev. Juozas                                        False
Graham, Miss. Margaret Edith                                 False
Johnston, Miss. Catherine Helen "Carrie"                     False
Behr, Mr. Karl Howell                                        False
Dooley, Mr. Patrick                                          False
Name: Fare, Length: 891, dtype: bool

1 2	#通过bool类型筛选价格大于40的乘客 df[df['Fare'] > 40][:5]

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	1000.0000	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
McCarthy, Mr. Timothy J	7	0	1	male	54.0	0	0	17463	51.8625	E46	S
Fortune, Mr. Charles Alexander	28	0	1	male	19.0	3	2	19950	263.0000	C23 C25 C27	S

1	df[df['Sex'] == 'male'][:5]

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S
Moran, Mr. James	6	0	3	male	NaN	0	0	330877	8.4583	NaN	Q
McCarthy, Mr. Timothy J	7	0	1	male	54.0	0	0	17463	51.8625	E46	S
Palsson, Master. Gosta Leonard	8	0	3	male	2.0	3	1	349909	21.0750	NaN	S

1 2	#计算男性乘客的平均年龄 df.loc[df['Sex'] == 'male', 'Age'].mean()

30.72664459161148

1 2	#计算年龄大于70的人的总数 (df['Age'] > 70).sum()

1.4 创建DataFrame

data = {'country':['China','America','India'],
       'population':[3, 14, 6]}
data_df = pd.DataFrame(data)
data_df

	country	population
0	China	3
1	America	14
2	India	6

1.5 Series操作

#DataFrame和Series都可以看作是二维矩阵，单独的一列就是Series，DataFrame由Series组合
#创建Series
data = [11,45,21]
index = ['a','b','c']
s = pd.Series(data = data, index = index)
s

a    11
b    45
c    21
dtype: int64

1 2	#标签查找 s.loc['b']

1 2	#索引查找 s.iloc[1]

#修改操作
s1 = s.copy()
s1.loc['a'] = 100
s1

a    100
b     45
c     21
dtype: int64

1 2	s1.replace(100,101,inplace = True) s1

a    101
b     45
c     21
dtype: int64

1 2	#修改索引 s1.index

Index(['a', 'b', 'c'], dtype='object')

1 2	s1.index = ['a','b','z'] s1.index

Index(['a', 'b', 'z'], dtype='object')

1
2
3

#索引重命名
s1.rename(index = {'a':'A'},inplace = True)
s1.index

Index(['A', 'b', 'z'], dtype='object')

data = [100,101]
index = ['e','f']
s2 = pd.Series(data,index,)
s3 = s1.append(s2)
s3

A    101
b     45
z     21
e    100
f    101
dtype: int64

1 2	s3['f'] = 500 s3

A    101
b     45
z     21
e    100
f    500
dtype: int64

2. 数据分析

2.1 统计分析

1
2
3

#创建DataFrame
df = pd.DataFrame([[1,2,3],[4,5,6]],index = ['a', 'b'], columns = ['A','B', 'C'])
df

	A	B	C
a	1	2	3
b	4	5	6

df.sum()

A    5
B    7
C    9
dtype: int64

1	df.sum(axis = 1)

a     6
b    15
dtype: int64

1 2	#快捷观察样本的情况 df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

1 2	#协方差矩阵 df.cov()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	66231.000000	-0.626966	-7.561798	138.696504	-16.325843	-0.342697	161.883369
Survived	-0.626966	0.236772	-0.137703	-0.551296	-0.018954	0.032017	6.221787
Pclass	-7.561798	-0.137703	0.699015	-4.496004	0.076599	0.012429	-22.830196
Age	138.696504	-0.551296	-4.496004	211.019125	-4.163334	-2.344191	73.849030
SibSp	-16.325843	-0.018954	0.076599	-4.163334	1.216043	0.368739	8.748734
Parch	-0.342697	0.032017	0.012429	-2.344191	0.368739	0.649728	8.661052
Fare	161.883369	6.221787	-22.830196	73.849030	8.748734	8.661052	2469.436846

1 2	#相关系数 df.corr()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

1 2	#统计该列所有属性的个数(离散值) df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

1 2	#对于连续值可以划分区间统计,bins的值代表分为几组 df['Age'].value_counts(ascending = True, bins = 5)

(64.084, 80.0]       11
(48.168, 64.084]     69
(0.339, 16.336]     100
(32.252, 48.168]    188
(16.336, 32.252]    346
Name: Age, dtype: int64

#分箱操作
ages = [15,18,20,21,22,34,41,52,63,79]
bins = [10,40,80]
bins_res = pd.cut(ages,bins)
bins_res

[(10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (40, 80], (40, 80], (40, 80], (40, 80]]
Categories (2, interval[int64]): [(10, 40] < (40, 80]]

2.3 groupby操作

1
2
3

import pandas as pd
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],'data':[0,5,10,5,10,15,10,15,20]})
df

	key	data
0	A	0
1	B	5
2	C	10
3	A	5
4	B	10
5	C	15
6	A	10
7	B	15
8	C	20

1 2	#计算各key对应的值的和 df.groupby('key').sum()

	data
key
A	15
B	30
C	45

3. 常用函数操作

3.1 Merge操作

left = pd.DataFrame({'key':['K0','K1','K2','K3'],
                   'A':['A0','A1','A2','A3'],
                   'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
                    'C':['C0','C1','C2','C3'],
                    'D':['D0','D1','D2','D3']})
pd.merge(left,right,on = 'key')

	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

3.1 排序操作

1
2
3

data  = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],
                     'data':[4,3,2,1,12,3,4,5,7]})
data

	group	data
0	a	4
1	a	3
2	a	2
3	b	1
4	b	12
5	b	3
6	c	4
7	c	5
8	c	7

1 2	data.sort_values(by = ['group','data'],ascending = [True,False],inplace=True) data

	group	data
0	a	4
1	a	3
2	a	2
4	b	12
5	b	3
3	b	1
8	c	7
7	c	5
6	c	4

3.3 缺失值处理

1
2
3

data = pd.DataFrame({'k1':['one'] *3 + ['two']* 4,
                    'k2':[3,2,1,3,3,4,4]})
data

	k1	k2
0	one	3
1	one	2
2	one	1
3	two	3
4	two	3
5	two	4
6	two	4

1
2
3

#去掉冗余相同数据drop_duplicates()
res =  data.drop_duplicates()
res

	k1	k2
0	one	3
1	one	2
2	one	1
3	two	3
5	two	4

1
2
3

#只考虑某些列的冗余数据处理
res1 = data.drop_duplicates(subset='k1')
res1

	k1	k2
0	one	3
3	two	3

#添加新列assign()
import numpy as np
df = pd.DataFrame({'data1':np.random.randn(5),'data2':np.random.randn(5)})
df2 = df.assign(ration = df['data1']/df['data2'])
df2

	data1	data2	ration
0	0.795552	1.063400	0.748121
1	1.516393	1.453561	1.043226
2	-1.043458	-0.210488	4.957335
3	1.112729	1.536009	0.724429
4	0.302984	-1.075604	-0.281687

1 2	df = pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)]) df

	1	2
0	1.0	2.0
1	NaN	0.0
2	0.0	NaN
3	1.0	2.0

1 2	#通过isnull判断是否存在缺失值 df.isnull()

	0	1	2
0	False	False	False
1	False	True	False
2	False	False	True
3	False	False	False

1 2	#直接判断某列是否存在缺失值 df.isnull().any()

0    False
1     True
2     True
dtype: bool

1 2	#指定维度检查缺失值 df.isnull().any(axis=1)

0    False
1     True
2     True
3    False
dtype: bool

1 2	#缺失值填充 df.fillna(5)

	1	2
0	1.0	2.0
1	5.0	0.0
2	0.0	5.0
3	1.0	2.0

3.4 apply自定义函数

1
2
3

data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],
                    'data':[1,2,3,4,5,6,7]})
data

	food	data
0	A1	1
1	A2	2
2	B1	3
3	B2	4
4	B3	5
5	C1	6
6	C2	7

def food_map(series):
    if series['food'] == 'A1':
        return 'A'
    elif series['food'] == 'A2':
        return 'A'
    elif series['food'] == 'B1':
        return 'B'
    elif series['food'] == 'B2':
        return 'B'
    elif series['food'] == 'B3':
        return 'B'
    elif series['food'] == 'C1':
        return 'C'
    elif series['food'] == 'C2':
        return 'C' 
data['food_map'] = data.apply(food_map,axis = 'columns')
data

	food	data	food_map
0	A1	1	A
1	A2	2	A
2	B1	3	B
3	B2	4	B
4	B3	5	B
5	C1	6	C
6	C2	7	C

3.5 时间操作

1
2
3

#读取数据时以时间为索引
data = pd.read_csv('./data/flowdata.csv',index_col = 0,parse_dates = True)
data.head()

	L06_347	LS06_347	LS06_348
Time
2009-01-01 00:00:00	0.137417	0.097500	0.016833
2009-01-01 03:00:00	0.131250	0.088833	0.016417
2009-01-01 06:00:00	0.113500	0.091250	0.016750
2009-01-01 09:00:00	0.135750	0.091500	0.016250
2009-01-01 12:00:00	0.140917	0.096167	0.017000

1 2	#取某一年的数据 data['2013']

	L06_347	LS06_347	LS06_348
Time
2013-01-01 00:00:00	1.688333	1.688333	0.207333
2013-01-01 03:00:00	2.693333	2.693333	0.201500
2013-01-01 06:00:00	2.220833	2.220833	0.166917
2013-01-01 09:00:00	2.055000	2.055000	0.175667
2013-01-01 12:00:00	1.710000	1.710000	0.129583
2013-01-01 15:00:00	1.420000	1.420000	0.096333
2013-01-01 18:00:00	1.178583	1.178583	0.083083
2013-01-01 21:00:00	0.898250	0.898250	0.077167
2013-01-02 00:00:00	0.860000	0.860000	0.075000

3.6 绘图操作

#pandas简单绘图操作
%matplotlib inline
df = pd.DataFrame(np.random.randn(10,4).cumsum(0),index = np.arange(0,100,10),
                  columns = ['A','B','C','D'])
df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1dbf30fb860>

import matplotlib.pyplot as plt
#指定子图两行一列
fig,axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(16),index = list('abcdefghijklmnop'))
data.plot(ax = axes[0],kind = 'bar')
data.plot(ax = axes[1],kind = 'barh')

<matplotlib.axes._subplots.AxesSubplot at 0x1dbf37e0080>

1
2
3

df = pd.DataFrame(np.random.rand(6,4),index = ['one','two','three','four','five','six'],
                 columns = pd.Index(['A','B','C','D'], name = 'Genus'))
df

Genus	A	B	C	D
one	0.825736	0.816818	0.836805	0.288769
two	0.568115	0.108279	0.188345	0.343175
three	0.669199	0.137701	0.567066	0.813652
four	0.961713	0.971082	0.319790	0.780224
five	0.196340	0.901948	0.684793	0.644339
six	0.249157	0.321956	0.110594	0.574358

1	df.plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1dbf35d4cf8>

1 2	data = pd.read_csv('./data/macrodata.csv') data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

1	data.plot.scatter('quarter','realgdp')

<matplotlib.axes._subplots.AxesSubplot at 0x1dbf39df198>

4. 大数据处理技巧

4.1 数值类型转换

1 2	gl = pd.read_csv('./data/game_logs.csv') gl.head()

D:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3049: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	...	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	18710504	Thu	CL1	na	1	FW1	na	1	0	...	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	18710505	Fri	BS1	na	1	WS3	na	1	20	...	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	18710506	Sat	CL1	na	2	RC1	na	1	12	...	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	18710508	Mon	CL1	na	3	CH1	na	1	12	...	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	18710509	Tue	BS1	na	2	TRO	na	1	9	...	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y

5 rows × 161 columns

gl.shape

(171907, 161)

1 2	#可以将数值类型向下转换减少内存占用 gl.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

1
2

Shylock's Blog

Pandas的基本操作

数据分析库Pandas

1. 数据预处理

1.1 数据读取

1.2 DataFrame结构

1.3 数据索引

1.4 创建DataFrame

1.5 Series操作

2. 数据分析

2.1 统计分析

2.3 groupby操作

3. 常用函数操作

3.1 Merge操作

3.1 排序操作

3.3 缺失值处理

3.4 apply自定义函数

3.5 时间操作

3.6 绘图操作

4. 大数据处理技巧

4.1 数值类型转换