10分钟 Pandas 入门
Pandas 是 Python 做数据分析最重要的模块之一,本文源自Pandas 作者 Wes McKinney 写的 10-minute tour of pandas。
首先安装 Pandas 和相关的两个包 numpy、matplotlib
pip install pandas
pip install numpy
pip install matplotlib
复制
导入 pandas、numpy、matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
复制
对象创建
Series 是一个序列,使用 Pandas 创建一个整数索引的序列:
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
复制
DataFrame 是有多个列的数据表,每个列拥有一个 label,当然,DataFrame 也有索引:
>>> dates = pd.date_range('20170101', periods=6)
>>> dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>> >>> df.shape
(6, 5)
>>> df
A B C D
2017-01-01 0.147072 1.235226 0.143952 0.831411
2017-01-02 0.862293 -0.725103 -0.104664 1.265863
2017-01-03 0.281511 0.956868 -0.741193 0.129071
2017-01-04 -0.664475 0.965653 1.522392 1.129707
2017-01-05 -1.364532 -0.167877 0.078448 0.217550
2017-01-06 0.717721 0.344734 -0.951364 0.362032
复制
通过一个对象字典创建 DataFrame, dict 的每个 value 会被转化成一个 Series:
>>> df2 = pd.DataFrame({ 'A' : 1.,
>>> 'B' : pd.Timestamp('20170102'),
>>> 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
>>> 'D' : np.array([3] * 4,dtype='int32'),
>>> 'E' : pd.Categorical(["test","train","test","train"]),
>>> 'F' : 'foo' })
>>> df2
A B C D E F
0 1.0 2017-01-02 1.0 3 test foo
1 1.0 2017-01-02 1.0 3 train foo
2 1.0 2017-01-02 1.0 3 test foo
3 1.0 2017-01-02 1.0 3 train foo
复制
查看每列的格式:
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
复制
查看某一列的具体值
>>> df2.C
0 1.0
1 1.0
2 1.0
3 1.0
Name: C, dtype: float32
复制
查看数据
使用 head() 查看 DataFrame 前几行; tail() 查看后几行:
>>> df.head(3)
A B C D
2017-01-01 0.147072 1.235226 0.143952 0.831411
2017-01-02 0.862293 -0.725103 -0.104664 1.265863
2017-01-03 0.281511 0.956868 -0.741193 0.129071
>>> df.tail(3)
A B C D
2017-01-04 -0.664475 0.965653 1.522392 1.129707
2017-01-05 -1.364532 -0.167877 0.078448 0.217550
2017-01-06 0.717721 0.344734 -0.951364 0.362032
复制
实际上,DataFrame 内部用 numpy 格式存储数据。你也可以单独查看 index、columns 和 values:
>>> df.index
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> df.values
array([[ 0.14707226, 1.23522557, 0.14395236, 0.83141137],
[ 0.86229302, -0.72510256, -0.10466379, 1.26586314],
[ 0.28151127, 0.95686785, -0.74119266, 0.12907115],
[-0.66447533, 0.96565318, 1.52239163, 1.12970702],
[-1.36453175, -0.16787707, 0.07844812, 0.21755034],
[ 0.71772123, 0.34473429, -0.95136372, 0.36203183]])
复制
使用 describe() 可以帮你做一些数据的概要
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.003402 0.434917 -0.008738 0.655939
std 0.855916 0.763118 0.872870 0.486500
min -1.364532 -0.725103 -0.951364 0.129071
25% -0.461588 -0.039724 -0.582060 0.253671
50% 0.214292 0.650801 -0.013108 0.596722
75% 0.608669 0.963457 0.127576 1.055133
max 0.862293 1.235226 1.522392 1.265863
复制
DataFrame 的矩阵转置
>>> df.T
复制
DataFrame 排序
(1) 使用 sort_index 按照索引排序
ascending 参数默认值为 True
axis = 0 指的是安装行排序,axis = 1 是指安装列排序:
>>> df.sort_index(axis=1, ascending=False)
复制
(2) 使用 sort_values 按照值排序
>>> df.sort_values(by='B', ascending=False)
复制
选择
行/列
选择单独的列:
>>> df['A']
>>> df.A
复制
切片,使用[]选择特定的行
>>> df[0:3]
A B C D
2017-01-01 0.147072 1.235226 0.143952 0.831411
2017-01-02 0.862293 -0.725103 -0.104664 1.265863
2017-01-03 0.281511 0.956868 -0.741193 0.129071
复制
通过 label 选择
通过 label 选择(dates[0]=Timestamp(‘2017-01-01 00:00:00’, offset=‘D’))
>>> df.loc[dates[0]]
A 0.147072
B 1.235226
C 0.143952
D 0.831411
复制
多选,「A:B」 表示从 A 到 B
>>> df.loc[:,['A','B']]
A B
2017-01-01 0.147072 1.235226
2017-01-02 0.862293 -0.725103
2017-01-03 0.281511 0.956868
2017-01-04 -0.664475 0.965653
2017-01-05 -1.364532 -0.167877
2017-01-06 0.717721 0.344734
>>> df.loc['20170102':'20170104',['A','B']]
A B
2017-01-02 0.862293 -0.725103
2017-01-03 0.281511 0.956868
2017-01-04 -0.664475 0.965653
>>> df.loc['20170102',['A','B']]
A 0.862293
B -0.725103
Name: 2017-01-02 00:00:00, dtype: float64
>>> df.at[dates[0],'A']
0.14707225966646126
复制
通过下标选择
选择第四行所有元素
>>> df.iloc[3]
A -0.664475
B 0.965653
C 1.522392
D 1.129707
复制
选出34行,01列
>>> df.iloc[3:5,0:2]
A B
2017-01-04 -0.664475 0.965653
2017-01-05 -1.364532 -0.167877
复制
选择单个元素
>>> df.iloc[1,1]
>>> df.iat[1,1]
复制
比较运算
>>> df[df.A > 0]
A B C D
2017-01-01 0.147072 1.235226 0.143952 0.831411
2017-01-02 0.862293 -0.725103 -0.104664 1.265863
2017-01-03 0.281511 0.956868 -0.741193 0.129071
2017-01-06 0.717721 0.344734 -0.951364 0.362032
复制
选出大于0 的全部元素,没有填充的值等于 NaN
>>> df[df > 0]
A B C D
2017-01-01 0.147072 1.235226 0.143952 0.831411
2017-01-02 0.862293 NaN NaN 1.265863
2017-01-03 0.281511 0.956868 NaN 0.129071
2017-01-04 NaN 0.965653 1.522392 1.129707
2017-01-05 NaN NaN 0.078448 0.217550
2017-01-06 0.717721 0.344734 NaN 0.362032
复制
isin() 函数:是否在集合中
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one','two','three','four','three']
>>> df2
A B C D E
2017-01-01 0.147072 1.235226 0.143952 0.831411 one
2017-01-02 0.862293 -0.725103 -0.104664 1.265863 one
2017-01-03 0.281511 0.956868 -0.741193 0.129071 two
2017-01-04 -0.664475 0.965653 1.522392 1.129707 three
2017-01-05 -1.364532 -0.167877 0.078448 0.217550 four
2017-01-06 0.717721 0.344734 -0.951364 0.362032 three
>>> df2[df2['E'].isin(['two','four'])]
A B C D E
2017-01-03 0.281511 0.956868 -0.741193 0.129071 two
2017-01-05 -1.364532 -0.167877 0.078448 0.217550 four
复制
设置
按照 index 给 DataFrame 添加新的列:
>>> s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20170102', periods=6))
>>> s1
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
Freq: D, dtype: int64
>>> df['F'] = s1
>>> df
A B C D F
2017-01-01 0.147072 1.235226 0.143952 0.831411 NaN
2017-01-02 0.862293 -0.725103 -0.104664 1.265863 1.0
2017-01-03 0.281511 0.956868 -0.741193 0.129071 2.0
2017-01-04 -0.664475 0.965653 1.522392 1.129707 3.0
2017-01-05 -1.364532 -0.167877 0.078448 0.217550 4.0
2017-01-06 0.717721 0.344734 -0.951364 0.362032 5.0
复制
通过 label 设置
>>> df.at[dates[0],'A'] = 0
>>> df['A']
2017-01-01 0.000000
2017-01-02 0.862293
2017-01-03 0.281511
2017-01-04 -0.664475
2017-01-05 -1.364532
2017-01-06 0.717721
复制
通过下标设置
>>> df.iat[0,1] = 0
复制
用 numpy 数组设置
>>> df.loc[:,'D'] = np.array([5] * len(df))
>>> df.D
2017-01-01 5
2017-01-02 5
2017-01-03 5
2017-01-04 5
2017-01-05 5
2017-01-06 5
复制
使用比较设置
>>> df2 = df.copy()
>>> df2[df2 > 0] = -df2
>>> df2
A B C D F
2017-01-01 0.000000 -1.000000 -0.143952 -5 NaN
2017-01-02 -0.862293 -0.725103 -0.104664 -5 -1.0
2017-01-03 -0.281511 -0.956868 -0.741193 -5 -2.0
2017-01-04 -0.664475 -0.965653 -1.522392 -5 -3.0
2017-01-05 -1.364532 -0.167877 -0.078448 -5 -4.0
2017-01-06 -0.717721 -0.344734 -0.951364 -5 -5.0
复制