深入pandas 資料處理

三個階段

資料準備
資料轉化
資料聚合

加載
組裝
- 合并 - pandas.merge()
- 拼接 - pandas.concat()
- 組合 - pandas.DataFrame.combine_first()
變形
删除

合并

example1:

import numpy as np
import pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],'price':[12.33,11.44,33.21,13.23,33.62]})
frame2 = pd.DataFrame({'id':['pencil','ball','pencil','pen'],'color':['white','red','red','black']})
pd.merge(frame1,frame2)

有必要定義合并操作的标準用 on 來指定

example2:

frame2.columns=['brand2','id2']

pd.merge(frame1,frame2,on='brand') # 需要重新明明

pd.merge(frame1,frame2,right_on='brand', left_on='sid')

拼接

concatenation

numpy 的 concatenate()函數就是做這種拼接操作

array1=np.arange(9).reshape((3,3))
array2=np.arange(9).reshape((3,3))+6
np=concatenate([array1,array2],axis=1)# axis=1 從行拼接 axis=0 從列拼接

pandas的concat()函數可以做拼接操作

ser1=pd.concat([ser1,ser2])
# axis=1 從行拼接 axis=0 從列拼接
# join='inner' or 'outer'

組合

Series對象： combine_first()

組合的同時還可以對齊資料

ser1=pd.Series(np.random.rand(5),index=[1,2,3,4,5])
ser2=pd.Series(np.random.rand(4),index=[2,4,5,6])
ser1.combine_first(ser2)

軸向旋轉

意思是需要按照行重新調整列或者反過來

兩個操作：

stacking 入棧，把列轉化為行
unstacking 出站，把行轉化為列

frame1=pd.DataFrame(np.arange(9).reshape(3,3),index=['w','b','r'], columns=['ball','pen','pencil'])
frame1.stack() # 得到一個Series對象
ser.unstack() # 得到一個DataFrame對象

# 長格式向寬格式轉化： DateFrame.pivot
wideframe=longframe.pivot('color','item')

删除一列

del frame['ball']

删除多餘的行

frame.drop('white')

删除重複資料

DataFrame 中duplicated()函數可以用來檢測重複的行，傳回bool型Series對象

dframe.duplicated()
# 得到過濾結果
dframe[dframe.duplicated()]
# 講重複的行删除
dframe.drop_duplicates<>

映射

dict 映射關系比較好

replace() 替換元素
map() 建立一列
rename() 替換索引

### 替換
newcolor={'rosso':'red','verde':'green'}
frame.replace(newcolors)

ser.replace(np.nan, 0)

### 添加元素
price={'ball':5.56,'mug':4.3}
frame['price']=frame['item'].map(price)

### 重命名軸索引
reindex={o:'first',2:'second'}
frame.replace(reindex)
frame.replace(index={1:'first'}, columns={'item':'object'})
# inplace 參數： 是否改變調用函數對象本身

離散化

result=[12,34,67,55,28,90.99,12,3,56,74,44,87,23,49,89,87]
bins=[0,25,50,75,100]
# 對result用cut函數
cat=pd.cut(result,bins)
cat >>> type(cat)
<class 'pandas.core.categorical.Categorical'>
# 傳回的是類别對象
cat.levels
cat.labels
# 類别中計數
pd.value_counts(cat)
# cut 函數中的labels标簽 labels=['a','b','c']

異常值的檢測和過濾

randframe=pd.DataFrame(np.random.randn(1000,3))

descibe()函數檢視每一列的描述性統計量

假設講比标準差大三倍的元素是為異常值，用std()函數可以求出每一列的标準差

randframe.std()

對DataFrame對象進行過濾

randframe[(np.abs(randframe)>(3*randframe.std())).any(1)]

排序

nframe=pd.DataFrame(np.arange(25).reshape(5,5))
# permutation(5)建立一個随機順序整數
new_order=np.random.permutation(5) # 0-4
nframe.take(new_order)

随機取樣

np.random.randint()函數
sample=np.random.randint(0,len(nframe),size=3)

字元串處理

内置字元串處理方法

split() 函數切割

test='12312,bob'
test.split(',')
# ['12312', 'bob']

strip()函數去空白

tokens=[s.strip() for s in test.split(',')]

join() 拼接

>>> strings=['1','2','3','45','5']
','.join(strings)

in index() find() 查找操作

test.index('bottom')
test.find('bottom')
'bottom' in test

count() 出現次數

test.count('bottom')

replace()

test.replace('A','a')

正規表達式

import re

幾個類别:

模式比對
替換
切分

re.split()

text="This is        an \t odd \n text!"
re.split('\s+',text)

# 内部過程
regex=re.compile('\s+')
regex.split(text)

re.findall()

# 以A開頭不區分大小寫
text='A! This is my address: 16 Boltom Avenue, Boston'
re.findall('[A,a]\w+',text)

GroupBy

SPLIT-APPLY-COMBINE 三個階段

分組
用函數處理

# 實際上隻使用了GroupBy函數
 frame=pd.DataFrame({'color':['white','red','green','red','green'],'obj':['pen','pencil','pencil','ashtray','pen'],'price1':[5.56,4.20,1.3,0.56,2.75],'price2':[4.75,4.12,1.60,0.75,3.15]})

  >>> frame
   color      obj  price1  price2
0  white      pen    5.56    4.75
1    red   pencil    4.20    4.12
2  green   pencil    1.30    1.60
3    red  ashtray    0.56    0.75
4  green      pen    2.75    3.15


# 想要根據color組，計算price1的均值
group=frame['price1'].groupby(frame['color'])
# 得到一個group對象
group.groups # 檢視分組情況
group.mean() # 檢視均值
group.sum() # 檢視分組總和

等級分組

ggroup=frame['price1'].groupby([frame['color'],frame['obj']])

frame[['price1','price2']].groupby(frame['color']).mean()

組疊代

for name, group in frame.groupby('color'):
	print(name)
	print(group)

分組函數

group=frame.groupby('color')
group['price1'].quantile(0.6) # 直接計算分位數

# 自定義聚合函數
def range(series):
	return series.max()-series.min()
group['price1'].agg(range)

group.agg(range)

代碼改變世界

Pandas 進階應用資料分析

深入pandas 資料處理

三個階段

合并

拼接

組合

軸向旋轉

删除重複資料

映射

離散化

異常值的檢測和過濾

排序

字元串處理

内置字元串處理方法

正規表達式

GroupBy

等級分組

組疊代

分組函數

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入

Pandas 進階應用 資料分析

深入pandas 資料處理

三個階段

合并

拼接

組合

軸向旋轉

删除重複資料

映射

離散化

異常值的檢測和過濾

排序

字元串處理

内置字元串處理方法

正規表達式

GroupBy

等級分組

組疊代

分組函數

繼續閱讀

Pandas 進階應用資料分析