美國大選資料分析
- 導入資料及相關庫
- 資料清洗
-
- 填充缺失值
- 資料轉換
- 面元化資料(分桶)
- 資料聚合與分組運算
- 時間處理
-
- str轉datetime(datetime為時間序列資料類型)
- 重新采樣和頻率轉換
導入資料及相關庫
#import相關的庫
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('datasets/fec/P00000001-ALL.csv')
df.head()
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 250.0 20-JUN-11 NaN NaN NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 50.0 23-JUN-11 NaN NaN NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e+08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NaN NaN NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e+08 NONE RETIRED 250.0 01-AUG-11 NaN NaN NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e+08 NONE RETIRED 300.0 20-JUN-11 NaN NaN NaN SA17A 736166
取幾個典型特征分析
df1 = df[["cand_nm","contbr_nm","contbr_st","contbr_employer","contbr_occupation","contb_receipt_amt","contb_receipt_dt"]]
df1.head()
cand_nm contbr_nm contbr_st contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt
0 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 250.0 20-JUN-11
1 Bachmann, Michelle HARVEY, WILLIAM AL RETIRED RETIRED 50.0 23-JUN-11
2 Bachmann, Michelle SMITH, LANIER AL INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11
3 Bachmann, Michelle BLEVINS, DARONDA AR NONE RETIRED 250.0 01-AUG-11
4 Bachmann, Michelle WARDENBURG, HAROLD AR NONE RETIRED 300.0 20-JUN-11
簡單總覽資料情況
#檢視資料框形狀
df1.shape
#檢視資料的資訊,包括每個字段的名稱、非空數量、字段的資料類型
df1.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cand_nm 1001731 non-null object
1 contbr_nm 1001731 non-null object
2 contbr_st 1001727 non-null object
3 contbr_employer 988002 non-null object
4 contbr_occupation 993301 non-null object
5 contb_receipt_amt 1001731 non-null float64
6 contb_receipt_dt 1001731 non-null object
dtypes: float64(1), object(6)
各字段含義
cand_nm – 接受捐贈的候選人姓名
contbr_nm – 捐贈人姓名
contbr_st – 捐贈人所在州
contbr_employer – 捐贈人所在公司
contbr_occupation – 捐贈人職業
contb_receipt_amt – 捐贈數額(美元)
contb_receipt_dt – 收到捐款的日期
#對資料進行統計性分析
df1.describe
資料清洗
填充缺失值
#從data.info()得知,contbr_employer、contbr_occupation均有少量缺失值,均填充為NOT PROVIDED
data['contbr_employer'].fillna('NOT PROVIDED',inplace=True)
data['contbr_occupation'].fillna('NOT PROVIDED',inplace=True)
資料轉換
針對"cand_nm 候選人姓名"這一條特征,我們檢視他的所有名單(pd.Series.value_counts()),
并用字典标明是屬于共和黨還是民主黨
df1["cand_nm"].value_counts() #顯示這一特征下不同類别的數量
Obama, Barack 593746
Paul, Ron 143757
Romney, Mitt 107229
Gingrich, Newt 47679
Santorum, Rick 46559
Cain, Herman 20107
Perry, Rick 13575
Bachmann, Michelle 13140
Roemer, Charles E. 'Buddy' III 5920
Pawlenty, Timothy 4555
Huntsman, Jon 4156
Johnson, Gary Earl 1234
McCotter, Thaddeus G 74
Name: cand_nm, dtype: int64
df['cand_nm'].unique() 也可以檢視這一列的不同值,傳回值為一維數組
#建立字典标明對應黨派
parties = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican',
'Santorum, Rick': 'Republican'}
#通過pd.Series.map()函數添加一列存儲黨派資訊
df1["party"] = df1["cand_nm"].map(parties)
df1.groupby('party').sum() #檢視不同黨派貢獻金額
contb_receipt_amt
party
Democrat 1.335026e+08
Republican 1.652488e+08
#檢視兩黨得票數
df1['party'].value_counts()
Democrat 593746
Republican 407985
Name: party, dtype: int64
發現Republican(共和黨)接受的贊助總金額更高,Democrat(民主黨)獲得的贊助次數更多一些
#為了簡化分析,我們将範圍控制在contb_receipt_amt>0中
df2= df1[df1["contb_receipt_amt"]>0]
由于奧巴馬和羅姆尼為兩個主要候選人,是以準備一個僅對他們有貢獻的子集
df_lmab = df2[df2['cand_nm'].isin(['Obama, Barack','Romney, Mitt'])]
排序:按照職業彙總對贊助總金額進行排序
根據職業分析捐贈是一個常見的統計分析:律師傾向于捐更多的錢給民主黨,商務人士更偏向與共和黨
#DataFrame.sort_values(by, ascending=True, inplace=False)
by是根據哪一列進行排序,可以傳入多列;ascending=True是升序排序,False為降序;inplace=Ture則是修改原dataframe,預設為False
df2.groupby('contbr_occupation')['contb_receipt_amt'].sum().sort_values(ascending=False)[:20]
利用函數進行資料轉換:職業與雇主資訊分析
利用了dict.get它允許沒有映射關系的職業也能“通過”)
#建立一個職業對應字典,把相同職業的不同表達映射為對應的職業,比如把C.E.O.映射為CEO
occupation_map = {
'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
'INFORMATION REQUESTED':'NOT PROVIDED',
'SELF' : 'SELF-EMPLOYED',
'SELF EMPLOYED' : 'SELF-EMPLOYED',
'C.E.O.':'CEO',
'LAWYER':'ATTORNEY',
}
# 如果不在字典中,傳回x
f = lambda x: occupation_map.get(x, x)
df2.contbr_occupation = df2.contbr_occupation.map(f)
對雇主的捐獻進行同樣處理
emp_mapping = {
'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
'INFORMATION REQUESTED' : 'NOT PROVIDED',
'SELF' : 'SELF-EMPLOYED',
'SELF EMPLOYED' : 'SELF-EMPLOYED',
}
# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
df2.contbr_employer = df2.contbr_employer.map(f)
面元化資料(分桶)
利用cut函數根據出資額大小将資料離散化到多個面元(桶)中
把之前篩選的奧巴馬和洛爾尼的子集中的捐贈金額進行處理
bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
labels = pd.cut(df_lmab ['contb_receipt_amt'],bins)
資料聚合與分組運算
Groupby即分組運算,其過程可概括為“split-apply-combine”(拆分-應用-合并)。即分組後對各部分進行運算
拆分的對象為pandas對象(Series、DataFrame等);拆分的依據是分組鍵,可以是清單、數組(長度與待分組的軸一樣)、字典、Series、函數、DataFrame列名
![](https://img.laitimes.com/img/_0nNw4CM6IyYiwiM6ICdiwiIwczX0xiRGZkRGZ0Xy9GbvNGL2EzXlpXazxSP9EVY0w2VZFDZywEMW1mY1RzRapnTtxkb5ckYplTeMZTTINGMShUYfRHelRHLwEzX39GZhh2css2RkBnVHFmb1clWvB3MaVnRtp1XlBXe0xyayFWbyVGdhd3LcV2Zh1Wa9M3clN2byBXLzN3btg3Pn5GcugTMxEjM1QTMzAjMxAjMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
透視表(pivot_table)分析黨派和職業
#按照黨派、職業對贊助金額進行彙總,類似excel中的透視表操作,聚合函數為sum
by_occupation = data.pivot_table('contb_receipt_amt',index='contbr_occupation',columns='party',aggfunc='sum')
#過濾掉贊助金額小于200W的資料
over_2mm = by_occupation[by_occupation.sum(1)>2000000]
資料聚合(aggregate)
資料聚合,即任何能從數組産生标量值的資料轉換過程。如mean、count、min、sum等,此外可以自定義聚合函數,或是已經定義好的任何方法。Groupby方法後的聚合,是在分組對象上調用聚合方法,再進行彙總。
分組級運算及轉換(transform和apply)
#來了解一下對Obama和Romney總出資最高的職業和雇主
def groupby_again(group,key, n = 2):
totals = group.groupby(key).sum()
return totals.sort_values( by ='contb_receipt_amt',ascending=False)[:n]
groupbyed = df_lmab.groupby('cand_nm')
groupbyed.apply(groupby_again,'contbr_occupation',n=7)
#同樣的,使用get_top_amounts()對雇主進行分析處理
grouped.apply(groupby_again,'contbr_employer',n=10)
#來了解一下對兩位候選人資助頻次最高的人
def paixu(group,n=2):
top_counts = group["contbr_nm"].value_counts()
top_final = top_counts.sort_values(ascending = False)[:n]
return top_final
groupeed = df_lmab.groupby("cand_nm")
groupeed.apply(paixu,n=5)
對贊助金額進行分組統計
首先統計各出資區間的贊助筆數,這裡用到unstack(),stack()函數是堆疊,unstack()函數就是不要堆疊,即把多層索引變為表格資料但是資料類型仍為DataFrame
df_lmab.groupby(["cand_nm",labels]).size()
#輸出
cand_nm contb_receipt_amt
Obama, Barack (0, 1] 493
(1, 10] 40070
(10, 100] 372280
(100, 1000] 153991
(1000, 10000] 22284
(10000, 100000] 2
(100000, 1000000] 3
(1000000, 10000000] 4
Romney, Mitt (0, 1] 77
(1, 10] 3681
(10, 100] 31853
(100, 1000] 43357
(1000, 10000] 26186
(10000, 100000] 1
(100000, 1000000] 0
(1000000, 10000000] 0
dtype: int64
df_lmab.groupby(["cand_nm",labels]).size().unstack(0)
#輸出
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493 77
(1, 10] 40070 3681
(10, 100] 372280 31853
(100, 1000] 153991 43357
(1000, 10000] 22284 26186
(10000, 100000] 2 1
(100000, 1000000] 3 0
(1000000, 10000000] 4 0
統計各區間的贊助金額
grouped_bins = df_lmab.groupby(["cand_nm",labels])
bucket_sums = grouped_bins['contb_receipt_amt'].sum().unstack(0)
bucket_sums
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 318.24 77.00
(1, 10] 337267.62 29819.66
(10, 100] 20288981.41 1987783.76
(100, 1000] 54798531.46 22363381.69
(1000, 10000] 51753705.67 63942145.42
(10000, 100000] 59100.00 12700.00
(100000, 1000000] 1490683.08 NaN
(1000000, 10000000] 7148839.76 NaN
對贊助金額進行可視化
增加百分比是效果更明顯
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
normed_sums[:-2].plot(kind = "barh)
時間處理
str轉datetime(datetime為時間序列資料類型)
df2['time'] = pd.to_datetime(df2['contb_receipt_dt'])
重新設定索引為time
daf2.set_index('time',inplace=True)
重新采樣和頻率轉換
pandas對象都擁有resample方法,該方法是所有頻率轉換的工具函數。resample擁有類似groupby的API可以調用resample對資料分組,之後再調用聚合函數
vs_time = df2.groupby('cand_nm').resample('M')['cand_nm'].count()
vs_time.unstack(0) #把層索引轉變為表格資料
參考連結
2012美國大選資料集