天天看點

dataframe groupby_利用Python進行資料分析GroupBy資料的分組與聚合操作資料聚合通用split-apply-combine資料透視表和交叉表總結

資料的分組與聚合操作

在資料分析工作流中,一個重要的工作是對資料進行分類,并在每一組上應用一個聚合函數或轉換函數。在經曆載入、合并、準備資料集之後,可能需要計算分組統計或者制作資料透視表用于報告或者可視化的目的。

pandas

提供了一個非常靈活的groupby接口,來對資料集進行切片、切塊和總結。本章主要内容如下:

  • 使用一個或多個鍵将pandas對象拆分成多塊
  • 計算組彙總統計資訊
  • 應用組内變換或其他操作
  • 計算資料透視表和交叉表
  • 執行分位數分析和其他統計組分析
import pandas as pdimport numpy as np
           

GroupBy

groupby包含三個步驟:split-apply-combine。首先是将資料按照特定軸向分離,進行分組;然後将函數應用到各個組中;最後将結果聯合起來。

dataframe groupby_利用Python進行資料分析GroupBy資料的分組與聚合操作資料聚合通用split-apply-combine資料透視表和交叉表總結
# 我們先建立一個DataFrame。對其進行groupby操作df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],                   'key2': ['one', 'two', 'one', 'two', 'one'],                   'data1': np.random.randn(5),                   'data2': np.random.randn(5)})df
           
key1 key2 data1 data2
a one 0.209170 -0.331921
1 a two 0.162572 0.322581
2 b one 0.145062 1.242235
3 b two -1.705477 -0.563494
4 a one 0.309842 -1.541020
# 根據key1标簽計算data1列的均值grouped = df['data1'].groupby(df['key1'])grouped
           
# grouped變量是一個GroupBy對象,可以對其進行操作# 比如計算均值grouped.mean()   # 得到的就是data1根據key1分組的均值
           
key1
a    0.227195
b   -0.780207
Name: data1, dtype: float64
           
# 當然我們可以使用清單傳入Series分組鍵進行分組means = df['data1'].groupby([df['key1'], df['key2']]).mean() # means含有多層索引means
           
key1  key2
a     one     0.259506
      two     0.162572
b     one     0.145062
      two    -1.705477
Name: data1, dtype: float64
           
# 當然,分組的鍵也可以是正确長度的任何數組states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])years = np.array([2005, 2005, 2006, 2005, 2006])df['data1'].groupby([states, years]).mean()
           
California  2005    0.162572
            2006    0.145062
Ohio        2005   -0.748154
            2006    0.309842
Name: data1, dtype: float64
           
# 分組資訊如果在DataFrame中,則可以直接使用列名作為分組鍵df.groupby('key1').mean()  # 因為df['key2']不是數值資料,是以并不會計算均值
           
data1 data2
key1
a 0.227195 -0.516787
b -0.780207 0.339371
# 使用size檢視組大小資訊,注意缺失值将排除在外df.groupby(['key1', 'key2']).size()
           
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64
           

周遊各分組

GroupBy對象支援疊代,會先生成一個包含組名和資料塊的2維元組序列

for name, group in df.groupby('key1'):    print(name)    print(group)
           
a
  key1 key2     data1     data2
0    a  one  0.209170 -0.331921
1    a  two  0.162572  0.322581
4    a  one  0.309842 -1.541020
b
  key1 key2     data1     data2
2    b  one  0.145062  1.242235
3    b  two -1.705477 -0.563494
           
# 在有多個分組鍵的情況下,元組中的第一個元素是鍵值的元組for k, group in df.groupby(['key1', 'key2']):    print(k)    print(group)
           
('a', 'one')
  key1 key2     data1     data2
0    a  one  0.209170 -0.331921
4    a  one  0.309842 -1.541020
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.162572  0.322581
('b', 'one')
  key1 key2     data1     data2
2    b  one  0.145062  1.242235
('b', 'two')
  key1 key2     data1     data2
3    b  two -1.705477 -0.563494
           
# 選擇任意一塊資料進行操作pieces = dict(list(df.groupby('key1')))pieces['b']
           
key1 key2 data1 data2
2 b one 0.145062 1.242235
3 b two -1.705477 -0.563494

預設情況下,groupby在axis=0的軸向上進行分組,但是也可以標明其他軸向。

# 我們可以根據資料類型進行分組df.dtypes
           
key1      object
key2      object
data1    float64
data2    float64
dtype: object
           
grouped = df.groupby(df.dtypes, axis=1)for dtype, group in grouped:    print(dtype)    print(group)
           
float64
      data1     data2
0  0.209170 -0.331921
1  0.162572  0.322581
2  0.145062  1.242235
3 -1.705477 -0.563494
4  0.309842 -1.541020
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one
           

選擇一列或者所有列的子集

将從DataFrame中建立的GroupBy對象用列名稱或者列名稱數組進行索引時,相當于擷取用于聚合的列子集。也就是說:

  • df.groupby('key1')['data1'] 相當于df['data1'].groupby(df['key1'])
  • df.groupby('key1')[['data1']]相當于df[['data1']].groupby(df['key1'])

對于大型資料集,可能隻需要聚合少部分列

# 計算data2列的均值并獲得DataFramedf.groupby('key1')[['data1']].mean()
           
data1
key1
a 0.227195
b -0.780207
# 如果隻有單個列名進行傳遞,則傳回的是Seriess_grouped = df.groupby(['key1', 'key2'])['data2']s_grouped # 可以看到獲得的是分組的Series
           
s_grouped.mean()
           
key1  key2
a     one    -0.936470
      two     0.322581
b     one     1.242235
      two    -0.563494
Name: data2, dtype: float64
           

使用字典和Series分組

分組資訊可能會以非數組形式存在

people = pd.DataFrame(np.random.randn(5, 5),                       columns=['a', 'b', 'c', 'd', 'e'],                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])people.iloc[2:3, [1, 2]] = np.nanpeople
           
a b c d e
Joe 1.221724 0.110267 1.285643 -2.290401 0.409033
Steve 0.847775 -0.455833 0.725666 -0.694706 -1.162236
Wes -0.504510 NaN NaN 1.069262 1.364316
Jim -2.089744 -0.940379 -0.737187 0.189542 -1.010504
Travis -1.310530 -0.341813 -1.730868 0.289435 0.053035
# 'f'未用于分組mapping = {'a': 'red', 'b':'blue', 'c':'blue', 'd':'blue', 'e': 'red', 'f': 'orange'}by_column = people.groupby(mapping, axis=1)by_column.sum()
           
blue red
Joe -0.894491 1.630757
Steve -0.424873 -0.314461
Wes 1.069262 0.859806
Jim -1.488024 -3.100248
Travis -1.783247 -1.257495
# 使用Series分組map_series = pd.Series(mapping)people.groupby(map_series, axis=1).count()
           
blue red
Joe 3 2
Steve 3 2
Wes 1 2
Jim 3 2
Travis 3 2

使用函數分組

作為分組鍵傳遞的函數将會按照每個索引值調用一次,同時傳回值用作分組名稱。

# 根據名字長度分組people.groupby(len).sum()
           
a b c d e
3 -1.372529 -0.830111 0.548455 -1.031598 0.762845
5 0.847775 -0.455833 0.725666 -0.694706 -1.162236
6 -1.310530 -0.341813 -1.730868 0.289435 0.053035

可以将函數與數組、字典或Series進行混合,所有的對象在内部都轉換為數組。

key_list = ['one', 'one', 'one', 'two', 'two']for key, group in people.groupby([len, key_list]):    print(key)    print(group)
           
(3, 'one')
            a         b         c         d         e
Joe  1.221724  0.110267  1.285643 -2.290401  0.409033
Wes -0.504510       NaN       NaN  1.069262  1.364316
(3, 'two')
            a         b         c         d         e
Jim -2.089744 -0.940379 -0.737187  0.189542 -1.010504
(5, 'one')
              a         b         c         d         e
Steve  0.847775 -0.455833  0.725666 -0.694706 -1.162236
(6, 'two')
              a         b         c         d         e
Travis -1.31053 -0.341813 -1.730868  0.289435  0.053035
           

根據索引層級分組

在分層索引的某個層級上可以進行聚合

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],                                     [1, 3, 5, 1, 3]],                                     names=['cty', 'tenor'])hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)hier_df
           
cty US JP
tenor 1 3 5 1 3
2.303857 1.048559 -0.188793 -0.178907 -0.334412
1 -0.221124 0.218428 -0.435082 0.538684 -0.740732
2 0.187450 0.325777 -0.892818 -0.897597 0.855642
3 -0.006315 0.537607 0.855795 -0.135583 -0.635938
# 根據層級索引時,将層級數值或層級名稱傳遞給level關鍵字hier_df.groupby(level='cty', axis=1).count()
           
cty JP US
2 3
1 2 3
2 2 3
3 2 3

資料聚合

資料聚合就是對Series或DataFrame所有列使用aggregate和所需函數,或者調用mean、std等方法 groupby優化方法:

函數名 描述
count 分組中的非NA數量
sum 非NA值的累和
mean 非NA值的均值
median 非NA值的算數中位數

逐列及多函數應用

# 使用上篇文章使用的tips資料集tips = pd.read_csv(r'tip.csv')tips['tip_pct'] = tips['tip'] / (tips['total_bill'])tips.head()
           
total_bill tip smoker day time size tip_pct
16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
# 根據day和smoker對tips進行分組grouped = tips.groupby(['day', 'smoker'])# 獲得描述性統計(不使用describe方法)# 将函數名以字元串形式傳遞grouped_pct = grouped['tip_pct']grouped_pct.agg('mean') # 得到的是Series
           
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64
           
# 若傳遞的是函數或者函數名的清單,則會獲得一個列名為函數名的DataFrame# 首先定義一個函數,獲得極差def peak_to_peak(arr):    return arr.max() - arr.min()grouped_pct.agg(['mean', 'std', peak_to_peak]) # 隻有groupby内置的方法才可以直接傳字元串,否則報錯
           
mean std peak_to_peak
day smoker
Fri No 0.151650 0.028123 0.067349
Yes 0.174783 0.051293 0.159925
Sat No 0.158048 0.039767 0.235193
Yes 0.147906 0.061375 0.290095
Sun No 0.160113 0.042347 0.193226
Yes 0.187250 0.154134 0.644685
Thur No 0.160298 0.038774 0.193350
Yes 0.163863 0.039389 0.151240
# 如果傳遞的是(name, function)的元組,那麼DataFrame的列名則是元組的第一個元素grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
           
foo bar
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389

在DataFrame中,可以指定應用到所有列上的函數清單或者每一列上應用不同函數

functions = ['count', 'mean', 'max']result = grouped['tip_pct', 'total_bill'].agg(functions)result
           
tip_pct total_bill
count mean max count mean max
day smoker
Fri No 4 0.151650 0.187735 4 18.420000 22.75
Yes 15 0.174783 0.263480 15 16.813333 40.17
Sat No 45 0.158048 0.291990 45 19.661778 48.33
Yes 42 0.147906 0.325733 42 21.276667 50.81
Sun No 57 0.160113 0.252672 57 20.506667 48.17
Yes 19 0.187250 0.710345 19 24.120000 45.35
Thur No 45 0.160298 0.266312 45 17.113111 41.19
Yes 17 0.163863 0.241255 17 19.190588 43.11

産生的DataFrame擁有分層列

result['tip_pct']
           
count mean max
day smoker
Fri No 4 0.151650 0.187735
Yes 15 0.174783 0.263480
Sat No 45 0.158048 0.291990
Yes 42 0.147906 0.325733
Sun No 57 0.160113 0.252672
Yes 19 0.187250 0.710345
Thur No 45 0.160298 0.266312
Yes 17 0.163863 0.241255

和以前一樣,可以傳遞具有自定義名稱的元組清單

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]grouped['tip_pct', 'total_bill'].agg(ftuples)
           
tip_pct total_bill
Durchschnitt Abweichung Durchschnitt Abweichung
day smoker
Fri No 0.151650 0.000791 18.420000 25.596333
Yes 0.174783 0.002631 16.813333 82.562438
Sat No 0.158048 0.001581 19.661778 79.908965
Yes 0.147906 0.003767 21.276667 101.387535
Sun No 0.160113 0.001793 20.506667 66.099980
Yes 0.187250 0.023757 24.120000 109.046044
Thur No 0.160298 0.001503 17.113111 59.625081
Yes 0.163863 0.001551 19.190588 69.808518

使用含有列名與函數對應關系的字典傳遞給agg,使列執行不同的函數操作;當多個函數應用于至少一個列時,會産生多層列

grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'],             'size': 'sum'})
           
tip_pct size
min max mean std sum
day smoker
Fri No 0.120385 0.187735 0.151650 0.028123 9
Yes 0.103555 0.263480 0.174783 0.051293 31
Sat No 0.056797 0.291990 0.158048 0.039767 115
Yes 0.035638 0.325733 0.147906 0.061375 104
Sun No 0.059447 0.252672 0.160113 0.042347 167
Yes 0.065660 0.710345 0.187250 0.154134 49
Thur No 0.072961 0.266312 0.160298 0.038774 112
Yes 0.090014 0.241255 0.163863 0.039389 40

傳回不含行索引的聚合資料

# 使用as_index=False來禁用分組鍵作為索引tips.groupby(['day','smoker'], as_index=False).mean()# 當然,使用reset_index也是一樣的tips.groupby(['day','smoker']).mean().reset_index()
           
day smoker total_bill tip size tip_pct
Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863

通用split-apply-combine

  • apply方法

使用之前的tip資料集,選出tip_pct最高的五個組。首先,可以寫一個在特定列中選出最大值所在行的函數

def  top(df, n=5, column='tip_pct'):    return tips.sort_values(by=column)[-n:]
           
top(tips, n=6)
           
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
# 按照smoker, day分組# top函數在每一行分組上被調用,之後使用pandas.concat将結果粘貼,分組名為标簽# top函數的參數可以放在函數後進行傳遞tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
           
total_bill tip smoker day time size tip_pct
smoker day
No Fri 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Sat 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Sun 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Thur 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Yes Fri 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Sat 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Sun 170 50.81 10.0 Yes Sat Dinner 3 0.196812
Thur 170 50.81 10.0 Yes Sat Dinner 3 0.196812

壓縮分組鍵

groupby

傳遞groupby_key=False ,不展示分組索引

tips.groupby('smoker').apply(top)
           
total_bill tip smoker day time size tip_pct
smoker
No 183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
Yes 183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
tips.groupby('smoker', as_index=False).apply(top)
           
total_bill tip smoker day time size tip_pct
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
1 183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
tips.groupby('smoker', group_keys=False).apply(top)
           
total_bill tip smoker day time size tip_pct
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
tips.groupby(['day','smoker']).mean()
           
total_bill tip size tip_pct
day smoker
Fri No 18.420000 2.812500 2.250000 0.151650
Yes 16.813333 2.714000 2.066667 0.174783
Sat No 19.661778 3.102889 2.555556 0.158048
Yes 21.276667 2.875476 2.476190 0.147906
Sun No 20.506667 3.167895 2.929825 0.160113
Yes 24.120000 3.516842 2.578947 0.187250
Thur No 17.113111 2.673778 2.488889 0.160298
Yes 19.190588 3.030000 2.352941 0.163863
tips.groupby(['day','smoker'], group_keys=False).mean()
           
total_bill tip size tip_pct
day smoker
Fri No 18.420000 2.812500 2.250000 0.151650
Yes 16.813333 2.714000 2.066667 0.174783
Sat No 19.661778 3.102889 2.555556 0.158048
Yes 21.276667 2.875476 2.476190 0.147906
Sun No 20.506667 3.167895 2.929825 0.160113
Yes 24.120000 3.516842 2.578947 0.187250
Thur No 17.113111 2.673778 2.488889 0.160298
Yes 19.190588 3.030000 2.352941 0.163863

注意:

as_index

:對于聚合輸出,傳回以組标簽作為索引的對象。僅與DataFrame輸入相關。as_index = False實際上是“SQL風格”的分組輸出。

group_keys

:調用apply時,選擇是否将組鍵添加到索引

分位數與桶分析

使用

groupby

方法與

cut, qcut

配合使用,對資料更友善地進行分桶或分位分析

frame = pd.DataFrame({'data1': np.random.randn(1000),                      'data2': np.random.randn(1000)})quantiles = pd.cut(frame.data1, 4)quantiles[:10]
           
0     (-1.138, 0.242]
1      (0.242, 1.622]
2      (0.242, 1.622]
3     (-1.138, 0.242]
4     (-1.138, 0.242]
5    (-2.524, -1.138]
6      (0.242, 1.622]
7     (-1.138, 0.242]
8     (-1.138, 0.242]
9     (-1.138, 0.242]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-2.524, -1.138] < (-1.138, 0.242] < (0.242, 1.622] < (1.622, 3.002]]
           
def get_stats(group):    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}grouped = frame.data2.groupby(quantiles)grouped.apply(get_stats)
           
data1                  
(-2.524, -1.138]  min       -2.243129
                  max        2.431713
                  count    116.000000
                  mean       0.111999
(-1.138, 0.242]   min       -2.734511
                  max        2.722345
                  count    467.000000
                  mean       0.028268
(0.242, 1.622]    min       -2.654807
                  max        2.826075
                  count    356.000000
                  mean       0.075527
(1.622, 3.002]    min       -2.873482
                  max        2.594543
                  count     61.000000
                  mean      -0.007187
Name: data2, dtype: float64
           
# 上面獲得的是等長桶,我們可以使用qcut方法獲得等大小的桶grouping = pd.qcut(frame.data1, 10, labels=False)grouped = frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()
           
min max count mean
data1
-2.243129 2.431713 100.0 0.102224
1 -2.296063 2.376008 100.0 -0.015589
2 -1.947194 2.722345 100.0 0.070937
3 -1.980102 2.534790 100.0 0.107431
4 -2.734511 2.335623 100.0 -0.025969
5 -2.424619 2.345450 100.0 0.098001
6 -2.654807 2.398495 100.0 0.026486
7 -2.076728 2.826075 100.0 0.038720
8 -2.483275 2.118598 100.0 0.056305
9 -2.873482 2.594543 100.0 0.067875

使用指定分組值填充缺失值

s = pd.Series(np.random.randn(6))s[::2] = np.nans.fillna(s.mean())
           
0    0.057708
1   -0.045778
2    0.057708
3    0.047644
4    0.057708
5    0.171259
dtype: float64
           
states = ['Ohio', 'New Year', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']group_key = ['East'] * 4 + ['West'] * 4data = pd.Series(np.random.randn(8), index=states)data
           
Ohio          0.350673
New Year     -0.126434
Vermont      -0.299054
Florida      -0.175114
Oregon        0.667130
Nevada        1.023922
California    0.840702
Idaho        -0.662078
dtype: float64
           
data[['Vermont', 'Nevada', 'Idaho']] = np.nandata
           
Ohio          0.350673
New Year     -0.126434
Vermont            NaN
Florida      -0.175114
Oregon        0.667130
Nevada             NaN
California    0.840702
Idaho              NaN
dtype: float64
           
data.groupby(group_key).mean()
           
East    0.016375
West    0.753916
dtype: float64
           
fill_mean = lambda g: g.fillna(g.mean())data.groupby(group_key).apply(fill_mean)
           
Ohio          0.350673
New Year     -0.126434
Vermont       0.016375
Florida      -0.175114
Oregon        0.667130
Nevada        0.753916
California    0.840702
Idaho         0.753916
dtype: float64
           
# 為每個分組預定義填充值fill_values = {'East': 0.5, 'West': -1}fill_func = lambda g: g.fillna(fill_values[g.name])data.groupby(group_key).apply(fill_func)
           
Ohio          0.350673
New Year     -0.126434
Vermont       0.500000
Florida      -0.175114
Oregon        0.667130
Nevada       -1.000000
California    0.840702
Idaho        -1.000000
dtype: float64
           

示例:随機采樣與排列

使用

Series

的sample方法

# 構造一副英式撲克牌# 紅桃。黑桃,梅花,方塊suits = ['H', 'S', 'C', 'D']card_val = (list(range(1, 11)) + [10] * 3) * 4base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']cards = []for suit in ['H', 'S', 'C', 'D']:    cards.extend(str(num) + suit for num in base_names)deck = pd.Series(card_val, index=cards)
           
deck[:13]
           
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
           
# 從其中随機拿出五張牌deck.sample(5)
           
10D    10
7H      7
AH      1
6S      6
4D      4
dtype: int64
           
# 為了後續友善使用,我們定義一個随機抽牌的函數def draw(deck, n=5):    return deck.sample(n)# 從每個花色中随機抽取兩張牌,花色是牌名的最後一個字元get_suit = lambda card: card[-1]deck.groupby(get_suit).apply(draw, 2)
           
C  QC     10
   KC     10
D  10D    10
   JD     10
H  2H      2
   QH     10
S  7S      7
   AS      1
dtype: int64
           

示例:分組權重平均

df = pd.DataFrame({'category': ['a', 'a', 'a', 'b', 'b', 'b'],                   'data': np.random.randn(6),                   'weights': np.random.rand(6)})df
           
category data weights
a 0.262284 0.552361
1 a 1.141605 0.295554
2 a 0.219825 0.030885
3 b 1.523184 0.917011
4 b -0.304085 0.136345
5 b -1.890389 0.219738
grouped = df.groupby('category')get_wavg = lambda g: np.average(g['data'], weights=g['weights'])grouped.apply(get_wavg)
           
category
a    0.556522
b    0.738300
dtype: float64
           

資料透視表和交叉表

資料透視表:根據一個或多個鍵聚合一張表的資料,将資料在矩形格式中排列。

pandas

中出了groupby工具以及分層索引的重塑操作外,還有

pivot_table

方法,可以實作資料透視表的功能,

pivot_table

還可以添加部分總計。

# 使用index設定分組tips.pivot_table(index=['day', 'smoker'])
           
size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588
# 使用groupby呈現grouped = tips.groupby(['day', 'smoker'])for key, group in grouped:    print(key)    print(group)
           
('Fri', 'No')
     total_bill   tip smoker  day    time  size   tip_pct
91        22.49  3.50     No  Fri  Dinner     2  0.155625
94        22.75  3.25     No  Fri  Dinner     2  0.142857
99        12.46  1.50     No  Fri  Dinner     2  0.120385
223       15.98  3.00     No  Fri   Lunch     3  0.187735
('Fri', 'Yes')
     total_bill   tip smoker  day    time  size   tip_pct
90        28.97  3.00    Yes  Fri  Dinner     2  0.103555
92         5.75  1.00    Yes  Fri  Dinner     2  0.173913
93        16.32  4.30    Yes  Fri  Dinner     2  0.263480
95        40.17  4.73    Yes  Fri  Dinner     4  0.117750
96        27.28  4.00    Yes  Fri  Dinner     2  0.146628
97        12.03  1.50    Yes  Fri  Dinner     2  0.124688
98        21.01  3.00    Yes  Fri  Dinner     2  0.142789
100       11.35  2.50    Yes  Fri  Dinner     2  0.220264
101       15.38  3.00    Yes  Fri  Dinner     2  0.195059
220       12.16  2.20    Yes  Fri   Lunch     2  0.180921
221       13.42  3.48    Yes  Fri   Lunch     2  0.259314
222        8.58  1.92    Yes  Fri   Lunch     1  0.223776
224       13.42  1.58    Yes  Fri   Lunch     2  0.117735
225       16.27  2.50    Yes  Fri   Lunch     2  0.153657
226       10.09  2.00    Yes  Fri   Lunch     2  0.198216
('Sat', 'No')
     total_bill   tip smoker  day    time  size   tip_pct
19        20.65  3.35     No  Sat  Dinner     3  0.162228
20        17.92  4.08     No  Sat  Dinner     2  0.227679
21        20.29  2.75     No  Sat  Dinner     2  0.135535
22        15.77  2.23     No  Sat  Dinner     2  0.141408
23        39.42  7.58     No  Sat  Dinner     4  0.192288
24        19.82  3.18     No  Sat  Dinner     2  0.160444
25        17.81  2.34     No  Sat  Dinner     4  0.131387
26        13.37  2.00     No  Sat  Dinner     2  0.149589
27        12.69  2.00     No  Sat  Dinner     2  0.157604
28        21.70  4.30     No  Sat  Dinner     2  0.198157
29        19.65  3.00     No  Sat  Dinner     2  0.152672
30         9.55  1.45     No  Sat  Dinner     2  0.151832
31        18.35  2.50     No  Sat  Dinner     4  0.136240
32        15.06  3.00     No  Sat  Dinner     2  0.199203
33        20.69  2.45     No  Sat  Dinner     4  0.118415
34        17.78  3.27     No  Sat  Dinner     2  0.183915
35        24.06  3.60     No  Sat  Dinner     3  0.149626
36        16.31  2.00     No  Sat  Dinner     3  0.122624
37        16.93  3.07     No  Sat  Dinner     3  0.181335
38        18.69  2.31     No  Sat  Dinner     3  0.123596
39        31.27  5.00     No  Sat  Dinner     3  0.159898
40        16.04  2.24     No  Sat  Dinner     3  0.139651
57        26.41  1.50     No  Sat  Dinner     2  0.056797
59        48.27  6.73     No  Sat  Dinner     4  0.139424
64        17.59  2.64     No  Sat  Dinner     3  0.150085
65        20.08  3.15     No  Sat  Dinner     3  0.156873
66        16.45  2.47     No  Sat  Dinner     2  0.150152
68        20.23  2.01     No  Sat  Dinner     2  0.099357
70        12.02  1.97     No  Sat  Dinner     2  0.163894
71        17.07  3.00     No  Sat  Dinner     3  0.175747
74        14.73  2.20     No  Sat  Dinner     2  0.149355
75        10.51  1.25     No  Sat  Dinner     2  0.118934
104       20.92  4.08     No  Sat  Dinner     2  0.195029
108       18.24  3.76     No  Sat  Dinner     2  0.206140
110       14.00  3.00     No  Sat  Dinner     2  0.214286
111        7.25  1.00     No  Sat  Dinner     1  0.137931
212       48.33  9.00     No  Sat  Dinner     4  0.186220
227       20.45  3.00     No  Sat  Dinner     4  0.146699
228       13.28  2.72     No  Sat  Dinner     2  0.204819
232       11.61  3.39     No  Sat  Dinner     2  0.291990
233       10.77  1.47     No  Sat  Dinner     2  0.136490
235       10.07  1.25     No  Sat  Dinner     2  0.124131
238       35.83  4.67     No  Sat  Dinner     3  0.130338
239       29.03  5.92     No  Sat  Dinner     3  0.203927
242       17.82  1.75     No  Sat  Dinner     2  0.098204
('Sat', 'Yes')
     total_bill    tip smoker  day    time  size   tip_pct
56        38.01   3.00    Yes  Sat  Dinner     4  0.078927
58        11.24   1.76    Yes  Sat  Dinner     2  0.156584
60        20.29   3.21    Yes  Sat  Dinner     2  0.158206
61        13.81   2.00    Yes  Sat  Dinner     2  0.144823
62        11.02   1.98    Yes  Sat  Dinner     2  0.179673
63        18.29   3.76    Yes  Sat  Dinner     4  0.205577
67         3.07   1.00    Yes  Sat  Dinner     1  0.325733
69        15.01   2.09    Yes  Sat  Dinner     2  0.139241
72        26.86   3.14    Yes  Sat  Dinner     2  0.116902
73        25.28   5.00    Yes  Sat  Dinner     2  0.197785
76        17.92   3.08    Yes  Sat  Dinner     2  0.171875
102       44.30   2.50    Yes  Sat  Dinner     3  0.056433
103       22.42   3.48    Yes  Sat  Dinner     2  0.155219
105       15.36   1.64    Yes  Sat  Dinner     2  0.106771
106       20.49   4.06    Yes  Sat  Dinner     2  0.198145
107       25.21   4.29    Yes  Sat  Dinner     2  0.170171
109       14.31   4.00    Yes  Sat  Dinner     2  0.279525
168       10.59   1.61    Yes  Sat  Dinner     2  0.152030
169       10.63   2.00    Yes  Sat  Dinner     2  0.188147
170       50.81  10.00    Yes  Sat  Dinner     3  0.196812
171       15.81   3.16    Yes  Sat  Dinner     2  0.199873
206       26.59   3.41    Yes  Sat  Dinner     3  0.128244
207       38.73   3.00    Yes  Sat  Dinner     4  0.077459
208       24.27   2.03    Yes  Sat  Dinner     2  0.083642
209       12.76   2.23    Yes  Sat  Dinner     2  0.174765
210       30.06   2.00    Yes  Sat  Dinner     3  0.066534
211       25.89   5.16    Yes  Sat  Dinner     4  0.199305
213       13.27   2.50    Yes  Sat  Dinner     2  0.188395
214       28.17   6.50    Yes  Sat  Dinner     3  0.230742
215       12.90   1.10    Yes  Sat  Dinner     2  0.085271
216       28.15   3.00    Yes  Sat  Dinner     5  0.106572
217       11.59   1.50    Yes  Sat  Dinner     2  0.129422
218        7.74   1.44    Yes  Sat  Dinner     2  0.186047
219       30.14   3.09    Yes  Sat  Dinner     4  0.102522
229       22.12   2.88    Yes  Sat  Dinner     2  0.130199
230       24.01   2.00    Yes  Sat  Dinner     4  0.083299
231       15.69   3.00    Yes  Sat  Dinner     3  0.191205
234       15.53   3.00    Yes  Sat  Dinner     2  0.193175
236       12.60   1.00    Yes  Sat  Dinner     2  0.079365
237       32.83   1.17    Yes  Sat  Dinner     2  0.035638
240       27.18   2.00    Yes  Sat  Dinner     2  0.073584
241       22.67   2.00    Yes  Sat  Dinner     2  0.088222
('Sun', 'No')
     total_bill   tip smoker  day    time  size   tip_pct
0         16.99  1.01     No  Sun  Dinner     2  0.059447
1         10.34  1.66     No  Sun  Dinner     3  0.160542
2         21.01  3.50     No  Sun  Dinner     3  0.166587
3         23.68  3.31     No  Sun  Dinner     2  0.139780
4         24.59  3.61     No  Sun  Dinner     4  0.146808
5         25.29  4.71     No  Sun  Dinner     4  0.186240
6          8.77  2.00     No  Sun  Dinner     2  0.228050
7         26.88  3.12     No  Sun  Dinner     4  0.116071
8         15.04  1.96     No  Sun  Dinner     2  0.130319
9         14.78  3.23     No  Sun  Dinner     2  0.218539
10        10.27  1.71     No  Sun  Dinner     2  0.166504
11        35.26  5.00     No  Sun  Dinner     4  0.141804
12        15.42  1.57     No  Sun  Dinner     2  0.101816
13        18.43  3.00     No  Sun  Dinner     4  0.162778
14        14.83  3.02     No  Sun  Dinner     2  0.203641
15        21.58  3.92     No  Sun  Dinner     2  0.181650
16        10.33  1.67     No  Sun  Dinner     3  0.161665
17        16.29  3.71     No  Sun  Dinner     3  0.227747
18        16.97  3.50     No  Sun  Dinner     3  0.206246
41        17.46  2.54     No  Sun  Dinner     2  0.145475
42        13.94  3.06     No  Sun  Dinner     2  0.219512
43         9.68  1.32     No  Sun  Dinner     2  0.136364
44        30.40  5.60     No  Sun  Dinner     4  0.184211
45        18.29  3.00     No  Sun  Dinner     2  0.164024
46        22.23  5.00     No  Sun  Dinner     2  0.224921
47        32.40  6.00     No  Sun  Dinner     4  0.185185
48        28.55  2.05     No  Sun  Dinner     3  0.071804
49        18.04  3.00     No  Sun  Dinner     2  0.166297
50        12.54  2.50     No  Sun  Dinner     2  0.199362
51        10.29  2.60     No  Sun  Dinner     2  0.252672
52        34.81  5.20     No  Sun  Dinner     4  0.149382
53         9.94  1.56     No  Sun  Dinner     2  0.156942
54        25.56  4.34     No  Sun  Dinner     4  0.169797
55        19.49  3.51     No  Sun  Dinner     2  0.180092
112       38.07  4.00     No  Sun  Dinner     3  0.105070
113       23.95  2.55     No  Sun  Dinner     2  0.106472
114       25.71  4.00     No  Sun  Dinner     3  0.155581
115       17.31  3.50     No  Sun  Dinner     2  0.202195
116       29.93  5.07     No  Sun  Dinner     4  0.169395
150       14.07  2.50     No  Sun  Dinner     2  0.177683
151       13.13  2.00     No  Sun  Dinner     2  0.152323
152       17.26  2.74     No  Sun  Dinner     3  0.158749
153       24.55  2.00     No  Sun  Dinner     4  0.081466
154       19.77  2.00     No  Sun  Dinner     4  0.101163
155       29.85  5.14     No  Sun  Dinner     5  0.172194
156       48.17  5.00     No  Sun  Dinner     6  0.103799
157       25.00  3.75     No  Sun  Dinner     4  0.150000
158       13.39  2.61     No  Sun  Dinner     2  0.194922
159       16.49  2.00     No  Sun  Dinner     4  0.121286
160       21.50  3.50     No  Sun  Dinner     4  0.162791
161       12.66  2.50     No  Sun  Dinner     2  0.197472
162       16.21  2.00     No  Sun  Dinner     3  0.123381
163       13.81  2.00     No  Sun  Dinner     2  0.144823
165       24.52  3.48     No  Sun  Dinner     3  0.141925
166       20.76  2.24     No  Sun  Dinner     2  0.107900
167       31.71  4.50     No  Sun  Dinner     4  0.141911
185       20.69  5.00     No  Sun  Dinner     5  0.241663
('Sun', 'Yes')
     total_bill   tip smoker  day    time  size   tip_pct
164       17.51  3.00    Yes  Sun  Dinner     2  0.171331
172        7.25  5.15    Yes  Sun  Dinner     2  0.710345
173       31.85  3.18    Yes  Sun  Dinner     2  0.099843
174       16.82  4.00    Yes  Sun  Dinner     2  0.237812
175       32.90  3.11    Yes  Sun  Dinner     2  0.094529
176       17.89  2.00    Yes  Sun  Dinner     2  0.111794
177       14.48  2.00    Yes  Sun  Dinner     2  0.138122
178        9.60  4.00    Yes  Sun  Dinner     2  0.416667
179       34.63  3.55    Yes  Sun  Dinner     2  0.102512
180       34.65  3.68    Yes  Sun  Dinner     4  0.106205
181       23.33  5.65    Yes  Sun  Dinner     2  0.242177
182       45.35  3.50    Yes  Sun  Dinner     3  0.077178
183       23.17  6.50    Yes  Sun  Dinner     4  0.280535
184       40.55  3.00    Yes  Sun  Dinner     2  0.073983
186       20.90  3.50    Yes  Sun  Dinner     3  0.167464
187       30.46  2.00    Yes  Sun  Dinner     5  0.065660
188       18.15  3.50    Yes  Sun  Dinner     3  0.192837
189       23.10  4.00    Yes  Sun  Dinner     3  0.173160
190       15.69  1.50    Yes  Sun  Dinner     2  0.095602
('Thur', 'No')
     total_bill   tip smoker   day    time  size   tip_pct
77        27.20  4.00     No  Thur   Lunch     4  0.147059
78        22.76  3.00     No  Thur   Lunch     2  0.131810
79        17.29  2.71     No  Thur   Lunch     2  0.156738
81        16.66  3.40     No  Thur   Lunch     2  0.204082
82        10.07  1.83     No  Thur   Lunch     1  0.181728
84        15.98  2.03     No  Thur   Lunch     2  0.127034
85        34.83  5.17     No  Thur   Lunch     4  0.148435
86        13.03  2.00     No  Thur   Lunch     2  0.153492
87        18.28  4.00     No  Thur   Lunch     2  0.218818
88        24.71  5.85     No  Thur   Lunch     2  0.236746
89        21.16  3.00     No  Thur   Lunch     2  0.141777
117       10.65  1.50     No  Thur   Lunch     2  0.140845
118       12.43  1.80     No  Thur   Lunch     2  0.144811
119       24.08  2.92     No  Thur   Lunch     4  0.121262
120       11.69  2.31     No  Thur   Lunch     2  0.197605
121       13.42  1.68     No  Thur   Lunch     2  0.125186
122       14.26  2.50     No  Thur   Lunch     2  0.175316
123       15.95  2.00     No  Thur   Lunch     2  0.125392
124       12.48  2.52     No  Thur   Lunch     2  0.201923
125       29.80  4.20     No  Thur   Lunch     6  0.140940
126        8.52  1.48     No  Thur   Lunch     2  0.173709
127       14.52  2.00     No  Thur   Lunch     2  0.137741
128       11.38  2.00     No  Thur   Lunch     2  0.175747
129       22.82  2.18     No  Thur   Lunch     3  0.095530
130       19.08  1.50     No  Thur   Lunch     2  0.078616
131       20.27  2.83     No  Thur   Lunch     2  0.139615
132       11.17  1.50     No  Thur   Lunch     2  0.134288
133       12.26  2.00     No  Thur   Lunch     2  0.163132
134       18.26  3.25     No  Thur   Lunch     2  0.177985
135        8.51  1.25     No  Thur   Lunch     2  0.146886
136       10.33  2.00     No  Thur   Lunch     2  0.193611
137       14.15  2.00     No  Thur   Lunch     2  0.141343
139       13.16  2.75     No  Thur   Lunch     2  0.208967
140       17.47  3.50     No  Thur   Lunch     2  0.200343
141       34.30  6.70     No  Thur   Lunch     6  0.195335
142       41.19  5.00     No  Thur   Lunch     5  0.121389
143       27.05  5.00     No  Thur   Lunch     6  0.184843
144       16.43  2.30     No  Thur   Lunch     2  0.139988
145        8.35  1.50     No  Thur   Lunch     2  0.179641
146       18.64  1.36     No  Thur   Lunch     3  0.072961
147       11.87  1.63     No  Thur   Lunch     2  0.137321
148        9.78  1.73     No  Thur   Lunch     2  0.176892
149        7.51  2.00     No  Thur   Lunch     2  0.266312
195        7.56  1.44     No  Thur   Lunch     2  0.190476
243       18.78  3.00     No  Thur  Dinner     2  0.159744
('Thur', 'Yes')
     total_bill   tip smoker   day   time  size   tip_pct
80        19.44  3.00    Yes  Thur  Lunch     2  0.154321
83        32.68  5.00    Yes  Thur  Lunch     2  0.152999
138       16.00  2.00    Yes  Thur  Lunch     2  0.125000
191       19.81  4.19    Yes  Thur  Lunch     2  0.211509
192       28.44  2.56    Yes  Thur  Lunch     2  0.090014
193       15.48  2.02    Yes  Thur  Lunch     2  0.130491
194       16.58  4.00    Yes  Thur  Lunch     2  0.241255
196       10.34  2.00    Yes  Thur  Lunch     2  0.193424
197       43.11  5.00    Yes  Thur  Lunch     4  0.115982
198       13.00  2.00    Yes  Thur  Lunch     2  0.153846
199       13.51  2.00    Yes  Thur  Lunch     2  0.148038
200       18.71  4.00    Yes  Thur  Lunch     3  0.213789
201       12.74  2.01    Yes  Thur  Lunch     2  0.157771
202       13.00  2.00    Yes  Thur  Lunch     2  0.153846
203       16.40  2.50    Yes  Thur  Lunch     2  0.152439
204       20.53  4.00    Yes  Thur  Lunch     4  0.194837
205       16.47  3.23    Yes  Thur  Lunch     3  0.196114
           
# 在tip_pct和size上進行聚合,并根據time分組,把smoker放入表的列,而将day放入表的行tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')
           
size tip_pct
smoker No Yes No Yes
time day
Dinner Fri 2.000000 2.222222 0.139622 0.165347
Sat 2.555556 2.476190 0.158048 0.147906
Sun 2.929825 2.578947 0.160113 0.187250
Thur 2.000000 NaN 0.159744 NaN
Lunch Fri 3.000000 1.833333 0.187735 0.188937
Thur 2.500000 2.352941 0.160311 0.163863
# 通過傳遞margins=True來包含部分統計。tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)
           
size tip_pct
smoker No Yes All No Yes All
time day
Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803
# 我們也可以使用不同的聚合函數,将函數傳遞給aggfunc,比如'count'和lentips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', aggfunc=len, margins=True)
           
day Fri Sat Sun Thur All
time smoker
Dinner No 3.0 45.0 57.0 1.0 106.0
Yes 9.0 42.0 19.0 NaN 70.0
Lunch No 1.0 NaN NaN 44.0 45.0
Yes 6.0 NaN NaN 17.0 23.0
All 19.0 87.0 76.0 62.0 244.0
# 當然,也可以使用fill_value填充空值tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], columns='day', aggfunc='mean', fill_value=0)
           
day Fri Sat Sun Thur
time size smoker
Dinner 1 No 0.000000 0.137931 0.000000 0.000000
Yes 0.000000 0.325733 0.000000 0.000000
2 No 0.139622 0.162705 0.168859 0.159744
Yes 0.171297 0.148668 0.207893 0.000000
3 No 0.000000 0.154661 0.152663 0.000000
Yes 0.000000 0.144995 0.152660 0.000000
4 No 0.000000 0.150096 0.148143 0.000000
Yes 0.117750 0.124515 0.193370 0.000000
5 No 0.000000 0.000000 0.206928 0.000000
Yes 0.000000 0.106572 0.065660 0.000000
6 No 0.000000 0.000000 0.103799 0.000000
Lunch 1 No 0.000000 0.000000 0.000000 0.181728
Yes 0.223776 0.000000 0.000000 0.000000
2 No 0.000000 0.000000 0.000000 0.166005
Yes 0.181969 0.000000 0.000000 0.158843
3 No 0.187735 0.000000 0.000000 0.084246
Yes 0.000000 0.000000 0.000000 0.204952
4 No 0.000000 0.000000 0.000000 0.138919
Yes 0.000000 0.000000 0.000000 0.155410
5 No 0.000000 0.000000 0.000000 0.121389
6 No 0.000000 0.000000 0.000000 0.173706

pivot_table

選項

選項名 描述
values 需要聚合的列名;預設情況下聚合所有數值型的列
index 在結果透視表的行上進行分組的列名或其他分組鍵
columns 在結果透視表的列上進行分組的列名或其他分組鍵
aggfunc 聚合函數或函數清單,預設為'mean'
fill_value 替換缺失值
dropna 删除含有NA的列
margins 添加行/列小計和總計

crosstab-交叉表

crosstab

前兩個參數是數組、Series或者數組清單,計算分組頻率

pd.crosstab([tips['time'], tips['day']], tips['smoker'], margins=True)
           
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244

總結

本文主要講解了

pandas

中分組聚合常用的GroupBy方法,這在資料分析中應用非常廣泛。下一次我們将繼續講解時間序列。

繼續閱讀