天天看點

熊貓DataFrame groupby()函數

1.熊貓groupby()函數 (1. Pandas groupby() function)

Pandas DataFrame groupby() function is used to group rows that have the same values. It’s mostly used with aggregate functions (count, sum, min, max, mean) to get the statistics based on one or more column values.

Pandas DataFrame groupby()函數用于對具有相同值的行進行分組。 它通常與聚合函數(計數,總和,最小值,最大值,平均值)一起使用,以基于一個或多個列值擷取統計資訊。

Pandas gropuby() function is very similar to the SQL group by statement. Afterall, DataFrame and SQL Table are almost similar too. It’s an intermediary function to create groups before reaching the final result.

Pandas gropuby()函數與SQL group by語句非常相似。 畢竟,DataFrame和SQL Table也幾乎相似。 這是一個中間功能,可以在達到最終結果之前建立組。

2.拆分應用合并 (2. Split Apply Combine)

It’s also called the split-apply-combine process. The groupby() function splits the data based on some criteria. The aggregate function is applied to each of the groups and then combined together to create the result DataFrame. The below diagram illustrates this behavior with a simple example.

這也稱為拆分應用合并過程。 groupby()函數根據某些條件拆分資料。 将聚合函數應用于每個組,然後組合在一起以建立結果DataFrame。 下圖通過一個簡單的示例說明了此行為。

熊貓DataFrame groupby()函數

Split Apply Combine Example

拆分應用合并示例

3. Pandas DataFrame groupby()文法 (3. Pandas DataFrame groupby() Syntax)

The groupby() function syntax is:

groupby()函數的文法為:

groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index=True,
        sort=True,
        group_keys=True,
        squeeze=False,
        observed=False,
        **kwargs
    )
           
  • The by argument determines the way to groupby elements. Generally, column names are used to group by the DataFrame elements.

    by參數确定分組元素的方式。 通常,列名用于按DataFrame元素進行分組。

  • The axis parameter determines whether to grouby rows or columns.

    axis參數确定是對行還是對列進行處理。

  • The level is used with MultiIndex (hierarchical) to group by a particular level or levels.

    該級别與MultiIndex(分層)一起使用,以按一個或多個特定級别分組。

  • as_index specifies to return aggregated object with group labels as the index.

    as_index指定傳回以組标簽為索引的聚合對象。

  • The sort parameter is used to sort group keys. We can pass it as False for better performance with larger DataFrame objects.

    sort參數用于對組密鑰進行排序。 我們可以将其作為False傳遞,以獲得更大的DataFrame對象更好的性能。

  • group_keys: when calling apply, add group keys to index to identify pieces.

    group_keys :在調用apply時,将組密鑰添加到索引以辨別片段。

  • squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

    squeeze :如果可能,減小傳回類型的維數,否則傳回一緻的類型。

  • observed: If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

    觀察到的 :如果為True:僅顯示分類石斑魚的觀察到的值。 如果為False:顯示分類石斑魚的所有值。

  • **kwargs: only accepts keyword argument ‘mutated’ and is passed to groupby.

    ** kwargs :僅接受關鍵字參數“ mutated”,并傳遞給groupby。

The groupby() function returns DataFrameGroupBy or SeriesGroupBy depending on the calling object.

groupby()函數根據調用對象傳回DataFrameGroupBy或SeriesGroupBy。

4.熊貓groupby()示例 (4. Pandas groupby() Example)

Let’s say we have a CSV file with the below content.

假設我們有一個包含以下内容的CSV檔案。

ID,Name,Role,Salary
1,Pankaj,Editor,10000
2,Lisa,Editor,8000
3,David,Author,6000
4,Ram,Author,4000
5,Anupam,Author,5000
           

We will use Pandas read_csv() function to read the CSV file and create the DataFrame object.

我們将使用Pandas的read_csv()函數來讀取CSV檔案并建立DataFrame對象。

import pandas as pd

df = pd.read_csv('records.csv')

print(df)
           

Output:

輸出:

ID    Name    Role  Salary
0   1  Pankaj  Editor   10000
1   2    Lisa  Editor    8000
2   3   David  Author    6000
3   4     Ram  Author    4000
4   5  Anupam  Author    5000
           

4.1)平均工資按角色分組 (4.1) Average Salary Group By Role)

We want to know the average salary of the employees based on their role. So we will use groupby() function to create groups based on the ‘Role’ column. Then call the aggregate function mean() to calculate the average and produce the result. Since we don’t need ID and Name columns, we will remove them from the output.

我們想根據員工的角色知道他們的平均工資。 是以,我們将使用groupby()函數基于“角色”列建立組。 然後調用聚合函數mean()來計算平均值并産生結果。 由于我們不需要ID和Name列,是以我們将從輸出中将其删除。

df_groupby_role = df.groupby(['Role'])

# select only required columns
df_groupby_role = df_groupby_role[["Role", "Salary"]]

# get the average
df_groupby_role_mean = df_groupby_role.mean()

print(df_groupby_role_mean)
           

Output:

輸出:

Salary
Role          
Author    5000
Editor    9000
           

The indexes in the output don’t look good. We can fix it by calling the reset_index() function.

輸出中的索引看起來不好。 我們可以通過調用reset_index()函數來修複它。

df_groupby_role_mean = df_groupby_role_mean.reset_index()
print(df_groupby_role_mean)
           

Output:

輸出:

Role  Salary
0  Author    5000
1  Editor    9000
           

4.2)按角色支付的總工資 (4.2) Total Salary Paid By Role)

In this example, we will calculate the salary paid for each role.

在此示例中,我們将計算為每個角色支付的薪水。

df_salary_by_role = df.groupby(['Role'])[["Role", "Salary"]].sum().reset_index()
print(df_salary_by_role)
           

Output:

輸出:

Role  Salary
0  Author   15000
1  Editor   18000
           

This example looks simple because everything is done in a single line. In the earlier example, I had divided the steps for clarity.

該示例看起來很簡單,因為所有操作都在一行中完成。 在前面的示例中,為清晰起見,我将步驟分為幾部分。

4.3)按角色劃分的員工總數 (4.3) Total Number of Employees by Role)

We can use size() aggregate function to get this data.

我們可以使用size()聚合函數來擷取此資料。

df_size_by_role = df.groupby(['Role']).size().reset_index()
df_size_by_role.columns.values[1] = 'Count'  # renaming the size column
print(df_size_by_role)
           

Output:

輸出:

Role  Count
0  Author      3
1  Editor      2
           

5.參考 (5. References)

  • Pandas group by: split-apply-combine

    熊貓分組方式:split-apply-combine

  • Pandas DataFrame groupby() API Doc

    熊貓DataFrame groupby()API文檔

翻譯自: https://www.journaldev.com/33402/pandas-dataframe-groupby-function