Grouping and Aggregating with Pandas

June 10, 2024

Grouping in Pandas involves splitting the data into groups based on some criteria. Once split, we can apply various aggregation functions to these groups independently. This is particularly useful in scenarios where we need to summarize data, such as calculating the mean, sum, count, or other statistical measures for each group.

Aggregation refers to applying a function that reduces multiple values to a single value. Common aggregation functions include sum, mean, min, max, count, and more.

Table of Contents

Basic Concepts

Grouping with groupby:
The groupby method in Pandas is used to split data into groups based on one or more keys. The syntax is straightforward:

   grouped = df.groupby('column_name')

This creates a GroupBy object, which can then be used to apply aggregation functions.

Aggregation Functions:
Pandas provides a wide range of aggregation functions, such as mean, sum, min, max, count, etc. These functions can be applied to a GroupBy object to compute aggregated results.

Step-by-Step Guide

Let’s dive into a step-by-step guide with examples.

Step 1: Importing Pandas and Creating a DataFrame

First, we need to import Pandas and create a sample DataFrame:

import pandas as pd

# Sample DataFrame
data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [50000, 60000, 55000, 58000, 75000, 72000, 70000]
}

df = pd.DataFrame(data)
print(df)

Step 2: Grouping Data

To group the data by the ‘Department’ column:

grouped = df.groupby('Department')

Step 3: Applying Aggregation Functions

We can apply various aggregation functions to our grouped data. For example, to calculate the mean salary for each department:

mean_salary = grouped['Salary'].mean()
print(mean_salary)

Output:

Department
HR       56500.0
IT       72333.3
Sales    55000.0
Name: Salary, dtype: float64

Advanced Grouping and Aggregating

Multiple Aggregation Functions

You can apply multiple aggregation functions at once using the agg method:

agg_funcs = grouped['Salary'].agg(['mean', 'sum', 'max', 'min'])
print(agg_funcs)

Output:

              mean    sum    max    min
Department                              
HR         56500.0  113000  58000  55000
IT         72333.3  217000  75000  70000
Sales      55000.0  110000  60000  50000

Aggregating Multiple Columns

If you want to apply different aggregation functions to different columns, you can pass a dictionary to the agg method:

agg_funcs = grouped.agg({
    'Salary': ['mean', 'sum'],
    'Employee': 'count'
})
print(agg_funcs)

Output:

               Salary              Employee
                 mean     sum   count
Department                              
HR            56500.0  113000       2
IT            72333.3  217000       3
Sales         55000.0  110000       2

Custom Aggregation Functions

You can also define and apply custom aggregation functions. For example, if you want to calculate the range of salaries for each department:

def salary_range(series):
    return series.max() - series.min()

salary_ranges = grouped['Salary'].agg(salary_range)
print(salary_ranges)

Output:

Department
HR       3000
IT       5000
Sales    10000
Name: Salary, dtype: int64

Grouping by Multiple Columns

Pandas allows grouping by multiple columns by passing a list of column names to groupby:

# Sample DataFrame with additional column
data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
    'Team': ['East', 'West', 'East', 'West', 'East', 'West', 'West'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [50000, 60000, 55000, 58000, 75000, 72000, 70000]
}

df = pd.DataFrame(data)

# Grouping by multiple columns
grouped = df.groupby(['Department', 'Team'])
mean_salary = grouped['Salary'].mean()
print(mean_salary)

Output:

Department  Team
HR          East    55000.0
            West    58000.0
IT          East    75000.0
            West    71000.0
Sales       East    50000.0
            West    60000.0
Name: Salary, dtype: float64

Using `transform` for Group-Specific Calculations

The transform method is useful when you need to perform group-specific calculations and return a DataFrame of the same shape as the original. For example, calculating the normalized salary within each department:

df['Normalized_Salary'] = grouped['Salary'].transform(lambda x: (x - x.mean()) / x.std())
print(df)

Pivot Tables

Pivot tables provide a way to summarize data with multi-dimensional grouping. They are very similar to Excel pivot tables:

pivot = df.pivot_table(values='Salary', index='Department', columns='Team', aggfunc='mean')
print(pivot)

Output:

Team         East      West
Department                    
HR           55000.0  58000.0
IT           75000.0  71000.0
Sales        50000.0  60000.0

Grouping and aggregating data in Pandas is an essential skill for data analysis. It allows you to summarize and extract insights from large datasets efficiently. This guide covered the basics and advanced techniques of grouping and aggregating, including custom aggregation functions, grouping by multiple columns, and using pivot tables. Mastering these techniques will enable you to perform complex data analysis tasks with ease.

Grouping and Aggregating with Pandas

Basic Concepts

Step-by-Step Guide

Step 1: Importing Pandas and Creating a DataFrame

Step 2: Grouping Data

Step 3: Applying Aggregation Functions

Advanced Grouping and Aggregating

Multiple Aggregation Functions

Aggregating Multiple Columns

Custom Aggregation Functions

Grouping by Multiple Columns

Using `transform` for Group-Specific Calculations

Pivot Tables

Popular

223. Rectangle Area – Leetcode Solutions

264. Ugly Number II – Leetcode Solutions

226. Invert Binary Tree – Leetcode Solutions

32. Longest Valid Parentheses – Leetcode Solutions

148. Sort List – Leetcode Solutions

What is a 502 Bad Gateway Error and How to Fix It

102. Binary Tree Level Order Traversal – Leetcode Solutions

Grouping and Aggregating with Pandas

Basic Concepts

Step-by-Step Guide

Step 1: Importing Pandas and Creating a DataFrame

Step 2: Grouping Data

Step 3: Applying Aggregation Functions

Advanced Grouping and Aggregating

Multiple Aggregation Functions

Aggregating Multiple Columns

Custom Aggregation Functions

Grouping by Multiple Columns

Using transform for Group-Specific Calculations

Pivot Tables

Related Post

Popular

Using `transform` for Group-Specific Calculations