Grouping in Pandas involves splitting the data into groups based on some criteria. Once split, we can apply various aggregation functions to these groups independently. This is particularly useful in scenarios where we need to summarize data, such as calculating the mean, sum, count, or other statistical measures for each group.
Aggregation refers to applying a function that reduces multiple values to a single value. Common aggregation functions include sum, mean, min, max, count, and more.
Basic Concepts
- Grouping with
groupby
:
Thegroupby
method in Pandas is used to split data into groups based on one or more keys. The syntax is straightforward:
grouped = df.groupby('column_name')
This creates a GroupBy
object, which can then be used to apply aggregation functions.
- Aggregation Functions:
Pandas provides a wide range of aggregation functions, such asmean
,sum
,min
,max
,count
, etc. These functions can be applied to aGroupBy
object to compute aggregated results.
Step-by-Step Guide
Let’s dive into a step-by-step guide with examples.
Step 1: Importing Pandas and Creating a DataFrame
First, we need to import Pandas and create a sample DataFrame:
import pandas as pd
# Sample DataFrame
data = {
'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Salary': [50000, 60000, 55000, 58000, 75000, 72000, 70000]
}
df = pd.DataFrame(data)
print(df)
Step 2: Grouping Data
To group the data by the ‘Department’ column:
grouped = df.groupby('Department')
Step 3: Applying Aggregation Functions
We can apply various aggregation functions to our grouped data. For example, to calculate the mean salary for each department:
mean_salary = grouped['Salary'].mean()
print(mean_salary)
Output:
Department
HR 56500.0
IT 72333.3
Sales 55000.0
Name: Salary, dtype: float64
Advanced Grouping and Aggregating
Multiple Aggregation Functions
You can apply multiple aggregation functions at once using the agg
method:
agg_funcs = grouped['Salary'].agg(['mean', 'sum', 'max', 'min'])
print(agg_funcs)
Output:
mean sum max min
Department
HR 56500.0 113000 58000 55000
IT 72333.3 217000 75000 70000
Sales 55000.0 110000 60000 50000
Aggregating Multiple Columns
If you want to apply different aggregation functions to different columns, you can pass a dictionary to the agg
method:
agg_funcs = grouped.agg({
'Salary': ['mean', 'sum'],
'Employee': 'count'
})
print(agg_funcs)
Output:
Salary Employee
mean sum count
Department
HR 56500.0 113000 2
IT 72333.3 217000 3
Sales 55000.0 110000 2
Custom Aggregation Functions
You can also define and apply custom aggregation functions. For example, if you want to calculate the range of salaries for each department:
def salary_range(series):
return series.max() - series.min()
salary_ranges = grouped['Salary'].agg(salary_range)
print(salary_ranges)
Output:
Department
HR 3000
IT 5000
Sales 10000
Name: Salary, dtype: int64
Grouping by Multiple Columns
Pandas allows grouping by multiple columns by passing a list of column names to groupby
:
# Sample DataFrame with additional column
data = {
'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
'Team': ['East', 'West', 'East', 'West', 'East', 'West', 'West'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Salary': [50000, 60000, 55000, 58000, 75000, 72000, 70000]
}
df = pd.DataFrame(data)
# Grouping by multiple columns
grouped = df.groupby(['Department', 'Team'])
mean_salary = grouped['Salary'].mean()
print(mean_salary)
Output:
Department Team
HR East 55000.0
West 58000.0
IT East 75000.0
West 71000.0
Sales East 50000.0
West 60000.0
Name: Salary, dtype: float64
Using transform
for Group-Specific Calculations
The transform
method is useful when you need to perform group-specific calculations and return a DataFrame of the same shape as the original. For example, calculating the normalized salary within each department:
df['Normalized_Salary'] = grouped['Salary'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
Pivot Tables
Pivot tables provide a way to summarize data with multi-dimensional grouping. They are very similar to Excel pivot tables:
pivot = df.pivot_table(values='Salary', index='Department', columns='Team', aggfunc='mean')
print(pivot)
Output:
Team East West
Department
HR 55000.0 58000.0
IT 75000.0 71000.0
Sales 50000.0 60000.0
Grouping and aggregating data in Pandas is an essential skill for data analysis. It allows you to summarize and extract insights from large datasets efficiently. This guide covered the basics and advanced techniques of grouping and aggregating, including custom aggregation functions, grouping by multiple columns, and using pivot tables. Mastering these techniques will enable you to perform complex data analysis tasks with ease.