How to Meg DataFrames
Merging DataFrames in Pandas is similar to performing joins in SQL. It allows you to combine data from different sources into a single DataFrame based on common keys or indices. The primary functions for merging DataFrames in Pandas are merge()
, join()
, and concat()
.
Types of Joins
Before diving into the methods, it’s important to understand the types of joins:
- Inner Join: Returns only the rows with matching keys in both DataFrames.
- Left Join: Returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows get NaN in the columns from the right DataFrame.
- Right Join: Returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows get NaN in the columns from the left DataFrame.
- Outer Join: Returns all rows from both DataFrames. Unmatched rows get NaN in the columns from the other DataFrame.
Merging with merge()
The merge()
function is the most versatile method for merging DataFrames. It allows you to specify the type of join and the columns to merge on.
Basic Merge
Here’s a simple example of merging two DataFrames:
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'EmployeeID': [3, 4, 5, 6],
'Department': ['HR', 'IT', 'Sales', 'Marketing']
})
# Merging on EmployeeID
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print(merged_df)
Output:
EmployeeID Name Department
0 3 Charlie HR
1 4 David IT
Specifying Join Types
You can specify the type of join using the how
parameter:
# Left Join
left_joined = pd.merge(df1, df2, on='EmployeeID', how='left')
print(left_joined)
Output:
EmployeeID Name Department
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie HR
3 4 David IT
Merging on Multiple Columns
You can merge on multiple columns by passing a list to the on
parameter:
# Sample DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'IT', 'HR']
})
df2 = pd.DataFrame({
'EmployeeID': [3, 4, 5, 6],
'Department': ['HR', 'IT', 'Sales', 'Marketing'],
'Salary': [55000, 72000, 60000, 58000]
})
# Merging on EmployeeID and Department
merged_df = pd.merge(df1, df2, on=['EmployeeID', 'Department'], how='inner')
print(merged_df)
Output:
EmployeeID Name Department Salary
0 4 David IT 72000
Joining with join()
The join()
method is used to join DataFrames based on their index. It is particularly useful when the indices of the DataFrames are meaningful.
Basic Join
Here’s an example of using join()
:
# Sample DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'IT', 'HR']
}, index=[1, 2, 3, 4])
df2 = pd.DataFrame({
'Salary': [50000, 60000, 55000, 72000]
}, index=[1, 2, 3, 4])
# Joining on index
joined_df = df1.join(df2)
print(joined_df)
Output:
Name Department Salary
1 Alice HR 50000
2 Bob IT 60000
3 Charlie IT 55000
4 David HR 72000
Joining on a Key
If you need to join on a specific column, you can set the column as the index:
df1.set_index('EmployeeID', inplace=True)
df2.set_index('EmployeeID', inplace=True)
# Joining on index
joined_df = df1.join(df2)
print(joined_df)
Output:
Name Department Salary
EmployeeID
1 Alice HR 50000
2 Bob IT 60000
3 Charlie IT 55000
4 David HR 72000
Concatenating with concat()
The concat()
function is used to concatenate DataFrames along a particular axis. It is useful for stacking DataFrames either vertically (axis=0) or horizontally (axis=1).
Vertical Concatenation
Here’s an example of vertical concatenation:
# Sample DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'EmployeeID': [4, 5, 6],
'Name': ['David', 'Eve', 'Frank']
})
# Concatenating vertically
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)
Output:
EmployeeID Name
0 1 Alice
1 2 Bob
2 3 Charlie
0 4 David
1 5 Eve
2 6 Frank
Horizontal Concatenation
Here’s an example of horizontal concatenation:
# Sample DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'Department': ['HR', 'IT', 'IT'],
'Salary': [50000, 60000, 55000]
})
# Concatenating horizontally
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)
Output:
EmployeeID Name Department Salary
0 1 Alice HR 50000
1 2 Bob IT 60000
2 3 Charlie IT 55000
Handling Duplicate Indices
When concatenating DataFrames, you might encounter duplicate indices. You can resolve this by using the ignore_index
parameter:
# Concatenating with ignore_index
concatenated_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated_df)
Output:
EmployeeID Name Department Salary
0 1.0 Alice NaN NaN
1 2.0 Bob NaN NaN
2 3.0 Charlie NaN NaN
3 NaN NaN HR 50000.0
4 NaN NaN IT 60000.0
5 NaN NaN IT 55000.0
Combining with combine_first()
The combine_first()
method allows you to combine two DataFrames, filling in missing values from the first DataFrame with values from the second DataFrame.
Basic Usage
Here’s an example:
# Sample DataFrames
df1 = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [np.nan, 2, 3, 4]
})
df2 = pd.DataFrame({
'A': [5, 6, 7, 8],
'B': [1, 2, 3, 4]
})
# Combining with combine_first
combined_df = df1.combine_first(df2)
print(combined_df)
Output:
A B
0 1.0 1.0
1 2.0 2.0
2 7.0 3.0
3 4.0 4.0