HomepythonHow to Merge Multiple DataFrames in Pandas

How to Merge Multiple DataFrames in Pandas

How to Meg DataFrames

Merging DataFrames in Pandas is similar to performing joins in SQL. It allows you to combine data from different sources into a single DataFrame based on common keys or indices. The primary functions for merging DataFrames in Pandas are merge(), join(), and concat().

Types of Joins

Before diving into the methods, it’s important to understand the types of joins:

  1. Inner Join: Returns only the rows with matching keys in both DataFrames.
  2. Left Join: Returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows get NaN in the columns from the right DataFrame.
  3. Right Join: Returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows get NaN in the columns from the left DataFrame.
  4. Outer Join: Returns all rows from both DataFrames. Unmatched rows get NaN in the columns from the other DataFrame.

Merging with merge()

The merge() function is the most versatile method for merging DataFrames. It allows you to specify the type of join and the columns to merge on.

Basic Merge

Here’s a simple example of merging two DataFrames:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'EmployeeID': [3, 4, 5, 6],
    'Department': ['HR', 'IT', 'Sales', 'Marketing']
})

# Merging on EmployeeID
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print(merged_df)

Output:

   EmployeeID     Name Department
0           3  Charlie         HR
1           4    David         IT

Specifying Join Types

You can specify the type of join using the how parameter:

# Left Join
left_joined = pd.merge(df1, df2, on='EmployeeID', how='left')
print(left_joined)

Output:

   EmployeeID     Name Department
0           1    Alice        NaN
1           2      Bob        NaN
2           3  Charlie         HR
3           4    David         IT

Merging on Multiple Columns

You can merge on multiple columns by passing a list to the on parameter:

# Sample DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'HR']
})

df2 = pd.DataFrame({
    'EmployeeID': [3, 4, 5, 6],
    'Department': ['HR', 'IT', 'Sales', 'Marketing'],
    'Salary': [55000, 72000, 60000, 58000]
})

# Merging on EmployeeID and Department
merged_df = pd.merge(df1, df2, on=['EmployeeID', 'Department'], how='inner')
print(merged_df)

Output:

   EmployeeID     Name Department  Salary
0           4    David        IT   72000

Joining with join()

The join() method is used to join DataFrames based on their index. It is particularly useful when the indices of the DataFrames are meaningful.

Basic Join

Here’s an example of using join():

# Sample DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'HR']
}, index=[1, 2, 3, 4])

df2 = pd.DataFrame({
    'Salary': [50000, 60000, 55000, 72000]
}, index=[1, 2, 3, 4])

# Joining on index
joined_df = df1.join(df2)
print(joined_df)

Output:

       Name Department  Salary
1     Alice        HR   50000
2       Bob        IT   60000
3   Charlie        IT   55000
4     David        HR   72000

Joining on a Key

If you need to join on a specific column, you can set the column as the index:

df1.set_index('EmployeeID', inplace=True)
df2.set_index('EmployeeID', inplace=True)

# Joining on index
joined_df = df1.join(df2)
print(joined_df)

Output:

            Name Department  Salary
EmployeeID                           
1          Alice        HR   50000
2            Bob        IT   60000
3        Charlie        IT   55000
4          David        HR   72000

Concatenating with concat()

The concat() function is used to concatenate DataFrames along a particular axis. It is useful for stacking DataFrames either vertically (axis=0) or horizontally (axis=1).

Vertical Concatenation

Here’s an example of vertical concatenation:

# Sample DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'EmployeeID': [4, 5, 6],
    'Name': ['David', 'Eve', 'Frank']
})

# Concatenating vertically
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)

Output:

   EmployeeID     Name
0           1    Alice
1           2      Bob
2           3  Charlie
0           4    David
1           5      Eve
2           6    Frank

Horizontal Concatenation

Here’s an example of horizontal concatenation:

# Sample DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'Department': ['HR', 'IT', 'IT'],
    'Salary': [50000, 60000, 55000]
})

# Concatenating horizontally
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)

Output:

   EmployeeID     Name Department  Salary
0           1    Alice        HR   50000
1           2      Bob        IT   60000
2           3  Charlie        IT   55000

Handling Duplicate Indices

When concatenating DataFrames, you might encounter duplicate indices. You can resolve this by using the ignore_index parameter:

# Concatenating with ignore_index
concatenated_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated_df)

Output:

   EmployeeID     Name Department  Salary
0         1.0    Alice        NaN     NaN
1         2.0      Bob        NaN     NaN
2         3.0  Charlie        NaN     NaN
3         NaN      NaN         HR  50000.0
4         NaN      NaN         IT  60000.0
5         NaN      NaN         IT  55000.0

Combining with combine_first()

The combine_first() method allows you to combine two DataFrames, filling in missing values from the first DataFrame with values from the second DataFrame.

Basic Usage

Here’s an example:

# Sample DataFrames
df1 = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4]
})

df2 = pd.DataFrame({
    'A': [5, 6, 7, 8],
    'B': [1, 2, 3, 4]
})

# Combining with combine_first
combined_df = df1.combine_first(df2)
print(combined_df)

Output:

     A    B
0  1.0  1.0
1  2.0  2.0
2  7.0  3.0
3  4.0  4.0
Subscribe
Notify of

0 Comments
Inline Feedbacks
View all comments

Popular