Pandas 丢掉 DataFrame 中重复的行

Suraj Joshi 2023年1月30日 2021年1月22日
  1. DataFrame.drop_duplicates() 语法
  2. 使用 DataFrame.drop_duplicates() 方法删除重复的行
  3. drop_duplicates() 方法中设置 keep='last'
Pandas 丢掉 DataFrame 中重复的行

本教程介绍了如何使用 DataFrame.drop_duplicates() 方法从 Pandas DataFrame 中删除所有重复的行。

DataFrame.drop_duplicates() 语法

DataFrame.drop_duplicates(subset=None, 
                          keep='first', 
                          inplace=False, 
                          ignore_index=False)

它返回一个 DataFrame,删除 DataFrame 中所有重复的行。

使用 DataFrame.drop_duplicates() 方法删除重复的行

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates()

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

输出:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300 

它会删除所有列的所有值都相同的行。默认情况下,DataFrame 中每一列都有相同值的行才被认为是重复的。在 df_with_duplicates DataFrame 中,第一行和第五行对所有列都有相同的值,所以第五行被删除。

设置 subset 参数以仅基于特定列删除重复项

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(subset=['Name'])

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

输出:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100 

在这里,我们将 Name 作为 subset 参数传给 drop_duplicates() 方法。第四行和第五行被删除,因为它们的 Name 列的值与第一列相同。

drop_duplicates() 方法中设置 keep='last'

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(
    subset=['Name'], keep="last")

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

输出:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
5  302   Watch  300 

它删除了所有的行,除了最后一行与 Name 列值相同的行。

我们设置 keep=False 来删除任何一列中具有相同值的所有行。

import pandas as pd

df_with_duplicates = pd.DataFrame({
    'Id': [302, 504, 708, 103, 303, 302],
    'Name': ['Watch', 'Camera', 'Phone', 'Shoes', 'Watch', 'Watch'],
    'Cost': ["300", "400", "350", "100", "300", "300"]
})

df_without_duplicates = df_with_duplicates.drop_duplicates(
    subset=['Name'], keep=False)

print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")

print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")

输出:

DataFrame with duplicates:
    Id    Name Cost
0  302   Watch  300
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100
4  303   Watch  300
5  302   Watch  300 

DataFrame without duplicates:
    Id    Name Cost
1  504  Camera  400
2  708   Phone  350
3  103   Shoes  100 

它删除了第一、五、六行,因为它们的 Name 列都有相同的值。

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

相关文章 - Pandas DataFrame Row