拆分 Pandas DataFrame

Suraj Joshi 2023年1月30日 2021年1月22日
  1. 使用行索引分割 DataFrame
  2. 使用 groupby() 方法拆分 DataFrame
  3. 使用 sample() 方法拆分 DataFrame
拆分 Pandas DataFrame

本教程解释了如何使用行索引、DataFrame.groupby() 方法和 DataFrame.sample() 方法将一个 DataFrame 分割成多个较小的 DataFrame。

我们将使用下面的 apprix_df DataFrame 来解释如何将一个 DataFrame 分割成多个更小的 DataFrame。

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MCA","PhD","BE"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

输出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

使用行索引分割 DataFrame

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MCA","PhD","BE"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

apprix_1 = apprix_df.iloc[:2,:]
apprix_2 = apprix_df.iloc[2:,:]

print("The DataFrames formed by splitting of Apprix Team DataFrame are: ","\n")
print(apprix_1,"\n")
print(apprix_2,"\n")

输出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

The DataFrames formed by splitting the Apprix Team DataFrame are:

       Name Post Qualification
0     Anish  CEO           MBA
1  Rabindra  CTO            MS

     Name          Post Qualification
2  Manish  System Admin           MCA
3   Samir    Consultant           PhD
4   Binam      Engineer            BE

它使用行索引将 DataFrame apprix_df 分成两部分。第一部分包含 apprix_df DataFrame 的前两行,而第二部分包含最后三行。

我们可以在 iloc 属性中指定每次分割的行。[:2,:] 表示选择索引 2 之前的行(索引 2 的行不包括在内)和 DataFrame 中的所有列。因此,apprix_df.iloc[:2,:] 选择 DataFrame apprix_df 中索引 01 的前两行。

使用 groupby() 方法拆分 DataFrame

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MS","PhD","MS"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

groups = apprix_df.groupby(apprix_df.Qualification)
ms_df = groups.get_group("MS")
mba_df=groups.get_group("MBA")
phd_df=groups.get_group("PhD")

print("Group with Qualification MS:")
print(ms_df,"\n")

print("Group with Qualification MBA:")
print(mba_df,"\n")

print("Group with Qualification PhD:")
print(phd_df,"\n")

输出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Group with Qualification MS:
       Name          Post Qualification
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
4     Binam      Engineer            MS

Group with Qualification MBA:
    Name Post Qualification
0  Anish  CEO           MBA

Group with Qualification PhD:
    Name        Post Qualification
3  Samir  Consultant           PhD

它根据 Qualification 列的值将 DataFrame apprix_df 分成三部分。Qualification 列值相同的行将被放在同一个组中。

groupby() 函数将根据 Qualification 列的值形成分组。然后我们使用 get_group() 方法提取被 groupby() 方法分组的行。

使用 sample() 方法拆分 DataFrame

我们可以通过使用 sample() 方法从 DataFrame 中随机抽取行来形成一个 DataFrame。我们可以设置从父 DataFrame 中抽取行的比例。

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MS","PhD","MS"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

random_df = apprix_df.sample(frac=0.4,random_state=60)

print("Random split from the Apprix Team DataFrame:")
print(random_df)

输出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Random split from the Apprix Team DataFrame:
    Name      Post Qualification
0  Anish       CEO           MBA
4  Binam  Engineer            MS

它从 apprix_df DataFrame 中随机抽取 40% 的行,然后显示由抽取的行形成的 DataFrame。设置 random_state 是为了确保每次抽样都能得到相同的随机样本。

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

相关文章 - Pandas DataFrame