拆分 Pandas DataFrame

Suraj Joshi 2023年1月30日 2021年1月22日
  1. 使用行索引分割 DataFrame
  2. 使用 groupby() 方法拆分 DataFrame
  3. 使用 sample() 方法拆分 DataFrame
拆分 Pandas DataFrame

本教程解釋瞭如何使用行索引、DataFrame.groupby() 方法和 DataFrame.sample() 方法將一個 DataFrame 分割成多個較小的 DataFrame。

我們將使用下面的 apprix_df DataFrame 來解釋如何將一個 DataFrame 分割成多個更小的 DataFrame。

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MCA","PhD","BE"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

使用行索引分割 DataFrame

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MCA","PhD","BE"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

apprix_1 = apprix_df.iloc[:2,:]
apprix_2 = apprix_df.iloc[2:,:]

print("The DataFrames formed by splitting of Apprix Team DataFrame are: ","\n")
print(apprix_1,"\n")
print(apprix_2,"\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

The DataFrames formed by splitting the Apprix Team DataFrame are:

       Name Post Qualification
0     Anish  CEO           MBA
1  Rabindra  CTO            MS

     Name          Post Qualification
2  Manish  System Admin           MCA
3   Samir    Consultant           PhD
4   Binam      Engineer            BE

它使用行索引將 DataFrame apprix_df 分成兩部分。第一部分包含 apprix_df DataFrame 的前兩行,而第二部分包含最後三行。

我們可以在 iloc 屬性中指定每次分割的行。[:2,:] 表示選擇索引 2 之前的行(索引 2 的行不包括在內)和 DataFrame 中的所有列。因此,apprix_df.iloc[:2,:] 選擇 DataFrame apprix_df 中索引 01 的前兩行。

使用 groupby() 方法拆分 DataFrame

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MS","PhD","MS"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

groups = apprix_df.groupby(apprix_df.Qualification)
ms_df = groups.get_group("MS")
mba_df=groups.get_group("MBA")
phd_df=groups.get_group("PhD")

print("Group with Qualification MS:")
print(ms_df,"\n")

print("Group with Qualification MBA:")
print(mba_df,"\n")

print("Group with Qualification PhD:")
print(phd_df,"\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Group with Qualification MS:
       Name          Post Qualification
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
4     Binam      Engineer            MS

Group with Qualification MBA:
    Name Post Qualification
0  Anish  CEO           MBA

Group with Qualification PhD:
    Name        Post Qualification
3  Samir  Consultant           PhD

它根據 Qualification 列的值將 DataFrame apprix_df 分成三部分。Qualification 列值相同的行將被放在同一個組中。

groupby() 函式將根據 Qualification 列的值形成分組。然後我們使用 get_group() 方法提取被 groupby() 方法分組的行。

使用 sample() 方法拆分 DataFrame

我們可以通過使用 sample() 方法從 DataFrame 中隨機抽取行來形成一個 DataFrame。我們可以設定從父 DataFrame 中抽取行的比例。

import pandas as pd

apprix_df = pd.DataFrame({
    'Name': ["Anish","Rabindra","Manish","Samir","Binam"],
    'Post': ["CEO","CTO","System Admin","Consultant","Engineer"],
    'Qualification':["MBA","MS","MS","PhD","MS"]
})

print("Apprix Team DataFrame:")
print(apprix_df,"\n")

random_df = apprix_df.sample(frac=0.4,random_state=60)

print("Random split from the Apprix Team DataFrame:")
print(random_df)

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Random split from the Apprix Team DataFrame:
    Name      Post Qualification
0  Anish       CEO           MBA
4  Binam  Engineer            MS

它從 apprix_df DataFrame 中隨機抽取 40% 的行,然後顯示由抽取的行形成的 DataFrame。設定 random_state 是為了確保每次抽樣都能得到相同的隨機樣本。

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

相關文章 - Pandas DataFrame