在 Python 中创建管道

Jay Shaw 2023年10月10日

Python Python Pipeline

在 Python 中为自定义数据集创建管道
在 Python 中为 Scikit-Learn 数据集创建管道

本文将演示为 sklearn 数据集和自定义数据集创建用于机器学习的 Python 管道。

在 Python 中为自定义数据集创建管道

我们需要两个导入包来创建 Python 管道，Pandas 用于生成 DataFrame，sklearn 用于管道。除此之外，我们还部署了另外两个子包，Pipeline 和 Linear Regression。

以下是使用的所有软件包的列表。

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

形成具有方程值的数据集

该程序旨在创建一个管道，当有足够的后续值训练模型时，该管道将预测方程的结果值。

这里使用的等式是：

c = a + 3*\sqrt[3]{b}

我们使用线性方程的值创建一个 Pandas 数据集。

df = pd.DataFrame(columns=["col1", "col2", "col3"], val=[[15, 8, 21], [16, 27, 25]])

将数据拆分为训练集和测试集

每个机器学习模型都需要将数据分成不等的两半。分离后，我们使用这两组来训练和测试模型。

比较显着的部分用于训练，另一部分用于测试模型。

在下面的代码片段中，前 8 个值用于训练模型，其余用于测试。

learn = df.iloc[:8]
evaluate = df.iloc[8:]

scikit-learn 管道通过将值输入管道然后给出结果来工作。值通过两个输入变量 - X 和 y 提供。

在所使用的等式中，c 是 a 和 b 的函数。因此，为了使管道适合线性回归模型中的值，我们将 a、b 值转换为 X 值和 y 中的 c 值。

重要的是要注意 X 和 y 是学习和评估变量。因此，我们将变量 a 和 b 传递给训练函数，并将变量 c 分配给测试函数。

learn_X = learn.drop("col3", axis=1)
learn_y = learn.col3

evaluate_X = evaluate.drop("col3", axis=1)
evaluate_y = evaluate.col3

在上面的代码中，当值被输入到 learn_X 变量中时，Pandas drop() 函数会删除列 c 的值。在 learn_y 变量中，传输 c 列的值。

axis = 1 代表列，而 0 值代表行。

创建 Python 管道并在其中拟合值

我们使用 Pipeline 函数在 Python 中创建一个管道。我们必须在使用前将其保存在变量中。

在这里，为此目的声明了一个名为 rock 的变量。

在管道内部，我们必须给出它的名称和要使用的模型 - ('Model for Linear Regression', LinearRegression())。

rock = Pipeline(steps=[("Model for Linear Regression", LinearRegression())])

完成在 Python 中创建管道的步骤后，需要为其拟合学习值，以便线性模型可以使用提供的值训练管道。

rock.fit(learn_X, learn_y)

管道训练完成后，变量 evaluate_X 通过 pipe1.predict() 函数预测以下值。

预测值存储在一个新变量 evalve 中并打印出来。

evalve = rock.predict(evaluate_X)
print(f"\n{evalve}")

让我们把所有东西放在一起来观察管道是如何创建的以及它的性能。

import pandas as pd

# import warnings
# warnings.filterwarnings('ignore')

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
    columns=["col1", "col2", "col3"],
    data=[
        [15, 8, 21],
        [16, 27, 25],
        [17, 64, 29],
        [18, 125, 33],
        [19, 216, 37],
        [20, 343, 41],
        [21, 512, 45],
        [22, 729, 49],
        [23, 1000, 53],
        [24, 1331, 57],
        [25, 1728, 61],
        [26, 2197, 65],
    ],
)

learn = df.iloc[:8]
evaluate = df.iloc[8:]

learn_X = learn.drop("col3", axis=1)
learn_y = learn.col3

evaluate_X = evaluate.drop("col3", axis=1)
evaluate_y = evaluate.col3

print("\n step: Here, the pipeline is formed")
rock = Pipeline(steps=[("Model for Linear Regression", LinearRegression())])
print("\n Step: Fitting the data inside")
rock.fit(learn_X, learn_y)
print("\n Searching for outcomes after evaluation")
evalve = rock.predict(evaluate_X)
print(f"\n{evalve}")

输出：

"C:/Users/Win 10/pipe.py"

 step: Here, the pipeline is formed

 Step: Fitting the data inside

 Searching for outcomes after evaluation

[53. 57. 61. 65.]

Process finished with exit code 0

正如我们所看到的，管道预测了确切的值。

在 Python 中为 Scikit-Learn 数据集创建管道

此示例演示如何在 Python 中为 Scikit 学习数据集创建管道。在大型数据集上执行管道操作与小型数据集略有不同。

管道在处理大型数据集时需要使用额外的模型来清理和过滤数据。

下面是我们需要的导入包。

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import datasets

使用了来自 sklearn 的数据集。它有多个列和值，但我们将专门使用两列 - 数据和目标。

加载数据集并将其拆分为训练集和测试集

我们将把数据集加载到变量 bc 中，并将各个列的值存储在变量 X 和 y 中。

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

加载数据集后，我们定义学习和评估变量。数据集必须分成训练集和测试集。

a_learn, a_evaluate, b_learn, b_evaluate = train_test_split(
    X, y, test_size=0.40, random_state=1, stratify=y
)

我们将数据集分配到 4 个主要变量 - X_learn、X_evaluate、y_learn 和 y_evaluate。与之前的程序不同，这里的分配是通过 train_test_split() 函数完成的。

test_size=0.4 指示函数保留 40% 的数据集用于测试，剩下的一半用于训练。

random_state=1 确保对数据集进行统一拆分，以便每次运行函数时预测都给出相同的输出。每次运行函数时，random_state=0 都会提供不同的结果。

stratify=y 确保在样本大小中使用相同的数据大小，如为参数分层提供的那样。如果有 15% 的 1 和 85% 的 0，stratify 将确保系统在每次随机拆分中都有 15% 的 1 和 85% 的 0。

创建 Python 管道并在其中拟合值

pipeline = make_pipeline(StandardScaler(),
RandomForestClassifier (n_estimators=10, max_features=5, max_depth=2, random_state=1))

其中，

make_pipeline() 是一个用于创建管道的 Scikit-learn 函数。
Standard scaler() 从平均值中删除值并将它们分配到其单位值。
RandomForestClassifier() 是一个决策模型，它从数据集中获取一些样本值，用每个样本值创建一个决策树，然后预测每个决策树的结果。然后，模型对预测结果的准确性进行投票，投票最多的结果被选为最终预测。
n_estimators 表示在投票前要创建的决策树的数量。
max_features 决定执行节点分裂时将形成多少个随机状态。
max_depth 表示树的节点有多深。

创建管道后，对值进行拟合，并预测结果。

pipeline.fit(a_learn, b_learn)
y_pred = pipeline.predict(a_evaluate)

让我们看看完整的程序。

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn import datasets

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

a_learn, a_evaluate, b_learn, b_evaluate = train_test_split(
    X, y, test_size=0.40, random_state=1, stratify=y
)

# Create the pipeline

pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(
        n_estimators=10, max_features=5, max_depth=2, random_state=1
    ),
)


pipeline.fit(a_learn, b_learn)

y_pred = pipeline.predict(a_evaluate)

print(y_pred)

输出：

"C:/Users/Win 10/test_cleaned.py"
[0 0 0 1 1 1 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1
 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1
 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1
 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1
 1 1 1 1 1 0]

Process finished with exit code 0