使用 Python 从 PDF 文件中提取图像

Lakshay Kapoor 2023年10月10日

Python Python PDF

在 Python 中安装 PyMuPDF 库
用 Python 从 PDF 文件中提取图像

你可以使用 Python 对外部文件和源执行许多操作。其中一项操作是在 Python 中从 PDF 文件中提取图像，这在 PDF 太长且无法手动管理时非常有用。

本指南向你展示如何在 Python 中从 PDF 文件中提取图像。

在 Python 中安装 `PyMuPDF` 库

要执行此操作，必须在 Python 中安装 PyMuPDF 库。这个库帮助用户处理 PDF、XPS、FB2、OpenXPS 和 EPUB 格式的文件。它是一个非常通用的库，以其高性能和渲染质量而闻名。但是，它没有预装在 Python 中。要安装此库，请运行以下命令。

pip install PyMuPDF Pillow

用 Python 从 PDF 文件中提取图像

现在，要从 PDF 文件中提取图像，有一个分步过程：

首先，导入所有必需的库。

import fitz
import io
from PIL import Image

然后，定义必须从中提取图像的文件的路径。使用 fitz 模块中的 open() 函数打开文件。

file_path = "randomfile.pdf"
open_file = fitz.open(file_path)

之后，PDF 文件的每一页都被迭代并检查每页上是否有可用的图像。

for page_number in range(len(open_file)):
    page = pdf_file[page_number]
    list_image = page.getImageList()

    if list_image:
        print(f"{len(list_image)} images found on page {page_number}")
    else:
        print("No images found on page", page_number)

在这一步中，getImageList() 函数用于以图像对象的形式提取所有图像，作为元组列表。

然后，使用 extractImage() 函数返回有关图像的所有额外信息，例如图像大小和图像扩展名。此步骤作为第一次迭代本身内部的迭代执行。

for image_number, img in enumerate(page.getImageList(), start=1):

    xref = img[0]

    image_base = pdf_file.extractImage(xref)
    bytes_image = image_base["image"]

    ext_image = base_image["ext"]

将所有这些步骤合并到一个程序中后，你可以轻松地从 PDF 文件中提取所有图像。

现在，假设 randomfile.pdf 文件中有 5 页。在这 5 页中，最后只有 1 张图像，例如第 5 页。因此，输出将如下所示。

0 images found on page 0
0 images found on page 1
0 images found on page 2
0 images found on page 3
0 images found on page 4
1 images found on page 5

作者： Lakshay Kapoor

Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.

在 Python 中安装 PyMuPDF 库

用 Python 从 PDF 文件中提取图像

相关文章 - Python PDF

在 Python 中安装 `PyMuPDF` 库