使用 Python 從 PDF 檔案中提取影象

Lakshay Kapoor 2023年10月10日

Python Python PDF

在 Python 中安裝 PyMuPDF 庫
用 Python 從 PDF 檔案中提取影象

你可以使用 Python 對外部檔案和源執行許多操作。其中一項操作是在 Python 中從 PDF 檔案中提取影象，這在 PDF 太長且無法手動管理時非常有用。

本指南向你展示如何在 Python 中從 PDF 檔案中提取影象。

在 Python 中安裝 `PyMuPDF` 庫

要執行此操作，必須在 Python 中安裝 PyMuPDF 庫。這個庫幫助使用者處理 PDF、XPS、FB2、OpenXPS 和 EPUB 格式的檔案。它是一個非常通用的庫，以其高效能和渲染質量而聞名。但是，它沒有預裝在 Python 中。要安裝此庫，請執行以下命令。

pip install PyMuPDF Pillow

用 Python 從 PDF 檔案中提取影象

現在，要從 PDF 檔案中提取影象，有一個分步過程：

首先，匯入所有必需的庫。

import fitz
import io
from PIL import Image

然後，定義必須從中提取影象的檔案的路徑。使用 fitz 模組中的 open() 函式開啟檔案。

file_path = "randomfile.pdf"
open_file = fitz.open(file_path)

之後，PDF 檔案的每一頁都被迭代並檢查每頁上是否有可用的影象。

for page_number in range(len(open_file)):
    page = pdf_file[page_number]
    list_image = page.getImageList()

    if list_image:
        print(f"{len(list_image)} images found on page {page_number}")
    else:
        print("No images found on page", page_number)

在這一步中，getImageList() 函式用於以影象物件的形式提取所有影象，作為元組列表。

然後，使用 extractImage() 函式返回有關影象的所有額外資訊，例如影象大小和影象副檔名。此步驟作為第一次迭代本身內部的迭代執行。

for image_number, img in enumerate(page.getImageList(), start=1):

    xref = img[0]

    image_base = pdf_file.extractImage(xref)
    bytes_image = image_base["image"]

    ext_image = base_image["ext"]

將所有這些步驟合併到一個程式中後，你可以輕鬆地從 PDF 檔案中提取所有影象。

現在，假設 randomfile.pdf 檔案中有 5 頁。在這 5 頁中，最後只有 1 張影象，例如第 5 頁。因此，輸出將如下所示。

0 images found on page 0
0 images found on page 1
0 images found on page 2
0 images found on page 3
0 images found on page 4
1 images found on page 5

作者： Lakshay Kapoor

Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.

在 Python 中安裝 PyMuPDF 庫

用 Python 從 PDF 檔案中提取影象

相關文章 - Python PDF

在 Python 中安裝 `PyMuPDF` 庫