PythonでPDF座標取得の完全ガイド｜基本から実践まで詳しく解説

「PDFファイルから特定のテキストや要素の位置を取得したい」「PDF内の座標情報を使って自動処理を行いたい」そんなニーズを持つ開発者の方は多いのではないでしょうか？

PDF座標取得は、文書解析やデータ抽出の自動化において重要な技術です。請求書の金額欄を自動で読み取ったり、フォームの入力位置を特定したりする際に欠かせない機能なんです。

今回は、Pythonを使ってPDFから座標情報を取得する方法を、初心者の方でも分かりやすく解説していきます。基本的な使い方から実践的な応用例まで、詳しくご紹介しますね。

PDF座標取得の基本概念
1. PDF座標系の仕組み
2. 座標取得が必要な場面
PyPDF2を使った基本的な座標取得
1. PyPDF2のインストールと基本設定
2. テキストの座標情報を取得する方法
pdfplumberを使った高精度座標取得
PyMuPDFを使った高速座標取得
実践的な応用例
1. 請求書の金額自動抽出
2. 表組みデータの構造化抽出
座標変換とページ座標系の対応
1. 異なる座標系への変換
2. ページ回転への対応
エラー処理とデバッグ
1. よくあるエラーと対処法
2. デバッグ用の可視化
パフォーマンスの最適化
1. 大容量PDFの効率的な処理
まとめ

PDF座標取得の基本概念

PDF座標系の仕組み

PDFファイルでは、独特の座標系が使われています。

基本的な座標系

原点(0, 0)は通常ページの左下角
X軸は右方向が正
Y軸は上方向が正
単位はポイント（1ポイント = 1/72インチ）

実際のサイズ例

A4サイズ：約595 × 842ポイント
レターサイズ：約612 × 792ポイント
B5サイズ：約516 × 729ポイント

この座標系を理解することで、正確な位置情報を取得できるようになります。

座標取得が必要な場面

PDF座標取得は、以下のような場面で活用されています：

データ抽出の自動化

請求書の金額や日付の自動読み取り
契約書の署名欄の位置特定
表組みデータの構造化抽出

フォーム処理

入力フィールドの位置確認
チェックボックスの状態取得
ボタンやリンクの座標取得

実際に、多くの企業でこれらの技術が業務効率化に活用されています。

PyPDF2を使った基本的な座標取得

PyPDF2のインストールと基本設定

まずは、最も基本的なライブラリであるPyPDF2から始めましょう。

インストール方法

pip install PyPDF2

基本的なPDFファイルの読み込み

import PyPDF2

# PDFファイルを開く
with open('sample.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    
    # ページ数を確認
    num_pages = len(pdf_reader.pages)
    print(f"ページ数: {num_pages}")
    
    # 最初のページを取得
    first_page = pdf_reader.pages[0]

この基本コードで、PDFファイルの読み込みと基本情報の取得ができます。

テキストの座標情報を取得する方法

PyPDF2では、テキストとその位置情報を取得できます。

テキスト抽出と座標取得のコード例

def get_text_coordinates(pdf_path):
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        page = pdf_reader.pages[0]  # 最初のページ
        
        # ページサイズを取得
        page_box = page.mediabox
        width = float(page_box.width)
        height = float(page_box.height)
        
        print(f"ページサイズ: {width} × {height} ポイント")
        
        # テキストを抽出
        text = page.extract_text()
        print(f"抽出されたテキスト: {text}")
        
        return width, height, text

# 使用例
width, height, text = get_text_coordinates('sample.pdf')

ただし、PyPDF2では詳細な座標情報の取得に限界があるため、より高度な処理にはpdfplumberがおすすめです。

pdfplumberを使った高精度座標取得

pdfplumberの特徴とインストール

pdfplumberは、PDF座標取得において最も人気の高いライブラリです。

インストール方法

pip install pdfplumber

主な特徴

高精度なテキスト座標取得
表組みの自動認識
画像や図形の位置情報取得
直感的なAPI設計

詳細な座標情報を取得するコード

pdfplumberを使った具体的な座標取得方法をご紹介します。

基本的な座標取得コード

import pdfplumber

def extract_text_with_coordinates(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        # 最初のページを処理
        page = pdf.pages[0]
        
        # ページサイズを取得
        width = page.width
        height = page.height
        print(f"ページサイズ: {width} × {height}")
        
        # テキストの座標情報を取得
        for char in page.chars:
            print(f"文字: '{char['text']}'")
            print(f"座標: x0={char['x0']:.2f}, y0={char['y0']:.2f}")
            print(f"     x1={char['x1']:.2f}, y1={char['y1']:.2f}")
            print(f"フォント: {char['fontname']}, サイズ: {char['size']}")
            print("---")

# 使用例
extract_text_with_coordinates('sample.pdf')

このコードにより、文字単位での詳細な位置情報が取得できます。

特定テキストの座標を検索する方法

目的のテキストを見つけて、その座標を取得する実用的な方法です。

特定テキスト検索のコード例

def find_text_coordinates(pdf_path, target_text):
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            # ページ内のすべてのテキスト要素を確認
            for word in page.extract_words():
                if target_text in word['text']:
                    print(f"ページ {page_num + 1} で発見:")
                    print(f"テキスト: '{word['text']}'")
                    print(f"座標: x0={word['x0']:.2f}, y0={word['y0']:.2f}")
                    print(f"     x1={word['x1']:.2f}, y1={word['y1']:.2f}")
                    
                    return {
                        'page': page_num + 1,
                        'text': word['text'],
                        'x0': word['x0'],
                        'y0': word['y0'],
                        'x1': word['x1'],
                        'y1': word['y1']
                    }
    
    print(f"'{target_text}' は見つかりませんでした")
    return None

# 使用例
coordinates = find_text_coordinates('invoice.pdf', '合計金額')

この機能により、請求書の金額欄など、特定の項目を効率的に見つけられます。

PyMuPDFを使った高速座標取得

PyMuPDFの特徴

PyMuPDF（fitz）は、高速処理が特徴的なライブラリです。

インストール方法

pip install PyMuPDF

主な利点

処理速度が非常に高速
画像の座標取得も可能
ベクター形式の図形にも対応
メモリ使用量が効率的

PyMuPDFでの座標取得実装

高速な座標取得を実現するコード例をご紹介します。

基本的な実装コード

import fitz  # PyMuPDF

def get_coordinates_with_pymupdf(pdf_path):
    # PDFドキュメントを開く
    doc = fitz.open(pdf_path)
    
    # 最初のページを取得
    page = doc[0]
    
    # ページサイズを取得
    rect = page.rect
    print(f"ページサイズ: {rect.width} × {rect.height}")
    
    # テキストブロックとその座標を取得
    text_dict = page.get_text("dict")
    
    for block in text_dict["blocks"]:
        if "lines" in block:  # テキストブロックの場合
            for line in block["lines"]:
                for span in line["spans"]:
                    bbox = span["bbox"]  # バウンディングボックス
                    text = span["text"]
                    
                    print(f"テキスト: '{text}'")
                    print(f"座標: x0={bbox[0]:.2f}, y0={bbox[1]:.2f}")
                    print(f"     x1={bbox[2]:.2f}, y1={bbox[3]:.2f}")
                    print("---")
    
    doc.close()

# 使用例
get_coordinates_with_pymupdf('sample.pdf')

画像や図形の座標取得

PyMuPDFなら、テキスト以外の要素の座標も取得できます。

画像座標取得のコード

def get_image_coordinates(pdf_path):
    doc = fitz.open(pdf_path)
    page = doc[0]
    
    # 画像リストを取得
    image_list = page.get_images()
    
    for img_index, img in enumerate(image_list):
        # 画像の詳細情報を取得
        xref = img[0]
        bbox = page.get_image_bbox(img)
        
        print(f"画像 {img_index + 1}:")
        print(f"座標: x0={bbox.x0:.2f}, y0={bbox.y0:.2f}")
        print(f"     x1={bbox.x1:.2f}, y1={bbox.y1:.2f}")
        print(f"サイズ: {bbox.width:.2f} × {bbox.height:.2f}")
        print("---")
    
    doc.close()

# 使用例
get_image_coordinates('document_with_images.pdf')

実践的な応用例

請求書の金額自動抽出

実際のビジネスでよく使われる、請求書からの金額抽出システムです。

請求書解析のコード例

import pdfplumber
import re

def extract_invoice_amounts(pdf_path):
    amounts = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # 金額パターンを検索（¥記号 + 数字 + カンマ）
            text = page.extract_text()
            
            # 正規表現で金額を検索
            amount_pattern = r'¥[\d,]+'
            found_amounts = re.findall(amount_pattern, text)
            
            # 座標情報付きで詳細検索
            for word in page.extract_words():
                if re.search(amount_pattern, word['text']):
                    amount_info = {
                        'text': word['text'],
                        'x': (word['x0'] + word['x1']) / 2,
                        'y': (word['y0'] + word['y1']) / 2,
                        'page': page.page_number
                    }
                    amounts.append(amount_info)
    
    return amounts

# 使用例
invoice_amounts = extract_invoice_amounts('invoice.pdf')
for amount in invoice_amounts:
    print(f"金額: {amount['text']}, 座標: ({amount['x']:.1f}, {amount['y']:.1f})")

表組みデータの構造化抽出

表形式のデータを座標情報を使って正確に抽出する方法です。

表データ抽出のコード

def extract_table_with_coordinates(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]
        
        # 表を自動検出
        tables = page.extract_tables()
        
        if tables:
            table = tables[0]  # 最初の表を処理
            
            print("表データと座標情報:")
            for row_index, row in enumerate(table):
                for col_index, cell in enumerate(row):
                    if cell:  # セルが空でない場合
                        # セルの座標を推定（簡易版）
                        print(f"行{row_index}, 列{col_index}: '{cell}'")
            
            # より詳細な座標情報が必要な場合
            table_settings = {
                "vertical_strategy": "lines",
                "horizontal_strategy": "lines"
            }
            
            detailed_table = page.extract_table(table_settings)
            return detailed_table
        
        return None

# 使用例
table_data = extract_table_with_coordinates('table_document.pdf')

座標変換とページ座標系の対応

異なる座標系への変換

PDF座標をピクセル座標や他の単位系に変換する方法です。

座標変換のコード例

def convert_coordinates(pdf_coords, page_height, dpi=72):
    """
    PDF座標を画面座標に変換
    PDFは左下が原点、画面は左上が原点
    """
    x_pdf, y_pdf = pdf_coords
    
    # Y座標を反転（左下原点 → 左上原点）
    y_screen = page_height - y_pdf
    
    # DPI変換（必要に応じて）
    if dpi != 72:  # 72 DPIがPDFの標準
        scale_factor = dpi / 72
        x_screen = x_pdf * scale_factor
        y_screen = y_screen * scale_factor
    else:
        x_screen = x_pdf
    
    return x_screen, y_screen

# 使用例
pdf_x, pdf_y = 100, 200
page_height = 842  # A4の高さ
screen_x, screen_y = convert_coordinates((pdf_x, pdf_y), page_height)
print(f"PDF座標: ({pdf_x}, {pdf_y}) → 画面座標: ({screen_x}, {screen_y})")

ページ回転への対応

回転されたPDFページでの座標補正方法です。

回転補正のコード

import math

def adjust_for_rotation(x, y, rotation, page_width, page_height):
    """
    ページ回転に対応した座標補正
    """
    if rotation == 0:
        return x, y
    elif rotation == 90:
        return y, page_width - x
    elif rotation == 180:
        return page_width - x, page_height - y
    elif rotation == 270:
        return page_height - y, x
    else:
        # 任意角度の回転（ラジアンに変換）
        rad = math.radians(rotation)
        new_x = x * math.cos(rad) - y * math.sin(rad)
        new_y = x * math.sin(rad) + y * math.cos(rad)
        return new_x, new_y

# 使用例
original_x, original_y = 100, 200
rotation_angle = 90  # 90度回転
page_w, page_h = 595, 842

adjusted_x, adjusted_y = adjust_for_rotation(
    original_x, original_y, rotation_angle, page_w, page_h
)
print(f"回転補正後の座標: ({adjusted_x}, {adjusted_y})")

エラー処理とデバッグ

よくあるエラーと対処法

PDF座標取得でよく発生する問題と解決方法をご紹介します。

エラー処理を含む実装例

def safe_coordinate_extraction(pdf_path):
    try:
        with pdfplumber.open(pdf_path) as pdf:
            if len(pdf.pages) == 0:
                raise ValueError("PDFにページが含まれていません")
            
            page = pdf.pages[0]
            
            # 座標取得処理
            coordinates = []
            for word in page.extract_words():
                coordinates.append({
                    'text': word['text'],
                    'x0': word['x0'],
                    'y0': word['y0'],
                    'x1': word['x1'],
                    'y1': word['y1']
                })
            
            return coordinates
            
    except FileNotFoundError:
        print(f"ファイルが見つかりません: {pdf_path}")
        return None
    except Exception as e:
        print(f"エラーが発生しました: {str(e)}")
        return None

# 使用例
result = safe_coordinate_extraction('sample.pdf')
if result:
    print(f"{len(result)}個の要素を取得しました")

デバッグ用の可視化

座標情報を視覚的に確認するためのコードです。

座標可視化のコード

import matplotlib.pyplot as plt
import matplotlib.patches as patches

def visualize_coordinates(pdf_path, output_image='coordinates.png'):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]
        
        # 図を作成
        fig, ax = plt.subplots(1, 1, figsize=(8, 11))
        
        # ページサイズに合わせて軸を設定
        ax.set_xlim(0, page.width)
        ax.set_ylim(0, page.height)
        ax.set_aspect('equal')
        
        # テキストの座標を可視化
        for word in page.extract_words():
            x0, y0, x1, y1 = word['x0'], word['y0'], word['x1'], word['y1']
            
            # バウンディングボックスを描画
            rect = patches.Rectangle(
                (x0, y0), x1-x0, y1-y0,
                linewidth=1, edgecolor='red', facecolor='none'
            )
            ax.add_patch(rect)
            
            # テキストを描画
            ax.text(x0, y0, word['text'], fontsize=6, ha='left', va='bottom')
        
        plt.title('PDF座標の可視化')
        plt.xlabel('X座標')
        plt.ylabel('Y座標')
        plt.savefig(output_image, dpi=150, bbox_inches='tight')
        plt.show()

# 使用例
visualize_coordinates('sample.pdf')

パフォーマンスの最適化

大容量PDFの効率的な処理

大きなPDFファイルを扱う際のパフォーマンス改善方法です。

最適化されたコード例

def optimized_coordinate_extraction(pdf_path, target_pages=None):
    """
    メモリ効率を考慮した座標取得
    """
    coordinates = []
    
    with pdfplumber.open(pdf_path) as pdf:
        pages_to_process = target_pages or range(len(pdf.pages))
        
        for page_num in pages_to_process:
            if page_num < len(pdf.pages):
                page = pdf.pages[page_num]
                
                # 必要な情報のみ抽出
                page_coords = []
                for word in page.extract_words():
                    # 座標のみを保存（メモリ節約）
                    page_coords.append((
                        word['text'],
                        round(word['x0'], 2),
                        round(word['y0'], 2),
                        round(word['x1'], 2),
                        round(word['y1'], 2)
                    ))
                
                coordinates.extend(page_coords)
                
                # ガベージコレクションを促進
                del page_coords
    
    return coordinates

# 使用例（最初の5ページのみ処理）
coords = optimized_coordinate_extraction('large_document.pdf', target_pages=range(5))