Extracting line items from scanned PDFs with pdfplumber

A scanned expense receipt is an image with no vector text layer, so pdfplumber.extract_words() returns an empty list and any coordinate parser silently emits zero rows — the fix is to overlay an OCR-generated text layer before spatial parsing, never during it. This page is the scanned-document deep dive under the parent pdfplumber Line-Item Parsing guide, which owns coordinate-space extraction for documents that already carry text; here the job is getting reliable, confidence-scored text onto an image-only page first. It sits inside the broader Receipt Ingestion & OCR Data Extraction framework and hands typed rows downstream to the Core Policy Architecture & Taxonomy Design rule engine.

Why Standard Approaches Fail

Pointing a coordinate parser straight at a scanned PDF fails in three named ways, and every one produces a silent false negative rather than a visible crash — the worst outcome for an audit trail:

Empty-text-layer collapse. pdfplumber reads PDF operators and vector glyphs only; an image-only page has none, so extract_words() returns []. An unguarded parser reads that as a clean, zero-line-item receipt and lets it pass policy evaluation unexamined.
Rasterized coordinate drift. Once OCR does run, word boxes come back in image-pixel space at whatever DPI the page was rendered. Mixing those pixel coordinates with pdfplumber’s point-based page.height — for header/footer filtering — merges or splits rows depending on the scan resolution, so the same vendor’s receipt parses differently at 200 DPI and 300 DPI.
OCR tokenization noise. Tesseract inserts zero-width characters, detaches currency symbols from amounts ($ and 1,240.00 as separate tokens), and misreads a smudged . as . Naive string splitting then produces 1 and 240 as two line items, or drops the decimal entirely — a class of error the Receipt Error Categorization stage exists to triage rather than absorb blindly.

Architecture & Algorithm

The pipeline is strictly two-stage. Stage one gates each page: if it already has a native text layer it is deferred to the native parser; if it is image-only it is rasterized and OCR’d. Stage two clusters the resulting words into visual rows by vertical proximity, parses monetary values with decimal.Decimal to avoid float drift, and carries the real per-word Tesseract confidence into every row so a below-threshold receipt is flagged, not trusted. pdfplumber owns page geometry and rasterization; Tesseract supplies text and confidence; both live in one point-based coordinate space after a single 72 / dpi scale, which closes the coordinate-drift failure above.

Memory stays flat because pages are rasterized and released one at a time inside the pdfplumber context manager, and the extractor is a generator — resident set does not grow with document length. Latency is dominated by OCR (roughly 150–400 ms/page at 300 DPI), not by the O(N log N) row clustering.

from __future__ import annotations

import logging
import re
from dataclasses import dataclass, field
from decimal import Decimal, InvalidOperation
from typing import Iterator, List

import pdfplumber
import pytesseract
from pytesseract import Output

# --- Structured audit logging: every emitted row is reconstructable from the log ---
logger = logging.getLogger("expense_audit.scanned_line_items")
if not logger.handlers:
    logger.setLevel(logging.INFO)
    _handler = logging.StreamHandler()
    _handler.setFormatter(
        logging.Formatter('{"ts":"%(asctime)s","lvl":"%(levelname)s","msg":"%(message)s"}')
    )
    logger.addHandler(_handler)

# Tesseract per-word confidence below this is treated as untrustworthy text.
MIN_WORD_CONF = 70.0
# --oem 1 = LSTM engine; --psm 6 = assume a uniform block of text (tabular receipts).
TESS_CONFIG = "--oem 1 --psm 6"

# Require two decimal places so "1,200" can never be split into "1" and "200".
AMOUNT_RE = re.compile(r"(?i)(?:[$€£¥]|USD|EUR|GBP|CAD|AUD)?\s*(\d[\d,]*\.\d{2})")
CURRENCY_RE = re.compile(r"(?i)[$€£¥]|USD|EUR|GBP|CAD|AUD")
ZERO_WIDTH = re.compile(r"[‌‍]")
_SYMBOL_TO_CODE = {"$": "USD", "€": "EUR", "£": "GBP", "¥": "JPY"}


@dataclass(frozen=True)
class ScannedLineItem:
    """Immutable, policy-ready row with audit-grade OCR provenance."""

    description: str
    amount: Decimal
    currency: str
    page_index: int
    row_index: int
    ocr_confidence: float           # minimum per-word Tesseract confidence in the row
    policy_flags: List[str] = field(default_factory=list)


@dataclass
class _Word:
    text: str
    top: float                      # PDF points from the page top
    x0: float
    conf: float


def _normalize_currency(token: str) -> str:
    return _SYMBOL_TO_CODE.get(token, token.upper())


def _ocr_page_words(page, dpi: int) -> List[_Word]:
    """Rasterize a scanned page and OCR it into coordinate-bearing words.

    pdfplumber renders the embedded image; Tesseract returns text plus a real
    per-word confidence. Pixel boxes are scaled to PDF points so header/footer
    filtering against page.height stays in one coordinate space.
    """
    scale = 72.0 / dpi              # image pixels -> PDF points
    image = page.to_image(resolution=dpi).original   # PIL.Image
    data = pytesseract.image_to_data(image, config=TESS_CONFIG, output_type=Output.DICT)

    words: List[_Word] = []
    for i, raw in enumerate(data["text"]):
        text = ZERO_WIDTH.sub("", raw).strip()
        conf = float(data["conf"][i])
        if not text or conf < 0:    # conf == -1 marks non-text regions
            continue
        words.append(
            _Word(text=text, top=data["top"][i] * scale, x0=data["left"][i] * scale, conf=conf)
        )
    return words


def _cluster_rows(words: List[_Word], y_tolerance: float) -> List[List[_Word]]:
    """Group words into visual rows by vertical proximity; deterministic ordering."""
    if not words:
        return []
    words = sorted(words, key=lambda w: (round(w.top, 1), w.x0))
    rows: List[List[_Word]] = [[words[0]]]
    for word in words[1:]:
        if abs(word.top - rows[-1][0].top) <= y_tolerance:
            rows[-1].append(word)
        else:
            rows.append([word])
    return rows


def extract_scanned_line_items(
    pdf_path: str,
    y_tolerance: float = 4.0,
    dpi: int = 300,
    header_footer_margin: float = 50.0,
) -> Iterator[ScannedLineItem]:
    """Yield policy-ready line items from a scanned (image-only) expense PDF.

    Pages are OCR'd and released one at a time, so resident memory is flat
    regardless of page count. Native-text pages are skipped for the native parser.
    """
    with pdfplumber.open(pdf_path) as pdf:
        for page_idx, page in enumerate(pdf.pages):
            if page.extract_words():
                logger.info('{"event":"native_text_layer","page":%d}', page_idx)
                continue

            words = _ocr_page_words(page, dpi=dpi)
            top_limit = header_footer_margin
            bottom_limit = float(page.height) - header_footer_margin
            body = [w for w in words if top_limit <= w.top <= bottom_limit]

            for row_idx, row in enumerate(_cluster_rows(body, y_tolerance)):
                text = " ".join(w.text for w in row)
                match = AMOUNT_RE.search(text)
                if not match:
                    continue

                try:
                    amount = Decimal(match.group(1).replace(",", ""))
                except InvalidOperation:
                    logger.debug('{"event":"bad_amount","page":%d,"row":%d}', page_idx, row_idx)
                    continue

                currency_match = CURRENCY_RE.search(text)
                currency = _normalize_currency(currency_match.group(0)) if currency_match else "USD"
                description = AMOUNT_RE.sub("", text).strip(" .-")
                row_conf = min(w.conf for w in row)

                flags: List[str] = []
                if row_conf < MIN_WORD_CONF:
                    flags.append("LOW_OCR_CONFIDENCE")
                if amount <= 0:
                    flags.append("NON_POSITIVE_AMOUNT")
                if re.search(r"(?i)\b(tip|gratuity|alcohol|minibar|personal)\b", description):
                    flags.append("NON_REIMBURSABLE_CATEGORY")

                item = ScannedLineItem(
                    description=description or "(unlabeled line item)",
                    amount=amount,
                    currency=currency,
                    page_index=page_idx,
                    row_index=row_idx,
                    ocr_confidence=row_conf,
                    policy_flags=flags,
                )
                logger.info(
                    '{"event":"emit","page":%d,"row":%d,"amount":"%s","conf":%.1f,"flags":"%s"}',
                    page_idx, row_idx, item.amount, row_conf, ",".join(flags) or "-",
                )
                yield item

The confidence gate is the audit-critical line: a row whose weakest word scored below MIN_WORD_CONF is still emitted, but tagged LOW_OCR_CONFIDENCE so the routing layer can divert it to human review instead of approving a misread amount. Because ScannedLineItem is frozen, no downstream stage can rewrite an amount after extraction, preserving the chain of custody.

Step-by-Step Integration

Install the OCR toolchain and pin it. The parser needs the Tesseract binary plus the Python wrapper, and both are part of the deterministic contract — an engine change moves word boundaries across an entire historical corpus. Pin exact versions and verify:
```
tesseract --version            # pin the major.minor you test against
pip install "pdfplumber==0.11.4" "pytesseract==0.3.13"
```
Normalize the scan before OCR. Deskew, denoise, and render at a consistent DPI upstream; borderline-legible thermal receipts belong to optimizing Tesseract for faded receipt text, not to this parser. Feeding raw, skewed images inflates the LOW_OCR_CONFIDENCE rate and re-introduces coordinate drift.

Confirm the empty-layer gate fires. Assert that a known image-only page yields rows rather than silently returning nothing:

rows = list(extract_scanned_line_items("fixtures/scanned_hotel_folio.pdf"))
assert rows, "scanned page produced no line items — OCR overlay did not run"
assert all(isinstance(r.amount, Decimal) for r in rows)

Verify confidence propagation, not a proxy. Every row must carry a real Tesseract score, and low-confidence rows must be flagged rather than dropped:

assert all(0 <= r.ocr_confidence <= 100 for r in rows)
assert all("LOW_OCR_CONFIDENCE" in r.policy_flags
           for r in rows if r.ocr_confidence < 70)

Wire it behind the batch queue. Call the generator from a stateless worker keyed on a correlation ID via building async batch queues for high-volume receipt uploads; never OCR synchronously in a request path, since a single dense page can block for hundreds of milliseconds.
Route by flag. Send rows tagged LOW_OCR_CONFIDENCE or NON_POSITIVE_AMOUNT to the manual queue owned by Receipt Error Categorization; let clean rows flow to policy evaluation and, from there, into Automated Policy Validation & Anomaly Flagging.

Edge Cases & Gotchas

Edge condition	Failure it causes	Mitigation
Page has a native text layer but you force OCR anyway	Double text, halved confidence, duplicate rows	Gate on `page.extract_words()` first; only rasterize truly empty pages
Currency symbol OCR’d as a separate token	`$` and `1,240.00` split; currency lost from the amount row	Cluster by Y-proximity so the symbol rejoins the row before the regex runs
Amount misread with no decimal point (`124000` for `1,240.00`)	Two-orders-of-magnitude overstatement passes policy	`AMOUNT_RE` requires `.\d{2}`; rows without a valid decimal are skipped, not guessed
DPI changed between runs	Row boundaries shift; deterministic replay breaks	Pin `dpi`; scale pixels to points with `72 / dpi` so `y_tolerance` is resolution-independent
Rotated or skewed scan	Words cluster across the wrong rows	Deskew upstream; validate against a rotated-page fixture in CI
Multi-currency receipt (FX line + local total)	First `CURRENCY_RE` hit wins for the whole row	Keep currency per row, not per document; reconcile totals in the policy engine, not the parser
Faint watermark OCR’d as text	Phantom low-value rows	The `MIN_WORD_CONF` gate flags them `LOW_OCR_CONFIDENCE` for review rather than trusting them

Silent rule changes violate SOX Section 404 control expectations, so the confidence score and every flag are emitted on each row and logged, keeping the extraction decision reproducible from the pinned configuration.

FAQ

Why does pdfplumber return empty strings on my scanned receipts?

Because the page contains only a raster image and no vector text or PDF text operators, extract_words() and extract_text() have nothing to read. A scanned PDF must pass through OCR to gain a searchable text layer first — either an overlay step (ocrmypdf) or, as shown here, rasterizing the page with page.to_image() and OCR’ing it with pytesseract.image_to_data so you also capture per-word confidence.

Should I OCR the whole PDF or just the failed pages?

Only the pages that lack a text layer. Corporate documents are frequently mixed — a native invoice with a scanned receipt stapled behind it. Gate each page on page.extract_words() and skip OCR where native text already exists; re-OCR’ing a native page wastes CPU and can lower quality by rasterizing crisp vector glyphs.

What DPI should I render at for reliable amount parsing?

300 DPI is the reliable default for printed receipts and keeps OCR near 150–400 ms/page. Drop to 200 DPI only for clean, large-font folios where latency matters; raise to 400 DPI for dense thermal itemizations where the decimal point is small. Whatever you choose, pin it — the 72 / dpi scale makes y_tolerance resolution-independent only if dpi itself does not drift between runs.

How do I keep line-item output deterministic after a Tesseract upgrade?

Treat the OCR engine and its language data as version-pinned dependencies and keep golden-file fixtures. An LSTM model change can shift a word box or a confidence score, so run the fixture suite on every upgrade and compare emitted rows byte-for-byte; a diff means you must re-baseline the historical corpus before promoting the new engine.

Why are my currency symbols splitting from the amount?

Tesseract often tokenizes $ or € as its own word, and can inject zero-width characters between glyphs. The row-clustering step reassembles the symbol and the number into one string by vertical proximity before AMOUNT_RE runs, and ZERO_WIDTH strips the invisible characters — so match on the joined row text, never on individual OCR tokens.

pdfplumber Line-Item Parsing — the parent guide covering coordinate extraction for text-bearing PDFs
Optimizing Tesseract for faded receipt text — tuning the OCR stage that feeds this parser
Building async batch queues for high-volume receipt uploads — concurrency wrapper for the page generator
Classifying OCR extraction errors for manual review — triage path for low-confidence rows
Receipt Ingestion & OCR Data Extraction — the parent intake framework this stage plugs into