Tesseract OCR Configuration for Expense Report Auditing & Policy Violation Detection

Deterministic Tesseract configuration turns receipt character recognition from a probabilistic guess into an auditable control point, so that low-confidence reads are quarantined instead of silently poisoning the general ledger.

Within the broader Receipt Ingestion & OCR Data Extraction framework, this component owns the character-recognition contract: how a normalized image is turned into text plus per-field confidence, and how that confidence decides whether a record advances or halts. It assumes images are already preprocessed and hands high-confidence text to pdfplumber line-item parsing for tabular extraction, delegates orchestration and backpressure to Async batch processing, routes failures to Receipt error categorization, and defers the hardest binarization work on degraded imagery to Optimizing Tesseract for faded receipt text. What follows is the Tesseract stage that produces confidence-gated, audit-ready text for those downstream stages.

Problem Framing & Root Causes

Default Tesseract configuration assumes clean, typewritten documents and fails predictably on receipt imagery. Four named failure modes account for most silent data corruption in accounts-payable pipelines:

Confidence bleed — raw text is consumed downstream without inspecting per-word confidence, so a 40%-confidence total ($1OO.OO) is parsed as a real number and posts a phantom liability.
Whitelist hallucination — with no character constraint, the LSTM engine invents symbols (§, ¤, box-drawing glyphs) that break amount parsers and merchant matchers.
Segmentation collapse — the wrong Page Segmentation Mode splits or merges the dense single-column layout of a thermal print, scrambling line-item order.
Engine drift — an unpinned Tesseract binary or tessdata file changes recognition behaviour between deploys, so a receipt that passed last month is rejected today with no code change and no audit explanation.

The remedy is to make the OCR stage emit structured metadata alongside text and to gate on confidence before any parsing occurs, so noisy reads are routed to review rather than guessed at.

Design Constraints & Prerequisites

Before wiring this stage into an AP or corporate-travel pipeline, five contracts must hold:

Engine version. Tesseract 5.x with the LSTM engine (--oem 3) and a version-pinned tessdata directory. Pin the binary and the language data together — recognition is a function of both.
Upstream data contract. Each image arrives already normalized (deskewed, grayscale-eligible, correct DPI). This stage performs light binarization only; heavy restoration of faded thermal prints belongs in Optimizing Tesseract for faded receipt text.
Memory budget. Peak resident memory must scale with the batch window, never with total volume. That forces a generator that yields one result at a time rather than materializing a list of decoded images.
Throughput / CPU. OCR is CPU-bound. Under the Async batch processing engine it runs in a worker thread; here it is kept as a pure, synchronous function so it stays unit-testable and can be offloaded by the caller.
Compliance preconditions. Every extraction emits an append-only audit record carrying confidence, --psm/--oem state, a correlation ID, and an explicit routing decision, so any downstream policy flag from Automated Policy Validation & Anomaly Flagging traces back to a verifiable read.

Production Python Implementation

The module below is self-contained and runnable on Python 3.11+ with pytesseract, opencv-python, and Pillow. It performs light binarization, extracts word-level confidence via image_to_data, gates on average confidence, and yields one audit-ready result at a time so memory stays bounded by the batch window rather than total volume. Tesseract reports confidence as integers 0–100, with -1 marking non-word (block/line/paragraph) rows that must be filtered before averaging.

import json
import logging
from dataclasses import dataclass, asdict, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, Generator, List

import cv2
import pytesseract
from PIL import Image

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.FileHandler("ocr_audit.log", mode="a")],
)
logger = logging.getLogger("expense_ocr")

# Character set permitted in receipt text. Anything outside it is dropped by
# Tesseract, preventing hallucinated glyphs from reaching the amount parser.
CHAR_WHITELIST = (
    "0123456789"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    ".,$€£¥%&+-/ "
)

# Confidence thresholds on Tesseract's native 0–100 scale.
PASS_THRESHOLD = 85.0
REVIEW_THRESHOLD = 70.0


@dataclass
class OCRExtractionResult:
    """Audit record for one receipt read. Serialized verbatim to the log."""

    receipt_id: str
    raw_text: str
    avg_confidence: float          # 0–100 (Tesseract native scale)
    critical_fields: Dict[str, Any]
    psm: int
    oem: int
    status: str                    # "PASS" | "REVIEW_REQUIRED" | "REJECTED"
    timestamp: str
    correlation_id: str
    error: str = ""


def preprocess_image(image_path: Path) -> Image.Image:
    """Light, deterministic binarization: grayscale + adaptive threshold.

    Heavy restoration of faded thermal prints is out of scope here — see
    /receipt-ingestion-ocr-data-extraction/tesseract-ocr-configuration/optimizing-tesseract-for-faded-receipt-text/.
    """
    img = cv2.imread(str(image_path))
    if img is None:
        raise FileNotFoundError(f"unreadable image: {image_path}")
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return Image.fromarray(
        cv2.adaptiveThreshold(
            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY, 11, 2,
        )
    )


def build_config(psm: int, oem: int) -> str:
    """Assemble the Tesseract config string with a pinned whitelist."""
    return (
        f"--psm {psm} --oem {oem} "
        f"-c tessedit_char_whitelist={CHAR_WHITELIST} "
        "-c preserve_interword_spaces=1"
    )


def extract_with_confidence(pil_image: Image.Image, psm: int, oem: int) -> Dict[str, Any]:
    """Return Tesseract's word-level data dict (text, conf, bounding boxes)."""
    return pytesseract.image_to_data(
        pil_image,
        output_type=pytesseract.Output.DICT,
        config=build_config(psm, oem),
    )


def average_confidence(tess_data: Dict[str, Any]) -> float:
    """Mean confidence over real words; -1 (non-word) rows are excluded."""
    scores = [int(c) for c in tess_data["conf"] if int(c) != -1]
    return round(sum(scores) / len(scores), 2) if scores else 0.0


def joined_text(tess_data: Dict[str, Any]) -> str:
    """Reassemble non-empty, real-word tokens into a single text string."""
    return " ".join(
        tok for tok, conf in zip(tess_data["text"], tess_data["conf"])
        if int(conf) != -1 and tok.strip()
    ).strip()


def classify(avg_conf: float) -> str:
    """Map average confidence to an explicit routing decision."""
    if avg_conf >= PASS_THRESHOLD:
        return "PASS"
    if avg_conf >= REVIEW_THRESHOLD:
        return "REVIEW_REQUIRED"
    return "REJECTED"


def ocr_receipts(
    image_dir: Path, *, psm: int = 6, oem: int = 3,
) -> Generator[OCRExtractionResult, None, None]:
    """Stream one audit-ready result per image.

    Memory-safe: images are decoded lazily one at a time, so resident memory
    is a function of a single receipt, not of the directory size.
    """
    images = sorted([*image_dir.glob("*.png"), *image_dir.glob("*.jpg")])
    for img_path in images:
        now = datetime.now(timezone.utc).isoformat()
        try:
            tess_data = extract_with_confidence(preprocess_image(img_path), psm, oem)
            avg_conf = average_confidence(tess_data)
            yield OCRExtractionResult(
                receipt_id=img_path.stem,
                raw_text=joined_text(tess_data),
                avg_confidence=avg_conf,
                critical_fields={"merchant": "", "date": "", "total": ""},
                psm=psm,
                oem=oem,
                status=classify(avg_conf),
                timestamp=now,
                correlation_id=f"EXP-{img_path.stem}",
            )
        except Exception as exc:  # isolate failure to this receipt only
            logger.error("OCR_FAILURE | %s | %s", img_path.name, exc)
            yield OCRExtractionResult(
                receipt_id=img_path.stem, raw_text="", avg_confidence=0.0,
                critical_fields={}, psm=psm, oem=oem, status="REJECTED",
                timestamp=now, correlation_id=f"ERR-{img_path.stem}", error=str(exc),
            )


def log_and_route(result: OCRExtractionResult) -> str:
    """Emit the append-only audit line and return the routing decision.

    PASS       → downstream parsing (pdfplumber line-item extraction).
    REVIEW_... → manual review queue with correlation ID + confidence.
    REJECTED   → error categorization for triage.
    """
    logger.info(json.dumps(asdict(result), separators=(",", ":")))
    if result.status == "REVIEW_REQUIRED":
        logger.warning(
            "POLICY_FLAG | %s | confidence %.1f/100",
            result.correlation_id, result.avg_confidence,
        )
    elif result.status == "REJECTED":
        logger.error("REJECTED | %s | below confidence gate", result.correlation_id)
    return result.status


if __name__ == "__main__":
    for res in ocr_receipts(Path("./incoming")):
        log_and_route(res)

High-confidence records (PASS) flow directly into pdfplumber line-item parsing for multi-page invoices; REVIEW_REQUIRED records are held for a human; and REJECTED records are triaged by Receipt error categorization before any retry. Because the stage never guesses on low-confidence fields, the Duplicate Receipt Detection and policy engines downstream never operate on phantom line items.

Configuration Reference

Every tunable is set explicitly and version-pinned. The whitelist and OEM in particular must be locked so recognition behaviour cannot drift silently across Tesseract releases.

Flag / key	Type	Default	Rationale
`--psm`	int	`6`	Single uniform block. Receipts are dense single-column layouts; PSM 6 avoids the line-splitting artifacts that PSM 3 (auto) produces on thermal prints.
`--oem`	int	`3`	LSTM neural engine — superior on degraded fonts and irregular spacing. Pin it; legacy modes recognize differently.
`tessedit_char_whitelist`	str	`CHAR_WHITELIST`	Drops out-of-set glyphs so hallucinated symbols never reach the amount parser. Include the currency symbols your fleet actually sees.
`preserve_interword_spaces`	int (0/1)	`1`	Keeps column alignment intact for downstream line-item extraction.
`PASS_THRESHOLD`	float	`85.0`	Average confidence at or above which a record advances unattended.
`REVIEW_THRESHOLD`	float	`70.0`	Floor for the manual-review band; below it, records are rejected outright.
`tessdata` version	pinned dir	—	Version-control the language data and pin the binary release; recognition is a function of both, so an unpinned upgrade is an unlogged behaviour change.

Set TESSDATA_PREFIX explicitly in the deployment environment and record the Tesseract version in the audit log at startup, so any recognition change is attributable to a specific, logged upgrade.

Validation & Testing

Because recognition, confidence, and routing are separated into pure functions, the confidence gate is unit-testable without invoking Tesseract at all — feed synthetic image_to_data dicts and assert on the routing decision and the -1 filtering.

import pytest

from ocr import average_confidence, classify, joined_text  # module above


def _data(words, confs):
    """Build a minimal Tesseract-style output dict."""
    return {"text": words, "conf": [str(c) for c in confs]}


def test_average_ignores_non_word_rows():
    # -1 rows (block/line/para) must not drag the mean toward zero.
    assert average_confidence(_data(["ACME", "12.50"], [-1, 96, 92])) == 94.0


def test_empty_read_scores_zero_not_crash():
    assert average_confidence(_data([], [-1])) == 0.0


def test_joined_text_drops_blanks_and_non_words():
    data = _data(["ACME", "", "12.50"], [95, -1, 91])
    assert joined_text(data) == "ACME 12.50"


@pytest.mark.parametrize("avg,expected", [
    (99.0, "PASS"),
    (85.0, "PASS"),            # boundary is inclusive
    (84.9, "REVIEW_REQUIRED"),
    (70.0, "REVIEW_REQUIRED"),
    (69.9, "REJECTED"),
    (0.0, "REJECTED"),
])
def test_confidence_gate_boundaries(avg, expected):
    assert classify(avg) == expected

Keep fixtures for the edge conditions that break naive configs: a faded thermal print that yields near-empty raw_text (must land in REJECTED, never silently PASS), a receipt whose only high-confidence tokens are non-word rows, and a mixed-currency receipt that exercises the whitelist. Timezone-sensitive fields such as transaction date are validated downstream, not here — that is owned by the date-window logic under Automated Policy Validation & Anomaly Flagging.

Operational Runbook

Deploy, monitor, and roll back the OCR stage with the following checklist.

Pre-deploy. Pin the Tesseract binary and tessdata version, run the test suite, and log the resolved Tesseract version at container startup. Confirm TESSDATA_PREFIX resolves and the whitelist matches the currency set in production traffic.
Shadow first. Run the new config in shadow against a sample of yesterday’s traffic and diff avg_confidence and status distributions against the current build before cutting over. A shift in the REJECTED rate with no traffic change signals engine or tessdata drift.

Monitor these signals.

Signal	Healthy	Alert threshold
`PASS` rate	matches historical baseline	drops > 10 points (preprocessing or engine regression)
`REVIEW_REQUIRED` rate	stable band	2× baseline (imagery quality or threshold drift)
`OCR_FAILURE` log rate	< 0.5% of receipts	> 2% sustained over 10 min (unreadable inputs / decode errors)
Per-receipt latency	flat	trending up (contention on the OCR thread pool)

Roll back. Because the config is code plus pinned data, roll back by redeploying the prior image and tessdata together — never one without the other, or recognition will not match the rolled-back audit history.
Reprocess quarantined records. After a config fix, resubmit REJECTED and REVIEW_REQUIRED receipts through this stage only, preserving the original correlation IDs so lineage stays intact. Persistently unreadable images are escalated via Receipt error categorization.

Together, a pinned engine, a locked whitelist, and a hard confidence gate make Tesseract a compliance control rather than a preprocessing afterthought: every approval or rejection downstream traces to a deterministic, logged read.

Receipt Ingestion & OCR Data Extraction — parent overview of the ingestion and extraction stages.
Optimizing Tesseract for faded receipt text — binarization strategies for degraded thermal prints upstream of this stage.
pdfplumber line-item parsing — deterministic tabular extraction that consumes high-confidence OCR text.
Async batch processing — the orchestration layer that offloads this CPU-bound OCR step across workers.
Receipt error categorization — triage for rejected and low-confidence reads.
Automated Policy Validation & Anomaly Flagging — the downstream engine that consumes confidence-gated text.