Receipt Error Categorization for Expense Automation Pipelines

Receipt error categorization is the deterministic control gate that turns every ambiguous extraction result into exactly one typed, auditable error state before any spend reaches a policy engine or financial ledger. Within the broader Receipt Ingestion & OCR Data Extraction framework, this component sits directly after structured parsing and before policy evaluation: it consumes the confidence-scored records emitted by Tesseract OCR configuration and pdfplumber line-item parsing, and it hands a single routing decision to the downstream Core Policy Architecture & Taxonomy Design engine. It owns the extraction-error taxonomy, severity assignment, and exception routing; it delegates genuine spend-rule conflicts to the Automated Policy Validation & Anomaly Flagging pillar and the human-triage detail to classifying OCR extraction errors for manual review. This guide covers the categorization engine, its taxonomy, the configuration surface, the test contract, and the runbook that makes every decision defensible.

Problem Framing & Root Causes

Extraction rarely fails cleanly — it fails ambiguously, and ambiguity is what leaks bad data into ledgers. Four named failure modes account for nearly every categorization incident in production accounts-payable and travel pipelines:

Silent confidence decay — OCR returns a value with a low per-field probability (faded thermal ink, skewed capture), and a pipeline that ignores the score treats a guessed $1,200.50 exactly like a certain one, so reimbursement rides on a hallucinated digit.
Null-as-zero coercion — a missing mandatory field (merchant, date, total) is coerced to an empty string or 0.0 downstream, converting a structural failure into a plausible-looking transaction that clears validation and corrupts reconciliation.
Arithmetic drift — parsed line items, tax, and grand total fail to reconcile because a row was merged or dropped during table extraction, yet the header total still parses, so the record looks complete while its detail is wrong.
Category collision — one record trips several rules at once (low confidence and a missing field), and a naive engine emits conflicting error records or routes the same receipt to two queues, fracturing the audit trail.

The design goal is an engine that is deterministic (same input, same category, same audit hash), that resolves every record to a single terminal routing decision while preserving the full violation set for audit, and that treats every ambiguity as an explicit typed state rather than a silent pass.

Design Constraints & Prerequisites

This stage occupies a fixed checkpoint: strictly after structured parsing and strictly before policy scoring. The upstream data contract is a normalized record carrying record_id, optional merchant / transaction_date / total_amount / currency, a per-record ocr_confidence in [0, 1], an optional line_items list, and a source_hash (the SHA-256 of the original artifact) that anchors chain of custody. Confidence scores originate upstream in Tesseract OCR configuration; tabular structure originates in pdfplumber line-item parsing. Categorization cannot compensate for a broken upstream contract — if confidence or field-null state is not propagated, this gate fails closed rather than guessing.

Every record must leave with exactly one of the categories below, each mapped immutably to a severity, a compliance flag, and a resolution path so exceptions reach the correct queue without manual reclassification:

Category Code	Trigger Condition	Severity	Resolution Path
`POLICY_PRECHECK_VIOLATION`	A cheap precheck breaches a hard spend rule (restricted MCC, per-diem cap)	`CRITICAL`	Route to the anomaly-flagging pillar; freeze payout
`SCHEMA_VALIDATION_FAILURE`	Payload fails strict schema coercion at the boundary	`CRITICAL`	Reject and log schema drift
`MISSING_MANDATORY_FIELD`	Required field (merchant, date, total) absent or null	`HIGH`	AP exception queue; block reimbursement
`CURRENCY_CONVERSION_DRIFT`	Converted amount deviates beyond tolerance from the reference rate	`HIGH`	Treasury validation; freeze payout
`LINE_ITEM_ARITHMETIC_MISMATCH`	Line items + tax fail to reconcile against the grand total	`MEDIUM`	Travel-policy audit; conditional approval
`OCR_CONFIDENCE_BELOW_THRESHOLD`	Per-record confidence falls below the enterprise floor	`LOW`	Re-preprocess or route to manual review
`CLEAN_PASS`	All rules satisfied	`INFO`	Proceed to the policy engine

Because a close cycle can push millions of line items through this gate, loading a whole dataset into memory causes OOM crashes in containerized workers. The engine therefore evaluates records in bounded chunks and yields decisions one batch at a time — the same chunked-streaming discipline described in async batch processing — so memory stays flat regardless of ledger size. Compliance preconditions: a pinned policy_version stamped on every decision so Sarbanes-Oxley Act reviewers can reconstruct the exact taxonomy active at processing time, and a deterministic decision hash so any replay reproduces the identical audit fingerprint.

Production Python Implementation

The module below is self-contained and runnable. It uses Pydantic v2 for strict boundary validation, a generator interface to keep memory flat across arbitrarily large batches, and structured JSON logging that emits explicit audit metadata on every decision. All rules are evaluated, the full violation set is preserved, and the highest-severity violation becomes the single terminal routing decision — so a record that is both low-confidence and missing a field routes once, to the more serious queue, without losing the secondary finding.

from __future__ import annotations

import hashlib
import json
import logging
from datetime import datetime, timezone
from enum import Enum
from itertools import islice
from typing import Callable, Iterator, Optional

from pydantic import BaseModel, ConfigDict, Field, ValidationError

POLICY_VERSION = "receipt-error-tax/2026.07"

# --- Structured audit logging -------------------------------------------------

class _AuditJSONFormatter(logging.Formatter):
    """Emit one machine-readable JSON object per categorization decision so a
    SIEM or compliance warehouse can ingest it without regex parsing."""

    def format(self, record: logging.LogRecord) -> str:
        payload = {
            "ts": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "stage": "receipt_error_categorization",
            "record_id": getattr(record, "record_id", "UNKNOWN"),
            "category": getattr(record, "category", "UNKNOWN"),
            "severity": getattr(record, "severity", "INFO"),
            "compliance_flag": getattr(record, "compliance_flag", False),
            "resolution_path": getattr(record, "resolution_path", ""),
            "decision_hash": getattr(record, "decision_hash", ""),
            "policy_version": getattr(record, "policy_version", ""),
            "violations": getattr(record, "violations", []),
            "msg": record.getMessage(),
        }
        return json.dumps(payload, separators=(",", ":"))


logger = logging.getLogger("expense.receipt_error_categorization")
if not logger.handlers:
    _handler = logging.StreamHandler()
    _handler.setFormatter(_AuditJSONFormatter())
    logger.addHandler(_handler)
    logger.setLevel(logging.INFO)


# --- Taxonomy -----------------------------------------------------------------

class ErrorCategory(str, Enum):
    POLICY_PRECHECK_VIOLATION = "POLICY_PRECHECK_VIOLATION"
    SCHEMA_VALIDATION_FAILURE = "SCHEMA_VALIDATION_FAILURE"
    MISSING_MANDATORY_FIELD = "MISSING_MANDATORY_FIELD"
    CURRENCY_CONVERSION_DRIFT = "CURRENCY_CONVERSION_DRIFT"
    LINE_ITEM_ARITHMETIC_MISMATCH = "LINE_ITEM_ARITHMETIC_MISMATCH"
    OCR_CONFIDENCE_BELOW_THRESHOLD = "OCR_CONFIDENCE_BELOW_THRESHOLD"
    CLEAN_PASS = "CLEAN_PASS"


# Severity rank (higher wins) plus the immutable routing contract per category.
_SEVERITY_RANK: dict[ErrorCategory, int] = {
    ErrorCategory.POLICY_PRECHECK_VIOLATION: 50,
    ErrorCategory.SCHEMA_VALIDATION_FAILURE: 50,
    ErrorCategory.MISSING_MANDATORY_FIELD: 40,
    ErrorCategory.CURRENCY_CONVERSION_DRIFT: 40,
    ErrorCategory.LINE_ITEM_ARITHMETIC_MISMATCH: 30,
    ErrorCategory.OCR_CONFIDENCE_BELOW_THRESHOLD: 20,
    ErrorCategory.CLEAN_PASS: 0,
}

_ROUTING: dict[ErrorCategory, tuple[str, bool, str]] = {
    #                              severity   compliance  resolution_path
    ErrorCategory.POLICY_PRECHECK_VIOLATION:     ("CRITICAL", True,  "anomaly_flagging_freeze"),
    ErrorCategory.SCHEMA_VALIDATION_FAILURE:     ("CRITICAL", True,  "reject_schema_drift"),
    ErrorCategory.MISSING_MANDATORY_FIELD:       ("HIGH",     True,  "ap_exception_queue"),
    ErrorCategory.CURRENCY_CONVERSION_DRIFT:     ("HIGH",     True,  "treasury_validation"),
    ErrorCategory.LINE_ITEM_ARITHMETIC_MISMATCH: ("MEDIUM",   False, "travel_policy_audit"),
    ErrorCategory.OCR_CONFIDENCE_BELOW_THRESHOLD:("LOW",      False, "reprocess_or_manual"),
    ErrorCategory.CLEAN_PASS:                    ("INFO",     False, "route_to_policy_engine"),
}


# --- Schemas ------------------------------------------------------------------

class ExtractionPayload(BaseModel):
    """Normalized record produced by parsing. Extra keys are rejected so an
    upstream contract change surfaces here instead of silently propagating."""

    model_config = ConfigDict(extra="forbid")

    record_id: str
    ocr_confidence: float = Field(ge=0.0, le=1.0)
    merchant: Optional[str] = None
    transaction_date: Optional[str] = None
    total_amount: Optional[float] = None
    currency: Optional[str] = None
    converted_amount: Optional[float] = None
    reference_rate: Optional[float] = None
    line_items: Optional[list[dict]] = None
    source_hash: str = ""


class CategorizedRecord(BaseModel):
    record_id: str
    category: ErrorCategory
    severity: str
    compliance_flag: bool
    resolution_path: str
    violations: list[str]          # full set of tripped rules, for audit
    policy_version: str
    decision_hash: str
    categorized_at_utc: str


# --- Engine -------------------------------------------------------------------

# A precheck maps a validated payload to a category or None. Genuine spend-rule
# evaluation lives in the policy pillar; only cheap hard-stops belong here.
PolicyPrecheck = Callable[["ExtractionPayload"], Optional[ErrorCategory]]

MANDATORY_FIELDS: frozenset[str] = frozenset({"merchant", "transaction_date", "total_amount"})


class ReceiptErrorCategorizer:
    """Resolve one ExtractionPayload to one terminal CategorizedRecord."""

    def __init__(
        self,
        *,
        confidence_floor: float = 0.85,
        arithmetic_tolerance: float = 0.01,
        fx_drift_tolerance: float = 0.02,
        policy_prechecks: tuple[PolicyPrecheck, ...] = (),
        policy_version: str = POLICY_VERSION,
    ) -> None:
        self._confidence_floor = confidence_floor
        self._arithmetic_tolerance = arithmetic_tolerance
        self._fx_drift_tolerance = fx_drift_tolerance
        self._policy_prechecks = policy_prechecks
        self._policy_version = policy_version

    def _collect_violations(self, payload: ExtractionPayload) -> list[ErrorCategory]:
        found: list[ErrorCategory] = []

        for precheck in self._policy_prechecks:
            hit = precheck(payload)
            if hit is not None:
                found.append(hit)

        present = {k for k, v in payload.model_dump().items() if v not in (None, "")}
        if MANDATORY_FIELDS - present:
            found.append(ErrorCategory.MISSING_MANDATORY_FIELD)

        if (
            payload.converted_amount is not None
            and payload.total_amount is not None
            and payload.reference_rate is not None
        ):
            expected = payload.total_amount * payload.reference_rate
            if expected and abs(payload.converted_amount - expected) / expected > self._fx_drift_tolerance:
                found.append(ErrorCategory.CURRENCY_CONVERSION_DRIFT)

        if payload.line_items and payload.total_amount is not None:
            line_total = sum(float(i.get("amount", 0)) for i in payload.line_items)
            tax = sum(float(i.get("tax", 0)) for i in payload.line_items)
            if abs((line_total + tax) - payload.total_amount) > self._arithmetic_tolerance:
                found.append(ErrorCategory.LINE_ITEM_ARITHMETIC_MISMATCH)

        if payload.ocr_confidence < self._confidence_floor:
            found.append(ErrorCategory.OCR_CONFIDENCE_BELOW_THRESHOLD)

        return found

    def _finalize(
        self, record_id: str, category: ErrorCategory, violations: list[str], source_hash: str
    ) -> CategorizedRecord:
        severity, compliance_flag, resolution_path = _ROUTING[category]
        fingerprint = "|".join(
            [record_id, category.value, resolution_path, self._policy_version, source_hash]
        )
        decision_hash = hashlib.sha256(fingerprint.encode("utf-8")).hexdigest()
        decision = CategorizedRecord(
            record_id=record_id,
            category=category,
            severity=severity,
            compliance_flag=compliance_flag,
            resolution_path=resolution_path,
            violations=violations,
            policy_version=self._policy_version,
            decision_hash=decision_hash,
            categorized_at_utc=datetime.now(timezone.utc).isoformat(),
        )
        logger.info(
            "categorization_decision",
            extra={
                "record_id": decision.record_id,
                "category": decision.category.value,
                "severity": decision.severity,
                "compliance_flag": decision.compliance_flag,
                "resolution_path": decision.resolution_path,
                "decision_hash": decision.decision_hash,
                "policy_version": decision.policy_version,
                "violations": violations,
            },
        )
        return decision

    def categorize(self, raw: dict) -> CategorizedRecord:
        record_id = str(raw.get("record_id", "UNKNOWN"))
        try:
            payload = ExtractionPayload.model_validate(raw)
        except ValidationError:
            return self._finalize(
                record_id,
                ErrorCategory.SCHEMA_VALIDATION_FAILURE,
                [ErrorCategory.SCHEMA_VALIDATION_FAILURE.value],
                str(raw.get("source_hash", "")),
            )

        violations = self._collect_violations(payload)
        if not violations:
            return self._finalize(
                payload.record_id, ErrorCategory.CLEAN_PASS, [], payload.source_hash
            )

        # Highest-severity violation is the terminal routing decision; the full
        # set is preserved on the record for downstream audit.
        terminal = max(violations, key=lambda c: _SEVERITY_RANK[c])
        return self._finalize(
            payload.record_id, terminal, [v.value for v in violations], payload.source_hash
        )

    def categorize_stream(
        self, payloads: Iterator[dict], *, chunk_size: int = 500
    ) -> Iterator[CategorizedRecord]:
        """Yield decisions in bounded chunks so memory stays flat at any scale."""
        while chunk := list(islice(payloads, chunk_size)):
            for raw in chunk:
                yield self.categorize(raw)


if __name__ == "__main__":
    def block_restricted_mcc(p: ExtractionPayload) -> Optional[ErrorCategory]:
        # Cheap hard-stop example; deep policy eval is delegated downstream.
        restricted = {"7995"}  # gambling
        if (p.merchant or "").lower().startswith("casino"):
            return ErrorCategory.POLICY_PRECHECK_VIOLATION
        return None

    engine = ReceiptErrorCategorizer(policy_prechecks=(block_restricted_mcc,))
    fixtures = [
        {"record_id": "R1", "ocr_confidence": 0.97, "merchant": "Grand Hotel",
         "transaction_date": "2026-06-01", "total_amount": 420.0, "currency": "USD",
         "line_items": [{"amount": 400.0, "tax": 20.0}], "source_hash": "abc"},
        {"record_id": "R2", "ocr_confidence": 0.40, "merchant": "Faded Diner",
         "transaction_date": "2026-06-02", "total_amount": 30.0, "currency": "USD",
         "source_hash": "def"},
        {"record_id": "R3", "ocr_confidence": 0.95, "merchant": None,
         "transaction_date": "2026-06-03", "total_amount": 55.0, "currency": "USD",
         "source_hash": "ghi"},
        {"record_id": "R4", "ocr_confidence": 0.99, "merchant": "Casino Royale",
         "transaction_date": "2026-06-04", "total_amount": 200.0, "currency": "USD",
         "source_hash": "jkl"},
    ]
    for decision in engine.categorize_stream(iter(fixtures)):
        print(decision.record_id, decision.category.value, decision.resolution_path)

R1 is a CLEAN_PASS (confident, complete, and its line items reconcile), R2 is OCR_CONFIDENCE_BELOW_THRESHOLD, R3 is MISSING_MANDATORY_FIELD because merchant is null, and R4 is POLICY_PRECHECK_VIOLATION — the highest-severity terminal state even though the record is otherwise clean. Folding source_hash and policy_version into decision_hash makes re-processing idempotent: the same artifact under the same taxonomy always produces the same audit fingerprint.

Configuration Reference

Every categorization decision is judged by exactly one immutable configuration. Pin the policy_version in your config store and bump it whenever a threshold or the taxonomy changes, so audit reconstruction resolves to a single rule set.

Key	Type	Default	Rationale
`confidence_floor`	`float`	`0.85`	Records below this per-record score are flagged `OCR_CONFIDENCE_BELOW_THRESHOLD`; align it with the score distribution from Tesseract OCR configuration.
`arithmetic_tolerance`	`float`	`0.01`	Absolute currency tolerance for line-item reconciliation; keep it at one minor unit to absorb rounding without masking a dropped row.
`fx_drift_tolerance`	`float`	`0.02`	Fractional deviation allowed between the converted amount and `total × reference_rate` before flagging drift.
`policy_prechecks`	`tuple[Callable, ...]`	`()`	Cheap hard-stops evaluated first; deep spend-rule logic is delegated to the anomaly-flagging pillar.
`policy_version`	`str`	`receipt-error-tax/2026.07`	Stamped on every decision and folded into `decision_hash` for point-in-time audit reconstruction.
`chunk_size`	`int`	`500`	Rows per streamed chunk; lower it if worker memory approaches the container budget during month-end close.

Version-pin pydantic (v2) in your lockfile: a major bump changes validation semantics, and mixing v1/v2 schemas across the pipeline silently drops the extra="forbid" guard that surfaces upstream contract drift.

Validation & Testing

Categorization fails at the boundaries between states and at severity precedence, so the suite pins the exact transitions and asserts that decision_hash is stable across repeated runs to prove idempotency.

import pytest

from receipt_error_categorization import (ErrorCategory,
                                          ReceiptErrorCategorizer)


def _raw(**overrides: object) -> dict:
    base = dict(record_id="R", ocr_confidence=0.95, merchant="M",
                transaction_date="2026-06-01", total_amount=30.0,
                currency="USD", source_hash="h")
    base.update(overrides)
    return base


@pytest.fixture
def engine() -> ReceiptErrorCategorizer:
    return ReceiptErrorCategorizer()


def test_clean_record_passes(engine: ReceiptErrorCategorizer) -> None:
    assert engine.categorize(_raw()).category is ErrorCategory.CLEAN_PASS


def test_low_confidence_is_flagged(engine: ReceiptErrorCategorizer) -> None:
    d = engine.categorize(_raw(ocr_confidence=0.40))
    assert d.category is ErrorCategory.OCR_CONFIDENCE_BELOW_THRESHOLD


def test_missing_field_beats_low_confidence(engine: ReceiptErrorCategorizer) -> None:
    # Both rules trip; the higher-severity category must win the routing.
    d = engine.categorize(_raw(merchant=None, ocr_confidence=0.40))
    assert d.category is ErrorCategory.MISSING_MANDATORY_FIELD
    assert ErrorCategory.OCR_CONFIDENCE_BELOW_THRESHOLD.value in d.violations


def test_arithmetic_mismatch_detected(engine: ReceiptErrorCategorizer) -> None:
    d = engine.categorize(_raw(total_amount=99.0,
                               line_items=[{"amount": 10.0, "tax": 1.0}]))
    assert d.category is ErrorCategory.LINE_ITEM_ARITHMETIC_MISMATCH


def test_schema_failure_on_bad_confidence(engine: ReceiptErrorCategorizer) -> None:
    d = engine.categorize(_raw(ocr_confidence=1.7))
    assert d.category is ErrorCategory.SCHEMA_VALIDATION_FAILURE


def test_decision_hash_is_idempotent(engine: ReceiptErrorCategorizer) -> None:
    raw = _raw()
    assert engine.categorize(raw).decision_hash == engine.categorize(raw).decision_hash

The confidence gate on this component is test_missing_field_beats_low_confidence: if severity precedence regresses, a record that should freeze in the AP exception queue would instead route to the softer reprocess path, so the build must fail closed. Fixtures should also cover the edge cases that break parsers deterministically — faded thermal ink (very low confidence), timezone-shifted transaction_date strings, and split-tender transactions where line items reconcile to a partial total — the human-triage handling of which belongs to classifying OCR extraction errors for manual review.

Operational Runbook

Deploy behind a version gate. Ship the new policy_version alongside the taxonomy and process each in-flight batch under the manifest active when it started, never mid-flight. Roll back by re-pinning the previous version — no code revert required.
Wire the audit stream. Route the JSON log events to your SIEM or compliance warehouse and confirm decision_hash, category, compliance_flag, and policy_version appear on every event before enabling downstream routing.
Fail closed on schema drift. Alert on any SCHEMA_VALIDATION_FAILURE; a nonzero rate means an upstream parser changed its output contract and the extraction stage must be inspected before the batch proceeds.
Baseline the category distribution. Emit a counter per ErrorCategory over a clean week, then alert on deviation from that baseline.
Alert thresholds. Page when OCR_CONFIDENCE_BELOW_THRESHOLD exceeds its baseline (a capture-quality or OCR-config regression), when MISSING_MANDATORY_FIELD spikes (a parsing contract break), or when POLICY_PRECHECK_VIOLATION rises (a genuine compliance event or a precheck misconfiguration).
Route by severity, not by rule count. Confirm HIGH and CRITICAL categories bypass the policy engine and land directly in the AP/compliance queues; only CLEAN_PASS records advance to Core Policy Architecture & Taxonomy Design.
Roll forward, not back, on data. Because categorization is idempotent, reprocessing a corrected batch under the same policy_version yields identical decision_hash values, so audit continuity survives any replay.

Receipt error categorization is a deterministic compliance control, not a heuristic filter. Strict boundary validation, severity-ranked terminal routing, and immutable audit logging give AP and travel teams a component that scales predictably, satisfies audit requirements, and turns ambiguous extraction failures into reproducible, auditable states.

Receipt Ingestion & OCR Data Extraction — the parent framework this gate plugs into
Classifying OCR extraction errors for manual review — human-triage routing for flagged records
Tesseract OCR configuration — produces the confidence scores this gate thresholds
pdfplumber line-item parsing — produces the line-item structure checked for arithmetic drift
Async batch processing — the chunked-streaming discipline this engine reuses
Merchant category code routing — the downstream consumer of clean, categorized records