Duplicate Receipt Detection: Implementation Guide for Expense Automation Pipelines

Duplicate receipt detection is the deterministic gate that stops the same expense from being reimbursed twice — whether it arrives as a re-uploaded photo, a corporate-card feed that overlaps a manual submission, or a PDF and its emailed copy. Within the broader Automated Policy Validation & Anomaly Flagging framework, this component consumes the canonical records produced by Receipt Ingestion & OCR Data Extraction, fingerprints each line item, and matches it against historical submissions before routing or approval. It owns exact and tolerance-bounded matching; it delegates fiscal-boundary overlap to Date Window Validation Logic, vendor entity resolution to Merchant Category Code Routing, and match-score cutoffs to Dynamic Threshold Tuning. This guide covers the matching engine, its configuration surface, and the audit trail that makes every decision defensible.

Problem Framing & Root Causes

The dominant production failure is memory exhaustion: naive implementations load the entire historical receipt set into RAM for pairwise comparison, degrading to O(N²) and stalling AP queues during month-end close. Three further failure modes cause silent false negatives: normalization drift, where AMZN MKTP and Amazon Marketplace or 1,234.50 and 1234.5 never compare equal under string/float equality; temporal decoupling, where the same transaction lands in adjacent reporting periods because capture, submission, and settlement dates diverge; and float rounding, where accumulated IEEE-754 error makes two identical amounts differ by a fraction of a cent. Each of these lets a genuine duplicate slip past an “exact match” rule, so the engine must normalize aggressively and match on a stable fingerprint rather than raw fields.

Design Constraints & Prerequisites

This component is stateless per invocation but reads and writes a persistent historical index. It assumes upstream stages have already resolved OCR confidence and coerced records into the canonical schema; if critical fields are missing it must route to human review rather than emit a false negative. Latency and memory are bounded by processing in fixed-size chunks and delegating joins to an embedded analytical engine (DuckDB) that spills to disk instead of RAM.

Constraint	Requirement	Rationale
Upstream contract	`receipt_id`, `amount`, `transaction_date`, `merchant_name`, `currency` present and OCR-validated	Missing fields must divert to a fallback chain, not silently match
Money representation	Compared at 2-decimal precision via `decimal.Decimal`, never `float` arithmetic	Eliminates rounding drift that breaks amount equality
Timestamps	Timezone-aware, normalized to UTC (ISO 8601)	Prevents the same transaction landing in two periods
Memory ceiling	Bounded per chunk (default 50k rows); historical index on disk	Avoids O(N²) in-memory joins at close
Latency target	Sub-second per batch at enterprise volume	Keeps AP queues moving during peak windows
Compliance	Every comparison and routing decision logged immutably	Satisfies Sarbanes-Oxley control evidence requirements

The historical index and its retention window are the only shared state; treat both as configuration-as-code so finance teams can tune tolerances without a redeploy. Sensitive receipt payloads referenced by the index must respect the isolation rules defined in Security & Compliance Boundaries.

Production Python Implementation

The engine is three composable stages: deterministic normalization, an out-of-core batch matcher, and audit-emitting routing. Each is independently testable and emits structured metadata.

Deterministic normalization

Normalization collapses surface variation into a single fingerprint. Amounts are quantized with decimal.Decimal, timestamps converted to UTC, and merchant strings stripped to a canonical form before a SHA-256 match_key is computed. When a merchant string is ambiguous, resolve it against the standardized tables owned by Merchant Category Code Routing before hashing so that the same vendor always yields the same key.

import re
import hashlib
import logging
from datetime import datetime, timezone
from decimal import Decimal, ROUND_HALF_UP
from typing import Dict, Any

logger = logging.getLogger("expense.dedup.normalize")


def normalize_receipt(raw: Dict[str, Any]) -> Dict[str, Any]:
    """Collapse a raw OCR record into a deterministic matching fingerprint.

    Money is quantized with Decimal (never float) so identical amounts always
    compare equal; timestamps are forced to UTC; merchant strings are reduced
    to a canonical form. The SHA-256 match_key is stable across OCR variance.
    """
    # Currency amount: quantize to 2 dp; assumes a pre-validated ISO 4217 code.
    amount = Decimal(str(raw["amount"])).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)

    # Timestamp: parse and force to UTC (ISO 8601).
    dt = datetime.fromisoformat(str(raw["transaction_date"]).replace("Z", "+00:00"))
    dt_utc = dt.astimezone(timezone.utc)

    # Merchant: lowercase, strip punctuation, collapse whitespace.
    merchant_clean = re.sub(r"[^a-z0-9\s]", "", str(raw["merchant_name"]).lower())
    merchant_clean = re.sub(r"\s+", " ", merchant_clean).strip()

    match_key_str = f"{dt_utc.strftime('%Y-%m-%d')}|{amount}|{merchant_clean}"
    match_hash = hashlib.sha256(match_key_str.encode("utf-8")).hexdigest()

    normalized = {
        "receipt_id": raw["receipt_id"],
        "transaction_date_utc": dt_utc.isoformat(),
        "amount_normalized": str(amount),  # keep Decimal-safe string form
        "merchant_normalized": merchant_clean,
        "currency_iso": raw.get("currency", "USD"),
        "match_key": match_hash,
        "raw_payload_ref": raw.get("ocr_job_id", "unknown"),
    }
    logger.debug("normalized_receipt", extra={"receipt_id": raw["receipt_id"], "match_key": match_hash})
    return normalized

The hashlib module provides FIPS-compliant hashing, so identical normalized inputs always yield identical keys regardless of upstream OCR jitter; see the official hashlib documentation. Records that arrive with null critical fields must never reach this function — they belong in the fallback chain described under Receipt Error Categorization.

Memory-efficient batch matcher

The matcher delegates the pairwise join to DuckDB’s vectorized engine over an on-disk index, so RAM stays flat regardless of historical volume. The incoming batch is loaded into a temp table and joined by match_key with a bounded amount tolerance — never an in-memory cross product.

import duckdb
import logging
from pathlib import Path
from typing import Iterator, List, Dict, Any

logger = logging.getLogger("expense.dedup.matcher")


class DuplicateMatcher:
    """Out-of-core duplicate matcher backed by an on-disk DuckDB index."""

    _COLUMNS = ("match_key", "receipt_id", "transaction_date_utc",
                "amount_normalized", "merchant_normalized")

    def __init__(self, historical_db_path: Path, amount_tolerance: str = "0.01",
                 chunk_size: int = 50_000) -> None:
        self.amount_tolerance = amount_tolerance
        self.chunk_size = chunk_size
        self.con = duckdb.connect(str(historical_db_path))
        self._init_index()

    def _init_index(self) -> None:
        self.con.execute("""
            CREATE TABLE IF NOT EXISTS receipt_index (
                match_key            VARCHAR PRIMARY KEY,
                receipt_id           VARCHAR,
                transaction_date_utc TIMESTAMP,
                amount_normalized    DECIMAL(12,2),
                merchant_normalized  VARCHAR
            )
        """)
        self.con.execute("CREATE INDEX IF NOT EXISTS idx_receipt_id ON receipt_index (receipt_id)")

    def _rows(self, batch: List[Dict[str, Any]]):
        return [tuple(r[c] for c in self._COLUMNS) for r in batch]

    def ingest_batch(self, normalized_batch: List[Dict[str, Any]]) -> None:
        """Persist a normalized batch; INSERT OR IGNORE makes re-ingest idempotent."""
        if not normalized_batch:
            return
        self.con.executemany(
            f"INSERT OR IGNORE INTO receipt_index ({', '.join(self._COLUMNS)}) VALUES (?,?,?,?,?)",
            self._rows(normalized_batch),
        )

    def find_duplicates(self, normalized_batch: List[Dict[str, Any]]) -> Iterator[Dict[str, Any]]:
        """Yield one audit record per detected duplicate for the incoming batch."""
        if not normalized_batch:
            return
        self.con.execute("CREATE TEMP TABLE IF NOT EXISTS _incoming (LIKE receipt_index)")
        self.con.execute("DELETE FROM _incoming")
        self.con.executemany(
            f"INSERT INTO _incoming ({', '.join(self._COLUMNS)}) VALUES (?,?,?,?,?)",
            self._rows(normalized_batch),
        )

        rows = self.con.execute("""
            SELECT b.receipt_id        AS new_receipt_id,
                   h.receipt_id        AS duplicate_of,
                   b.amount_normalized AS new_amount,
                   h.amount_normalized AS historical_amount,
                   b.transaction_date_utc AS new_date,
                   h.transaction_date_utc AS historical_date,
                   b.match_key
            FROM _incoming b
            JOIN receipt_index h
              ON b.match_key = h.match_key
             AND b.receipt_id != h.receipt_id
             AND ABS(b.amount_normalized - h.amount_normalized) <= ?
        """, [self.amount_tolerance]).fetchall()

        cols = ["new_receipt_id", "duplicate_of", "new_amount", "historical_amount",
                "new_date", "historical_date", "match_key"]
        for row in rows:
            r = dict(zip(cols, row))
            logger.info("duplicate_detected", extra={"receipt_id": r["new_receipt_id"],
                                                     "duplicate_of": r["duplicate_of"]})
            yield {
                "type": "DUPLICATE_DETECTED",
                "new_receipt_id": r["new_receipt_id"],
                "duplicate_of": r["duplicate_of"],
                "amount_variance": float(r["new_amount"] - r["historical_amount"]),
                "match_key": r["match_key"],
                "routing_action": "HOLD_FOR_AP_REVIEW",
            }

This eliminates in-memory O(N²) joins entirely: DuckDB streams the join from disk and honours the primary-key index on match_key. Exact-key matching only catches receipts that fall inside the same normalized day; overlaps that span a fiscal cutoff are the concern of the dedicated pattern in Detecting duplicate expenses across overlapping submission windows, which layers a sliding UTC window on top of this same index.

Audit-emitting routing

Every match, non-match, and routing decision is logged as structured JSON with correlation IDs so an auditor can reconstruct the exact decision path. structlog renders events that stream cleanly into SIEM platforms and AP reconciliation dashboards.

import structlog
from typing import Dict, Any, Protocol

logger = structlog.wrap_logger(
    structlog.get_logger(),
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)


class PolicyEngine(Protocol):
    def queue_for_manual_review(self, *, receipt_id: str, reason: str, escalation_tier: str) -> None: ...


def route_and_log(match_result: Dict[str, Any], policy_engine: PolicyEngine) -> None:
    """Emit an immutable audit event and route the flagged receipt to AP review."""
    logger.info(
        "duplicate_detection_result",
        receipt_id=match_result["new_receipt_id"],
        match_type="DETERMINISTIC_HASH",
        duplicate_of=match_result["duplicate_of"],
        amount_variance=match_result["amount_variance"],
        routing_action=match_result["routing_action"],
        compliance_flag="POLICY_VIOLATION_SUSPECTED",
        audit_hash=match_result["match_key"],
    )
    if match_result["routing_action"] == "HOLD_FOR_AP_REVIEW":
        policy_engine.queue_for_manual_review(
            receipt_id=match_result["new_receipt_id"],
            reason="Duplicate receipt detected within tolerance window",
            escalation_tier="AP_MANAGER",
        )

Structured logging ties every decision back to a specific normalization state and match key, which is what turns a suspected duplicate into defensible audit evidence rather than an opaque flag.

Configuration Reference

Expose every tunable through environment variables or a centralized policy store so finance can adjust controls without a code change. Pin duckdb and structlog versions; DuckDB’s on-disk format and SQL surface evolve between minor releases.

Key	Type	Default	Rationale
`DEDUP_CHUNK_SIZE`	int	`50000`	Rows per batch; raise for throughput, lower to cap memory/temp-disk use
`DEDUP_AMOUNT_TOLERANCE`	str (Decimal)	`"0.01"`	Absolute cent tolerance to absorb rounding without over-matching
`DEDUP_HISTORICAL_DB_PATH`	path	`/data/receipts.duckdb`	On-disk index location; must be on durable storage
`DEDUP_RETENTION_DAYS`	int	`540`	How far back the index is kept; align with the fiscal audit window
`DEDUP_ROUTING_ACTION`	enum	`HOLD_FOR_AP_REVIEW`	Terminal action on a match; alternatives: `AUTO_REJECT`, `FLAG_ONLY`
`DEDUP_ESCALATION_TIER`	enum	`AP_MANAGER`	Review queue tier for held receipts
`DEDUP_ON_MISSING_FIELD`	enum	`FALLBACK_REVIEW`	Behaviour when critical fields are null; never `SKIP`

Validation & Testing

Test the fingerprint, not the plumbing: assert that inputs which differ only by formatting collapse to one match_key, and that genuinely different receipts do not. Use fixtures drawn from real failure modes — merchant tokenization drift, timezone shifts, and trailing-zero amounts.

import pytest
from pathlib import Path


def test_formatting_variants_collapse_to_one_key():
    a = normalize_receipt({"receipt_id": "A", "amount": "1234.5",
                           "transaction_date": "2026-03-01T23:30:00Z",
                           "merchant_name": "Amazon Marketplace", "currency": "USD"})
    b = normalize_receipt({"receipt_id": "B", "amount": 1234.50,
                           "transaction_date": "2026-03-01T23:30:00+00:00",
                           "merchant_name": "AMAZON  MARKETPLACE!!", "currency": "USD"})
    assert a["match_key"] == b["match_key"]


def test_distinct_amounts_do_not_match():
    a = normalize_receipt({"receipt_id": "A", "amount": "10.00",
                           "transaction_date": "2026-03-01T10:00:00Z",
                           "merchant_name": "Cafe", "currency": "USD"})
    b = normalize_receipt({"receipt_id": "B", "amount": "10.50",
                           "transaction_date": "2026-03-01T10:00:00Z",
                           "merchant_name": "Cafe", "currency": "USD"})
    assert a["match_key"] != b["match_key"]


def test_matcher_flags_reingested_duplicate(tmp_path: Path):
    matcher = DuplicateMatcher(tmp_path / "test.duckdb")
    first = normalize_receipt({"receipt_id": "R1", "amount": "42.00",
                               "transaction_date": "2026-03-01T09:00:00Z",
                               "merchant_name": "Hotel Co", "currency": "USD"})
    matcher.ingest_batch([first])
    dup = normalize_receipt({"receipt_id": "R2", "amount": "42.00",
                             "transaction_date": "2026-03-01T09:00:00Z",
                             "merchant_name": "hotel  co", "currency": "USD"})
    matcher.ingest_batch([dup])
    results = list(matcher.find_duplicates([dup]))
    assert results and results[0]["duplicate_of"] == "R1"
    assert results[0]["routing_action"] == "HOLD_FOR_AP_REVIEW"

Gate deployments on these assertions in CI. Cross-check the amount-tolerance edge (± DEDUP_AMOUNT_TOLERANCE) and confirm that records missing a critical field raise before reaching normalize_receipt, so they divert to review instead of matching.

Operational Runbook

Provision the index. Mount durable storage for DEDUP_HISTORICAL_DB_PATH and back-fill from reconciled historical receipts before enabling enforcement, so the first live batch has context to match against.
Deploy behind a fallback. Wire DEDUP_ON_MISSING_FIELD=FALLBACK_REVIEW; if OCR returns null for amount, transaction_date, or merchant_name, route to human review rather than forcing a match.
Tune the chunk size. Start at 50,000 rows and watch DuckDB’s temp-directory usage during month-end close; lower it if temp spill approaches the disk budget.
Set alert thresholds. Page when detected-duplicate rate deviates more than 3σ from the trailing 30-day baseline, when batch latency exceeds the sub-second target, or when the fallback-review rate spikes (a sign of upstream OCR degradation).
Sweep retention. Evict rows older than DEDUP_RETENTION_DAYS on an off-peak schedule to keep the index bounded and the join fast.
Roll back safely. Because ingest_batch uses INSERT OR IGNORE, re-running a batch after a rollback is idempotent; to disable enforcement, flip DEDUP_ROUTING_ACTION to FLAG_ONLY so decisions are still logged for audit while nothing is held.

Duplicate receipt detection is a deterministic compliance control, not a heuristic. Strict normalization, an out-of-core join, and immutable audit logging give AP teams a component that scales predictably, satisfies audit requirements, and eliminates the memory bottleneck that plagues legacy expense systems.

Automated Policy Validation & Anomaly Flagging — the parent framework this component plugs into
Detecting duplicate expenses across overlapping submission windows — sliding-window matching across fiscal boundaries
Date Window Validation Logic — temporal boundary enforcement that feeds this engine
Merchant Category Code Routing — vendor entity resolution used during normalization
Dynamic Threshold Tuning — governs match tolerances and score cutoffs
Receipt Ingestion & OCR Data Extraction — the upstream stage that produces canonical records