Won't a two percent amount band flag legitimate similar purchases?

Only when they also share an MCC and fall inside the tolerance window, and even then the action is HOLD_FOR_AP_REVIEW, never auto-reject. The band absorbs FX revaluation and rounding on genuine duplicates and can be tuned through dynamic threshold tuning.

Detecting duplicate expenses across overlapping submission windows

The same expense evades a period-bucketed dedup rule whenever its capture, submission, and settlement timestamps straddle a fiscal cutoff, so the two copies land in adjacent reporting periods and never get compared. This pattern extends the exact-key matcher documented in the parent Duplicate Receipt Detection guide with a sliding UTC window, so overlaps that cross a month-end, a rolling reimbursement cycle, or a multi-entity consolidation boundary are still caught. It sits inside the broader Automated Policy Validation & Anomaly Flagging framework and consumes the canonical records emitted upstream by Receipt Ingestion & OCR Data Extraction.

Why Standard Approaches Fail

Discrete submission_date BETWEEN period_start AND period_end bucketing breaks in production for three named reasons, each producing a silent false negative rather than a visible error:

Temporal decoupling. transaction_date, receipt_capture_date, and submission_date diverge across mobile uploads, manual corrections, and batch-processor lags. A receipt captured on the 30th but submitted on the 3rd lands in the next fiscal bucket, so a period-scoped query never sees both copies. This is the same boundary problem that Date Window Validation Logic resolves for policy dates — here it defeats deduplication instead.
Timezone normalization gaps. Corporate-card feeds settle in UTC while employee submissions carry a local offset. A 23:59 US/Eastern receipt resolves to 04:59 UTC the following day and skips a month-end cutoff entirely, exactly as the sibling pattern validating expense dates against corporate travel policies describes for window enforcement.
OCR tokenization drift. Strict string equality fails when merchant text varies (AMZN MKTP vs Amazon Marketplace) or an amount is misread by a decimal place. Entity resolution belongs to Merchant Category Code Routing; until a stable token is available, a fuzzy fallback constrained by MCC and amount tolerance is required to avoid over-matching.

Architecture & Algorithm

Replace isolated period buckets with a rolling historical timeline. Normalize every timestamp to UTC, insert it into a sorted index, and evaluate each incoming record against a configurable tolerance buffer (±72 hours by default) using bisect for O(log N) candidate retrieval — never a linear scan, which degrades to O(N²) at close. Candidates then flow through a deterministic fallback chain that halts at the first positive match, so the emitted decision is reproducible and its precedence is auditable. Match-score cutoffs used inside the fuzzy stage are governed by Dynamic Threshold Tuning.

Memory and latency notes are inline: __slots__ drops the per-instance __dict__, the sorted timeline keeps window lookup logarithmic, and pre-filtering by currency and MCC before any fuzzy comparison keeps CPU cost bounded.

import bisect
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from typing import Callable, Dict, List, Optional, Tuple

# Audit-safe structured logging: every decision is reconstructable from the log.
audit_logger = logging.getLogger("expense.dedup.overlap")
if not audit_logger.handlers:
    _handler = logging.StreamHandler()
    _handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(message)s"))
    audit_logger.addHandler(_handler)
audit_logger.setLevel(logging.INFO)


@dataclass(slots=True)
class ExpenseRecord:
    id: str
    transaction_dt: datetime      # timezone-aware; converted to UTC on ingest
    submission_dt: datetime       # timezone-aware
    amount: float
    currency: str
    merchant_raw: str
    mcc: str
    card_ref: Optional[str] = None


class OverlapDuplicateDetector:
    """Sliding-window duplicate detector for records that cross period boundaries.

    Maintains a UTC-sorted timeline and evaluates each record against a bounded
    tolerance window using binary search. A deterministic fallback chain assigns
    exactly one match reason, highest confidence first, and every positive match
    is logged for SOX-traceable audit evidence.
    """

    def __init__(self, tolerance_hours: int = 72, amount_variance_pct: float = 0.02,
                 fuzzy_min_score: int = 85) -> None:
        self.tolerance = timedelta(hours=tolerance_hours)
        self.variance = amount_variance_pct
        self.fuzzy_min_score = fuzzy_min_score
        # Sorted timeline of (utc_timestamp, record_id) enables O(log N) windowing.
        self._timeline: List[Tuple[datetime, str]] = []
        self._index: Dict[str, ExpenseRecord] = {}
        # Deterministic precedence: first positive result wins.
        self._fallback_chain: List[Callable[[ExpenseRecord, List[ExpenseRecord]], Optional[str]]] = [
            self._exact_match,
            self._card_ref_cross_ref,
            self._fuzzy_mcc_match,
            self._temporal_amount_match,
        ]

    def ingest(self, record: ExpenseRecord) -> None:
        utc_ts = record.transaction_dt.astimezone(timezone.utc)
        self._index[record.id] = record
        bisect.insort(self._timeline, (utc_ts, record.id))

    def _get_window_candidates(self, record: ExpenseRecord) -> List[ExpenseRecord]:
        utc_ts = record.transaction_dt.astimezone(timezone.utc)
        # "" sorts before any id, "~" (0x7E) sorts after typical ids, so the pair
        # bounds capture every record whose timestamp is within the tolerance band.
        left = bisect.bisect_left(self._timeline, (utc_ts - self.tolerance, ""))
        right = bisect.bisect_right(self._timeline, (utc_ts + self.tolerance, "~"))
        return [self._index[idx] for _, idx in self._timeline[left:right] if idx != record.id]

    def _exact_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        for c in candidates:
            if (rec.amount == c.amount and rec.currency == c.currency and
                    rec.merchant_raw.strip().lower() == c.merchant_raw.strip().lower()):
                return f"EXACT_MATCH:{c.id}"
        return None

    def _card_ref_cross_ref(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        # Corporate-card transaction id is the highest-confidence signal available.
        if not rec.card_ref:
            return None
        for c in candidates:
            if c.card_ref and c.card_ref == rec.card_ref:
                return f"CARD_REF:{c.id}"
        return None

    def _fuzzy_mcc_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        # rapidfuzz is optional; degrade to exact-token comparison if unavailable.
        try:
            from rapidfuzz import fuzz
            score = lambda a, b: fuzz.token_set_ratio(a, b)
        except ImportError:  # pragma: no cover - environment-dependent
            score = lambda a, b: 100 if a.split() == b.split() else 0
        for c in (c for c in candidates if c.mcc == rec.mcc and c.currency == rec.currency):
            if (score(rec.merchant_raw, c.merchant_raw) >= self.fuzzy_min_score and
                    abs(rec.amount - c.amount) <= rec.amount * self.variance):
                return f"FUZZY_MCC:{c.id}"
        return None

    def _temporal_amount_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        for c in candidates:
            if (rec.currency == c.currency and
                    abs(rec.amount - c.amount) <= rec.amount * self.variance):
                return f"TEMPORAL_AMOUNT:{c.id}"
        return None

    def evaluate(self, record: ExpenseRecord) -> Dict[str, object]:
        """Return a routing decision with the single winning match reason, if any."""
        candidates = self._get_window_candidates(record)
        if not candidates:
            return {"status": "CLEAN", "flag": None, "matched_id": None}

        for rule in self._fallback_chain:
            flag = rule(record, candidates)
            if flag:
                matched_id = flag.split(":", 1)[1]
                audit_logger.info(
                    "duplicate_flagged id=%s reason=%s matched_id=%s window_hours=%s",
                    record.id, flag.split(":", 1)[0], matched_id, self.tolerance.total_seconds() / 3600,
                )
                return {"status": "DUPLICATE_DETECTED", "flag": flag,
                        "matched_id": matched_id, "routing_action": "HOLD_FOR_AP_REVIEW"}

        return {"status": "CLEAN", "flag": None, "matched_id": None}

The _exact_match and _card_ref_cross_ref rules run before the fuzzy stage precisely because they are deterministic and cheap; a corporate-card transaction id is a stronger signal than any string similarity, so it outranks fuzzy matching in the precedence order.

Step-by-Step Integration

Widen the window past your worst decoupling gap. Set tolerance_hours to at least the longest observed lag between capture and settlement in a trailing 90-day sample. Verify the boundary holds:

det = OverlapDuplicateDetector(tolerance_hours=72)
base = datetime(2026, 3, 31, 23, 59, tzinfo=timezone(timedelta(hours=-5)))  # last minute of month, EST
det.ingest(ExpenseRecord("A", base, base, 120.0, "USD", "AMZN MKTP", "5942"))
probe = ExpenseRecord("B", base + timedelta(hours=6), base, 120.0, "USD", "AMZN MKTP", "5942")
assert det.evaluate(probe)["status"] == "DUPLICATE_DETECTED"  # crosses into April UTC, still caught

Normalize on ingest, never in the matcher. Convert to UTC and resolve the merchant token before ingest(); do not mutate raw OCR. Records missing amount, transaction_dt, or merchant_raw must divert to the fallback described in classifying OCR extraction errors for manual review, not enter the timeline.
Confirm precedence is deterministic. Feed a record that satisfies two rules and assert the higher-confidence reason wins:
```
assert det.evaluate(probe)["flag"].startswith(("EXACT_MATCH", "CARD_REF"))
```
Bound the index. Evict entries older than tolerance_hours * 2 on an off-peak sweep so the timeline stays small and lookups stay logarithmic; flush to a persistent store between chunks for batches over ~500k records/month.
Wire the audit stream. Route the audit_logger JSON to your SIEM and confirm reason and matched_id appear on every DUPLICATE_DETECTED event before enabling any hold action.
Deploy behind a flag-only mode first. Log decisions without holding receipts for one close cycle, reconcile flagged pairs against known duplicates, then promote to HOLD_FOR_AP_REVIEW.

Edge Cases & Gotchas

Edge condition	Failure it causes	Mitigation
Timezone offset dropped on ingest	Two copies land in different UTC days; window misses one	Require timezone-aware `datetime`; reject naive values before `ingest()`
FX-revalued corporate-card copy	Amounts differ by more than the exact cent, so exact match fails	Rely on the `±2%` `amount_variance_pct` band in the fuzzy/temporal rules
Merchant tokenization drift	`AMZN MKTP` vs `Amazon Marketplace` never string-equal	Resolve via Merchant Category Code Routing; fall back to fuzzy score ≥ 85 within one MCC
Legitimate same-day repeat purchase	Two genuine coffees flagged as one duplicate	Constrain fuzzy stage to identical MCC + amount band; route holds to human review, never auto-reject
OCR confidence below 0.85	Text-based rules produce false positives on degraded scans	Skip fuzzy text matching; rely on `_card_ref_cross_ref` and `_temporal_amount_match` only
Window set narrower than decoupling lag	Cross-period duplicate slips through	Size `tolerance_hours` from observed capture-to-settlement lag, not the fiscal period length

Silent rule overrides violate SOX Section 404 control requirements, so the fallback chain executes sequentially and logs the winning reason; when multiple rules would fire, the highest-precedence match is recorded and the suppressed paths remain reproducible from the pinned configuration. External log-retention practice follows NIST SP 800-92.

FAQ

How wide should the tolerance window be?

Size it from data, not policy. Measure the distribution of submission_dt - transaction_dt across a trailing 90-day sample and set tolerance_hours above the 99th percentile of that lag. The default of 72 hours covers most corporate-card settlement delays plus a weekend; anything narrower re-introduces the cross-period miss the pattern exists to close.

Why is the timeline sorted instead of a hash lookup?

A hash keyed on an exact timestamp only finds records at the same instant, which defeats the whole point — duplicates here differ in time. A sorted timeline plus bisect returns every record inside ±tolerance in O(log N), so window retrieval stays fast even as the historical index grows into the millions.

How does this differ from the exact-key matcher in the parent guide?

The parent Duplicate Receipt Detection engine hashes a normalized day-plus-amount-plus-merchant key and catches copies inside the same normalized day. This page layers a continuous UTC window on top so copies whose timestamps fall in adjacent periods — the exact case a same-day key misses — are still compared.

Won’t a ±2% amount band flag legitimate similar purchases?

Only when they also share an MCC and fall inside the tolerance window, and even then the routing action is HOLD_FOR_AP_REVIEW, never auto-reject. The band exists to absorb FX revaluation and rounding on genuine duplicates; tune it through Dynamic Threshold Tuning if your merchant mix produces false holds.

What happens when OCR confidence is too low to trust the merchant string?

Bypass the fuzzy text stage entirely and let _card_ref_cross_ref and _temporal_amount_match decide. If no card reference exists and confidence is below threshold, the record should route to human review via the receipt error categorization path rather than risk a text-based false positive.

Duplicate Receipt Detection — the parent guide whose exact-key matcher this window extends
Validating expense dates against corporate travel policies — the sibling temporal pattern for policy windows
Classifying OCR extraction errors for manual review — where low-confidence records divert before matching
Merchant Category Code Routing — vendor entity resolution used by the fuzzy stage
Dynamic Threshold Tuning — governs the fuzzy score and amount-variance cutoffs