Detecting duplicate expenses across overlapping submission windows

Expense report auditing fractures at fiscal cutoffs, rolling reimbursement cycles, and multi-entity consolidation periods. When submission windows overlap, rigid exact-match deduplication misses split receipts, timezone-shifted transactions, and delayed corporate card feeds. [Detecting duplicate expenses across overlapping submission windows] requires a sliding-window architecture, tolerance-based matching, and deterministic fallback chains aligned with Automated Policy Validation & Anomaly Flagging standards. This guide provides root cause analysis, memory-optimized Python patterns, and audit-safe validation sequences for AP and travel operations.

Root Cause Analysis: Temporal Decoupling & Boundary Failures

Traditional deduplication relies on submission_date BETWEEN period_start AND period_end. This fails in production due to three systemic breakdowns:

  1. Temporal Decoupling: transaction_date, receipt_capture_date, and submission_date diverge across mobile uploads, manual corrections, and batch processor lags. A receipt captured on day 30 may be submitted on day 3, landing in an adjacent fiscal bucket.
  2. Timezone Normalization Gaps: Corporate cards settle in UTC, while employee submissions use local TZ. A 23:59 EST receipt often resolves to 04:59 UTC the following day, bypassing month-end cutoffs.
  3. Rule Collisions & OCR Variance: Strict “exact match” policies intersect with anomaly scoring models that permit ±2% FX variance. Merchant tokenization drift (AMZN MKTP vs Amazon Marketplace) and decimal misreads guarantee silent false negatives when string equality is enforced.

Sliding-Window Architecture & Memory Optimization

Replace isolated period buckets with a rolling historical state table. Normalize all timestamps to UTC using datetime timezone-aware objects, then apply a configurable tolerance buffer (±72 hours default). Evaluate transactions against a continuous timeline rather than discrete windows. This approach maintains temporal continuity across fiscal boundaries and aligns with Duplicate Receipt Detection best practices.

Memory & Latency Constraints:

  • Linear scans over historical expense tables degrade to O(N²) as submission volume scales. Use a sorted index with binary search (bisect) to achieve O(log N) window candidate retrieval.
  • Apply __slots__ to data models to eliminate per-instance __dict__ overhead, reducing memory footprint by ~40% for high-throughput pipelines.
  • Pre-filter candidates by currency and MCC before executing fuzzy string comparisons to minimize CPU cycles.

Production Python Implementation

The following class implements a deterministic, memory-efficient sliding-window detector with explicit fallback precedence.

import bisect
import logging
from datetime import datetime, timedelta, timezone
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from rapidfuzz import fuzz

# Audit-safe logging configuration
audit_logger = logging.getLogger("expense_audit")
audit_logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(message)s"))
audit_logger.addHandler(handler)

@dataclass(slots=True)
class ExpenseRecord:
    id: str
    transaction_dt: datetime
    submission_dt: datetime
    amount: float
    currency: str
    merchant_raw: str
    mcc: str
    card_ref: Optional[str] = None

class OverlapDuplicateDetector:
    def __init__(self, tolerance_hours: int = 72, amount_variance_pct: float = 0.02):
        self.tolerance = timedelta(hours=tolerance_hours)
        self.variance = amount_variance_pct
        # Sorted timeline: (utc_timestamp, record_id)
        self._timeline: List[Tuple[datetime, str]] = []
        # Fast lookup index
        self._index: Dict[str, ExpenseRecord] = {}
        # Deterministic fallback precedence
        self._fallback_chain = [
            self._exact_match,
            self._fuzzy_mcc_match,
            self._temporal_amount_match,
            self._card_ref_cross_ref
        ]

    def ingest(self, record: ExpenseRecord) -> None:
        utc_ts = record.transaction_dt.astimezone(timezone.utc)
        self._index[record.id] = record
        bisect.insort(self._timeline, (utc_ts, record.id))

    def _get_window_candidates(self, record: ExpenseRecord) -> List[ExpenseRecord]:
        utc_ts = record.transaction_dt.astimezone(timezone.utc)
        start = utc_ts - self.tolerance
        end = utc_ts + self.tolerance
        left = bisect.bisect_left(self._timeline, (start, ""))
        right = bisect.bisect_right(self._timeline, (end, "zzzzzz"))
        return [self._index[idx] for _, idx in self._timeline[left:right] if idx != record.id]

    def _exact_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        for c in candidates:
            if (rec.amount == c.amount and
                rec.currency == c.currency and
                rec.merchant_raw.strip().lower() == c.merchant_raw.strip().lower()):
                return f"EXACT_MATCH:{c.id}"
        return None

    def _fuzzy_mcc_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        same_mcc = [c for c in candidates if c.mcc == rec.mcc]
        if not same_mcc: return None
        for c in same_mcc:
            score = fuzz.token_set_ratio(rec.merchant_raw, c.merchant_raw)
            if score >= 85 and abs(rec.amount - c.amount) <= rec.amount * self.variance:
                return f"FUZZY_MCC:{c.id}"
        return None

    def _temporal_amount_match(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        for c in candidates:
            if abs(rec.amount - c.amount) <= rec.amount * self.variance:
                return f"TEMPORAL_AMOUNT:{c.id}"
        return None

    def _card_ref_cross_ref(self, rec: ExpenseRecord, candidates: List[ExpenseRecord]) -> Optional[str]:
        if not rec.card_ref: return None
        for c in candidates:
            if c.card_ref == rec.card_ref:
                return f"CARD_REF:{c.id}"
        return None

    def evaluate(self, record: ExpenseRecord) -> Dict:
        candidates = self._get_window_candidates(record)
        if not candidates:
            return {"status": "CLEAN", "flags": [], "eval_path": []}
        
        eval_path = []
        for fallback in self._fallback_chain:
            flag = fallback(record, candidates)
            if flag:
                eval_path.append(flag)
                break  # Deterministic precedence: halt at first positive match
        
        if not eval_path:
            return {"status": "CLEAN", "flags": [], "eval_path": []}
            
        audit_logger.info(f"DUPLICATE_FLAGGED id={record.id} path={eval_path[0]}")
        return {
            "status": "DUPLICATE_DETECTED",
            "flags": eval_path,
            "matched_ids": [flag.split(":")[1] for flag in eval_path],
            "eval_path": eval_path
        }

Deterministic Fallback Chains & Audit Compliance

Silent rule overrides violate SOX 404 control requirements. The fallback matrix must execute sequentially, logging every evaluation step to an immutable audit trail. When OCR confidence drops below 0.85, bypass text-based matching and route directly to _card_ref_cross_ref. This prevents false positives from receipt degradation while maintaining compliance with NIST SP 800-92 guidelines for computer security log management.

Precedence Matrix:

  1. EXACT_MATCH: Identical amount, currency, and normalized merchant string.
  2. FUZZY_MCC_MATCH: Tokenized similarity ≥85% constrained to identical MCC, amount within ±2%.
  3. TEMPORAL_AMOUNT_MATCH: Amount within tolerance, overlapping UTC window, merchant variance ignored.
  4. CARD_REF_CROSS_REF: Corporate card transaction ID match (highest confidence fallback).

If multiple rules trigger, the pipeline defaults to the highest precedence match and logs the suppressed paths. This explicit evaluation chain eliminates ambiguous scoring and provides AP managers with defensible audit artifacts.

Latency Tuning & Pipeline Integration

For high-volume AP pipelines (>500k records/month), optimize ingestion and evaluation throughput:

  • Batch Chunking: Process records in 10k-row micro-batches. Flush the _timeline index to a persistent key-value store (Redis/SQLite) between batches to bound memory growth.
  • Vectorized Fallbacks: Replace Python loops with Polars or Pandas for bulk candidate evaluation. Use categorical encoding for currency and mcc to accelerate joins and filters.
  • Index Pruning: Evict records older than (tolerance_hours * 2) from _timeline to maintain O(log N) lookup performance. Implement a background TTL sweeper that runs during off-peak windows.
  • Monitoring: Instrument evaluate() with Prometheus counters for duplicate_detected_total, fallback_chain_depth, and window_lookup_latency_ms. Alert when fallback depth exceeds 3, indicating degraded OCR quality or policy misalignment.

Deploy the detector as a stateless microservice or embedded library within your expense ingestion DAG. Validate against historical reconciliation reports before production rollout, and enforce schema versioning on ExpenseRecord to prevent silent field drift.