Dynamic Threshold Tuning for Adaptive Expense Policy Validation

Dynamic threshold tuning replaces static spending limits with per-group boundaries that recalibrate against recent spend history, so an expense pipeline flags genuine outliers without drowning finance ops in false positives. This layer sits inside the Automated Policy Validation & Anomaly Flagging stage: it consumes records that have already been deduplicated and temporally validated, computes an adaptive per-group ceiling, and emits a deterministic routing decision plus an explainable deviation score. It deliberately delegates the fixed, non-negotiable ceilings to the Spending Cap Hierarchies layer, location-aware allowances to Per Diem Rate Structuring, and category assignment to Merchant Category Code Routing — its only job is the statistical band between the policy floor and the hard cap.

The engineering problem is narrow but subtle: a threshold that adapts is a threshold that can be manipulated, starved, or destabilized. This page covers how to compute adaptive baselines that are memory-bounded and reproducible, how to clamp them against deterministic guardrails so adaptation can never relax a compliance ceiling, and how to deploy, test, and roll back the component without breaking the audit trail.

Problem Framing & Root Causes

Static caps fail in two directions at once: set them low and every legitimate conference week generates a review queue; set them high and steady policy drift goes unnoticed until month-end reconciliation. Adaptive thresholds fix that, but they introduce their own failure modes that a production implementation must name and neutralize.

Baseline poisoning: if the historical window that feeds a baseline includes unreviewed or fraudulent spend, the percentile drifts upward and the band widens to admit the very behavior it should catch. Baselines must be computed only from records that already cleared upstream validation.
Cold-start sparsity: new departments, newly onboarded merchant category codes, or rare travel corridors have too few samples to support a stable percentile. Extrapolating from a thin sample produces a band that is either absurdly tight or wide open.
Seasonal drift: end-of-quarter travel spikes and annual conference cycles inflate the rolling window, quietly relaxing the ceiling exactly when scrutiny should increase. A baseline that refreshes per-transaction chases this noise instead of dampening it.
Boundary clustering: when many submissions land just under the active threshold, that is a signal the band is misaligned (or that spend is being structured to evade it), not that everything is compliant.

Each of these is a data-quality or cadence problem, not a modeling problem — which is why the mitigation is a strict input contract, a sample-count guardrail, and a fixed refresh cadence rather than a more elaborate estimator.

Design Constraints & Prerequisites

This component is a mid-pipeline stage with a hard input contract. It must never see a raw payload; it must never be the first line of defense against duplicates or bad dates. Upstream stages guarantee the following before a record reaches threshold evaluation:

Field	Type	Guaranteed by	Notes
`transaction_id`	`str`	Ingestion	Stable idempotency key
`employee_id`	`str`	Ingestion	Maps to a department/cost center
`department`	`str`	Policy taxonomy	Grouping dimension for baselines
`merchant_category_code`	`str`	MCC Routing	Grouping dimension for baselines
`amount_minor`	`int`	Ingestion	Integer minor units (cents); never a float
`is_deduplicated`	`bool`	Duplicate Receipt Detection	Repeat submissions already suppressed
`date_state`	`str`	Date Window Validation Logic	Only `VALID`/`GRACE_PERIOD_APPLIED` proceed

Storing money as an integer count of minor units (never a float) is non-negotiable: floating-point drift silently corrupts percentile and cap comparisons and breaks reconciliation totals downstream. The other constraints are operational:

Memory: enterprise ledgers exceed available RAM at month-end close. Baseline computation must stream the historical window rather than loading it whole; evaluation must process submissions in bounded batches. When the historical read itself is the bottleneck, offload it to the Async Batch Processing layer and hand this stage the pre-aggregated samples.
Determinism: given the same baselines and config, evaluation must be byte-for-byte reproducible so an auditor can replay any decision.
Compliance: every decision emits an append-only audit record. The retention and immutability requirements are owned by the Security & Compliance Boundaries layer; this stage must produce records that satisfy them.

Production Python Implementation

The implementation has three parts: a streaming accumulator that gathers per-group samples from the historical window, a baseline builder that turns those samples into clamped per-group thresholds, and a vectorized evaluator that assigns a routing decision and emits an audit record for every submission. All money is handled in integer minor units; the deviation ratio is unitless.

from __future__ import annotations

import json
import logging
from collections import defaultdict
from dataclasses import asdict, dataclass
from datetime import datetime, timezone
from typing import Dict, Iterator, List, Tuple

import numpy as np
import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("expense.threshold_engine")

# (merchant_category_code, department) — the grouping dimension for a baseline.
GroupKey = Tuple[str, str]


@dataclass(frozen=True)
class ThresholdConfig:
    """Tunable settings for adaptive threshold evaluation.

    Frozen so a config instance is hashable and safe to reuse across the
    baseline-build and evaluation passes without defensive copying.
    """
    rolling_percentile: float = 0.85       # percentile that anchors the baseline
    min_samples: int = 30                  # cold-start guardrail
    floor_minor_units: int = 5_00          # $5.00 absolute floor
    hard_cap_minor_units: int = 5_000_00   # $5,000.00 deterministic ceiling
    soft_flag_multiplier: float = 1.15
    hard_block_multiplier: float = 1.35


def accumulate_group_samples(
    chunks: Iterator[pd.DataFrame],
    config: ThresholdConfig,
) -> Dict[GroupKey, np.ndarray]:
    """Stream the historical window and accumulate per-group amount samples.

    Memory scales with (distinct groups x samples), not total rows. For very
    high-cardinality ledgers, swap the list accumulator for reservoir sampling
    or a t-digest to bound memory regardless of window size.
    """
    buckets: Dict[GroupKey, List[int]] = defaultdict(list)
    for chunk in chunks:
        chunk = chunk.copy()
        chunk["amount_minor"] = pd.to_numeric(chunk["amount_minor"], errors="coerce")
        chunk = chunk.dropna(subset=["amount_minor", "merchant_category_code", "department"])
        chunk["amount_minor"] = chunk["amount_minor"].astype("int64")
        grouped = chunk.groupby(["merchant_category_code", "department"], sort=False)
        for (mcc, dept), grp in grouped:
            buckets[(str(mcc), str(dept))].extend(grp["amount_minor"].tolist())
    return {k: np.asarray(v, dtype=np.int64) for k, v in buckets.items()}


def compute_dynamic_baselines(
    samples: Dict[GroupKey, np.ndarray],
    config: ThresholdConfig,
) -> Dict[GroupKey, int]:
    """Turn per-group samples into clamped per-group thresholds (minor units).

    Groups below ``min_samples`` are omitted so the evaluator falls back to a
    deterministic floor instead of extrapolating from a thin sample. Every
    baseline is clamped into [floor, hard_cap] so adaptation can never widen a
    band past the policy ceiling.
    """
    baselines: Dict[GroupKey, int] = {}
    skipped = 0
    for key, arr in samples.items():
        if arr.size < config.min_samples:
            skipped += 1
            continue
        raw = float(np.percentile(arr, config.rolling_percentile * 100.0))
        clamped = int(min(max(raw, config.floor_minor_units), config.hard_cap_minor_units))
        baselines[key] = clamped
    logger.info(json.dumps({
        "event": "baselines_built",
        "groups_kept": len(baselines),
        "groups_cold_start": skipped,
        "config": asdict(config),
    }))
    return baselines


def evaluate_batch(
    batch: pd.DataFrame,
    baselines: Dict[GroupKey, int],
    config: ThresholdConfig,
) -> pd.DataFrame:
    """Assign a routing decision and deviation score to every submission.

    Vectorized over the batch. Records with a missing amount fall through to
    NEEDS_DATA rather than silently auto-approving.
    """
    batch = batch.copy()
    batch["amount_minor"] = pd.to_numeric(batch["amount_minor"], errors="coerce")

    keys = list(zip(batch["merchant_category_code"].astype(str),
                    batch["department"].astype(str)))
    adaptive = np.array([baselines.get(k, config.floor_minor_units) for k in keys],
                        dtype=np.int64)
    # The deterministic ceiling always wins over an adaptive baseline.
    batch["effective_threshold"] = np.minimum(adaptive, config.hard_cap_minor_units)

    batch["deviation_ratio"] = (
        batch["amount_minor"].astype("float64") / batch["effective_threshold"]
    )

    ratio = batch["deviation_ratio"]
    conditions = [
        ratio.isna(),
        ratio <= 1.0,
        ratio <= config.soft_flag_multiplier,
        ratio <= config.hard_block_multiplier,
    ]
    choices = ["NEEDS_DATA", "AUTO_APPROVE", "SOFT_FLAG", "MANAGER_REVIEW"]
    batch["routing_decision"] = np.select(conditions, choices, default="HARD_BLOCK")
    return batch


def emit_audit_record(row: pd.Series, config: ThresholdConfig) -> None:
    """Emit one append-only, replayable decision record.

    Includes the exact threshold and config that produced the decision so an
    auditor can reconstruct the outcome without the original baselines.
    """
    logger.info(json.dumps({
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "transaction_id": str(row["transaction_id"]),
        "employee_id": str(row["employee_id"]),
        "merchant_category_code": str(row["merchant_category_code"]),
        "department": str(row["department"]),
        "amount_minor": None if pd.isna(row["amount_minor"]) else int(row["amount_minor"]),
        "effective_threshold_minor": int(row["effective_threshold"]),
        "deviation_ratio": None if pd.isna(row["deviation_ratio"]) else round(float(row["deviation_ratio"]), 4),
        "routing_decision": row["routing_decision"],
        "percentile": config.rolling_percentile,
        "compliance_framework": "SOX_404",
    }, default=str))


def run_stage(
    history_chunks: Iterator[pd.DataFrame],
    submission_batches: Iterator[pd.DataFrame],
    config: ThresholdConfig,
) -> Iterator[pd.DataFrame]:
    """Build baselines once, then evaluate each submission batch as a generator."""
    samples = accumulate_group_samples(history_chunks, config)
    baselines = compute_dynamic_baselines(samples, config)
    for batch in submission_batches:
        evaluated = evaluate_batch(batch, baselines, config)
        for _, row in evaluated.iterrows():
            emit_audit_record(row, config)
        yield evaluated

Wire the historical read with pandas.read_csv(path, chunksize=...) so the window streams instead of loading whole; see the official pandas guide on iterating through files chunk-by-chunk. The run_stage generator keeps only one submission batch resident at a time, so peak memory is bounded by the batch size plus the accumulated per-group samples.

Configuration Reference

Every tunable lives on ThresholdConfig so policy is expressed as code and versioned in source control, not hard-coded in the engine.

Key	Type	Default	Rationale
`rolling_percentile`	`float`	`0.85`	Anchors the baseline at the 85th percentile of recent same-group spend — high enough to permit normal variance, low enough to surface outliers.
`min_samples`	`int`	`30`	Cold-start guardrail: below this count a group is omitted and evaluation falls back to `floor_minor_units`.
`floor_minor_units`	`int`	`500`	Absolute floor so a sparse or low-spend group cannot produce a near-zero band that flags everything.
`hard_cap_minor_units`	`int`	`500000`	Deterministic ceiling; the adaptive baseline is clamped to this so tuning can never relax a compliance limit. Mirror the value owned by Spending Cap Hierarchies.
`soft_flag_multiplier`	`float`	`1.15`	Records 1.0x–1.15x over baseline get a soft flag but no human stop — absorbs benign variance.
`hard_block_multiplier`	`float`	`1.35`	Records above 1.35x route straight to `HARD_BLOCK`; the band between is `MANAGER_REVIEW`.

Pin numpy and pandas to explicit patch versions in your lockfile: np.percentile interpolation defaults and groupby ordering are behaviorally stable within a minor series but should never float across a deploy, or two runs of the same ledger can disagree on a boundary case. Treat any change to rolling_percentile, the multipliers, or the cap as a policy change — version it, and snapshot the config alongside the baselines it produced.

Validation & Testing

Because a misaligned band silently changes who gets reimbursed, the routing math needs assertion-level tests, not smoke tests. Cover the guardrails explicitly: the clamp to the hard cap, the cold-start fallback, and the missing-amount path.

import numpy as np
import pandas as pd

from expense.threshold_engine import (
    ThresholdConfig,
    compute_dynamic_baselines,
    evaluate_batch,
)


def test_baseline_is_clamped_to_hard_cap():
    cfg = ThresholdConfig(min_samples=3, hard_cap_minor_units=10_000)
    # 85th percentile of this sample is ~90k, well above the 10k cap.
    samples = {("5812", "sales"): np.array([50_000, 80_000, 100_000, 120_000])}
    baselines = compute_dynamic_baselines(samples, cfg)
    assert baselines[("5812", "sales")] == 10_000  # cap wins


def test_cold_start_group_is_omitted():
    cfg = ThresholdConfig(min_samples=30)
    samples = {("7011", "new-dept"): np.array([12_000, 13_000])}  # only 2 rows
    assert compute_dynamic_baselines(samples, cfg) == {}


def test_routing_bands_and_missing_amount():
    cfg = ThresholdConfig(floor_minor_units=10_000)
    baselines = {("5812", "sales"): 10_000}
    batch = pd.DataFrame([
        {"transaction_id": "t1", "employee_id": "e1", "merchant_category_code": "5812",
         "department": "sales", "amount_minor": 9_000},    # under -> approve
        {"transaction_id": "t2", "employee_id": "e1", "merchant_category_code": "5812",
         "department": "sales", "amount_minor": 11_000},   # 1.10x -> soft flag
        {"transaction_id": "t3", "employee_id": "e1", "merchant_category_code": "5812",
         "department": "sales", "amount_minor": 20_000},   # 2.0x -> hard block
        {"transaction_id": "t4", "employee_id": "e1", "merchant_category_code": "5812",
         "department": "sales", "amount_minor": None},     # missing -> needs data
    ])
    out = evaluate_batch(batch, baselines, cfg)
    assert out["routing_decision"].tolist() == [
        "AUTO_APPROVE", "SOFT_FLAG", "HARD_BLOCK", "NEEDS_DATA",
    ]

Add fixtures for the failure modes named earlier: a poisoned historical window (inject an un-reviewed 100x outlier and assert the clamp still bounds the band), a boundary-clustering batch (many rows at 0.98x and assert a monitoring hook fires), and a group whose sample count sits exactly at min_samples (off-by-one at the fallback edge). Run the same fixed batch through two independent invocations and assert the audit records are identical to lock in determinism.

Operational Runbook

Deploy, monitor, and roll back adaptive thresholds as a single versioned unit — the config, the baselines it produced, and the code that read them.

Refresh on a fixed cadence, not per-transaction. Rebuild baselines weekly (bi-weekly for low-volume groups) from records that already cleared Duplicate Receipt Detection and Date Window Validation Logic. A fixed cadence dampens seasonal noise; real-time refresh chases it.
Snapshot and hash every baseline set. Persist the baseline map plus the ThresholdConfig under a content hash before it goes live, so any past decision can be replayed exactly.
Canary before full cutover. Run the new baselines in shadow against a recent day of traffic and diff routing decisions against the live set. Investigate any group whose auto-approval rate moves more than a few points.
Alert on the leading indicators, not just error rates. Page when the cold-start group count climbs (upstream taxonomy churn), when deviation ratios cluster within a few percent of a boundary (misaligned band or structured spend), or when the HARD_BLOCK rate for any group spikes against its trailing average.
Roll back by pointer swap. Because baselines are immutable snapshots, rollback is repointing the engine at the previous hash — no recomputation, no partial state. Keep at least the last two snapshots hot.
Reconcile against the audit trail. After each refresh, confirm every emitted record carries the threshold and config that produced it, satisfying the retention rules owned by Security & Compliance Boundaries.