Core Policy Architecture & Taxonomy Design

Manual reimbursement review does not scale: every subjective judgment an AP analyst makes is unreproducible, unauditable, and impossible to replay when an external auditor asks why a 2023 expense passed. Core Policy Architecture & Taxonomy Design solves this by compiling your reimbursement handbook into a deterministic, cryptographically traceable rule engine — one where identical inputs always yield identical audit outcomes regardless of host, worker count, or execution order. This is the control plane of an automated expense pipeline: it defines how policies are modeled as data, how raw line items are normalized into canonical categories, and how cascading limits resolve to a single, defensible verdict per transaction.

This guide is written for finance-operations engineers and AP platform teams who need to move policy logic out of spreadsheets and stored procedures and into version-controlled, testable code. Within the broader Expense Audit Automation Guide, this is the layer that consumes structured records from the Receipt Ingestion & OCR Data Extraction framework and hands verdicts downstream to Automated Policy Validation & Anomaly Flagging. Everything here optimizes for one property above all: deterministic execution that produces a tamper-evident audit trail suitable for SOX and internal-audit scrutiny.

Deterministic control plane: three feeds converge into a single engine — schema validation, taxonomy normalization, then priority rule traversal — emitting a reproducible SHA-256 audit manifest. Four sub-domains supply the limits the rules carry.

Foundational Architecture & Data Modeling

Policies must be serialized as structured data, never as prose documents or wiki pages. A production engine relies on a normalized schema that separates three concerns: the rule definition (what limit applies), the context selector (when it applies — geography, department, role), and the enforcement disposition (what happens on breach). Collapsing these into one flat record is the single most common cause of unmaintainable policy code, because it forces a schema migration every time a new dimension is added.

Every policy object is validated with Pydantic before it ever reaches the evaluator. Validation at the boundary means the engine can assume total type safety internally — no defensive if isinstance(...) branches, no silent coercions that break determinism across Python versions. The models below are the canonical data contract for the entire control plane.

from __future__ import annotations

import hashlib
import json
import logging
from datetime import date, datetime, timezone
from enum import Enum
from typing import Optional

from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger("policy.schema")


class ExpenseCategory(str, Enum):
    """Canonical expense taxonomy. Raw MCCs and OCR strings map onto these."""

    MEALS = "meals"
    LODGING = "lodging"
    TRANSPORT = "transport"
    SOFTWARE = "software"
    MISC = "misc"


class Disposition(str, Enum):
    """What the engine does when a rule matches and the amount breaches it."""

    PASS = "PASS"
    WARN = "WARN"
    FAIL = "FAIL"
    REVIEW_REQUIRED = "REVIEW_REQUIRED"


class PolicyRule(BaseModel):
    """A single enforceable constraint. Immutable once loaded into the engine."""

    model_config = {"frozen": True}

    rule_id: str
    category: ExpenseCategory
    base_limit: float = Field(gt=0.0)
    geo_modifier: float = Field(default=1.0, ge=0.0)
    role_modifier: float = Field(default=1.0, ge=0.0)
    # Higher priority wins. Regulatory caps sit at 90-100; role overrides at 1-50.
    priority: int = Field(ge=1, le=100)
    # Optional context selectors; None means "applies to any value on this axis".
    geo_code: Optional[str] = None
    role_level: Optional[str] = None
    effective_from: date

    @field_validator("rule_id")
    @classmethod
    def _rule_id_is_stable(cls, v: str) -> str:
        # rule_id must be deterministic and human-traceable — no UUIDs here.
        if not v or " " in v:
            raise ValueError("rule_id must be a stable, space-free identifier")
        return v


class ExpenseLineItem(BaseModel):
    """One normalized line item arriving from the ingestion/taxonomy layer."""

    transaction_id: str
    employee_id: str
    category: ExpenseCategory
    amount: float = Field(ge=0.0)
    currency: str = "USD"
    geo_code: str
    role_level: str
    incurred_on: date
    receipt_hash: Optional[str] = None

Two design decisions here are load-bearing. First, PolicyRule is frozen: once a rule set is loaded, no code path can mutate a limit mid-run, which is what guarantees that two workers evaluating the same batch reach the same verdict. Second, rule_id is deliberately not a UUID — auditors trace violations by rule identifier, so LODGING_EXEC_EMEA_2026 is worth far more than a random hex string.

The taxonomy layer sits upstream of these models and is where most real-world accuracy is won or lost. Merchant category codes, OCR-extracted descriptors, and free-text employee memos must all collapse onto the ExpenseCategory enum before evaluation. That normalization is deep enough to warrant its own treatment in Expense Category Taxonomies, which defines the three-tier hierarchy and the deterministic string-to-node mapping the engine depends on. Ambiguous descriptors that resolve to more than one node are never guessed at — they are routed to the fallback path described later, preserving determinism.

Pipeline Stage Map

Policy evaluation is one stage in a larger flow, and its correctness depends on strict ordering upstream. Introducing the taxonomy before OCR confidence is scored amplifies noise; running caps before per-diem normalization produces contradictory verdicts. The canonical sequence and its failure-mode ownership:

#	Stage	Responsibility	Owns failure mode
1	Ingestion	Accept corporate-card, ERP, and upload payloads; assign `transaction_id`	Duplicate/replayed payloads
2	OCR extraction	Text + confidence scoring on receipt imagery	Faded/low-confidence text
3	Field normalization	Parse amounts, currencies, dates into typed fields	FX and timezone drift
4	Taxonomy classification	Map raw descriptors → `ExpenseCategory`	MCC ambiguity, unmapped merchants
5	Policy evaluation	Priority rule traversal → `AuditEntry` per item	Contradictory/overlapping rules
6	Routing & audit	Emit verdicts, sign manifest, route exceptions	Non-reproducible approvals

Stages 1–4 are owned by the ingestion and OCR domain; the taxonomy classifier only activates once confidence thresholds (typically ≥ 0.85) are met and monetary and temporal fields are normalized. This control plane owns stages 5 and 6. Records that fail an upstream stage never reach the evaluator — they are quarantined by the ingestion layer, which keeps the engine’s input contract clean and its verdicts reproducible.

Stages 1–4 belong to the ingestion and OCR domain; a confidence gate (≥ 0.85, after field normalization) admits clean records to classification and quarantines the rest. The policy engine owns stages 5–6 — policy evaluation and signed routing.

Core Algorithm: The Deterministic Evaluation Engine

This is the heart of the control plane. The engine loads an immutable rule set, sorts it once by priority, and evaluates each line item against the applicable rules in a fixed order. It emits one AuditEntry per rule touched, short-circuits on the first hard failure so that the highest-priority breach dictates the final state, and routes any unmatched item to a human-in-the-loop queue rather than silently passing it. Every entry carries explicit audit metadata — the exact rule applied, the computed effective limit, and a UTC timestamp — so the trail is self-describing.

class AuditEntry(BaseModel):
    """Immutable record of a single rule evaluation against one line item."""

    audit_id: str
    timestamp: str
    transaction_id: str
    employee_id: str
    rule_applied: str
    effective_limit: float
    submitted_amount: float
    status: Disposition
    violation_details: Optional[str] = None


class PolicyEngine:
    """Deterministic, stateless-per-item expense policy evaluator.

    The engine is constructed once from an immutable rule set and reused
    across a batch. It never mutates rules and never depends on wall-clock
    ordering, so N workers evaluating the same batch produce identical
    audit manifests.
    """

    #: Fraction of the effective limit above which an item is flagged WARN.
    WARN_THRESHOLD: float = 0.90

    def __init__(self, rules: list[PolicyRule]) -> None:
        # Deterministic ordering: priority desc, then rule_id asc as a stable
        # tie-breaker so equal-priority rules never reorder between runs.
        self.rules: tuple[PolicyRule, ...] = tuple(
            sorted(rules, key=lambda r: (-r.priority, r.rule_id))
        )
        self.audit_trail: list[AuditEntry] = []
        logger.info(
            "policy_engine_loaded",
            extra={"rule_count": len(self.rules)},
        )

    def _rule_applies(self, rule: PolicyRule, item: ExpenseLineItem) -> bool:
        """A rule applies when its category matches, it is in effect on the
        incurred date, and every set context selector matches the item."""
        if rule.category != item.category:
            return False
        if rule.effective_from > item.incurred_on:
            return False
        if rule.geo_code is not None and rule.geo_code != item.geo_code:
            return False
        if rule.role_level is not None and rule.role_level != item.role_level:
            return False
        return True

    def _make_entry(
        self,
        item: ExpenseLineItem,
        rule_id: str,
        effective_limit: float,
        status: Disposition,
        details: Optional[str],
    ) -> AuditEntry:
        # audit_id is derived, not random, so the manifest hash is reproducible.
        seed = f"{item.transaction_id}:{rule_id}:{effective_limit:.4f}"
        return AuditEntry(
            audit_id=hashlib.sha256(seed.encode()).hexdigest()[:16],
            timestamp=datetime.now(timezone.utc).isoformat(),
            transaction_id=item.transaction_id,
            employee_id=item.employee_id,
            rule_applied=rule_id,
            effective_limit=round(effective_limit, 2),
            submitted_amount=item.amount,
            status=status,
            violation_details=details,
        )

    def evaluate(self, item: ExpenseLineItem) -> list[AuditEntry]:
        """Evaluate one line item and return the ordered audit entries."""
        line_audit: list[AuditEntry] = []

        for rule in self.rules:
            if not self._rule_applies(rule, item):
                continue

            effective_limit = rule.base_limit * rule.geo_modifier * rule.role_modifier
            status = Disposition.PASS
            details: Optional[str] = None

            if item.amount > effective_limit:
                status = Disposition.FAIL
                details = f"Exceeds limit by {item.amount - effective_limit:.2f} {item.currency}"
            elif item.amount > effective_limit * self.WARN_THRESHOLD:
                status = Disposition.WARN
                details = "Within 10% of the policy threshold"

            entry = self._make_entry(
                item, rule.rule_id, effective_limit, status, details
            )
            line_audit.append(entry)

            if status is Disposition.FAIL:
                # Highest-priority breach wins; stop so lower rules cannot
                # override a hard failure. This is the determinism guarantee.
                logger.warning(
                    "policy_violation",
                    extra={
                        "transaction_id": item.transaction_id,
                        "rule_applied": rule.rule_id,
                        "effective_limit": entry.effective_limit,
                        "submitted_amount": item.amount,
                    },
                )
                break

        if not line_audit:
            line_audit.append(self._apply_fallback(item))

        self.audit_trail.extend(line_audit)
        return line_audit

    def _apply_fallback(self, item: ExpenseLineItem) -> AuditEntry:
        """No rule matched — never auto-approve. Route to manual review with
        explicit routing metadata so the taxonomy gap can be resolved."""
        logger.info(
            "fallback_unmapped",
            extra={"transaction_id": item.transaction_id, "category": item.category},
        )
        return self._make_entry(
            item,
            rule_id="FALLBACK_UNMAPPED",
            effective_limit=0.0,
            status=Disposition.REVIEW_REQUIRED,
            details="No matching policy rule; routed to AP manual review",
        )

    def generate_audit_manifest(self) -> str:
        """SHA-256 over the sorted audit trail — the point-in-time proof."""
        payload = json.dumps(
            [e.model_dump() for e in self.audit_trail],
            sort_keys=True,
            default=str,
        )
        return hashlib.sha256(payload.encode()).hexdigest()

The _rule_applies selector is where the multi-dimensional matching happens, and it is intentionally explicit rather than clever: category, effective date, then each optional context axis. Because rules are sorted by priority descending with rule_id as a stable tie-breaker, the traversal order is fully determined by the rule set — there is no dependency on dict insertion order, thread scheduling, or the wall clock. The audit_id is derived from a hash of the transaction and rule rather than a random UUID, which is what lets generate_audit_manifest() produce the same digest on every replay of the same data.

When an item matches no rule, _apply_fallback fires. The critical property is that the fallback disposition is REVIEW_REQUIRED, never PASS: an unmapped merchant or a newly introduced category must always surface to a human, because silently approving it would create exactly the kind of unauditable gap this architecture exists to eliminate. In production these quarantined items feed the same exception queue used by Automated Policy Validation & Anomaly Flagging, so anomaly scoring and unmapped-taxonomy triage share one review surface.

How the sub-domains plug into the engine

The engine above is deliberately generic about where its limits come from. Three sub-domains supply them:

Location-aware limits are produced by the Per Diem Rate Structuring layer, which resolves federal benchmarks such as the GSA per-diem tables (and corporate regional adjustments) into the geo_modifier a rule carries. Rates are keyed by effective date so a 2025 trip is never judged against 2026 rates.
Cascading ceilings — departmental budgets, project allocations, and role caps that intersect — are modeled by Spending Cap Hierarchies. In practice these form a directed acyclic graph of constraint nodes where a regulatory or executive cap at priority 95 short-circuits a role override at priority 40, which is exactly the break-on-FAIL behavior the engine implements.
Which fields the engine is even allowed to read — receipt imagery, PII, cross-border payloads — is governed by Security & Compliance Boundaries, covered in the compliance section below.

Rules are sorted by priority descending. Traversal walks top to bottom while each amount stays within its limit; the first rule that breaches short-circuits to a hard FAIL — mirroring the engine's break-on-FAIL. Only an item that clears every applicable rule reaches a PASS or WARN verdict.

Auditability, Versioning & Compliance

Deterministic execution is only half the compliance story; the other half is proving, months later, that a given verdict was correct under the rules in force at the time. Three mechanisms deliver that.

Append-only manifests. Every batch run ends by calling generate_audit_manifest(), which hashes the sorted audit trail into a single SHA-256 digest. That digest, together with the serialized entries, is written to an append-only sink — object storage with a WORM (write-once, read-many) retention lock, or a dedicated audit table with no UPDATE/DELETE grants. Tampering with any entry changes the digest, making silent alteration detectable.

Point-in-time policy snapshots. When policies change for a new fiscal year or a regulatory update, you must never re-evaluate historical reports against the new rules. Store each rule set with its own content hash and evaluate every report against the exact snapshot active on its incurred_on date.

def snapshot_hash(rules: list[PolicyRule]) -> str:
    """Content hash of a rule set — pins evaluations to a policy version."""
    payload = json.dumps(
        sorted((r.model_dump() for r in rules), key=lambda d: d["rule_id"]),
        sort_keys=True,
        default=str,
    )
    return hashlib.sha256(payload.encode()).hexdigest()


def audit_record(rules: list[PolicyRule], manifest: str) -> dict[str, str]:
    """The minimal tuple an external auditor needs to replay a run."""
    return {
        "policy_snapshot": snapshot_hash(rules),
        "audit_manifest": manifest,
        "generated_at": datetime.now(timezone.utc).isoformat(),
    }

Persisting the policy_snapshot alongside the audit_manifest gives an auditor everything needed to replay a run: load the pinned rule set, re-run the batch, and confirm the manifest matches. This satisfies SOX’s requirement that controls be both operating and evidenced, and it eliminates retroactive compliance drift — the class of finding where a report “passes” today only because the rules loosened after it was filed.

Chain of custody for verdicts. Because each AuditEntry names its employee_id, rule_applied, and effective_limit, the trail is a self-contained explanation of every reimbursement decision. Internal audit can answer “why did this $412 hotel night pass?” without spelunking through application logs — the entry states the rule, the location-adjusted limit, and the outcome.

Integration & Operational Readiness

Wire the control plane into an existing AP or travel stack as a stateless, idempotent microservice. It consumes normalized payloads from OCR pipelines, ERP exports, or corporate-card feeds; runs deterministic traversal; and returns structured audit manifests. Statelessness matters because it lets you scale evaluation horizontally — the engine holds no per-request state beyond the batch it is given, so any worker can process any batch.

Idempotency is enforced at the boundary using the derived audit_id: re-submitting the same transaction against the same policy snapshot yields the same entries, so a retried or replayed message is a no-op rather than a double-charge against a budget. This is the same replay-safety discipline that Duplicate Receipt Detection relies on downstream.

Treat policy definitions as configuration-as-code. Rules live in version control, are reviewed by pull request, and are validated in CI before they can ship:

def load_rules_from_config(raw_rules: list[dict]) -> list[PolicyRule]:
    """Validate a config payload into typed rules — call this in CI and at boot."""
    validated: list[PolicyRule] = []
    for raw in raw_rules:
        # Pydantic raises ValidationError on any malformed rule, failing the
        # build before a bad limit can reach production.
        validated.append(PolicyRule.model_validate(raw))
    logger.info("rules_validated", extra={"count": len(validated)})
    return validated

Because the rule set is decoupled from application logic, finance teams deploy threshold adjustments, regional rate refreshes, and compliance patches by merging a config change — no full CI/CD deployment of the service, and no code review of Python for what is fundamentally a business decision. Feed the validated set into Dynamic Threshold Tuning if you want limits that adapt to observed spend distributions rather than static numbers.

A minimal operational checklist for standing this up:

Pin pydantic and the Python runtime; record both in the audit snapshot metadata.
Run load_rules_from_config as a required CI gate on every policy PR.
Point the audit sink at WORM storage with a retention lock ≥ your SOX retention window.
Emit the structured policy_violation log to your SIEM, keyed by transaction_id.
Store policy_snapshot + audit_manifest per batch; verify replay in a nightly job.

Failure Modes & Edge Cases

The engine is deterministic, but the surrounding system fails in predictable ways. Design for these explicitly rather than discovering them in an audit.

Failure mode	Root cause	Mitigation
Silent auto-pass of new merchant	New MCC has no taxonomy mapping	Fallback returns `REVIEW_REQUIRED`, never `PASS`
Contradictory verdicts	Two equal-priority rules match one item	Stable `rule_id` tie-break; lint rule set for overlap in CI
Retroactive drift	Historical report re-run against new rules	Pin evaluation to `policy_snapshot` by `incurred_on` date
Non-reproducible manifest	Random `audit_id` or unsorted trail	Derive `audit_id` from content; `sort_keys=True` on hash
Timezone-skewed effective dates	`incurred_on` compared across TZs	Normalize all dates to a single TZ before ingestion
FX mismatch on limits	Item currency ≠ limit currency	Convert to a base currency upstream; store rate used
Float rounding divergence	Accumulated `float` error across hosts	Round `effective_limit` at emission; consider `Decimal` for money
Rule-set explosion	One flat rule per geo × role × category	Use context selectors (`None` = wildcard) instead of enumerating

The most expensive of these is retroactive drift, because it is invisible until an auditor asks for a point-in-time replay and the numbers no longer reconcile. The policy_snapshot discipline in the compliance section is the direct countermeasure. The most common is rule-set explosion: teams enumerate a rule for every combination of geography, role, and category, and the set becomes unmaintainable. The None-as-wildcard context selectors on PolicyRule exist precisely to keep the set small — write one broad rule and let a higher-priority specific rule override it only where needed.