Automated compliance checks

Automated compliance checks at L4 go beyond the process gates of L3 (did the developer fill in the right fields?) to evaluate substantive compliance questions automatically: does t

·Full provenance tracking per change: model version, prompt context, agent session ID, iteration count
·Automated compliance checks run without manual intervention on every merge
·AI-generated code is distinguishable from human-written code in version control (metadata, labels, or attribution)

·Provenance data is queryable (e.g., "show all changes made by model X in the last 30 days")
·Compliance check results are aggregated into a governance dashboard

Evidence

·Provenance metadata on commits/PRs showing full attribution chain
·Automated compliance check configuration with zero manual steps
·VCS query showing AI-vs-human code distinction

What It Is

Automated compliance checks at L4 go beyond the process gates of L3 (did the developer fill in the right fields?) to evaluate substantive compliance questions automatically: does this AI-generated code introduce patterns that are prohibited by the organization's security policy? Does this change affect a regulatory boundary in the system (a PCI scope crossing, a HIPAA data flow, an EU AI Act high-risk system component)? Does the model version used in this change have any known issues that require elevated review?

At L4 (Optimized), automated compliance checks are running continuously - not just at PR time, but in the background as code is deployed, as regulatory guidance updates are published, and as new vulnerability disclosures affect AI-generated code patterns. The checks are not yes/no gates (that's L3) but scored assessments: this change has a compliance risk score of 73 based on its proximity to regulated data flows and the model version used. High-score changes get elevated review; low-score changes flow through automatically.

The technology stack for automated compliance checks at L4 typically includes: static analysis tools configured with compliance-specific rules (detecting data flow patterns that cross regulatory boundaries), AI-powered code review that applies policy rules at a semantic level (not just syntactic patterns), provenance analysis that correlates model version with compliance risk, and regulatory mapping tools that maintain a current map of which code modules are in regulatory scope.

The shift from L3 to L4 compliance checks is a shift in what's being evaluated. L3 checks are process checks: was the form filled in correctly? L4 checks are substantive checks: does the content of this change raise compliance concerns? L4 checks require understanding the semantics of code changes, not just their metadata. This is where AI-powered compliance review becomes self-referential: you're using AI to check whether AI-generated code meets compliance requirements.

Why It Matters

Process compliance is necessary but not sufficient - an AI-generated change can have complete MVAT fields, be reviewed by an approved reviewer, and still introduce a GDPR violation or a PCI scope crossing. Automated substantive checks catch what process checks cannot
Regulatory boundaries are dynamic - as code evolves, what was outside regulatory scope can cross into scope without any developer deliberately intending it. Automated boundary tracking detects these crossings immediately rather than in annual security reviews
AI-generated code may have systematic patterns - if a specific model version consistently generates a particular insecure pattern (e.g., a specific vulnerable serialization approach), automated scanning can detect all instances of that pattern across the codebase simultaneously
Scales with AI-generated PR volume - as agents generate hundreds of PRs per day, human-dependent substantive compliance review cannot scale. Automated checks provide the substantive evaluation that allows auto-merge policies to operate safely
Creates a risk-scored review queue - not all PRs need the same level of human review. Automated compliance scoring routes high-risk changes to expert reviewers and allows low-risk changes to merge automatically. This is how high-velocity AI-assisted delivery maintains compliance without review bottlenecks

Getting Started

Build a regulatory boundary map - before you can check whether changes cross regulatory boundaries, you need a map of where those boundaries are. This is an architecture exercise: for each regulatory regime that applies to your system (PCI, HIPAA, GDPR, SOC2), identify the specific modules, services, and data flows that are in scope. Store this map as a machine-readable configuration file (YAML or JSON) that compliance checks can read.
Implement data flow analysis - use static analysis tools (Semgrep with custom rules, CodeQL, or Datadog's code analysis) to detect data flows that cross regulatory boundaries. A GDPR check might detect when a new code path reads from the EU customer table and writes to an unencrypted log. A PCI check detects when PAN data flows to a new destination.
Configure AI-powered compliance review - integrate an AI reviewer (Claude, GPT-4) with your compliance ruleset as its system prompt. The AI reviewer evaluates PR diffs against the compliance rules and produces structured findings: {rule: "PCI-DSS 3.2.1", finding: "PANs may be logged in the new order processing path", severity: "high", location: "src/orders/processor.py:142"}. This is different from a general code review - it's specifically checking compliance rules.
Build the compliance risk score - aggregate individual check results into a per-PR score that routes the PR to the appropriate review tier: score 0-30 (auto-approvable), 31-70 (standard review), 71-100 (elevated review with compliance team involvement). The score weights different factors: regulatory boundary proximity, model version risk level, code area classification, and process compliance completeness.
Instrument for compliance dashboard - every automated compliance check run should produce events for the compliance dashboard: risk score distribution over time, which checks are triggering most frequently, which teams are generating the most elevated-review PRs. This dashboard drives the compliance improvement work queue.
Build the check update pipeline - compliance checks need to update as regulations change and as new AI vulnerability patterns are discovered. Build a pipeline where check updates go through a testing phase (run against the last 90 days of PRs to see what the new check would have caught) before deployment. This prevents compliance check updates from causing unexpected workflow disruptions.

Tip

Start with compliance checks that have high precision (rarely false positive) even if they have lower recall (miss some issues). A compliance check that fires accurately on 80% of real issues and almost never fires incorrectly is far more useful than one that catches 95% of issues but has a 30% false positive rate. False positives in compliance checks are costly: they either get ignored (defeating the purpose) or create unnecessary escalations (creating friction and losing credibility).

Common Pitfalls

Building checks that are too sensitive to false positives. A compliance check that flags every PR touching a payment-related file as "high risk" regardless of the actual change will be tuned out immediately. Compliance check precision matters more than recall: developers learn to ignore noisy checks, and noisy compliance checks undermine the credibility of the entire governance system.

Not maintaining the regulatory boundary map. The regulatory boundary map is only accurate at the moment it was created. As architecture evolves, new services are added, and data flows change, the map becomes stale. Stale maps produce false negatives: changes that cross regulatory boundaries are not detected because the boundary isn't mapped. Build a process for updating the boundary map when architecture changes, and include a "boundary map current?" check in the quarterly compliance review.

Treating AI compliance review as infallible. An AI-powered compliance reviewer is better than no automated check, but it makes mistakes. False negatives are particularly dangerous in compliance contexts. AI compliance review should be configured conservatively: when uncertain, flag for human review rather than passing. The AI reviewer is a triage tool, not a compliance certifier.

Not creating a feedback loop from production issues to checks. When a production security incident reveals a pattern that the automated checks missed, that pattern should become a new check. The compliance check suite should grow over time based on real findings, not just get built once and maintained unchanged. Build a post-incident process that includes "did any automated check detect this issue before it reached production? If not, can we add a check for this pattern?"

Compliance checks that slow CI significantly. Complex static analysis and AI-powered review can add significant time to CI pipelines. Compliance checks that add more than 5 minutes to CI latency will generate pressure to disable or bypass them. Optimize for speed: run checks in parallel where possible, cache results for unchanged files, and scope checks to affected modules rather than running full-repo scans on every PR.

How Different Roles See It

BobHead of Engineering

Bob has been running L3 compliance gates for six months and has near-100% process compliance. But a security incident last month revealed that an AI-generated change introduced a GDPR data retention issue - the process was compliant (the MVAT was filled in, the reviewer approved) but the substantive compliance requirement was missed. Bob needs to add substantive checks.

What Bob should do: Bob should start with the highest-value substantive check: data flow analysis for GDPR-scoped personal data. He should work with the security team to create a Semgrep rule that detects when changes create new paths for personal data to flow to logging, analytics, or external systems without explicit anonymization. This check should be specific enough to avoid false positives (it checks specific table names and field patterns, not generic "user data") and should be integrated as an elevated-review trigger rather than a hard gate. The incident that prompted this work becomes the test case: verify that the new check would have caught the GDPR issue before it reached production.

SarahProductivity Lead

Sarah wants to use compliance risk scoring to improve her productivity analysis. Specifically, she wants to understand whether high-risk PRs (those that trigger compliance escalations) have different review times, defect rates, and team distributions than low-risk PRs. This analysis would let her target developer training more precisely.

What Sarah should do: Sarah should build a compliance risk score analytics pipeline alongside the check implementation. She should join the compliance risk scores with her existing PR analytics: review time, number of revision cycles, downstream defect rate. The hypothesis is that high-risk PRs have higher review times and more revision cycles - which means the compliance escalation process is working as intended. If high-risk PRs don't have different downstream defect rates, the risk scoring model needs recalibration. If they do, that validates the scoring approach and lets Sarah recommend investing in developer training for the code areas that generate the most high-risk PRs.

VictorStaff Engineer - AI Champion

Victor's agent workflows generate PRs that touch multiple compliance domains simultaneously - a single agent session might implement a feature that touches both PCI-scoped payment logic and GDPR-scoped user preferences. Current compliance checks evaluate each dimension independently, but Victor knows that changes touching multiple compliance boundaries simultaneously are riskier than single-boundary changes.

What Victor should do: Victor should propose and prototype a multi-boundary risk multiplier for the compliance score. When a PR touches more than one regulated boundary, the risk score is multiplied (not just summed): a PR that touches PCI and GDPR simultaneously is not scored at PCI + GDPR risk but at PCI x GDPR risk multiplied by a factor that reflects the increased complexity of multi-boundary review. Victor should validate this model against the last year of production incidents: do incidents disproportionately come from multi-boundary PRs? If yes, the multiplier model is supported by data. Victor should also propose that multi-boundary PRs trigger a specific review requirement: not just any approved reviewer, but a reviewer who has explicit knowledge of both regulatory domains being touched.

From the Field

Recent releases, projects, and discussions relevant to this maturity level.

articleL4

news.ycombinator.comViatoris: Signed receipts and audit trails for AI agentsViatoris establishes cryptographic accountability for AI agents by replacing standard logs with signed receipts, facilitating systematic enterprise governance. news.ycombinator.com

articleL4

github.comShow HN: Lmscan – Detect AI text and fingerprint which LLM wrote it (zero deps)lmscan is a zero-dependency Python tool designed for local detection and fingerprinting of LLM-generated text, identifying specific model signatures without extgithub.com

discoveredL4

regent-vcs/re_gentGit for AI coding agents.re_gent (rgt) implements automated, tool-level version control for AI agents, specifically optimized for Claude Code workflows. Written in Go, the tool replacesgithub.com

releaseL4

mem0ai/mem0Mem0 v0.1.1 formalizes OpenCode Plugin distribution via a Bun-based CI/CD pipeline (opencode-plugin-cd.yml) and OIDC-backed npm provenance for supply chain secugithub.com

Where does your team actually sit on this?

This guide describes one level of one area. Run the assessment to place your team across all 16 areas, see which gates you have passed, and get a report you can take to your stakeholders.

Start the assessment

Governance & Compliance

Full provenance tracking per change (cryptographic, tamper-evident agent traces - Dapr 1.18 Verifiable Execution)AI code vs human code distinction in VCS