DORA + basic AI tracking

At L2 (Guided), teams have moved past the L1 silence on metrics.

·DORA metrics are tracked consistently with a dashboard
·AI tool license count vs. active usage rate is measured
·PR throughput per developer is tracked

·AI acceptance rate (% of AI suggestions accepted) is measured per tool
·Metrics are reviewed in team retrospectives at least monthly

Evidence

·DORA metrics dashboard with current data
·License utilization report (licenses purchased vs. active users)
·PR throughput chart showing per-developer breakdown

What It Is

At L2 (Guided), teams have moved past the L1 silence on metrics. They're tracking the four DORA metrics - deployment frequency, lead time for changes, change failure rate, and mean time to restore - and they've added a basic layer of AI-specific tracking on top. "Basic AI tracking" means capturing the signals that DORA doesn't cover: how many of your PRs are AI-assisted, how many developers are actively using AI tools each week, and how AI-labeled PRs compare to human-labeled PRs on review time and defect rate.

The DORA + basic AI tracking combination is the minimum viable measurement stack for a team that has made a meaningful AI investment. DORA gives you delivery performance. Basic AI tracking gives you the first correlation signal: are the developers using AI tools more performing better on DORA metrics? This correlation is imperfect - it doesn't prove causation, it doesn't control for developer seniority, and it doesn't account for which AI usage patterns are producing results - but it's the first step toward evidence-based AI program management.

The word "basic" is important. Basic AI tracking at L2 does not include ITS, CPI, TORS, or Agent Autonomy Score - those are L3/L4 metrics that require more sophisticated instrumentation. Basic tracking is: usage rates, AI-labeled PR percentage, and a first pass at comparing AI-assisted vs. non-AI-assisted PR cycle times. This is achievable with a few GitHub Actions, a labeling convention, and a simple dashboard.

Most teams at L2 discover a pattern in their data that surprises them: there is a significant spread between high-usage and low-usage developers on DORA metrics. Developers who use AI tools daily and have developed good prompting and review habits are measurably faster. Developers who have licenses but use them inconsistently show no throughput improvement. This insight - that usage rate is a better predictor of impact than license count - is the core finding that drives L2 to L3 progression.

Why It Matters

Combines established and new signals - DORA metrics are trusted by engineering leadership and finance; pairing them with AI tracking gives AI metrics the credibility boost of being presented alongside a recognized framework
Surfaces the usage/impact correlation - basic AI tracking almost always reveals that high-usage developers outperform low-usage developers on delivery metrics; this data makes the case for adoption programs that move low-usage developers upward
Creates accountability for AI tool ROI - a team tracking DORA + AI metrics can answer the ROI question with data rather than anecdotes; "our AI-high-usage cohort has 40% shorter PR cycle time than our low-usage cohort" is a defensible ROI claim
Establishes the foundation for L3 metrics - ITS, CPI, and TORS require understanding of the AI-assisted PR workflow; teams that have already labeled PRs and tracked basic AI metrics have the data hygiene foundation to add these more sophisticated metrics later
Makes adoption gaps visible - basic tracking reveals which teams or individuals are not adopting AI tools effectively; this makes targeted intervention possible rather than leaving low-adoption situations to persist invisibly

Getting Started

Implement DORA tracking if you haven't already - Use the Four Keys open-source project, LinearB, Jellyfish, or your CI/CD platform's native DORA reporting. Get all four metrics on a dashboard and reviewed weekly before adding AI layers.
Create an AI-labeling convention - Define labels for your PR workflow: ai-authored (agent wrote the majority of changes), ai-assisted (developer used AI for significant portions but wrote most of it), and human-authored (no AI involvement). Enforce this via PR template prompts that ask developers to apply the appropriate label.
Instrument AI tool usage rates - Pull weekly active user data from your AI tool admin dashboards (GitHub Copilot Business, Cursor Teams, etc.). Define "active user" as having used the tool on 3+ days in the week. Publish this as a weekly metric alongside DORA.
Build a cohort comparison - Group developers into high-AI-usage (active 4+ days/week) and low-AI-usage (active 0-2 days/week) cohorts. Compare their DORA metrics. Do high-usage developers have shorter lead times? More frequent deployments? This comparison is your first evidence of AI impact.
Track AI-assisted PR review time separately - Does it take reviewers longer or shorter to review AI-generated code? This metric surfaces both quality signals (AI code that's hard to review suggests context problems) and process signals (reviewers who are unfamiliar with AI code take longer initially but speed up with practice).
Present DORA + AI metrics together in engineering reviews - Never present AI metrics in isolation. Always show them alongside DORA. "Our deployment frequency is up 20% and correlates with the 30% increase in AI-assisted PRs this quarter" is a much stronger story than either metric alone.

Tip

GitHub's built-in DORA metrics (available in GitHub Enterprise through Insights) now include some AI-specific signals from Copilot. If you're on GitHub Enterprise, check the Insights dashboard before building custom instrumentation - you may already have a significant portion of the L2 metrics stack available.

6 steps to get from here to the next level

Common Pitfalls

Adding AI tracking without fixing DORA first. If your DORA metrics are inconsistently defined or poorly instrumented, adding AI tracking on top creates confusion rather than clarity. Fix the DORA foundation before layering AI metrics. Bad DORA data + AI labels = misleading conclusions.

Tracking too many AI metrics simultaneously. It's tempting to instrument everything at once: acceptance rates, active users, PR labels, code quality scores, survey responses, cost per token. The result is metric overload - no one knows which metrics matter and the dashboard becomes noise. At L2, pick three metrics and track them consistently for a full quarter before adding more.

Using AI label adherence as a performance metric. If developers believe that applying ai-authored labels will lead to their work being scrutinized more heavily or used against them, they'll stop applying labels. The purpose of labeling is organizational learning, not individual evaluation. Make this explicit and enforce it consistently.

Not segmenting by team or role. AI tool impact varies significantly by role and codebase. Frontend developers using AI for React components may see different throughput gains than backend developers writing database migrations. Reporting a single average hides these differences. Segment the DORA + AI data by team and role to get actionable insights.

Treating correlation as causation too early. The finding that high-AI-usage developers have better DORA metrics is promising but not proof. These developers may be inherently more productive, more senior, or working on simpler tasks. Acknowledge this caveat when reporting. The way to move from correlation to causation is a controlled experiment - take a cohort of matched developers, give half AI tools, measure the difference. Most teams never run this experiment, but it's worth acknowledging the limitation of the observational data.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has established DORA tracking across his teams and is now seeing that deployment frequency is up year-over-year, but he can't tell how much of that is AI tools vs. the new CI pipeline they rebuilt in Q2. He wants to isolate the AI contribution.

What Bob should do: Bob should implement AI PR labeling retroactively where possible (many git commit messages and PR descriptions contain signals of AI involvement) and prospectively going forward. The key analytical question is: for PRs labeled ai-assisted or ai-authored, what is the lead time compared to human-authored PRs of similar size and complexity? Controlling for PR size removes a major confound (AI-generated PRs tend to be larger, which inflates lead time). Bob should present this analysis in the next quarterly business review as evidence of AI impact on delivery performance. The retroactive labeling won't be perfect, but it will be good enough to show directional impact.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has developer usage rate data from Copilot Business and basic DORA metrics from GitHub Insights. She wants to connect the two - to show that developers using Copilot more are shipping faster - but she's not sure how to do the analysis rigorously.

What Sarah should do: Sarah should run a cohort analysis with three groups: non-users (0 Copilot suggestions accepted in a week), light users (1-10 suggestions/day), and heavy users (10+ suggestions/day). For each group, compute average PR cycle time and PR volume per week over the last quarter. The analysis will almost certainly show that heavy users have shorter cycle times and higher PR volume. Sarah should present this with the caveat that it's observational, not causal - but then follow up with a recommendation: target the non-users and light users for structured adoption support. The data doesn't prove causation, but it does identify where intervention will have the highest expected return.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor runs agent-heavy workflows and knows the DORA + basic tracking metrics are table stakes. He's already thinking about ITS and CPI, the L3 metrics. But he recognizes that the team needs to build the L2 foundation before jumping to L3.

What Victor should do: Victor should help build the L2 instrumentation that will feed into L3. Specifically: the PR labeling convention needs to be more granular than a simple ai-assisted label. Victor should propose a tagging schema that captures the agent type (Claude Code, Copilot, Cursor), the task type (new feature, bug fix, refactor, test writing), and the autonomy level (fully autonomous, human-guided, collaborative). This richer metadata makes the L3 metrics - ITS and CPI - much more meaningful because they can be segmented by task type and agent. Building this schema now is the infrastructure investment that makes L3 metrics analysis straightforward rather than a data archaeology project.

What Victor should do - role-specific action plan