No AI-specific metrics

At L1, engineering teams that have adopted AI tools - GitHub Copilot, Cursor, Claude Code - are tracking those tools with zero AI-specific metrics.

·Delivery is tracked with at least basic metrics
·Standard delivery metrics are in place (AI-specific metrics come later)

·Team acknowledges the need for AI-specific metrics beyond traditional DORA
·Basic deployment frequency is at least known (even if not dashboarded)

Evidence

·Absence of metrics dashboard or inconsistent/manual tracking
·No AI-specific fields in existing metrics systems

What It Is

At L1, engineering teams that have adopted AI tools - GitHub Copilot, Cursor, Claude Code - are tracking those tools with zero AI-specific metrics. They know how many licenses they've purchased. They do not know how many are actively used, what percentage of production code was AI-assisted, whether AI-generated code has a higher or lower defect rate than human-written code, or whether AI adoption has changed developer throughput in any measurable way. The AI tools are running, but there's no instrumentation to see what they're doing.

This is the normal starting state. Teams adopt AI tools because developers request them or because leadership mandates them, and the focus is on adoption and tooling setup. Measurement is treated as something to figure out later, once the tools are embedded in the workflow. The problem is that "later" often never arrives, and the team ends up 12 months into AI tool usage with no ability to answer the question every engineering leader eventually faces: "What did we get for this investment?"

The absence of AI-specific metrics is different from the absence of DORA metrics. DORA metrics measure delivery performance, which existed before AI. AI-specific metrics measure a new phenomenon: the degree to which AI is changing how code is produced. They answer different questions. How many iterations does it take an agent to produce a passing CI result? What percentage of merged PRs were primarily AI-authored? How does the defect rate of AI-generated code compare to human-generated code in the same codebase? These questions have no analog in traditional software metrics, and they require instrumentation that most teams at L1 haven't built.

The gap between "AI tools installed" and "AI impact measured" is where most organizations live. It's a dangerous place because decisions about AI investment are being made on impressions and anecdotes rather than data. Some developers love the tools and exaggerate their productivity gains. Others are skeptical and underreport their usage. Without AI-specific metrics, there's no way to cut through the noise and understand what's actually happening.

Why It Matters

Invisible AI usage means invisible impact - if you don't know which PRs were AI-assisted and which were human-written, you can't compare defect rates, review time, or test coverage between the two cohorts; the signal is permanently lost
License spend without usage data is waste - at L1, teams often discover that 30-40% of Copilot licenses are paid for but never actively used; without usage metrics, this waste is invisible until someone does a license audit
AI tool selection requires evidence - deciding whether to use Claude Code vs Cursor vs Copilot should be data-driven; teams without AI-specific metrics make these decisions based on demos and developer preferences rather than actual productivity impact in their specific codebase
Regulatory and compliance risk - as AI-generated code becomes subject to audit requirements (EU AI Act, SOC2 extensions, etc.), teams that didn't track AI usage have no way to produce the records they need; the measurement gap becomes a compliance gap
Cannot learn what works - some AI usage patterns produce much better outcomes than others; without metrics, there's no way to identify which patterns are working and systematize them across the team

Getting Started

Define "AI-assisted" for your context - Before instrumentation, decide what counts as AI-assisted code. A PR where the developer used Copilot for one line completion? A PR written primarily by an agent? A PR reviewed by an AI review tool? The definition shapes the metrics. Start with a simple binary: was an AI agent the primary author of this PR's changes?
Add an AI label to your PR workflow - The simplest possible instrumentation is a PR label. Create a label like ai-assisted or agent-authored and ask developers to apply it when appropriate. This is zero-infrastructure measurement. It's imperfect and relies on developer discipline, but it gives you your first data point: what percentage of PRs in a given week are developers identifying as AI-assisted?
Instrument your AI tool's native analytics - GitHub Copilot, Cursor, and most enterprise AI coding tools have built-in usage analytics. Check the admin console. You likely already have data on acceptance rates, active users, and lines accepted. Pull this data into your engineering dashboard.
Track AI tool active usage, not license count - Define "active usage" (e.g., used the AI tool at least 3 days in a week) and track this metric weekly per developer. The gap between license count and active users is your first meaningful AI-specific signal.
Start a simple AI metrics log - Create a shared document or dashboard where you record weekly: AI tool active users, PR count (total and AI-labeled), and any qualitative observations. Even a Google Sheet works. The goal is a habit of recording, not a sophisticated BI system.
Review AI metrics in your sprint retrospective - Whatever AI metrics you start tracking, make them a standing agenda item in the sprint retrospective or engineering review. "Here's our AI usage this week - anything interesting?" creates the habit of attention that makes measurement meaningful.

Tip

GitHub's Copilot usage API and the Copilot Business admin dashboard give you acceptance rate, active users, and lines of code accepted without any custom instrumentation. If you're on GitHub Enterprise, you have this data available right now - you're just not looking at it.

6 steps to get from here to the next level

Common Pitfalls

Conflating license count with usage. Buying 50 Copilot licenses and reporting "50 developers use AI" is a common mistake that produces misleading ROI calculations. License count is a budget line item. Active usage is a behavior metric. Track both and report the gap honestly.

Treating acceptance rate as the primary success metric. GitHub Copilot's acceptance rate (percentage of suggestions accepted) is widely reported but weakly predictive of productivity impact. Developers can accept low-quality suggestions that slow them down later, or reject high-quality suggestions that they'll type themselves anyway. Acceptance rate measures engagement, not value. It's worth tracking but shouldn't be the headline metric.

Waiting for perfect instrumentation before tracking anything. Teams frequently say "we'll track AI impact once we have a proper BI pipeline." The BI pipeline takes six months to build, and by then the historical data is gone. Start with imperfect manual tracking - PR labels, weekly surveys, a shared spreadsheet - and improve the instrumentation over time. Something is always better than nothing.

Not tracking the baseline before AI tool rollout. The most important AI-specific metric is the before/after comparison. Teams that roll out AI tools without establishing a pre-AI baseline of PR throughput, lead time, and defect rate cannot compute impact. If you haven't started yet, measure now. If you've already rolled out, find whatever historical data you can from before the rollout.

Ignoring qualitative metrics. Developer satisfaction, perceived productivity, and tool preference are not captured in activity metrics but are critical for understanding adoption health. Run a brief monthly developer survey: "How much is AI tooling helping you in your daily work? (1-5)" The trend in this number is as important as any quantitative metric.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has approved AI tool purchases for his team over the past year and is now being asked to present the ROI to the CTO. He pulls together license costs, queries a few developers about their experience, and realizes he has no consistent data to show. Some developers love the tools, some barely use them, and he has no way to quantify the impact on delivery.

What Bob should do: Bob should treat the absence of AI metrics as a technical debt item, not an analytics problem. He should assign an engineering manager to own AI metrics for the next quarter, with a specific goal: by the end of the quarter, the team has three AI-specific metrics tracked weekly (active usage rate, AI-labeled PR percentage, Copilot acceptance rate) and a simple dashboard that shows trends. Bob should also establish the expectation that AI tool usage is tracked - not to surveil developers but because the organization is making a significant investment and needs to understand its impact. The framing matters: metrics are how the organization learns, not how it judges individuals.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is tasked with improving developer productivity and suspects that AI tools are having an impact - she just can't prove it. She's heard anecdotes from some developers about being dramatically more productive and from others about the tools being distracting. She needs data to separate signal from noise.

What Sarah should do: Sarah should run a 30-day AI measurement sprint. For 30 days, she asks every developer to do two things: (1) apply an ai-assisted label to every PR where AI was a significant contributor, and (2) fill out a two-question weekly survey (productivity rating, tool satisfaction rating). At the end of 30 days, Sarah compares: do developers who label PRs as AI-assisted have higher PR throughput? Is there a correlation between tool satisfaction and PR cycle time? This quick experiment produces the first real data the team has about AI impact and identifies which developers and use cases are getting the most value - which is the foundation for spreading those patterns.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor runs sophisticated agent workflows and is convinced the productivity gains are real. But when he advocates for deeper AI investment in architecture reviews, people ask for data he doesn't have. His personal experience is compelling but not generalizable.

What Victor should do: Victor should build the AI metrics instrumentation himself as a platform contribution. He should write a simple GitHub Actions workflow that automatically tags PRs based on commit message patterns (agent-generated commits often have recognizable signatures), pulls Copilot usage data from the API, and publishes a weekly metrics digest to Slack. This takes an afternoon to build and immediately gives the whole team visibility into AI usage patterns. Victor should document the methodology clearly so others can understand what's being measured and why. The act of building the instrumentation is itself an advocacy tool - it makes the absence of AI metrics visible and provides the foundation for the L2 measurement work that follows.

What Victor should do - role-specific action plan