How to measure the impact of AI code:
Throughput, Quality and Autonomy

git-ai-project/git-ai2.2k

When the first coding agents launched last year, it was not clear whether they could handle real software engineering work. Now it’s becoming widely accepted that agents will write most of the code on large engineering teams — while engineers shift toward architecture, review, and building internal harnesses.

Engineering leaders now carry a mandate to adopt AI and figure out how to make it work well for their organizations, but lack clear visibility into what’s working, what’s breaking, and where to invest next. Without the right metrics, it’s easy to mistake activity for progress. To measure the real impact of coding agents you need to track AI-generated code through the entire SDLC — from prompt to production — and drive improvements in:

Throughput
Quality and durability
Agent autonomy
Agent efficiency

1. Measuring AI Code Throughput

It's more important than ever to understand how much work is actually shipping. It's not enough to count merged PRs — you have to weight each unit of work using a measure like storypoints, or use LLMs to score the scope of each PR.

The key discipline for the AI-era: "done" means done — in production, serving customers, reliably. It's easier than ever to ship code, but if you can't keep the quality bar high up-front, you'll spend the time you "saved" on rework. Make sure that rework is counted as part of the original task.

Count when it works, not when it merges. Track throughput from the start of the task through the point where the changes have settled and are actually delivering value in production.
Weight by the scope of the changes, not # of PRs. A small fix and a large feature shouldn't count the same. Size the work that lands so throughput reflects real output, not PR volume.

2. Measuring the Quality and Durability of AI Changes

With accurate AI-attribution, every production incident, bug, rollback, and hotfix can be traced back to the exact agent session that wrote the code — and mined to drive continuous improvements to your prompts/skills, AI-review rules, and codebase context.

That same attribution also surfaces failure modes in AI-aided development that most teams can't see today:

"Not really done". AI-generated code that keeps getting modified after the task was marked done — silent rewrites during the next feature, regressions patched the following week, quiet deletions. We've seen 30 day durability of AI-code range from ~30% to ~85%. At the low end, most AI code is getting rewritten within a month of shipping — a signal that the team is merging code that does not hold up.
Review-time churn. AI-generated code that gets heavily rewritten between PR open and merge. With AI writing most of the code, human review is the new bottleneck — and churn at this stage points to a harness or operator problem: thin codebase context, weak prompting, incomplete rules, or missing tests or other forms of automated verification.
Session failure rate in production. The share of AI sessions whose shipped code caused an incident, rollback, or hotfix. It's Change Failure Rate applied to sessions instead of commits.

AI-attribution lets you track the impact of Coding Agents through the entire SDLC and create a feedback loop: every incident and every major rework of agent contributions becomes a data point that platform teams can use to add guardrails, improve context, and build a better automated software factory.

3. Measuring Agent Autonomy

Imagine two agent sessions working on the same bug fix. Agent 1 pulls context from your issue tracker and produces a patch that gets reviewed and merged into production. Agent 2 gets to the same place — but only after a lot of steering and interruptions.

Session 1Compounding value

A straight line from intent to production

→Agent pulls context from the issue tracker
→Writes failing tests for the bug
→Corrects the logic without introducing regressions
→Opens a PR with a clear accounting of cause and fix
→Developer reviews and approves

Session 2Compounding debt

A loop of steering, rewrites, and regressions

↯Engineer kicks things off locally
↯Agent struggles to reproduce — repo documentation is thin
↯Developer steers it toward the right part of the codebase
↯Takes the Agent a few tries to fix the issue correctly
↯PR opens; reviewer spots a missed edge case
↯Engineer re-prompts the agent, rewrites most of the patch
↯Ships. Customer reports a regression
↯Manual Hotfix goes out the next morning

Both sessions might show 100% AI-authored code. Both might have strong acceptance rates. Both might get done 5x faster than the same work a year ago. But one of them is effective and autonomous — freeing up engineers to work on other issues — while the other is a time suck. What separates them is the level of autonomy of the software factory.

Agent autonomy is driven by three things:

Prompt quality. There are wide gaps between the best and average prompters on any given team. Better prompts produce straighter paths from intent to working code, with fewer steering corrections and fewer abandoned branches. This is a skill that can be measured, modeled, and taught.
Codebase context. Agents get lost in sparse codebases. The more signal available — documentation, agent skills, architectural context, examples — the less time the agent spends thrashing. Teams that invest in making their codebases agent-ready see dramatically better results.
Automated verification. Agents that can run tests and verify their own outputs stay autonomous for longer and solve problems with less ongoing input. They catch regressions before committing, not after deploying.

4. Measuring Agent Efficiency

When AI-code is tracked from prompt all the way to production, it is possible to measure how efficiently the agent gets from intent to working code. There can be large differences between engineers, teams, and even repositories at the same company. For every 50 lines generated, 1 might make it to production. On teams that have invested more in making their work agent-ready, you might see 4 lines generated for every 1 that makes it to production. Tokens are expensive — and inefficient sessions burn through them fast. The gap between a 4:1 and a 52:1 ratio isn't just wasted time, it's wasted money.

Team A4 : 2 : 1

Context-rich, few missteps

The agent figures things out fast — most of what it commits makes it through review and into production.

Team B52 : 8 : 1

Thrash inside the session

Regeneration, backtracking, and abandoned approaches — most of what the agent generates never makes it into a commit.

Of course, some sessions and tasks are genuinely harder — an agent working through a complex migration or an unfamiliar part of the codebase will naturally generate more throwaway code than one fixing a well-understood bug. These ratios are much more useful in aggregate. If you're seeing one codebase with consistently worse agent efficiency than another repo, that's a signal worth paying attention to — and potentially worth investing in better context, documentation, or test coverage to close the gap.

Both teams ship code that's >95% AI-generated — but Team B spends far more tokens, time, and engineer attention getting there. The real leverage isn't a cheaper model or a smarter agent; it's building the right environment for your agents to work in to shorten the path from intent to working code.

The teams that adapt and learn to integrate AI this year will be the ones that treat AI coding as a system to be tuned: measuring throughput honestly, tracking quality and durability through production, and working to continuously improve agent autonomy. Git AI gives you visibility into coding agents and the code they generate — giving you a compass to help you walk in the right direction.

Git AI: The open source standard for tracking
AI-code from prompt to production.

curl -sSL https://usegitai.com/install.sh | bash

Get started — open source Book a demo

How to measure the impact of AI code: Throughput, Quality and Autonomy