Tests written manually, coverage < 40%

The baseline testing state at L1 - manual test writing, chronic under-coverage, and the compounding debt that makes AI-generated code increasingly risky to ship.

·An automated test suite exists and runs
·The team writes and maintains its own tests

·Team is aware of flaky test impact (16% of dev time per Google data)
·AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)

Evidence

·Coverage report from the existing test suite
·Test authorship in git history (manual, no agent attribution)

What It Is

At Level 1 (Ad-hoc), all tests are written by hand, one at a time, by developers who are also responsible for the code they're testing. Coverage is typically below 40% - not because developers don't know better, but because writing tests is slow, repetitive, and routinely deprioritized when deadlines approach. The result is a codebase where more than half the logic runs in production with no automated verification at all.

This isn't a discipline problem. It's a structural one. Manual test writing requires a developer to hold two mental models simultaneously: what the code is supposed to do, and how to express that in test code. Under deadline pressure, the second model is the first casualty. Tests get written for the happy path, edge cases get a comment that says "TODO: test this," and that comment stays in the codebase for years.

The sub-40% threshold is significant because it marks the point where the test suite provides the illusion of coverage rather than actual safety. A 38% coverage number sounds like meaningful progress until you realize the covered 38% is almost entirely trivial utility functions, while the business-critical payment logic, state machines, and integration paths have nothing guarding them.

At L1, this situation becomes a compounding problem as soon as AI agents enter the picture. AI-generated code is fast but needs verification. Without tests, you can't verify AI output at scale. The lack of coverage that was manageable when humans wrote code at human speed becomes a critical liability when an agent can produce 500 lines of unverified logic in minutes.

Why It Matters

Low test coverage at L1 is not just a quality issue - it's a rate limiter on every subsequent maturity level:

AI verification gap - Without tests, you cannot safely accept AI-generated code. Every PR becomes a manual review exercise, erasing the speed gains from generation.
Refactoring paralysis - Developers are afraid to touch legacy code because there's no safety net. Technical debt compounds and architecture degrades over time.
Hidden regressions - Changes that break existing behavior go undetected until production. The cost of discovering bugs in production is 5-10x the cost of catching them in CI.
False velocity - Teams feel productive (tickets are closing) but quality is silently degrading. The debt surfaces as a production incident, not a planning item.
Blocker for automation - Automated merge decisions, incremental test selection, and self-healing test suites (L4-L5) are impossible without a reliable, comprehensive test foundation.

The path out of L1 is not to write more manual tests - it's to introduce AI-assisted test generation (L2) while simultaneously tracking coverage as a first-class metric. The goal at L2 is not perfection but momentum: every PR either maintains coverage or increases it.

Tip

Start by measuring what you actually have. Run a coverage report against your main branch and find the five highest-risk files with the lowest coverage. Make those the first targets for AI-generated tests at L2. You don't need to boil the ocean - you need a beachhead.

Getting Started

Establish a coverage baseline - Run your coverage tool (Istanbul/nyc for JavaScript, pytest-cov for Python, JaCoCo for Java) and commit the numbers. Don't set targets yet - just make the current state visible. You can't improve what you can't measure.
Identify the highest-risk coverage gaps - Coverage by percentage is misleading. Sort by risk: what code handles money, authentication, data integrity, or external integrations? A 20% coverage number in your payment processor is more dangerous than 0% in a utility library.
Add coverage reporting to CI - Make coverage reporting a required CI step so the number is visible on every PR. Even without a threshold, visibility creates accountability.
Start with one component - Pick one service or module and commit to bringing it to 60% coverage before moving on. Trying to raise coverage everywhere at once produces motion without progress.
Write tests during PR review - For every PR that introduces new logic, require that tests come with it. This won't fix the existing debt, but it stops the bleeding.
Document your test-writing conventions - Even at L1, write down what "a good test" looks like in your codebase. This becomes the foundation for the CLAUDE.md context injection at L2.

6 steps to get from here to the next level

Common Pitfalls

Chasing the coverage number, not the coverage. Teams that optimize for the percentage metric learn to game it: test every getter and setter, skip the complex business logic. A 70% coverage number full of trivial assertions is worse than a 40% number with focused tests on critical paths - it gives false confidence without providing real safety.

Retrofitting tests onto untested legacy code. Legacy code is often structured in ways that resist testing: tight coupling, hidden dependencies, global state. Don't start by testing legacy code - start by testing new code, and address legacy debt through the strangler fig pattern or targeted refactoring when you must touch it.

Treating test writing as a separate task. At L1, developers often finish a feature and then "add tests" as a follow-up item. Tests written after the code is working tend to be superficial - they verify the code you wrote, not the behavior you intended. The fix is writing test cases before or during implementation, not after.

Ignoring test quality in code review. Coverage goes up but trust goes down when test quality isn't reviewed. A test that asserts expect(result).toBeDefined() increases coverage without providing any safety. Code review should evaluate test assertions with the same rigor as the implementation.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has been shipping for two years without a systematic testing conversation. Coverage is estimated at around 35% - but it's never been formally measured. A recent production incident (a refactor broke an integration that had no tests) cost the team a full day of incident response and a difficult conversation with a customer.

What Bob should do: Use the incident as a catalyst, not a blame exercise. The first action is measurement: run coverage reporting across all services and make the results visible to the whole team. Then establish a simple policy: coverage cannot decrease on any PR. This doesn't fix the existing debt, but it stops the accumulation. Bob should also start the conversation about AI-assisted test generation as a way to climb out of the debt hole without burning developer time - it's the core proposition for moving to L2.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is trying to make the case for expanding AI tooling to include test generation, but her stakeholders keep asking: "If developers aren't writing tests now, why will AI tools fix that? Won't they just generate bad tests?" She doesn't have a great answer yet.

What Sarah should do: The stakeholder concern is legitimate and deserves a direct answer. The reason manual test writing stays below 40% isn't laziness - it's friction. Writing a test manually requires context-switching, boilerplate, and cognitive overhead that developers deprioritize under pressure. AI test generation removes the friction: you write the implementation, the agent generates the test scaffolding, and the developer's job becomes reviewing rather than authoring. Sarah should frame the investment not as "AI writes our tests" but as "AI removes the friction that causes us to skip tests."

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has 85% coverage on his services because he practices TDD religiously. He's frustrated watching the rest of the team ship undertested code that he eventually gets paged about at 2am. He's suggested writing tests multiple times in code review and been told "we'll add them later."

What Victor should do: Victor's TDD practice is proof that high coverage is achievable - but preaching TDD to a team under deadline pressure is ineffective. His leverage is in tooling, not culture. Victor should set up AI-assisted test generation for the team's most common test patterns, lower the friction to near-zero, and demonstrate on one PR what it looks like when the AI generates the test scaffolding. Making test generation fast is more effective than making test writing mandatory.

What Victor should do - role-specific action plan