Can Coding Agents Become Engineers? We Are Finding Out

If top engineers are not spending most of their time writing code, then our AI agents should not be judged only by code output either. Here is a new evaluation approach that measures how coding agents investigate, validate, and improve real software systems, much like junior engineers do.

ยท8 min read
AIcoding agentssoftware engineeringbenchmarks

Engineers Are Doing Less Coding. Agents Should Be Measured The Same Way

Executives at major tech companies have said it out loud. Many top software engineers are not primarily writing code anymore. Their value is in understanding complex systems, tracing behavior, validating hypotheses, and explaining what they find so teams can act with confidence.

If that is what human engineers are doing, then coding agents should be evaluated like engineers, not just code generators. We should ask whether an agent can investigate a running system, gather runtime evidence, and produce a grounded explanation. That is the bar that helps us answer a bigger question. Can coding agents become engineers in real workflows, not just on toy problems.

In this post, I share a practical evaluation approach that I have been building and testing. It measures how agents understand, validate, and improve real software systems inside real repositories. It focuses on investigative and maintenance work, the part of engineering that happens before and after code changes.

A Three-Part Evaluation That Mirrors Real Work

To evaluate agents like junior engineers, we need tasks that look like what junior engineers actually do. That means learning unfamiliar codebases, writing tests to lock behavior, and refactoring for readability and maintainability without breaking things.

Here are the three evaluations I use to cover that surface area:

  • Codebase QnA. Understand complex codebases through runtime analysis and multi file reasoning. The agent cannot just guess. It must run the system, inspect logs, trace calls, and explain how things work with supporting evidence.
  • Test Writing. Write targeted, meaningful tests that exercise real functionality and increase code coverage. The goal is not raw coverage. It is whether tests reflect real behavior, catch regressions, and provide guardrails for future changes.
  • Refactoring. Restructure code to improve readability and maintainability while preserving behavior. The agent must keep the system working, communicate why the refactor helps, and demonstrate that nothing broke.

Codebase QnA is available now, with Test Writing and Refactoring following soon. Together, these evaluations make it possible to judge agents on the same investigative and maintenance workflows that define real engineering.

Why Build On Real Repositories And Real Environments

Most coding benchmarks focus on correctness of a final change. That is useful, but it misses the work around a change. Engineers spend stack time figuring out how a system behaves, validating their assumptions, and explaining their reasoning to teammates. Good agents should do the same.

That is why these evaluations run agents inside reproducible environments built from real software repos. Agents can inspect code, run commands, execute the system, and interact with standard developer tools. We then focus scoring on how the agent investigates behavior, gathers evidence, and grounds its explanations in runtime reality.

This approach pushes measurement toward interaction with a working codebase, not just a clean interface and a static output. It captures whether agents can operate as system collaborators that think with the codebase, not just produce snippets.

What A Codebase QnA Task Looks Like

Codebase QnA tasks reflect the kinds of questions engineers ask when they join or debug a real system. The agent needs to get oriented, trace flows, validate what it thinks is happening, and then explain it clearly.

Example onboarding prompt: "When I run 'kitten @ ls' from another terminal, how does that command reach the running 'kitty' instance and get processed?"

Answering that requires tracing Unix socket communication, IPC framing, and command dispatch across both C and Python. The agent then needs to run the system and check that its explanation matches observed behavior.

Tasks cover multiple investigation types, including:

  • Architecture and system design. Map components, boundaries, and interactions.
  • Root cause analysis. Explain unexpected behavior and identify the minimal failing path.
  • Onboarding. Understand unfamiliar systems and summarize how key features work.
  • Security reasoning. Trace boundaries, permissions, and data flows.
  • API or library integration. Explain how external dependencies are wired into the system.

The dataset mirrors real engineering conditions. Tasks are net new, authored by experienced engineers and technical experts inside open source repositories. They draw from production grade systems like terminal emulators, mail servers, and object storage platforms. They span multiple architectures and languages, including Go, Python, C, and TypeScript.

Each task passes multi stage review and is evaluated with structured rubrics that score whether an agent's explanation reflects how the system actually works. That guards against hand waving and rewards evidence backed reasoning.

How The Measurement Works

Agents operate in sandboxed environments so evaluation is reproducible and safe. They have access to standard developer tools to inspect code, run commands, and execute the system itself. To keep interactions consistent, I use a scaffold that standardizes how the agent reads files, runs processes, logs findings, and produces final explanations.

I also tested a few models with their native shells and harnesses to check whether measurement was sensitive to the interaction style. That helped validate the rubric across different agent entry points and tool usage patterns.

Scoring combines programmatic checks with expert defined rubrics. Programmatic checks verify that the agent actually ran the system, touched the right files, or executed specific commands. Rubrics judge the quality of the investigation and the grounding of the explanation against runtime evidence.

The primary metric is Task Resolve Rate. It is the share of tasks where every rubric item is satisfied. Secondary views break down performance by task type, language, and environment features. Results are tracked on a public leaderboard so the community can see progress over time.

Why This Matters For Teams

If we want agents to contribute to real codebases, we need to know whether they can work inside messy systems. Most teams do not fail because they cannot generate code. They fail because they cannot build confidence in changes, manage complexity, and explain what is going on.

By evaluating agents on investigation and maintenance workflows, we raise the bar to something a team can trust. A stronger agent is one that can enter a repo, learn how a feature works, write tests that reflect reality, refactor safely, and leave behind a clear explanation others can follow.

This also changes how we deploy agents. Instead of single shot code generation, we can move toward collaboration loops. Agents propose an investigation plan, gather runtime evidence, draft a hypothesis, and then act. Humans review the plan and outputs, then guide the next step. That is a workflow that fits how engineering actually happens.

Early Learnings From Running Agents In Real Systems

Running agents inside real repos surfaces different strengths and weaknesses than static benchmarks. Some models can write plausible explanations but struggle to ground them in logs or traces. Others are good at shell interactions but do not connect findings to a coherent narrative.

Two patterns stand out:

  • Runtime grounding is the bottleneck. Asking an agent to prove its claims forces it to run the system, collect artifacts, and link evidence to conclusions. This is where performance diverges from pure text generation benchmarks.
  • Multi file reasoning matters. Many behaviors are spread across languages, layers, and processes. Agents that can track cross file references and stitch them back into a single explanation do much better.

On top of that, clear communication is a skill. A messy but correct shell trace is less useful than a structured explanation that cites exact paths, functions, and commands. The rubric rewards that clarity so teams can act on the output without guessing.

From Investigation To Collaboration

The next phase of AI coding reflects a simple reality. The value is not in producing lines of code. It is in modifying software while reasoning about complex environments. Agents that can investigate, validate, and maintain systems are closer to how engineers operate every day.

This evaluation approach begins to map those workflows. As the benchmarks expand and more of the ecosystem participates, the definition of a capable coding agent will move from code generator to system collaborator. That is where the real leverage shows up for teams and companies.

I will be sharing updates on the Test Writing and Refactoring evaluations, including how they interact with Codebase QnA and how the three pieces work together to measure end to end workflows. The goal is simple. Measure what matters for real engineering, then improve agents against that bar.

What Comes Next

There are plenty of open questions. How do we balance autonomy and oversight in collaborative loops. What is the right mix of programmatic checks and expert rubrics for different task types. How do we standardize scaffolds without constraining agent creativity.

That said, the path is becoming clearer. Put agents inside real systems. Make them run, inspect, and explain. Score them on evidence grounded reasoning, test quality, and safe maintenance. Then use those results to drive both model improvements and workflow design.

If you are experimenting with coding agents in your org, consider starting with investigation tasks before you let agents make changes. Ask for a plan, ask for evidence, and ask for a written explanation. It is a small shift that yields big gains in trust and consistency.

Summary and Key Takeaways

  • Evaluate agents like engineers. Measure investigation, validation, and explanation, not just code output.
  • Use real repos and environments. Ground agent reasoning in runtime evidence, logs, and traces.
  • Cover three workflows. Codebase QnA for understanding, Test Writing for guardrails, Refactoring for safe maintenance.
  • Adopt clear metrics. Task Resolve Rate combines programmatic checks and expert rubrics to score end to end performance.
  • Move toward collaboration. Agents should act as system collaborators, proposing plans, gathering evidence, and communicating clearly.

Coding agents will not replace engineers overnight, but they can start to do more of the work engineers actually do. With better evaluations and grounded workflows, we can find out how close they are and where to push next.

Tags#AI#coding agents#software engineering#benchmarks#LLM
Tharun P Karun

Written by

Tharun P Karun

Full-Stack Engineer & AI Enthusiast. Writing tutorials, reviews, and lessons learned.

โ† Back to all posts
Published March 5, 2026