Dark Light

Most engineering organizations are discovering that AI agents do not make bad engineering systems better. They make them worse, faster. The flaky tests, undocumented tooling, and mysterious build failures that developers learned to work around become blocking problems the moment an agent hits them, and the blame lands on AI rather than on the underlying platform issues that were always there. 

Mike Swafford, Vice President of Software Engineering at Microsoft, has spent the past year moving some of the largest repositories in the industry from vibe coding into full agentic development. What he learned runs counter to the dominant narrative about where the real work lives. “Repos need to become agent-ready in the same way they once needed to become cloud-ready,” Swafford states. “Stable tools, explicit build and test instructions, discoverability, and reproducible environments matter more than most teams think.”

Determinism Beats Model Quality

The instinct when agentic development underperforms is to reach for a better model. The actual problem, in almost every case, is the repository itself. AI agents amplify whatever weaknesses already exist in an engineering system. A surprising amount of the work that actually unlocks agent productivity is not about AI at all; it is about repo instruction and tooling curation. Cleaning up the build environment, documenting what was previously implicit, and eliminating the failure modes that experienced engineers navigate through muscle memory.

The organizational dividend from this work extends well beyond agent performance. The same changes that make a repository agent-ready make it significantly better for human engineers. Discoverability, reproducible environments, and stable tooling are not AI-specific requirements; they are the engineering hygiene that large codebases tend to deprioritize as they grow. Agentic development creates a forcing function to address the debt that teams have been deferring for years.

Specificity Is the Key to Agent Output

AI is effective at tasks. It does not have the judgment of a seasoned engineer, and expecting it to infer what matters from context produces results that reflect that gap. The organizations getting the most value from agents have done the upfront work of writing down the important things – such as objective success criteria, coding guidelines, architecture structure, validation frameworks, and repo-specific review instructions – and have built agents that consistently enforce those standards.

Swafford’s framing is “measure twice, code once.” The more preparation done upfront, the more specific the guardrails, the more valuable the agent output becomes. For important decisions, having multiple agents produce implementations and selecting the best elements from each is not redundancy; it is quality control. The non-negotiables around design, security, and architecture need to be explicit before development begins, because agents will optimize for whatever criteria they are given and ignore the ones that were never articulated.

Eval Infrastructure Is an Underrated Differentiator

The industry conversation around agentic development is almost entirely focused on model selection. Swafford’s experience points to a different variable as the real competitive differentiator: eval infrastructure. Evals, treated seriously, are a combination of test paths and experiments that capture the lived experience of developers working in large repositories across real scenarios: refactoring, security fixes, feature work, performance enhancements, and code reviews.

At Microsoft, evals inform which models perform best in which scenarios, validate repository-specific behavior, and provide the evidence base that finance teams need when making cost-optimization decisions. That last application illustrates the organizational leverage that serious eval infrastructure creates; it transforms AI investment decisions from directional bets into evidence-based choices. The teams that build this capability are not just improving their current agent performance. They are building the measurement foundation that will continue to differentiate them as models and tooling evolve.

Follow Mike L. Swafford on LinkedIn for more insights on AI in engineering, large-scale developer productivity, and building the systems that make agentic development work at enterprise scale.

Related Posts