Beyond AGENTS.md: Harness Engineering, Loop-Based Delivery, and Context-Aware Prompting
Most AI coding workflows still assume a simple pattern: give the model a prompt, let it write code, and then clean up whatever comes out. That can work for a small bug fix, a minor UI enhancement, or a narrow refactor. But once you are building real features in a complex production codebase, that model starts to break down.
The problem is not just model quality. The problem is process.
What matters is not whether an agent can produce code in one shot. What matters is whether the system around the agent can repeatedly turn messy user intent into production-safe changes. That means better orchestration, better artifacts, better feedback loops, and better context delivery.
That is where harness engineering becomes useful. And in my view, the next step after a good harness is not a bigger AGENTS.md file. It is a context-aware prompting system that injects the right information at the right step of the loop.
Harness Engineering & Agent-Guided Work
Harness engineering starts with a simple shift in perspective: the human engineer is no longer only writing code directly. The human is designing the environment in which agents can succeed.
That environment includes the repository structure, architectural constraints, test and CI workflows, migration strategy, deployment rules, observability, coding conventions, and the documents that explain how the system works. The harness is everything that shapes agent behavior before the agent ever writes a line of code.
This matters because single-shot prompting has a ceiling. If I ask an agent to build a small feature from scratch with a few sentences of instruction, I might get something plausible. But in a large codebase, plausibility is not enough. I need the result to fit the architecture, respect production constraints, preserve rollout safety, and integrate with the rest of the system.
That is why the harness matters. It gives the agent a world to operate in.
But there is an important distinction here: a harness is not the same thing as a static instruction dump. A lot of teams treat AGENTS.md as if it is the answer. They keep appending more and more rules into one file and hope the agent will absorb them at the right moment. In practice, that becomes noisy, bloated, and brittle.
A static instruction file can still be useful, but only for truly global rules: things that apply to every task, every step, and every model. Once a document tries to cover every possible case, it stops being guidance and starts becoming clutter.
The Role of Workflow & Feedback
A good harness is necessary, but it is not sufficient. The real breakthrough comes from pairing the harness with an explicit workflow.
The pattern I have found most useful is a loop-based process. Instead of asking an agent to solve everything in one pass, the work is broken into artifacts and stages.
At the top level, the process begins with a product requirements document. The purpose of that document is not to be technical. Its job is to turn unstructured conversation into something coherent: business needs, user outcomes, functional requirements, constraints, and success criteria.
That PRD then gets broken into technical specs. This step matters because the way you ship production software is usually not the same as the way a user describes a feature. A single user requirement may need to be implemented in several safe stages: migrations before behavior, behavior before UI, backfills before rollout, and so on. The spec layer translates product intent into engineering increments.
From there, each spec runs through its own internal loop. In my preferred version, that loop has four core steps:
- Plan — produce a detailed work plan for the spec.
- Work — implement the change.
- Review — verify that the implementation actually satisfies the spec.
- Ship — commit, merge, deploy, or otherwise complete the increment.
This creates two loops at once: an outer loop across specs, and an inner loop inside each spec.
The planning step is especially important. It creates an artifact that is optimized for the implementation model. Rather than asking the coding agent to infer everything from a broad spec and a giant codebase, the planner produces a focused work plan: what needs to change, how it should be verified, which tests should be run, and what completion looks like.
Then the implementation step can stay narrow. It is not discovering the problem anymore. It is executing a plan.
When implementation finishes, it should not just stop. It should write a work summary: what changed, what tests ran, what assumptions were made, and how the spec requirements were addressed. That summary then feeds the review step, where another model, another pass, or another review process checks the claim against reality.
This is where the loop becomes powerful. If review fails, the workflow does not collapse. It routes back to plan or work and iterates again. You can cap the number of iterations, but the key point is that failure is handled as part of the design.
This also solves an important context-window problem. Not every token inside a long prompt is equally useful. As context gets longer, quality can degrade, especially in complex codebases with lots of specialized names and patterns. A loop-based process lets each step operate inside a narrower, more useful slice of context rather than drowning in the full history of the task.
Context-Aware Prompting
This is the point where I think many teams make a wrong turn.
They notice that agents miss things, and their first instinct is to add more instructions to AGENTS.md. For example, imagine a Rust project using SQLx. A coding engine makes valid changes, but CI fails because sqlx prepare was not run and the prepared query data is out of date. The obvious reaction is to add another instruction to the agent guide: “If you change queries, run sqlx prepare.”
That sounds reasonable, but if you keep doing this, the file becomes a junk drawer of edge cases.
The usual answer to that problem is progressive disclosure: keep a slim top-level file and let the agent decide when to read additional documentation. I do not think that is a strong solution. It pushes the context-selection problem back onto the coding agent itself. Now the agent is not only implementing code, it is also deciding which instructions matter, when to go looking for them, and whether it has understood enough to proceed.
That is the wrong place to spend model effort.
A better approach is what I would call context-aware prompting. The idea is simple: do not leave it to the coding agent to decide what it should read. Build a separate orchestration step whose job is to assemble the best possible prompt for the current step.
In other words, instead of giving the model a giant manual and asking it to self-direct, you give it a task-specific working set.
For a planning step, that working set might include:
- the spec being planned
- relevant architecture documents
- nearby code structure
- rollout constraints
- migration patterns
- existing test conventions
For an implementation step, it might include:
- the approved work plan
- exact file excerpts or interfaces to modify
- relevant patterns from the codebase
- commands to validate the work
- edge cases already identified during planning
For a review step, it might include:
- the original spec
- the work summary
- the diff or changed files
- explicit verification criteria
- known failure modes to inspect
The point is not merely to stuff more information into the prompt. The point is to inject the most relevant information for that exact phase of the workflow.
This is why I think the next generation of agent systems will not be defined by better generic agent files. They will be defined by better context assembly.
Evolution: From Harness to Context-Aware
Once you start thinking this way, the system becomes capable of improving itself.
A strong harness gives you process. A loop gives you iteration. Context-aware prompting gives you precision. But the final piece is learning from context gaps.
There will still be times when the agent needs more information. That is fine. The important thing is not to hide that event. Make it explicit.
If an agent wants to read additional documentation, inspect more files, or ask for more context, require it to log why. What is it trying to understand? What is missing from the current prompt? Which assumption is blocking progress?
That log becomes a gold mine.
It tells you where your prompt-construction system is weak. It shows which information should probably have been injected earlier. It reveals the recurring blind spots in your orchestration layer.
Now you are no longer just improving the coding step. You are improving the step that prepares the coding step.
That is a very different kind of optimization.
You could still keep a small, broad AGENTS.md file for universal rules. You could still allow optional deeper documentation. But the primary mechanism would shift from agent-led exploration to orchestrator-led context injection, with auditability around every escape hatch.
Over time, that means your system can become better at building its own prompts. It starts to learn which materials matter during planning, which matter during implementation, and which matter during review. It becomes less dependent on a model wandering through a repo and more dependent on a repeatable algorithm for supplying context.
That is much closer to engineering than hoping an agent will read the right file at the right time.
Conclusion
The biggest shift in AI-assisted software delivery is not that models can write code. It is that we can now design systems around them.
Harness engineering is a strong foundation because it recognizes that agent quality depends on the environment, constraints, and feedback loops around the agent. But in complex production work, a harness alone is not enough. A static AGENTS.md file is not enough either.
The next step is to break work into explicit stages, create artifacts that travel between those stages, and assemble step-specific context so each agent operates inside a narrow, high-signal window.
That is the real move beyond generic agent guidance.
The best agent is not the one that reads the most documentation. It is the one that is given the right context, at the right time, for the right step.
And if you can log the gaps, audit the misses, and continuously improve how that context is assembled, the system does not just execute work. It gets better at preparing work.
That is where I think this is heading.