MAY 11, 2026
13 MIN READ

From Prompt Engineering to Harness Engineering

From Prompt Engineering to Harness Engineering
How harness engineering is redefining AI agents for GRC, shifting focus from prompts and context to runtime, policy, and audit-ready systems.

Watching the discipline of building with LLMs reinvent itself, in real time, for the third time in three years.

The Discipline Keeps Reinventing Itself—That’s the Point

It's worth pausing to notice how fast the discipline of building with large language models has been rewriting its own definition.

In 2023, the practice was prompt engineering. The job was to coax better output from a fixed model by choosing your words carefully. Whole consultancies were founded on it. Conferences filled rooms with it. A handful of people became genuinely famous for being good at it.

In 2024 and early 2025, the center of gravity shifted to context engineering. The realization, articulated most clearly in Anthropic's public writing but felt everywhere, was that the prompt is only a thin slice of what the model actually sees. 

The real craft was in what you put in the window: retrieval, compression, grounding, structured note-taking, multi-agent context sharing. The job stopped being "write a better prompt" and started being "curate a better information environment."

We are now, visibly, inside a third shift. The center of gravity is moving again, this time toward what the field is starting to call harness engineering.

This framing is not mine. LangChain has a post titled "The Anatomy of an Agent Harness." Phil Schmid has been writing about the importance of agent harnesses in 2026. Anthropic has published at least four pieces on harness design. Aakash Gupta's framing was blunt: 2025 was agents. 2026 is agent harnesses.

The question worth asking isn't whether this is a real shift—it clearly is. The question is what it actually means, and what the next wave of the work looks like for serious builders.

What "Harness" Means, and Why the Word Matters

A harness, in this evolving usage, is the runtime environment a language model operates inside. It owns everything that isn't the model itself: which tools the model can call, how context gets assembled, where memory lives, how policies get enforced, how runs get traced, how failures get retried.

Phil Schmid has articulated the cleanest analogy: the model is the CPU, and the harness is the operating system around it. The model reasons. The harness decides what the model sees, what it can do, and what happens around the reasoning.

That shift in language tracks a shift in where the interesting engineering actually lives. In a prompt-engineering world, the leverage is in the words. In a context-engineering world, the leverage is in the information pipeline that fills the window. In a harness-engineering world, the leverage is in the surrounding runtime, and almost none of it is the model.

The progression has a pattern to it. Each era subsumes the one before it rather than replacing it. Good prompt engineering is still required; it's just no longer the whole job. Good context engineering is still required; it's now a subsystem inside the harness. Each shift has broadened what "building with LLMs" actually means.

The Current Shape of a Harness

The consensus that's emerging in public writing, across LangChain, Anthropic, Temporal, HumanLayer, Epsilla, and agent-engineering.dev, is that a full agent harness has something like eight layers. No two authors draw the diagram identically, but the layers are recognizable across all of them:

  • Input & Trigger

  • Context Engineering

  • Orchestration & Planning

  • Reasoning Core (containing Memory, the LLM, and Tools)

  • Policy & Guardrails

  • State & Session

  • Action & Output

  • Observability (cuts across all layers)

Each layer is the subject of active, visible work across the industry right now. Durable execution in layer 6, protocol standardization (MCP) reshaping layer 2, evals hardening in layer 8, memory systems getting their first real theoretical treatments in layer 4. None of this is settled. All of it is moving.

The Pattern Beneath the Shift

If you squint, the three eras tell a consistent story about where the leverage sits.

Prompt engineering assumed the model was the bottleneck, and the job was to wring more out of it. Context engineering noticed that the bottleneck had moved: the model was plenty capable, but it was operating on impoverished information. The job was to fix the information environment.

Harness engineering is noticing that the bottleneck has moved again. The model is capable, the information environment is solvable, and the new limit is the runtime. 

What happens when a step fails at 2am? What does "this agent acted on behalf of this customer" actually mean mechanically? How do you reconstruct a chain of decisions for an auditor six months later? How do you compose ten specialist agents without ending up with ten disconnected scripts? 

None of that is a prompting problem. None of it is a context problem. It's plumbing.

The unglamorous reading of all of this is that the field keeps discovering that the hard problems in applied AI are the ones operating systems have always had. Concurrency. State. Policy. Identity. Resource accounting. Observability. Failure recovery. 

The harness era is, in a real sense, the moment the field has to grow up and take those seriously, not as nice-to-haves, but as the substrate without which nothing composes.

That is a much less glamorous story than the one about the model. But it is probably the more durable one.

Where This Gets Interesting, and Where It's Remains Unsolved

A few of the layers look genuinely well-developed in public discourse. Orchestration and durable execution have a strong body of work behind them. Observability and evals have credible tools and methodologies. The tool layer has a serious protocol (MCP) with a roadmap and adoption.

A few of the layers are still, honestly, unsolved in public.

Memory is underspecified. Most teams use "memory" to mean "a vector store." That's not memory; that's retrieval. Real memory systems have a taxonomy (working, episodic, semantic, procedural), lifecycle policies, and, ideally, a unified read/write/query surface across all of them. Almost nobody has built that yet.

Context as live state is underspecified. Loading "what is true for this customer, right now, under these exceptions, with this scope" as a first-class runtime concept is hard, and most systems today still reduce it to "a big system prompt plus some retrieved docs."

Policy as a gate, not a filter, is underspecified. Most harnesses today treat policy as something applied to outputs. In a lot of regulated contexts, policy is better understood as an authorization layer, a gate the agent has to pass through before acting.

An audit fabric is barely discussed. In most industries, "the log will do." In others, the log is the product. The infrastructure to make an agent's decisions reconstructable, attributable, and defensible to an external reviewer is a different thing from application logging. Almost nobody is writing about this, and it will matter.

I don't have a clean answer to any of those yet. Neither, I think, does the field. These are the places the next year of work gets interesting.

What I Think a GRC-Native Harness Actually Looks Like

I spend most of my time thinking about AI in the GRC space, which tends to surface the hard version of each of these problems early. The generic eight-layer picture above applies, but in a regulated context several layers reshape in ways I think are genuinely specific to the domain.

Evidence has to be a first-class type, not a file. In a generic harness, agents produce "outputs": text, artifacts, tool calls. In a GRC harness, most of what the agent produces is evidence, a structured assertion linked to a specific control, from a specific source, with provenance and a timestamp. Treating evidence as "a document the agent wrote" is the category error that quietly kills most GRC-AI projects.

Policy belongs on the front of the action, not the back. Most general-purpose harnesses treat policy as a filter applied to outputs. In GRC, that framing is backwards. The question isn't "is the output safe to show." It's "is the agent authorized to take this action, on this tenant, under this framework, given this exception state, right now." Policy is an authorization gate that runs before the action, not a content check that runs after.

Identity has three axes, not one. In a consumer AI product, "who is this?" is one question. In a GRC harness, it's three: which agent is acting, on behalf of which human, inside which tenant's boundary. Every layer of the harness has to know all three, and the boundaries between tenants have to be enforced as hard walls rather than convenience.

Regulatory context is live state, not configuration. Which frameworks apply to this tenant, what version of each, which controls are in scope, which exceptions are currently active, which policies have been updated since the last audit — that isn't a config file you load at startup. It's a live state the agent has to reason inside, and it changes constantly.

Reversibility should be a property of every tool, not an afterthought. Every tool the harness exposes should be designed with an explicit answer to "what happens if we're wrong about this?" Some actions are recoverable, some aren't, and the ones that aren't deserve human-in-the-loop gates by default.

The audit fabric is the product, not a logging layer. In a GRC context, the substrate has to serve external reviewers, months or years after the fact, reconstructing exactly what the agent did, on whose behalf, against which policy version, with what evidence. That's not a log. It's an evidentiary record with different guarantees (immutability, chain of custody, decision attribution) than application logging cares about.

Durability has to cover the decision, not just the execution. The harness also has to make the agent's reasoning durable: why did it choose this action, which evidence supported that choice, which policy version was in effect. "The code ran to completion" is not the same as "the decision can be reconstructed." An auditor cares about the second one.

These categories also tend to discover the limits of each era a little earlier than the rest of the market. The prompt era was short there. The context era was short there. The harness era, I suspect, is going to be long, because most of the items in the list above aren't problems anyone solves in a single quarter.

Most of this isn't an AI Engineering Problem

A point worth making: most of what makes a real agent harness hard isn't the AI part. It's the part where it has to plug into an engineering organization that has already made some architectural decisions, most of them long before "agents" was a meaningful word.

There is no "buy the agent harness" SKU. The hard problems are old problems in new clothing, and the people who solve them are the people who already know your company's API surfaces, identity model, transaction boundaries, and deployment topology.

Authorization. Which agent is allowed to call which API endpoint, on whose behalf, with what scope? Sounds like an AI question; it's almost entirely a permissions-and-identity question. Standards like OAuth 2.1 cover the delegation flow — they don't cover agent-to-agent chains, action-level policy scopes, or the audit-grade attestation a regulated context eventually demands.

Cross-system reversibility. When an agent's action spans multiple services and step three fails, you're back in saga-pattern territory: compensating transactions across service boundaries, run by an orchestrator that survives failure.

Identity and tenant boundaries. Most organizations have an identity story for humans and a separate, messier one for service accounts. Agents are a third category that fits neither, and they have to carry tenant context through every call they make.

Observability across the seam. Once an agent run touches five services, your application traces and your AI traces are two different graphs. Stitching them so a single action is reconstructable end-to-end is a real instrumentation project.

The honest message to leadership: the next year of work isn't "buy an AI platform and integrate it." It's "revisit several of our existing distributed-systems decisions in light of agents, build the parts that don't exist yet, and accept that some of this is platform-level engineering no vendor will do for us."

Questions I'm Asking

A short list of open questions I'd like to compare notes on with anyone else working in this space:

  • Does memory become a standardized subsystem the way context assembly has, or does it stay bespoke for another year or two?

  • Does MCP's momentum continue, and do we end up with an agent-side protocol standard as well as a tool-side one?

  • Do evals become the primary feedback loop into fine-tuning, or do we end up with something more like a closed-loop online-learning pattern?

  • Does "policy as a gate" become a named architectural layer, or does it stay buried inside application logic?

  • Is the next era after harness engineering "systems engineering for fleets of agents," or something else entirely?

I don't know. None of us do yet. But the pace at which the field has renamed its own central discipline three times in three years is, at minimum, worthy of attention. 

References:

On the evolution of the discipline:

On harness engineering as its own discipline:

On protocols, durable execution, evals, and memory:

This article was originally published on Medium in the AI in GRC publication.

One Last Thing

This isn’t theoretical for me. As an AI leader at Drata, we’re building the agent harness from the ground up, in a category where every action eventually meets an auditor. We think this layer, the runtime around the model rather than the model itself, is where the next several years of GRC get decided. And we’re building like we believe it: rethinking memory, identity, policy, and the audit fabric as first-class primitives, not features bolted onto a chatbot.

If you find this kind of work fascinating, we’re hiring. AI engineers, platform engineers, anyone who looks at the gap between “agents that demo” and “agents that survive an audit” and sees a career’s worth of interesting problems. Come build the future of compliance with us.


Image
Lior Solomon
VPE, Data
Lior Solomon is VP of Engineering, Data at Drata, where he leads the company’s data engineering and AI efforts, with a focus on building trustworthy systems and privacy-by-design practices. Prior to Drata, he held senior data and engineering leadership roles including VP of Engineering, Data and Head of Data at Vimeo.

category + topics

Engineering
AI
Industry Trends
Subscribe to the Trusted Newsletter
Get biweekly expert insights so you never miss what’s next.

Chart Your Course

Navigate to new worlds of trust with Drata.