Essay

Why Most Agent Frameworks Solve the Wrong Problem

Most agent frameworks optimize the prompt loop. Production agent infrastructure has to optimize governance, durability, memory, and review.

March 12, 202612 min readInfrastructureOrchestrationOpinion

Most agent frameworks solve for the part that demos well: prompt an agent, give it a tool, let it loop, show the output. That is useful if your goal is a personal assistant, a prototype, or a bounded workflow where failure is cheap.

That is not the hard problem I care about.

The hard problem is what happens when agents become part of an operating system for work. Multiple agents. Long-running tasks. Human approvals. Trust calibration. Quality scoring. Budget limits. Memory that survives a session. Review surfaces that let an operator see what actually happened.

Over the past 15 months and 1,270+ commits, I've been building OrgX precisely because most agent tooling stops before that layer. This post is the argument for why the infrastructure problem matters more than the framework race.

Most Frameworks Optimize The Demo Loop

Single-agent systems are well-understood. You give an agent a tool, a prompt, and a goal. It does its thing. But when you need multiple agents to collaborate on complex work — with handoffs, approvals, quality checks, and organizational memory — the coordination problem explodes.

What I needed was not a prettier loop. I needed agents that could:

Spawn sub-agents for specialized tasks
Hand off work between agents with full context preservation
Get human approval before taking high-impact actions
Score and evaluate their own output quality
Remember organizational patterns and decisions across sessions

None of the existing frameworks handled all of this well. Most focus on single-agent chains or crew-style task assignment, but they stop short of organization-wide governance, trust calibration, or proof that an autonomous action should have been allowed in the first place.

That is the category error. They treat the prompt as the primitive. I needed the organization to be the primitive.

The Missing Layer Is Infrastructure

The infrastructure layer is the part that answers questions frameworks usually defer:

Who is allowed to do what?
Which actions require review?
What happens if a task spans hours instead of seconds?
How do you preserve memory across agents and sessions?
How do you prove why an action was routed, approved, blocked, or retried?

Without that layer, you do not have reliable agent operations. You have agent theater.

Protocols Matter Because Portability Matters

The Model Context Protocol (MCP) became the backbone of OrgX for a specific reason: it provides a standardized way for agents to discover and use tools without tight coupling.

In OrgX, every capability is exposed as an MCP server:

Entity management (creating tasks, projects, initiatives)
Agent spawning (launching specialized sub-agents)
Decision approval (human-in-the-loop governance)
Quality scoring (composite evaluation of agent output)
Org memory (querying organizational knowledge)

This means any MCP-compatible client — Claude Code, Cursor, a custom CLI — can interact with the full OrgX system through a single protocol. No custom integrations needed.

// Example: Agent spawning via MCP tool call
const result = await mcpClient.callTool("spawn_agent_task", {
  title: "Review PR #847",
  description: "Check for security vulnerabilities and test coverage",
  assignee: "security-review-agent",
  parent_task_id: currentTask.id,
});

Durable Workflows Beat Request-Response

The biggest architectural shift was moving from request-response agent interactions to durable workflows. An agent task in OrgX isn't a single LLM call — it's a stateful workflow that can:

Pause for approval — Agent proposes an action, human reviews
Resume after interruption — Context preserved across sessions
Fork into sub-tasks — Agent spawns child tasks that report back
Accumulate quality scores — Each step contributes to a composite score

This required a state machine approach. Every task has a lifecycle: created → in_progress → pending_approval → approved → completed, with quality gates at each transition.

Frameworks that optimize only for prompt execution skip this entirely. That is fine until a task needs to pause for a day, wait for a decision, resume with preserved state, and still be inspectable afterward.

Governance Is Not Optional

Not all agents should have the same permissions. OrgX implements a trust model where:

New agents start with limited autonomy — every action requires approval
Proven agents (based on quality scores and outcome attribution) earn more autonomy
High-impact actions always require human approval regardless of trust level
Trust degrades if quality scores drop

This isn't theoretical. It is what you need the moment an agent can mutate state, move money, touch customer records, or trigger downstream automation with a real blast radius.

// Simplified trust check
async function checkAgentAutonomy(agentId: string, action: Action) {
  const trustContext = await getTrustContext(agentId);

  if (action.impact === "high") return { requiresApproval: true };
  if (trustContext.score < AUTONOMY_THRESHOLD) return { requiresApproval: true };

  return { requiresApproval: false };
}

Quality Has To Feed Back Into The System

Every piece of agent work in OrgX gets scored across multiple dimensions:

Completeness — Did the agent address all requirements?
Accuracy — Is the output factually correct?
Code quality — Does generated code follow standards?
Communication — Did the agent explain its reasoning?

These individual scores feed into a composite scoring engine that weights dimensions based on the task type. A code review task weights accuracy and code quality higher; a research task weights completeness higher.

The scoring isn't just for record-keeping. It feeds back into the trust model, outcome attribution, and future routing decisions. If quality does not change the system's future behavior, it is decoration.

Memory Is Part Of Reliability

One of the hardest problems was organizational memory. Agents need to:

Remember past decisions and their outcomes
Learn from what worked and what didn't
Share knowledge across agent boundaries

OrgX stores learnings in Supabase with semantic search capability. When an agent starts a new task, it queries relevant learnings:

const learnings = await queryOrgMemory({
  context: task.description,
  limit: 5,
  minRelevance: 0.7,
});

This creates a compounding advantage. The more work agents do, the better the organization's collective memory becomes. Frameworks that reset to zero every session leave that advantage on the table.

The 7-Repo Ecosystem

OrgX isn't a single repo. It's a 7-repo ecosystem that includes:

Core platform — Next.js app with Supabase backend
MCP server — Tool definitions and handler implementations
Claude Code skills — Custom skills for Claude Code integration
Cursor plugin — IDE integration for agent orchestration
CLI tools — Command-line interfaces for agent management
OpenClaw plugin — Browser extension for agent control

This modularity was intentional. Each piece can evolve independently, and the MCP protocol ensures they all interoperate.

What I Would Push Earlier

Start with the trust model earlier. I added governance after building the core orchestration, and retrofitting trust checks was painful.
Invest in observability from day one. Debugging multi-agent workflows without good tracing is like debugging distributed systems without logs.
Design for human-in-the-loop from the start. The approval workflow wasn't an afterthought, but I underestimated how central it would become.

The Real Split: Applications vs Infrastructure

Building agent infrastructure is fundamentally different from building agent applications. Applications use agents. Infrastructure makes agents reliable, governable, inspectable, and worth trusting.

The patterns that matter most:

Durable workflows over request-response
Trust-based governance over unlimited autonomy
Quality gates at every state transition
Organizational memory that compounds over time
Protocol-first design (MCP) for interoperability

If a framework does not help with those things, it may still be useful. It is just solving a different problem.

If you're building in this space, I'd love to compare notes. Reach out at hopeatina@gmail.com or find me on GitHub.

Trust Models for Agent Autonomy: When to Let Agents Act Alone

A practical framework for strict, balanced, and open autonomy, with trust earned through evidence instead of granted by prompt.

MCP Protocol · 10 min read

What I Learned Running MCP in Production

What broke, what patterns held up, and what I would do differently after shipping the OrgX MCP server — 61 tools across 16 categories — plus integrations with ~20 external MCP services.

Or see all writing · read the OrgX case study ↗