Statewright Review: State Machine Guardrails for AI Agents

TL;DR

Statewright is a state-machine guardrail for AI coding agents. Instead of trusting a prompt to make Claude Code, Codex, or Cursor “plan first, then implement, then test,” it enforces those phases at the protocol layer — the model can’t see the Edit tool during planning, and can’t run rm from a testing state, because the MCP server refuses to surface those tools. It hit Show HN on May 12, 2026 and got the comment section nodding: this is what observability stacks were supposed to fix and didn’t.

Key facts:

Show HN May 12, 2026 — front page, positive reception around “make the problem smaller, not the model bigger”
Apache 2.0 Rust engine (crates/engine), zero runtime deps, embeddable
Plugins for Claude Code, Codex, opencode, Pi, and Cursor — first four enforce at the protocol layer, Cursor is advisory
Visual workflow editor at statewright.ai/workflows
Measurable lift on local models — gpt-oss:20b and gemma4:31b went 2/10 → 10/10 on a 5-task SWE-bench subset with constraints enabled
Bash discernment: even when Bash is allowed, write redirects (>>), destructive ops, and scripting interpreters are blocked from non-write states
Free tier: 3 workflows, 200 transitions/month. Pro $29/mo, 10 workflows, 2500 transitions
Limitations: provisional patent on protocol-layer tool gating (carve-outs for solo devs/self-hosters), 5-task SWE-bench isn’t the full benchmark, Cursor enforcement is advisory

If you’ve watched a coding agent re-read the same file five times without ever editing, or seen Claude Code happily run rm -rf because the system prompt “suggested” not to, Statewright is the most principled fix I’ve seen ship in 2026.

The Problem: Suggestions Aren’t Enforcement

The agentic coding workflow recommended in every blog post — “plan, implement, test, repeat” — works great when the model cooperates. The trouble is LLMs are trained to be helpful, not disciplined. So you get:

Read-loop death spirals. Model re-reads the same file 5+ times, forgets each prior read mid-context, never edits, gives up.
Phase blur. You said “plan first,” but the model is editing on turn 2 because the file looked easy.
Destructive ops in test phases. “Verify the tests pass” becomes “let me clean up” becomes rm -rf in the wrong directory.
Jailbreaks via context. The user asks the agent to ignore the system prompt; the agent obliges.

The standard fix is observability + bigger models. Datadog, Helicone, LangSmith — they tell you what went wrong, after it did. Bigger models reduce failures but don’t eliminate them, and cost more per turn.

Statewright bets differently: stop trusting the model. Define the workflow as a state machine. Bind each state to an explicit tool allow-list. When the model is in planning, it sees Read, Grep, Glob — full stop. Edit doesn’t exist in its tool list. You can’t be jailbroken into using a tool you can’t see.

How Statewright Works

The architecture has three layers:

1. The Rust engine (Apache 2.0)

crates/engine is a deterministic state-machine evaluator. No LLM in the loop. You hand it a MachineDefinition (states, transitions, guards, allowed tools), it tells you whether a transition is valid given the current state and context. Embeddable, no async runtime requirement.

use statewright_engine::{MachineDefinition, resolve_transition, validate_definition};

let def: MachineDefinition = serde_json::from_str(workflow_json)?;
validate_definition(&def)?;
let next = resolve_transition(&def, current_state, "READY", &context)?;

If you want guardrails without the cloud, ship just this. Single developer or single team self-hosting of the full stack is covered under FSL-1.1 (converts to Apache 2.0 on May 3, 2029).

2. The MCP plugin layer

Each integration is a plugin sitting between the agent and the engine:

Claude Code — hooks + MCP, hard enforcement (tools physically hidden when not in the current state’s allow-list)
Codex — hooks, hard enforcement (alpha)
opencode — TypeScript plugin, hard enforcement (alpha)
Pi — Skills extension, hard enforcement (alpha)
Cursor — MCP + rules, advisory only because Cursor’s architecture doesn’t let MCP gate Cursor’s native tools

“Hard” means the tool call is rejected at the protocol layer before the model sees it. The model gets a message like “Edit is not available in state planning. Available tools: Read, Grep, Glob. Transition with statewright_transition.” This isn’t suggestion — it’s denied at the wire.

3. The workflow definition

A workflow is JSON. Here’s the canonical bugfix workflow from the README:

{
  "id": "bugfix",
  "initial": "planning",
  "states": {
    "planning": {
      "allowed_tools": ["Read", "Grep", "Glob"],
      "max_iterations": 8,
      "on": { "READY": "implementing" }
    },
    "implementing": {
      "allowed_tools": ["Read", "Edit", "Write"],
      "max_edit_lines": 20,
      "max_files_per_state": 3,
      "on": { "DONE": "testing" }
    },
    "testing": {
      "allowed_tools": ["Read", "Bash"],
      "allowed_commands": ["pytest", "cargo test", "npm test"],
      "on": {
        "PASS": { "target": "completed", "guard": "tests_passed" },
        "FAIL_TEST": "implementing"
      }
    },
    "completed": { "type": "final" }
  },
  "guards": {
    "tests_passed": { "field": "test_result", "op": "eq", "value": "pass" }
  }
}

Three things to notice here:

States aren’t a DAG. testing can transition back to implementing on FAIL_TEST. This is the part most “agent orchestrators” get wrong — real agentic work loops, retries, and backs up. A DAG-based scheduler can’t model that without bolting on retry logic.
Guards are programmatic predicates. The tests_passed guard reads test_result from the context and only transitions if it equals "pass". The guard doesn’t ask the model “did the tests pass?” — it inspects state the engine controls.
Allow-lists are explicit, not derived. testing permits Bash, but only allowed_commands: ["pytest", "cargo test", "npm test"] — prefix-matched. The model can’t bash -c "pytest && rm -rf ..." because bash -c isn’t on the prefix list and the redirect-blocking would catch the cleanup anyway.

Installing It in Claude Code (90 Seconds)

# Inside Claude Code
/plugin marketplace add statewright/statewright
/plugin install statewright
/reload-plugins

Then sign up at statewright.ai, generate an API key, paste it when prompted. (Recent Claude versions may flag the paste as suspicious — paste again and confirm.)

Start the bundled workflow:

❯ start the bugfix workflow — fix the failing tests in calc.py

◆ statewright — statewright_start (workflow: bugfix)
◆ [statewright] Workflow activated: bugfix

◆ statewright — statewright_get_state (MCP)
◆ Current phase: planning. Let me read the code first.
  Read 2 files
  [statewright] planning => implementing

◆ statewright — statewright_transition (READY)
  Edit calc.py: 1 line changed
  [statewright] implementing => testing

◆ statewright — statewright_transition (DONE)
  Bash: pytest -x — 7 passed
  [statewright] testing => completed

◆ [statewright] Workflow complete. 46 seconds.

What you don’t see in that trace is what didn’t happen: no thrashing, no re-reads, no aborted edit attempts. The model was given five tools in planning, transitioned cleanly to implementing, made one edit, transitioned to testing, ran pytest, done.

The Benchmark That Got HN’s Attention

Statewright validates on a 5-task SWE-bench subset:

Model	Size	Bug Fix (26 lines)	SWE-bench (5 tasks)
gemma3	3.3GB	FAIL	FAIL
gemma4:e2b	7.2GB	PASS*	FAIL
gpt-oss:20b	13.8GB	PASS	PASS (5/5)
gemma4:31b	19.9GB	PASS	PASS (5/5)
llama3.3	42.5GB	PASS	PASS (2/2)†

*with specialized edit_line tool adaptation. †tested on 2 of the 5 tasks.

Headline: two local models in the 13–20GB range went from 2/10 to 10/10 with constraints enabled. Same models, hardware, tasks. The only change was tool space narrowed per phase.

The 13GB threshold is the floor where models retain enough file content for accurate edits. Below that, no amount of state-machine discipline saves you — the model can’t hold the file in working memory. That’s a model limit, not a Statewright one.

Frontier models benefit differently: fewer tokens to completion. 8 turns of planning becomes 2 when you’ve narrowed to 3 tools. Across a team’s daily usage, real money.

Caveat: 5 tasks isn’t the full 2,294-instance SWE-bench. Directionally correct, but don’t quote them as “Statewright triples SWE-bench scores.” The author is upfront about this.

Bash Discernment: The Detail That Matters

This is the feature I didn’t know I wanted until I read the docs. Even when Bash is in a state’s allowed_tools, Statewright enforces sub-tool-level restrictions:

Blocked in non-write states	Why
`>` and `>>` redirects	Write through the shell bypasses Edit limits
`rm`, `shred`, `dd` of=…	Destructive ops
`bash -c`, `sh -c`, `python -c`, `node -e`	Scripting interpreters bypass command allow-lists
Pipes into `sh`/`bash`	Curl-pipe-to-shell

allowed_commands is prefix-matched per state, so testing can grant pytest without granting pytest tests/ && rm -rf ./tmp. Combine that with max_edit_lines and max_files_per_state in implementing, and you’ve built a rate-limited, sandboxed coding agent without writing any policy code.

Other useful guardrails baked in:

Approval gates — requires_approval: true pauses for human review before transitioning to a state. Useful for deploy or merge_pr.
Environment scoping — blocked_env and env_overrides per state. Strip AWS_* during planning, inject NODE_ENV=test during testing.
Session isolation — state is keyed by CLAUDE_SESSION_ID, so parallel agent sessions don’t trample each other’s state machines.

Community Reactions on HN

Three threads of feedback emerge from the Show HN comments:

1. The framing landed. Multiple commenters singled out “making the problem smaller, not the model bigger” as the right reframe. Bigger context windows have hit diminishing returns; tighter tool spaces show measurable lift. Not a new idea in research, but the first product I’ve seen ship it as a one-command install.

2. The patent caught flak — but the carve-outs landed. A provisional patent on “constraining LLM agent tool access at the protocol layer” raised eyebrows. The author’s response: defensive, aimed at preventing larger companies from blocking competitors. The PATENTS.md pledge covers solo devs, researchers, open source projects, and single-team self-hosted deployments. Apache 2.0 in full on May 3, 2029.

3. The integration breadth got respect. Shipping Day-1 with Claude Code, Codex, opencode, Pi, and Cursor (advisory) is unusual — most agent tooling launches with one and “soon” on the rest.

There’s also a recurring observation that Cursor users get less out of this because Cursor’s tool system isn’t MCP-gated.

Honest Limitations

Buried in the README, but worth surfacing:

MCP support is a hard requirement (or hooks for non-MCP agents like Codex). Agents without either can’t enforce.
Workflow definitions are still hand-authored, though there’s a statewright_create_workflow tool that lets the agent generate them from the JSON schema. The visual editor at statewright.ai/workflows is faster for non-trivial machines.
Cursor is advisory. Don’t pay for Statewright Pro if Cursor is 80% of your usage.
5-task SWE-bench is not the full benchmark. Treat the numbers as directional, not authoritative.
Too-restrictive workflows get stuck. If you define a state that requires pytest to pass but your codebase uses unittest, the agent has no way out. statewright_deactivate is the escape hatch.
Cloud or self-host, pick one. Self-hosting the full stack is FSL-1.1 (limited to single dev/team). The engine alone is Apache 2.0.

Where It Fits

Statewright is most useful when you run local models and want usable work out of 13–20GB models that otherwise flail, you’re paying for a frontier model and want to cut tokens-per-task via tighter tool scoping, you run agents in CI/production and need protocol-layer guardrails for compliance, or you want approval gates on destructive transitions without writing a custom orchestrator.

It’s less useful for single-turn tasks, Cursor-primary workflows (advisory only), and open-ended research/exploration where unrestricted tool use is the point.

How It Compares

Approach	Enforcement	Open Source	Visual Editor
Statewright	Protocol-layer, hard	Engine Apache 2.0	Yes (cloud)
Prompt engineering / CLAUDE.md	Suggestion only	n/a	No
LangGraph	Code-level, soft	MIT	Studio (paid)
OpenAI Agent SDK handoffs	Code-level, soft	MIT	No
Custom hooks (claude-hooks etc.)	Hard, hand-rolled	Varies	No

The differentiator isn’t the state-machine idea — LangGraph and others have offered that for a year. It’s that Statewright enforces at the agent’s protocol layer, not in your application code. That means it works even when you’re not the one calling the agent (e.g., a teammate using their Claude Code with your team’s workflows).

FAQ

Is Statewright a replacement for LangGraph? Not really. LangGraph is a code-level orchestration framework you call from your Python. Statewright is a guardrail that wraps an existing agent (Claude Code, Codex, etc.) and constrains what it can do. If you’re building an agent from scratch, LangGraph. If you’re hardening an off-the-shelf agent, Statewright.

Does it work with local Ollama models? Yes — and that’s where the validation numbers come from. gpt-oss:20b and gemma4:31b running locally via Ollama showed the biggest lift. You’ll want at least 13GB of model to clear the working-memory floor.

What’s the deal with the patent? A provisional patent covers the method of constraining LLM agent tool access at the protocol layer. The author’s framing is defensive — protecting against larger companies, not blocking solo devs or self-hosters. The PATENTS.md pledge explicitly covers independent implementations by solo developers, researchers, open source projects, and single-team self-hosted deployments. Apache 2.0 code becomes patent-free in 2029.

Can I run the whole thing offline? The Rust engine, yes — it’s embeddable, deterministic, and has no runtime dependencies. The plugin layer for Claude Code/Codex/etc. is also installable locally. The managed cloud (workflow storage, MCP gateway, run history) is the optional piece — useful for teams, skippable for solo work.

How does it compare to writing a CLAUDE.md with “first plan, then implement”? A CLAUDE.md is a suggestion the model can rationalize away. Statewright is enforcement the model can’t see around. In the SWE-bench subset, prompt-only discipline produced 2/10; Statewright produced 10/10 with the same models. The mechanism is the same as the difference between a code review checklist and a CI lint — one relies on cooperation, the other on a wall.

Is the free tier actually usable? For one developer doing 1–2 workflows on personal projects, yes. 200 transitions/month is ~7/day — plenty for focused sessions. Heavy daily use pushes you to Pro ($29/mo, 2500 transitions). The engine itself is Apache 2.0 if you’d rather skip SaaS entirely.

Bottom Line

Statewright is the first agent-guardrail product I’d actually pay for. The framing is right (constrain the problem, not the model), the engineering is right (protocol-layer enforcement, not prompt suggestions), and the integration breadth is right (Claude Code, Codex, opencode, Pi on launch). The 2/10 → 10/10 SWE-bench lift on local models is the kind of number you don’t see often in agent tooling.

If you run coding agents in any production capacity — even just shipping side projects faster — install the Claude Code plugin and try the bugfix workflow on something real. Worst case, uninstall and you’re out 90 seconds. Best case, your agent stops re-reading the same file five times and finishes the task.

Repo: github.com/statewright/statewright. Docs: docs.statewright.ai. Visual editor: statewright.ai/workflows.