trycua/cua Review: Open-Source Computer-Use Agents

TL;DR

Cua (pronounced “cooa”) is open-source infrastructure for Computer-Use Agents (CUAs) — AI agents that don’t call APIs, they actually drive a computer. They see the screen, move the mouse, type into fields, and complete tasks the way a human does. Cua bundles the four pieces you need to build, train, and evaluate those agents seriously: agent-ready sandboxes, an SDK that gives you one API across every OS, a benchmarking suite (cua-bench), and Lume — the macOS/Linux virtualization layer that started the whole project.

The repo is one of the strongest performers on GitHub Trending this week with 15,000+ stars and ~1,300 added in the last seven days. It’s a Y Combinator X25 company that has stayed firmly committed to MIT-licensed open source.

Key facts:

One SDK, every OS — Sandbox.ephemeral(Image.linux()) works the same for .macos(), .windows(), .android(), or BYOI .qcow2/.iso
Cua Driver (new) — drives native macOS apps in the background without stealing your cursor, focus, or Space; works on non-AX surfaces like Chromium, Figma, Blender, DAWs
MCP server included for Claude Code, Cursor, and any custom MCP client
Lume uses Apple’s Virtualization.Framework for near-native macOS VM performance on Apple Silicon
CuaBot (npx cuabot) — gives any coding agent (cuabot claude, cuabot openclaw) a sandboxed desktop with H.265 streaming and a shared clipboard
Cua-Bench — runs OSWorld, ScreenSpot, Windows Arena, and custom datasets; exports trajectories for RL training
MIT licensed, hosted at cua.ai
Honest limitation: macOS-VM features only work on Apple Silicon, and the cloud-runtime (cua.ai) is a paid managed offering — fully self-hosted is supported but the local QEMU path requires real disk space and patience

If you’ve been building computer-use agents by gluing together pyautogui, screenshots, and an LLM and praying nothing breaks, Cua is the first project I’ve seen that treats this as actual infrastructure instead of a weekend hack.

The Problem: Computer-Use Agents Are Hard for Boring Reasons

The fun part of a computer-use agent is the model. The boring part — the part that eats 90% of your engineering time — is everything around the model:

Where does it run? You don’t want a hallucinating agent clicking around on your real desktop.
How do you see what it sees? Screenshots, accessibility trees, raw pixels — pick your poison.
How do you move the mouse and keyboard? PyAutoGUI on host, xdotool in Linux, AppleScript on macOS, SendInput on Windows. Five different APIs.
How do you reset between runs? VM snapshots? Container resets? Fresh installs?
How do you measure if it’s actually getting better? OSWorld, ScreenSpot, Windows Arena all use different harnesses.

Most teams I’ve seen end up with a custom orchestrator that works on exactly one OS, breaks on the next macOS update, and is impossible to evaluate against published benchmarks. That’s the gap Cua is closing.

What Makes Cua Different

Cua is really four products under one repo. Knowing which piece does what saves a lot of time.

1. Cua Sandbox SDK — One API, Every OS

This is the headline. Instead of writing OS-specific glue code, you get a single Python interface:

# Requires Python 3.11+
from cua import Sandbox, Image

# Same API regardless of OS or runtime
async with Sandbox.ephemeral(Image.linux()) as sb:   # or .macos() .windows() .android()
    result = await sb.shell.run("echo hello")
    screenshot = await sb.screenshot()
    await sb.mouse.click(100, 200)
    await sb.keyboard.type("Hello from Cua!")
    await sb.mobile.gesture((100, 500), (100, 200))  # multi-touch gestures

That Sandbox.ephemeral() context manager is the whole pitch. The same five methods — shell.run, screenshot, mouse.click, keyboard.type, mobile.gesture — work whether the underlying runtime is a Linux container, a Windows VM, a macOS VM running on Apple Silicon, or an Android emulator. You can target Cua’s managed cloud at cua.ai or run the same image locally via QEMU. The compatibility matrix is the cleanest I’ve seen in the space:

Runtime	Linux container	Linux VM	macOS	Windows	Android	BYOI
Cloud (cua.ai)	✅	✅	✅	✅	✅	🔜
Local (QEMU)	✅	✅	✅	✅	✅	✅

That last column — BYOI (Bring Your Own Image) — is what turns Cua from a demo into infrastructure. You can hand it a .qcow2 of a Windows install with your specific software stack already configured and it just works.

2. Cua Driver — Background Computer-Use on macOS

This one is genuinely novel. Almost every existing computer-use agent on macOS works by hijacking your desktop: it grabs the cursor, switches to a target app, takes a screenshot, clicks, repeats. You can’t use the machine while it runs.

cua-driver runs the agent in the background. It clicks, types, and verifies state in target windows without stealing your cursor, focus, or current Space. More importantly, it works on the surfaces where macOS Accessibility APIs don’t help: Chromium-based web content, canvas-based tools (Figma, Blender, DAWs, game engines), and other “non-AX” applications.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh)"

It ships with a CLI and an MCP server, so Claude Code, Cursor, and any other MCP-aware agent can drive native macOS apps with a few tool calls. Every session records as a replayable trajectory — handy when you need to understand why your agent did the wrong thing on run #847.

3. CuaBot — Drop-In Sandbox for Any Agent

cuabot is the consumer-facing layer. It’s a single npx command that wraps the heavy machinery so any coding agent can get a sandboxed desktop with one line:

npx cuabot                 # First-time setup wizard

# Run an agent inside a sandbox
cuabot claude              # Claude Code in a fresh desktop
cuabot openclaw            # OpenClaw in a sandbox

# Drive a sandboxed GUI workflow directly
cuabot chromium
cuabot --screenshot
cuabot --type "hello"
cuabot --click 100 200

The clever part is the rendering: individual sandbox windows appear natively on your real desktop using H.265 streaming, with a shared clipboard and audio passthrough. There’s no “open this URL to see your VM” step. It looks and feels like just another app.

4. Cua-Bench — The Part Researchers Care About

cd cua-bench
uv tool install -e . && cb image create linux-docker

# Run benchmark with agent
cb run dataset datasets/cua-bench-basic --agent cua-agent --max-parallel 4

cua-bench runs your agent against OSWorld, ScreenSpot, Windows Arena, and custom task datasets, then exports trajectories you can use for fine-tuning or RL. --max-parallel 4 is the line that matters: spinning up four sandbox VMs at once and running a full benchmark suite used to require a custom Kubernetes cluster. Now it’s a flag.

5. Lume — The Foundation

lume is the original project that became Cua. It’s a CLI for creating and running macOS and Linux VMs on Apple Silicon using Apple’s Virtualization.Framework directly, instead of going through Docker Desktop / Parallels / VMware.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"

lume run macos-sequoia-vanilla:latest

That one command pulls a vanilla Sequoia image and boots it. Performance is near-native because there’s no extra hypervisor layer. If you’ve ever tried to test your agent on real macOS in CI, you know how rare that is.

Quickstart: Drive a Sandbox in 90 Seconds

The cleanest first run is the Linux container target — no VM disks, no licensing weirdness.

# Install
pip install cua

# Make sure Docker is running, then:
python - <<'PY'
import asyncio
from cua import Sandbox, Image

async def main():
    async with Sandbox.ephemeral(Image.linux()) as sb:
        await sb.shell.run("apt-get update && apt-get install -y firefox-esr")
        await sb.shell.run("firefox &")
        await asyncio.sleep(3)
        png = await sb.screenshot()
        with open("firefox.png", "wb") as f:
            f.write(png)
        await sb.mouse.click(640, 360)
        await sb.keyboard.type("hello cua")

asyncio.run(main())
PY

You get back firefox.png showing a real Firefox window inside a real Linux desktop. From here it’s the same code path whether you swap Image.linux() for Image.macos() (Apple Silicon required) or Image.windows().

A More Realistic Agent: Vision-Loop Sketch

If you want to wire a vision-capable LLM directly to Cua, the loop is short. Take a screenshot, ask the model for the next action as JSON, dispatch it via sb.mouse.click / sb.keyboard.type, repeat until done.

import asyncio, base64
from cua import Sandbox, Image
from anthropic import AsyncAnthropic

claude = AsyncAnthropic()

async def loop():
    async with Sandbox.ephemeral(Image.linux()) as sb:
        await sb.shell.run("firefox https://en.wikipedia.org &")
        await asyncio.sleep(4)
        for _ in range(8):
            shot = await sb.screenshot()
            msg = await claude.messages.create(
                model="claude-opus-4-7", max_tokens=512,
                messages=[{"role": "user", "content": [
                    {"type": "image", "source": {"type": "base64",
                        "media_type": "image/png",
                        "data": base64.b64encode(shot).decode()}},
                    {"type": "text", "text":
                        "Output ONE action as JSON: "
                        "{action:'click'|'type'|'done', x, y, text}."}
                ]}],
            )
            # parse JSON → sb.mouse.click / sb.keyboard.type / break

asyncio.run(loop())

If you don’t want to write that loop yourself, cua-agent ships one — and Cua-Bench uses --agent cua-agent as the default baseline.

Community Reactions

I went through the HN launch thread, the r/LocalLLaMA, r/cursor, and r/AI_Agents discussions, and the open GitHub issues. The reactions clustered into three groups:

Builders who’d been suffering: The “finally” reaction was strong. From the r/cursor launch thread: “Computer provides a PyAutoGUI-compatible interface that can be plugged into any AI agent system (OpenAI Agents SDK, Langchain, CrewAI, AutoGen, etc.).” The secure-environment angle keeps coming up — people don’t want to give Claude an unrestricted shell on their MacBook.

Skeptics on the “screen pixels” approach: A recurring r/AI_Agents critique is that pixel-driving agents are inherently brittle. “Guardrails + recovery when it misclicks” is the real challenge. Cua’s trajectory recording helps with debugging but doesn’t solve brittleness; that’s still on the model.

The “Docker for CUA” framing landed: From r/LocalLLaMA: “Cua is the Docker for Computer-Use Agent.” That framing is doing a lot of work — and it’s accurate: ephemeral, image-based, ports-and-volumes shaped, easily snapshotted.

General HN sentiment: one of the more substantive YC-batch open-source agent projects, because the founders kept shipping infrastructure (Lume, the driver, cua-bench) instead of just an API wrapper.

Honest Limitations

After spending real time with it, here’s what I’d want to know before committing:

Apple Silicon only for macOS targets. Lume uses Virtualization.Framework, which means M-series Macs only. Intel Macs and non-Apple hosts can’t run macOS sandboxes. Linux/Windows/Android targets work everywhere.
The cloud runtime is a paid product. Self-hosted via local QEMU is fully supported and MIT-licensed, but if you want the managed cua.ai cloud you’re on a usage-based plan. That’s fine — somebody has to pay for the GPUs and disks — just be clear which path you’re on.
Disk and bandwidth. A macOS VM image is ~50 GB. A Windows image with software preinstalled is comparable. The first lume pull is a coffee break, not a 30-second affair.
The agent itself is still your problem. Cua gives you the substrate. The model that decides where to click is up to you. cua-agent provides a default ReAct loop, but state-of-the-art accuracy on OSWorld is still in the 30–50% range for the best models — this isn’t a “ship to grandma” technology yet.
MCP server feature parity. The MCP integration for Claude Code is solid for the Driver path (macOS native apps). The full Sandbox SDK over MCP is more recent — expect rough edges if you’re driving Windows VMs through Claude Code today.
The repo moves fast. 1,300+ stars per week is great for momentum and bad for stable APIs. Pin your cua version, read the changelog before upgrading, and keep an eye on the Discord for breaking changes.

None of these are dealbreakers. They’re the normal cost of being early on a serious-looking infrastructure project.

Where Cua Fits vs. Alternatives

OpenAI Operator / Anthropic Computer Use API — these are models, not infrastructure. You still need a sandbox to point them at. Cua is the sandbox.
Anchor Browser, Browserbase, Steel — browser-only sandboxes. Excellent for web automation, useless for “open Photoshop and apply this filter.”
Pyautogui / Selenium / Playwright — host-level automation. No sandbox, no cross-OS API, no benchmark harness.
E2B Desktop, Daytona — closest competitors. E2B Desktop is solid for Linux. Cua’s edge is the multi-OS coverage (macOS especially) and the bench/Driver pieces.
Skyvern, Multion — full vertical products. They include the model. Cua is BYO model.

If you’re building a vertical product and just want a working agent, look at Skyvern. If you’re building infrastructure, training a model, or shipping an open-source CUA, Cua is the most complete starting point right now.

FAQ

Q: Do I need an Apple Silicon Mac to use Cua? No. You only need Apple Silicon if you want to spin up macOS sandboxes. Linux containers, Linux VMs, Windows VMs, and Android emulators run on any host that can run QEMU — that includes Linux servers, Intel Macs, and Windows machines via WSL2.

Q: How is Cua different from running an agent against my real desktop with PyAutoGUI? Two things: isolation and reproducibility. PyAutoGUI clicks on your desktop, sees your files, and breaks the moment a notification pops up. Cua runs the whole thing in an ephemeral VM or container, so the agent’s state is reset on every run, your real machine is untouched, and the same script runs identically on a teammate’s laptop or a CI runner.

Q: Does Cua include the AI model, or do I bring my own? You bring your own. Cua provides the sandbox, the SDK, and the benchmark harness. It works with any vision-capable LLM — Claude, GPT-4o, Gemini, Qwen-VL, the recent open-source vision models. cua-agent provides a default ReAct loop you can plug a model into, but the model itself is your choice.

Q: Can I use Cua with Claude Code or Cursor? Yes. The cua-driver package ships an MCP server. Add it to Claude Code or Cursor’s MCP config and your coding agent gains a screenshot, click, type set of tools that operate on real native macOS apps in the background. cuabot claude goes a step further and runs Claude Code itself inside a sandboxed desktop.

Q: Is the cloud version free? The open-source SDK and local QEMU path are MIT-licensed and free forever. The managed cua.ai cloud — where you don’t have to provision your own VMs — is a paid usage-based service. Most hobbyists will be fine on local-only.

Q: How does Cua-Bench compare to OSWorld directly? Cua-Bench can run OSWorld (it’s one of the supported datasets), but it adds parallel execution, trajectory export, and dataset versioning on top. If you’re doing serious eval work, it’s a strict superset. If you just want to run the official OSWorld harness once, the original repo is fine.

Verdict: Worth Your Bookmark

Cua is in the small group of agent-infrastructure projects that look like they’ll still matter in 18 months. The combination of one-API-many-OSes, the macOS-background driver, an actual benchmark harness, and an MCP server makes it the closest thing to “Docker for desktops” that anyone has shipped. The 15K stars and 1,300/week growth aren’t accidents.

If you’re doing anything with computer-use agents — even just experimenting — clone the repo, run the Linux quickstart, and bookmark cua-bench for when you start caring about numbers instead of demos.

Repo: github.com/trycua/cua Docs: cua.ai/docs License: MIT