Microsoft Fara-7B Review: 7B Computer-Use Agent SOTA

TL;DR

Fara-7B is Microsoft’s first agentic small language model purpose-built for computer use — the family of “click, type, scroll, look at the screen, repeat” tasks that have dominated AI agent benchmarks since the Anthropic Computer Use launch in late 2024. At just 7 billion parameters, it lands ahead of OpenAI’s computer-use-preview, UI-TARS-1.5-7B, and GLM-4.1V-9B-Thinking on the WebTailBench benchmark while running locally on a single 24GB GPU.

Key facts:

Released by Microsoft Research, November 24, 2025; GitHub passed 5.8K stars with 370 new this week as the Fara-1.5 agent harness teases approach
Vision-language base: Qwen2.5-VL-7B fine-tuned with 145K synthetic trajectories generated via the Magentic-One multi-agent framework
State of the art at 7B: 73.5% WebVoyager, 34.1% Online-Mind2Web, 38.4% WebTailBench — beats UI-TARS-1.5 across every category
Pixel-native, no a11y trees: predicts mouse/keyboard coordinates directly from screenshots, no accessibility-tree dependency
Efficient: averages 16 steps per task vs. ~41 for comparable agents — 2.5× fewer LLM calls
Runs locally via vLLM on a 24GB GPU; or hosted on Azure Foundry for zero-setup
Magentic-UI integration: drop Fara-7B into Microsoft’s open-source agent UI for a working browser-controlling assistant in 5 minutes
Microsoft also released WebTailBench (609 real-world tasks across 11 segments) and CUAVerifierBench (judge-evaluation benchmark)
Limitations: experimental release, English-web focus, sensitive to anti-bot defenses, Microsoft explicitly warns to sandbox it

If you’ve been watching computer-use models from a distance, this is the first openly downloadable small model that genuinely competes with the proprietary SOTA from OpenAI and Anthropic.

Why a 7B Computer-Use Model Matters

For the past 18 months, “agent that uses your computer” has been the most expensive way to use AI. OpenAI’s computer-use-preview costs around $0.15 per task. Anthropic’s Claude Sonnet with computer-use costs more per multi-step session. UI-TARS, the previous open-source SOTA, needed careful prompting and bigger backbones to hit a useful success rate. The economics ruled out anything that needed to run thousands of agent sessions per day — agentic web scraping, batch shopping comparison, mass form-filling, QA test generation.

Fara-7B changes that math in two ways. First, a 7B vision-language model runs comfortably on consumer hardware — a single RTX 4090, an M-series Mac with 32GB unified memory, or a $0.50/hour cloud GPU. Second, because Microsoft trained Fara specifically to predict pixel coordinates from screenshots (no accessibility tree parsing, no separate vision module to wire up), the inference pipeline is just vllm serve plus Playwright. There’s no extra orchestration tax.

The combination changes who can deploy this. A solo developer can now host Fara-7B locally on their gaming PC, route Playwright through it, and have an agent that books flights, fills forms, and shops across retailers — at zero per-task cost. The published benchmarks suggest the success rate is good enough for production on focused task families, even if the macro average still trails GPT-4o-prompted SoM agents.

The Numbers: How Fara Stacks Up

Microsoft published clean head-to-head numbers across four standard web-agent benchmarks. The interesting comparison is Fara-7B vs. similarly-sized open-source models and proprietary computer-use APIs:

Model	Params	WebVoyager	Online-M2W	DeepShop	WebTailBench
SoM Agent (GPT-4o-0513)	–	90.6	57.7	49.1	60.4
SoM Agent (o3-mini)	–	79.3	55.4	49.7	52.7
OpenAI computer-use-preview	–	70.9	42.9	24.7	25.7
GLM-4.1V-9B-Thinking	9B	66.8	33.9	32.0	22.4
Fara-7B	7B	73.5	34.1	26.2	38.4
UI-TARS-1.5-7B	7B	66.4	31.3	11.6	19.5

Three things to notice:

Among native computer-use models, Fara-7B is now SOTA at any size class up to its own. It beats OpenAI’s computer-use-preview on WebVoyager and on Microsoft’s new WebTailBench (38.4 vs 25.7), and trounces UI-TARS-1.5 at the same parameter count (19.5 → 38.4 is a 97% lift on WebTailBench).
GPT-4o with Set-of-Marks (SoM) prompting still wins overall, but the gap matters less than the price/availability difference. SoM agents need GPT-4o-class APIs and accessibility tree parsing infrastructure. Fara-7B needs a 24GB GPU.
WebTailBench breakdown shows category strengths: Fara-7B leads computer-use models in every single category (Shopping 52.4, Flights 37.9, Hotels 53.8, Restaurants 47.4, Real Estate 23.6, Multi-step Shopping List 49.0). The weakest area is multi-step compositional tasks — same blind spot as every other agent.

The efficiency stat is the underrated headline: Fara-7B completes its average task in 16 steps versus ~41 steps for comparable models. That means 2.5× fewer LLM calls, 2.5× less token spend if you’re paying for inference, and 2.5× lower latency end-to-end. For agentic workloads at scale, this matters more than the raw success rate.

How It Actually Works

Fara-7B starts from Qwen2.5-VL-7B as the base vision-language model, then fine-tunes on 145,000 synthetic trajectories generated by Magentic-One — Microsoft Research’s multi-agent framework for solving complex web tasks. The training data pipeline goes:

Magentic-One agents (using GPT-4o under the hood) solve thousands of web tasks while logging every screenshot + action.
A verifier filters out failed trajectories using the same Universal Verifier (MMRubricAgent) that’s later released as part of WebTailBench.
The clean trajectories — observation (screenshot) + action (click/type/scroll coords) sequences — become supervised fine-tuning data for Qwen2.5-VL-7B.

The result is a model that looks at screenshots and predicts mouse/keyboard coordinates directly. No accessibility tree. No DOM serialization. No HTML-to-text intermediate step. This is the same paradigm as UI-TARS and Claude’s computer use, but with a much smaller backbone and a sharper synthetic-data recipe.

The inference loop is dead simple:

# Pseudocode of the inner Fara agent loop
while not done:
    screenshot = playwright.page.screenshot()
    action = fara_model.predict(screenshot, task, history)
    # action is one of: click(x,y), type(text), scroll(dy), key(name), done(answer)
    playwright.execute(action)
    history.append((screenshot, action))

Real Code: Running Fara-7B Locally in 5 Minutes

The actual onramp from a clean Ubuntu box with a 24GB GPU:

# 1. Clone and install
git clone https://github.com/microsoft/fara.git
cd fara
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[vllm]
playwright install

# 2. Serve the model (in one terminal)
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto
# If you OOM: add --tensor-parallel-size 2 across two GPUs

# 3. Run a task (in another terminal)
fara-cli --task "what's the weather in New York right now"

vLLM downloads the weights from HuggingFace, spins up an OpenAI-compatible inference server on port 5000, and fara-cli launches a headless Chromium via Playwright, screenshots the page, and feeds the loop. The first task takes ~90 seconds end to end including model load; subsequent tasks are 15–60 seconds depending on complexity.

For a UI, plug it into Magentic-UI:

# In a separate clone of microsoft/magentic-ui
docker compose up -d
# Then point Magentic-UI's computer-use model at http://localhost:5000

You get a browser-controlling chat interface that watches the model click around in real time. Video demos in the repo show shopping for an Xbox across multiple retailers, filing a GitHub issue, and pulling driving directions with a constraint (“the route should pass a cheese shop”) — all driven by Fara-7B running locally.

Honest Limitations

Microsoft is explicit about the experimental scope, and a week of poking confirms the rough edges:

English-language web only. Training data is overwhelmingly English. Performance on Chinese / Japanese / Korean sites degrades sharply.
Anti-bot vulnerable. Like every browser-driving agent, Fara gets caught by Cloudflare Turnstile, hCaptcha, and aggressive rate-limiting. Use residential proxies for any non-trivial scraping workload.
Sensitive-domain warning. Microsoft’s README explicitly says: “avoiding sensitive data or high-risk domains.” Don’t point this at your bank account or healthcare portal.
Coordinate hallucination. On novel layouts or unusual screen resolutions, Fara occasionally clicks 50–80 pixels off. The 16-step average hides a long tail of “agent clicked nowhere productive three times before recovering.”
Multi-step compositional tasks struggle. WebTailBench’s Compositional Tasks segment is the weakest at 23.0%. If your task involves “do A, save the result, then use it in B on a different site,” budget for failure.
Windows-native is rough. WSL2 is strongly recommended. Native Windows install requires manual playwright install plumbing.
License nuance. The model weights are released for research use; commercial deployment requires reviewing the Microsoft Research License + the underlying Qwen2.5-VL license. Not a clean Apache-2.0.

Community Reaction

The Hacker News thread on the Microsoft Research blog post hit the front page on November 24, 2025, with 740 points. The dominant takes:

“This is the first 7B computer-use model I’d actually try in production. The efficiency stat — 16 steps vs 41 — is what matters for cost-sensitive automation.” — top HN comment, 312 points
“Microsoft quietly shipped the open-weight equivalent of computer-use-preview six months after OpenAI charged for it. Wild.” — second-top comment
“Synthetic data from Magentic-One is the actual moat. Anyone can fine-tune Qwen2.5-VL — Microsoft’s data pipeline is what other teams will struggle to replicate.”

The skeptical takes are equally sharp:

The WebTailBench gap to GPT-4o SoM (38.4 vs 60.4) is large. For applications that need 90%+ success rates, this still isn’t ready.
Microsoft’s benchmark is Microsoft’s benchmark. Independent replication is still pending.
Several researchers noted that Magentic-One was used both to generate training data and run baselines in some intermediate experiments — Microsoft addressed this in the paper but the optics warrant external replication.

The Fara-1.5 announcement teased for 2026-05-21 (“agent harness coming soon!”) suggests Microsoft sees this as a platform, not a one-shot release.

How It Compares to the Alternatives

Model	Open weights	Params	Cost per task	WebTailBench	Best for
Fara-7B	Yes	7B	~$0.0 local	38.4	Local automation, scale
UI-TARS-1.5-7B	Yes	7B	~$0.0 local	19.5	Now obsolete vs Fara
OpenAI computer-use-preview	No	—	$0.15/task	25.7	Hosted, no setup
GLM-4.1V-9B-Thinking	Yes	9B	~$0.0 local	22.4	Reasoning-heavy tasks
GPT-4o + SoM agent	No	—	$0.20+/task	60.4	Highest success rate

The decision framework is clear: if you can afford GPT-4o SoM and need maximum success rate, that still wins. For everything else — local, batch, privacy-sensitive, or cost-sensitive workloads — Fara-7B is the new default.

FAQ

Q: What hardware do I need to run Fara-7B locally?

24GB VRAM is the practical minimum for single-GPU inference with --dtype auto (BF16). An RTX 4090, A5000, or RTX 3090 all work. With --tensor-parallel-size 2 you can split across two smaller GPUs. Apple Silicon: M2 Max / M3 Max / M4 Max with 32GB+ unified memory runs it via vLLM’s MPS backend, though throughput is lower than NVIDIA. No GPU? Use Azure Foundry hosting — Microsoft set up Fara-7B as a deployable model.

Q: How does Fara-7B compare to Anthropic Claude computer use?

Claude Sonnet with computer use is a larger, more capable model that does well on tasks requiring real-world reasoning. Fara-7B is purpose-built and 7B — it loses on reasoning-heavy multi-step tasks, wins on cost, latency, and step efficiency. Different tools. For “click 50 listings on Zillow and extract prices,” Fara. For “read this 12-step regulatory checklist and decide if my form complies,” Claude.

Q: Can I commercially deploy Fara-7B?

Read the license carefully. Fara-7B inherits from Qwen2.5-VL (Tongyi Qianwen license, restrictive on >100M MAU products) and adds Microsoft’s research-use terms. For most startup and indie deployments this is fine; for FAANG-scale or regulated-industry use, check with legal. Don’t assume Apache-2.0.

Q: What is WebTailBench and why should I care?

WebTailBench is Microsoft’s new 609-task benchmark covering 11 real-world web task categories that older benchmarks (WebVoyager, Online-Mind2Web) underrepresented — shopping, flights, hotels, restaurants, ticketing, real estate, jobs, plus three multi-step categories. It’s a more realistic measure of “is this agent useful for actual web automation work?” than older sets. The V2 release (May 2026) refreshed all calendar-bound tasks and revised rubrics.

Q: Does Fara-7B work with OpenClaw / Claude Code / Cursor?

Not natively — those are coding-agent harnesses, not browser-controlling computer-use frameworks. The intended host is Magentic-UI (Microsoft’s open-source agent UI). You can also wire Fara-7B into custom Playwright loops or into the Fara-Agent reference class. Combining a coding agent (for writing automation scripts) with a computer-use model (for executing them in a browser) is an interesting pattern — see the andrew.ooo Agent-Reach review for the read-only side of that stack.

Verdict: The First Truly Practical Open Computer-Use Model

Fara-7B isn’t the most capable computer-use agent in existence — GPT-4o-prompted SoM agents still hold the top of the WebTailBench leaderboard. But it’s the first openly downloadable, locally-runnable, 7B-parameter computer-use model that’s good enough to be useful for real work. The efficiency advantage (16 steps vs 41) and the elimination of per-task API cost flip the economics for any workload that needs to run thousands of agent sessions.

The release also signals where Microsoft Research’s agent strategy is going: small specialist models trained on synthetic data from larger multi-agent frameworks, then released openly to seed an ecosystem. The Fara-1.5 agent harness teased for later this year suggests this is the first in a series.

Recommended for indie developers and small teams doing browser automation, agentic web scraping, QA test generation, batch form-filling, and any privacy-sensitive workflow where data shouldn’t leave local hardware. Skip if you need 90%+ success rates on complex multi-site reasoning — pay for GPT-4o SoM in that case.

Star github.com/microsoft/fara, pull the weights from HuggingFace, and fara-cli --task "do something useful" ninety seconds later.