video-use Review: browser-use Team's AI Video Editor

The browser-use team built their reputation by giving LLMs a structured DOM instead of screenshots. Now they’ve applied the same trick to video.

video-use is a 100% open-source skill that turns Claude Code, Codex, Hermes, OpenClaw — or any coding agent with shell access — into a full video editor. You drop raw footage into a folder, tell your agent what you want, and it hands you back final.mp4. No timeline. No menus. No presets.

The project shipped in April 2026 and hit 13,000+ GitHub stars with 3,000+ added in a single week as of early July 2026 — one of the fastest-growing repos in the browser-use org. It sits in a category that barely existed six months ago: coding agents editing video from raw source files.

I spent a few days running it against podcast rushes, product demos, and travel B‑roll. This is the review.

What video-use actually is

Strip the marketing away and video-use is three things:

A skill package (SKILL.md + helpers/*.py) any Anthropic-style agent can load, with production rules like “subtitles applied last” and “30ms audio fades at every cut boundary.”
A transcript-first data pipeline that turns raw video into a compact phrase-level Markdown transcript — using ElevenLabs Scribe for word-level timestamps and speaker diarization.
A thin ffmpeg orchestration layer (render.py, grade.py, timeline_view.py) so the LLM expresses edits as data (edl.json) and lets plain ffmpeg encode.

The core insight, from the README:

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise. Video Use: 12KB text + a handful of PNGs.

The LLM never watches the video. It reads it.

The architecture in one diagram

Raw footage/
├── take_001.mov      ──► transcribe.py ──► transcripts/take_001.json  (cached)
├── take_002.mov      ──► transcribe.py ──► transcripts/take_002.json  (cached)
└── ...
                                │
                                ▼
                        pack_transcripts.py
                                │
                                ▼
                     edit/takes_packed.md  ◄── LLM's primary "reading view"
                                │
              ┌─────────────────┴──────────────────┐
              ▼                                    ▼
   LLM proposes strategy               timeline_view.py (on demand)
   (plain English, waits              filmstrip + waveform PNG at
    for user confirmation)             ambiguous decision points
              │
              ▼
        edit/edl.json  (cut decisions as data)
              │
              ▼
   render.py: per-segment extract → -c copy concat → overlays → subtitles LAST
              │
              ▼
   Self-eval: timeline_view on the RENDERED output at every cut boundary
              │
              ▼
        edit/final.mp4

Two design decisions carry most of the weight:

Audio is primary, visuals follow. Cuts are always proposed on speech boundaries and silence gaps from the Scribe transcript. Visuals are only sampled at ambiguous moments (a long pause: was it a “thinking beat” worth keeping, or dead air worth cutting?).
Ask → confirm → execute → self-eval → persist. The agent has to describe its plan in plain English before it touches the cut, and it re-renders + re-inspects before it shows you anything.

Installation: really is one paste

The install flow is deliberately paste-into-your-agent-and-walk-away. From the README:

Set up https://github.com/browser-use/video-use for me.
Read install.md first to install this repo, wire up ffmpeg,
register the skill with your agent, set up the ElevenLabs API
key, then read SKILL.md and helpers/.

That paragraph works because agents like Claude Code and Codex have a skills directory convention. The script clones the repo, symlinks into ~/.claude/skills/video-use/ (or ~/.codex/skills/video-use/), runs uv sync, verifies ffmpeg/ffprobe, and prompts for ELEVENLABS_API_KEY.

Or do it by hand in four commands:

# 1. Clone and symlink into your agent's skills directory
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use   # Claude Code
# ln -sfn ~/Developer/video-use ~/.codex/skills/video-use  # Codex

# 2. Install Python deps (ffmpeg-python, elevenlabs, pillow, etc.)
cd ~/Developer/video-use
uv sync   # or: pip install -e .

# 3. System binaries
brew install ffmpeg       # required
brew install yt-dlp       # optional, for downloading online sources

# 4. Add your ElevenLabs API key
cp .env.example .env
$EDITOR .env              # ELEVENLABS_API_KEY=...

Grab a key at elevenlabs.io/app/settings/api-keys. Scribe is billed per minute of audio; a 20-minute podcast take costs roughly a few cents.

A real session: podcast edit end-to-end

Here’s what a session actually looks like. I cd’d into a directory with four raw takes of a two-person podcast and started Claude Code:

$ cd ~/podcast/ep47-raw
$ ls
take_001.mov  take_002.mov  take_003.mov  take_004.mov
$ claude

Then just:

> edit these into a 4–6 minute launch video for a developer audience.
  Neutral cinematic grade. 2-word uppercase subtitle chunks. Cut out
  the second take entirely — the guest coughed through it.

Behind the scenes, the agent did roughly this:

ffprobe the sources, then transcribed all four takes in parallel via transcribe_batch.py (4-worker ElevenLabs Scribe fan-out, cached to edit/transcripts/).
Packed the transcripts into edit/takes_packed.md — phrase-level, break on silence ≥ 0.5s, ~12 KB for 40 minutes of source. This is the LLM’s reading view:

## take_001 (duration: 512.4s, 87 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.
  [007.90-012.11] S1 Wait, so — what does that mean exactly?
  [012.55-014.20] S0 (laughter)

Proposed a strategy in plain English: “Open on take_001 002.52 → 006.74 (hook). Cut to take_003 048–132s. Skip take_002. Close on take_004 210–230s. Target 4:47. Grade: neutral cinematic. Confirm before I cut.”
Waited for me to say “yes.” The agent isn’t allowed to touch the cut until the user approves the plan in words.
Generated edit/edl.json — every segment, source, start, end, grade, overlay, and subtitle style as data.
Rendered per-segment, ffmpeg concat -c copy, applied overlays, then burned subtitles last.
Self-evaluated by running timeline_view.py on the rendered output at every cut boundary (visual jumps, waveform spikes, hidden subtitles, misaligned overlays). Up to 3 fix-and-re-render passes.
Handed me edit/preview.mp4.

Wall-clock: about 8 minutes. Transcription dominated. Two revisions later (“20% tighter”, “swap the opening line”) the final rendered without re-transcription — Scribe results are cached per source.

The 12 hard rules — production correctness

The most interesting part of SKILL.md isn’t aesthetics. It’s a list labeled Hard Rules — non-negotiable correctness constraints:

Subtitles applied LAST in the filter chain, otherwise overlays hide captions (silent failure).
Per-segment extract → lossless -c copy concat, not single-pass filtergraph — avoids double-encoding when overlays land.
30ms audio fades at every segment boundary (afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03), otherwise audible pops.
Overlays use setpts=PTS-STARTPTS+T/TB to shift the overlay’s frame 0 to its window start.
Master SRT uses output-timeline offsets: output_time = word.start - segment_start + segment_offset.
Never cut inside a word. Snap every cut edge to a Scribe word boundary.
Pad every cut edge by 30–200ms — Scribe timestamps drift 50–100ms.
Word-level verbatim ASR only. Never SRT/phrase mode. Never normalized fillers.
Cache transcripts per source. Never re-transcribe unless the file itself changed.
Parallel sub-agents for multiple animations, never sequential.
Strategy confirmation before execution.
All session outputs in <videos_dir>/edit/. Never write inside video-use/.

The list is worth reading even if you never use video-use — accumulated bruises from actually shipping this pipeline.

Overlays, grades, and generated animations

Animation overlays render via parallel sub-agents in any of four systems:

HyperFrames — HeyGen’s declarative framer
Remotion — React-based programmatic video
Manim — 3Blue1Brown’s math library
PIL — plain Pillow for raster overlays

Each animation lives in edit/animations/slot_<id>/. The parent agent spawns one sub-agent per slot in parallel via Claude Code’s Agent tool — wall time is the slowest single animation, not the sum. On a five-animation launch video: 90 seconds instead of 7+ minutes.

Color grading is a one-liner:

helpers/grade.py -i in.mp4 -o graded.mp4 --preset neutral_cinematic
# or custom ffmpeg chain:
helpers/grade.py -i in.mp4 -o graded.mp4 \
  --filter 'curves=preset=medium_contrast,eq=saturation=1.05:gamma=0.98'

Grades apply per-segment before concat — combined with rule 2, that’s one encode per segment, not two.

What it’s genuinely good at

After a week on real work:

Talking-head content. Podcasts, video essays, tutorial narration, launch videos. Transcript-first is perfect for anything speech-driven.
Multi-take selection. Pick the best of six takes without watching all six. The LLM reads them all.
Filler-word removal at scale. Every “um,” “uh,” false start comes with a word-level timestamp — one pass to cut them.
Iterating verbally. “Tighter.” “Warmer grade.” “Bigger subtitle chunks.” Each revision is minutes, not hours.
Session handoff. project.md persists memory. Start with Claude Code, resume next week in Codex.
Reproducibility. edl.json is a checked-in data file. Regenerate the same edit deterministically.

Honest limitations

video-use is not a Premiere replacement, and the README is upfront about it:

Purely visual content is a bad fit. No speech, nothing for the LLM to reason about. Wordless drone montages: use something else.
ElevenLabs Scribe is a hard dependency. No local Whisper fallback in the box. Sensitive material: non-starter.
Word-timing drift. Scribe timestamps drift 50–100ms. Edge padding hides most of it, but tightly-paced comedic timing can feel slightly off vs. a hand-cut.
No frame-accurate motion graphics. Overlays sit on top; nothing gets composited into a scene. Rotoscoping and motion tracking need a real NLE.
Long-form gets expensive. 90-minute raw source at Scribe’s per-minute pricing is a few dollars per first pass. Iteration is cheap because results cache.
ffmpeg is the ceiling. Effects that need After Effects still need After Effects.
Windows story is weaker. Docs and community are macOS/Linux-first. WSL2 works fine.

Community reactions

Reception in the six weeks since launch has been unusually positive:

The browser-use org page lists video-use alongside the flagship browser agent and desktop app — signaling first-class product status.
Coverage from SoloSoft, Openflows, and The Menon Lab all landed on the same observation: the transcript-first pipeline makes it feel qualitatively different from previous “AI video editors” that tried to reason from frames.
The most cited critique on X and Reddit is the ElevenLabs dependency and the cost of transcribing long sources — legitimate concerns that a local-Whisper fallback would fix.
Independent developers have started shipping video-use plugins as their own agent skills — automatic B‑roll insertion, chapter marker generation. The “skill” packaging is proving to be a real extension point.

When to use video-use vs. alternatives

Tool	Best for	Trade-off
video-use	Podcasts, launch videos, tutorials with agent-first workflow	Needs ElevenLabs, ffmpeg-ceiling on effects
OpenMontage	Full agentic pipeline, 12 pipelines, 500+ skills	Heavier setup, broader surface area
Descript	Text-based editing with polished UI	Proprietary, cloud-only, subscription
DaVinci / Premiere	Color-critical or motion-graphics-heavy work	Manual, hours per edit, no LLM
yt-dlp + ffmpeg	One-off clip extraction	You write the whole pipeline

video-use wins when material is speech-driven, iteration speed matters more than pixel-perfect control, and you already live in Claude Code or Codex.

FAQ

Does video-use work with any coding agent? Yes — the README explicitly lists Claude Code, Codex, Hermes, and OpenClaw, and anything that can run shell commands + read a skill file. The skill is just a directory with SKILL.md and helper scripts; the agent-specific step is symlinking it into that agent’s skills directory. Cursor and Windsurf work via their own skill/rule conventions.

Do I need an ElevenLabs subscription? You need an API key, but ElevenLabs offers a free tier with limited monthly minutes that’s enough to try the whole pipeline on a short podcast. Paid usage is billed per minute of audio transcribed. There’s no local Whisper fallback in the current release; adding one is the most common community feature request.

Can it edit videos without spoken audio? Poorly. The whole pipeline is built around the transcript being the LLM’s primary reading surface. For wordless material — B‑roll, dance, music videos, silent product demos — the LLM has almost nothing to reason about and you’d be better off with a traditional editor.

Does it re-transcribe every session? No. Transcripts are cached per source file in edit/transcripts/<name>.json and only regenerated if the source file itself changes. This is Rule 9 in the hard rules. Iteration on cut decisions, grades, and overlays reuses the cached JSON.

Where do the outputs go? All session artifacts live under <your_videos_dir>/edit/ — project.md for persistent session memory, takes_packed.md for the LLM’s reading view, edl.json for cut decisions, clips_graded/ for per-segment extracts, animations/slot_<id>/ for overlays, master.srt for subtitles, preview.mp4, and final.mp4. The video-use/ repo itself is never written to.

How does the self-eval loop actually work? After render, the agent runs timeline_view.py on the rendered output (not the sources) at every cut boundary — a ±1.5s window around each cut. It checks each generated PNG for visual discontinuity, waveform spikes past the 30ms fade, subtitles hidden behind overlays, or misaligned animation frames. If it finds any of these, it fixes and re-renders, capping at 3 self-eval passes before surfacing the issue to the user.

What license is it under? MIT. Everything — the pipeline, helpers, and skill definition — is fully open source. You can fork it, ship your own version, or vendor parts of it inside a proprietary product.

Is this actually production-ready? For talking-head content, podcasts, launch videos, and tutorial narration: yes, it’s shipping real work today, including for the browser-use team’s own 15-second TikTok demo advertising it. For anything that would strain a traditional NLE — color-critical work, motion graphics, VFX — you’ll want a real editor. Treat video-use as the fastest possible path from raw footage to polished-enough draft, not as a replacement for professional post-production.

Verdict

video-use is one of the more surprising open-source releases of 2026 because it doesn’t ask an LLM to do the thing LLMs are worst at — reasoning about pixels. Instead, it does the browser-use trick: give the model a well-structured symbolic representation of the medium (a transcript with word-level timestamps), let it reason there, and let deterministic code (ffmpeg) do the actual rendering. Wrap the whole thing in an agent skill so any coding agent can pick it up in one paste.

If you spend any time producing speech-driven video and you already use Claude Code or Codex for other work, video-use is worth an afternoon to try. It’s MIT-licensed, sits on a well-known dependency stack, and the hard-rules list alone is a useful education in what silent-failure modes actually exist in an automated video pipeline.

For the browser-use team specifically, video-use is another data point in a pattern that’s starting to look like a thesis: the right primitive for an agent isn’t a screenshot, it’s a structured view of the medium. DOM for the web. Transcript for video. Whatever’s next for whatever comes next.

Repo: github.com/browser-use/video-use. Try it in Browser Use Cloud. Related: OpenMontage review, Voicebox: open-source ElevenLabs alternative, browser-use review.