Video Vision MCP

Give your AI eyes & ears.

The MCP tool that lets any AI watch and listen to videos.
YouTube · TikTok · Reels · X · local files. Local Whisper. No keys.

See how it works ↓

0stars

See it actually work

Eight seconds. One URL. Your AI gets the whole video.

The MCP tool that lets any AI watch and listen — frames, transcript, scenes — the moment you paste a URL. Live-typed, on a loop.

claude · video-vision-mcp · TikTok · 0:47

# paste any video URL — frames, transcript & scenes go straight into your AI

vvmp: https://tiktok.com/@chef/video/12345 (0:47)

Hover to pause · click the dots to jump

The Pain

Your AI can read docs. Write code. Look at images. It still can't watch or listen to a video.

Some influencer just dropped a 90-second video showing the exact workflow you want your AI to build. You can see every click, every prompt, every step.

But your AI can't see it. So now you're pausing every 4 seconds, screenshotting, transcribing by hand, pasting fragments into chat. Fifteen minutes of monkey work before you've even asked the first real question. Sometimes you just give up.

Without Video Vision MCP

With Video Vision MCP

Pause. Screenshot. Pause. Transcribe. Pause. Paste.

Paste URL. Ask. Done.

15+ minutes per video

~30 seconds

Half the context lost in translation

Frames + transcript + timestamps, all at once

Works for some YouTube videos

YouTube · TikTok · Reels · X · 1000+ platforms · local files

STOP THE CLOCK

How much is the pause-and-paste tax costing you?

00:00time spent reading this row

VIDEOS YOU ANALYZE / WEEK5 videos

YOUR TIME IS WORTH (PER HOUR)$40

HOURS SAVED / YEAR

DOLLARS BACK / YEAR

$2,513

VIDEOS / YEAR

260

Math: ~14.5 min saved per video × your weekly volume × 52 weeks. Tools cost: $0.

Who This Is For

Anyone who's ever wished their AI could just watch the damn video.

“That influencer just showed a 20-step Claude setup. Now my AI can build it.”

Drop the URL into your AI chat. It watches, extracts every command, every click, every config. You get a working script.

“Watch this and write me an automation that does the same thing.”

“Why did this competitor's ad work?”

Hand your AI a TikTok or Reel. It analyzes the hook, the pacing, the cuts, the hashtags. Tells you exactly why it converts.

“Watch this competitor's TikTok. Tell me the hook, the pacing, why it went viral.”

“Make this video mine.”

Drop a YouTube. Your AI rewrites it as a tweet thread, a LinkedIn post, a script in your own voice.

“Turn this 5-min YouTube into a tweet thread in my voice.”

“Bake this. Build this. Set this up.”

Cooking video, IKEA assembly, AI agent build. Your AI watches, converts to a numbered checklist, can guide you through it.

“Watch this baking video. Make me a shopping list and step-by-step instructions.”

“Summarize the lecture I didn't have time for.”

Hour-long talks, university lectures, podcasts, interviews. AI watches, summarizes, pulls timestamps for the parts you actually need.

“Summarize this 1-hour talk in 5 bullets with timestamps for the key moments.”

“Reproduce this bug from my screen recording.”

Screen recordings, demo videos, code walkthroughs. AI extracts every visible command, every UI state, every error.

“Here's my screen recording of the bug. Reproduce it and propose a fix.”

Why This, Not That

Most AI tools were built to read. This one lets them watch and listen.

Most AI tools struggle the moment you give them a TikTok, a Reel, an X clip, or a private file — they were built for text, not video. Video Vision MCP is the missing layer: paste any URL or point to any local file, and your AI gets the frames, the transcript, and the timestamps. That's it.

Without Video Vision MCP

With Video Vision MCP

YouTube

Sometimes (transcript only)

Frames + transcript + scenes

TikTok

Not natively

Yes

Instagram Reels

Not natively

Yes

X / Twitter videos

Not natively

Yes

Vimeo / Twitch / 1000+

Yes (yt-dlp)

Local mp4 / mov files

Yes

Inside Cursor / Claude Code

Not really

Yes (any MCP client)

API key required

Yes — your existing one

No — runs locally

Cost per video

Tokens + your time

100% local.

Your videos never leave your machine. No upload, no cloud, no telemetry.

Zero API keys.

No OpenAI, no Gemini, no Anthropic billing. Whisper runs on your CPU.

1000+ platforms.

If yt-dlp can grab it, this can analyze it. Plus any local file.

Universal AI.

Works with Claude Code, Cursor, Cline, Windsurf, Continue, Claude Desktop — any MCP client.

Under the Hood

One URL in. Everything your AI needs out.

Download

yt-dlp pulls the video. YouTube, TikTok, Reels, X, 1000+ platforms. Or just point it at a local file.

Smart scenes

Instead of grabbing one frame every 5 seconds (dumb), it detects actual scene changes — the moments that matter.

Burned-in time

Every frame has the time visibly stamped. Your AI knows exactly when something happens.

Audio · captions or local Whisper

If the platform has subtitles, grabs them instantly. If not, Whisper runs locally on CPU. No API key, no cloud, no GPU.

You don't configure any of this. It figures it out.

Just Talk to It

Real prompts that already work.

Click any card to copy the prompt. Paste it into Claude, Cursor, or any MCP-connected AI with a video URL.

Watch this YouTube tutorial and give me the exact steps as a numbered list.

I recorded myself doing this task manually. Watch it and write an automation spec.

Transcribe this meeting recording and pull out all the action items.

Watch this product demo. List every feature shown, in order, with timestamps.

Extract every terminal command visible in this coding tutorial.

Compare the UI in this screen recording against our Figma spec.

Watch this TikTok and tell me exactly how they edited it — cuts, transitions, effects.

Summarize this 1-hour conference talk in 5 bullet points. Include timestamps for the key moments.

Get It Running

Two minutes. Zero API keys.

Pick your tool and follow the steps below. Works with any AI coding agent — Claude Code, Cursor, Cline, Windsurf, Continue, Claude Desktop, or anything MCP-compatible.

Paste this into Claude / Cursor / any AI chat — it'll handle setup itself.

Please configure the Video Vision MCP so you can watch videos for me. The package is @oamaestro/video-vision-mcp and the install command is: npx -y @oamaestro/video-vision-mcp — register it in your MCP settings, then confirm it's connected.

Copy prompt

Run this command in your terminal:

claude mcp add video-vision -- npx -y @oamaestro/video-vision-mcpCopy

Settings → Features → MCP Servers → Add the JSON below:

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

Copy JSON

Cline extension settings → MCP Servers → Add the JSON below:

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

Copy JSON

Add to: ~/.codeium/windsurf/mcp_config.json

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

Copy JSON

Add to: ~/.continue/config.json

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

Copy JSON

Mac: ~/Library/Application Support/Claude/claude_desktop_config.json

Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

{
  "mcpServers": {
    "video-vision": {
      "command": "npx",
      "args": ["-y", "@oamaestro/video-vision-mcp"]
    }
  }
}

Copy JSON

No API keys. No environment variables. No “step 3 of 11”. Just paste and go.

Questions

Everything you'd ask before installing.

Does this need an API key?

No. Nothing. No OpenAI, no Gemini, no Anthropic API. Whisper runs locally on your CPU.

Does it actually work with Instagram Reels and TikTok?

Yes. Drop a URL, it pulls the video via yt-dlp and analyzes it. Works the same for YouTube, X, Facebook, Vimeo, Twitch, and 1000+ others.

What about videos that need login (private TikToks, etc.)?

Pass a cookies.txt file via the cookies parameter. Standard yt-dlp practice.

Do I need a GPU?

No. Whisper runs on CPU. First run downloads a ~150MB model, takes about a minute, then cached at ~/.oamaestro/models/ forever.

How long does a 30-min video take?

Frame extraction + caption grab is seconds. If captions exist, total is under a minute. No captions → Whisper takes a few minutes on CPU. Use start_time/end_time to analyze sections.

What AI tools does it work with?

Claude Code, Cursor, Cline, Windsurf, Continue, Claude Desktop. Anything MCP-compatible.

Is anything sent to the cloud?

Only the video download itself (from YouTube, TikTok, etc., as you'd expect). Analysis is 100% local.

What's the disk usage?

Roughly 8 MB per minute of video, all auto-cleaned when the server stops. Or call cleanup manually anytime.

Open source?

MIT licensed. Free forever. Use it, fork it, ship it.

Where do I report bugs or request features?

GitHub Issues — or DM @OAMaestro anywhere.

It's completely free. Open source. Do whatever you want with it.
But if it saved you 15 minutes of screenshotting and transcribing — or helped you ship something cool — you can buy me a coffee if you want.
No obligation. Genuinely.

OR ENTER ANY AMOUNT

Minimum £4.99. Pay what you feel like.

Secure checkout via Stripe · Card, Apple Pay, Google Pay