Computer Use MCP Server: PyAutoGUI + OpenCV + Ollama vision
Find a file
slothitude ab6a232a66 Fix: remove invalid host/port kwargs from FastMCP.run()
FastMCP.run() only takes transport and mount_path. Also fixed
COMPACT_MODE env var init so it reads before __main__.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 04:57:55 +10:00
.gitignore
.mcp.json Add startup env var diagnostics, fix OCR CUDA on Lappy 2026-06-01 23:37:37 +10:00
computer_use_mcp.py Fix: remove invalid host/port kwargs from FastMCP.run() 2026-06-02 04:57:55 +10:00
ocr_server.py Add OCR server with idle VRAM unload (120s timeout) 2026-06-01 23:47:04 +10:00
README.md Add any_tool meta-tool dispatch + compact mode (14K → 440 tokens) 2026-06-02 04:53:44 +10:00
run_tests.py Expand to 26 tools: window mgmt, multi-monitor, pixel color, OCR, clipboard 2026-06-01 15:12:58 +10:00
test_tools.py Expand to 26 tools: window mgmt, multi-monitor, pixel color, OCR, clipboard 2026-06-01 15:12:58 +10:00

computer-use-mcp

MCP server for desktop automation on Windows. PyAutoGUI + OpenCV + Vision AI + Win32.

75 tools for mouse, keyboard, screenshots, screen understanding, window management, pixel color, OCR, clipboard, template matching, UI element detection, recording, semantic screen memory, macro recording, workflow automation, visual testing, YOLO detection, table extraction, form auto-fill, policy enforcement, plugins, and more.

Setup

pip install mcp pyautogui opencv-contrib-python Pillow pywin32 psutil pywinauto

Use opencv-contrib-python (not opencv-python) for ORB feature matching. pywinauto required for UIA/accessibility tools.

Optional (for local OCR fallback):

pip install rapidocr-onnxruntime

Optional (for YOLO UI detection):

pip install ultralytics

Environment variables:

Variable Default Description
VISION_BACKEND nvidia Vision provider: ollama or nvidia
COMPUTER_VISION_MODEL qwen2.5vl:3b Ollama vision model
OLLAMA_BASE http://localhost:11434 Ollama API base URL
NVIDIA_VISION_URL (NIM endpoint) NVIDIA NIM endpoint
NVIDIA_VISION_MODEL meta/llama-3.2-90b-vision-instruct NVIDIA model
NVIDIA_API_KEY (from .mcp.json) NVIDIA API bearer token
VISION_TIMEOUT 300 Vision API timeout (seconds)
SCREEN_MAX_DIMENSION 1280 Max dimension for vision screenshots
COMPUTER_USE_DATA_DIR data Directory for templates and screenshots
OCR_SERVICE_URL http://192.168.0.33:8100/ocr Remote RapidOCR endpoint
TRACE_LOG_PATH (disabled) Set to a JSONL path to persist action trace to disk
COMPUTER_USE_POLICY_PATH data/computer_use_policy.json Permission policy file

Adding to Claude Desktop

In claude_desktop_config.json (or via claude mcp add):

{
  "mcpServers": {
    "computer-use": {
      "command": "python",
      "args": ["-m", "computer_use_mcp"],
      "cwd": "C:/Users/aaron/computer-use-mcp",
      "env": {
        "NVIDIA_API_KEY": "your-key-here"
      }
    }
  }
}

HTTP API Mode

Run as HTTP/SSE server for LAN remote desktop driving or agent meshes:

python -m computer_use_mcp --http --host 0.0.0.0 --port 8000

The server exposes all 75 tools via FastMCP's SSE transport.

Compact Mode — any_tool (saves ~13K tokens per request)

By default, all 75 tools send their schemas to the LLM (~14K tokens). Compact mode exposes only 3 meta-tools and lets the LLM discover + call everything dynamically:

python -m computer_use_mcp --compact
# or
COMPACT_MODE=1 python -m computer_use_mcp
Mode MCP tools sent to LLM Token cost
Full (default) 78 tools ~14K tokens
Compact (--compact) 3 meta-tools ~440 tokens

How compact mode works

Three meta-tools replace the entire catalog:

Tool Description
any_tool(tool_name, arguments) Call any tool by name. Supports aliases (clickcomputer_click, screenshotcomputer_screenshot).
tool_catalog(category, search, level) Discover tools. Level 0 = names only (~800 tokens), 1 = names + descriptions (~1.3K tokens), 2 = full schemas.
tool_info(name) Get full parameter details for one tool.

Agent workflow:

tool_catalog(level=0) → see all tool names
tool_info("computer_click") → get params {x, y, button, clicks}
any_tool("computer_click", '{"x": 500, "y": 300}') → execute
any_tool("click", '{"x": 500, "y": 300}') → same, via alias

Tools

Screenshots & Vision

Tool Description
computer_screenshot(monitor, region, inline_b64) Screenshot. monitor=0 = all, 1+ = specific. region="x,y,w,h" crop. inline_b64=True returns inline data:image/png;base64,... URI for remote clients.
analyze_screen(question, monitor) Send screenshot to vision AI. Returns text description.
get_monitors() List connected monitors with bounds, resolution, primary flag.
screen_record(duration, fps, region, monitor) Record short video to data/videos/ for agent replay debugging.

Template Matching

Tool Description
save_template(name, x, y, width, height, monitor) Crop a screen region and save as reusable template (data/templates/).
find_on_screen(template, threshold, multi_scale, monitor, method) Find a template on screen. Results are NMS-deduplicated.
method="auto" (default) — ORB feature matching first, falls back to template
method="feature" — ORB only. Best for rotation/scale.
method="template" — Multi-scale pixel matching only.
find_and_click_all(template, button, threshold, min_distance, multi_scale) Find all instances and click each. Clusters nearby matches.

UI Element Detection (template-free)

Tool Description
find_elements(query, region, min_area, monitor) Find UI elements by color, shape, or both — no template. Returns scored, sorted, NMS-deduplicated results.
click_element(query, index, min_area, monitor) Find elements and click one at its center.
detect_ui(labels, confidence, region, monitor) YOLO neural network detection. Semantic labels (keyboard, mouse, phone, screen). <50ms CPU.
click_detected(label, index, confidence, monitor) Detect with YOLO and click the result.

Query language (find_elements):

Query What it finds
"blue buttons" Blue-colored rectangles
"red circles" Red circles (close buttons, indicators)
"close button" Red circles (close/X buttons)
"yellow" All yellow-colored regions
"icons" Any blob-shaped contours
"Save button" Rectangle-shaped elements

Colors: red, blue, green, yellow, orange, purple, white, gray, black Shapes: rectangle (button, rect, square, box, bar), circle (round, dot, oval, ellipse), blob (icon, shape)

Mouse

Tool Description
mouse_position() Get current mouse cursor position {x, y}.
computer_click(x, y, button, clicks) Click at coordinates. Button: left/right/middle.
computer_move(x, y, duration) Move mouse without clicking.
computer_scroll(amount, direction) Scroll at current position.
computer_drag(x1, y1, x2, y2, duration) Drag from point A to B.

Keyboard

Tool Description
computer_type(text, interval) Type text or hotkeys. Supports ctrl+a, alt+tab, win+d, f1-f24, delete, backspace, arrows, numpad, and all standard keys. Unknown keys in combos produce an error instead of silently typing.

Window Management (Win32)

Tool Description
window_list(title_filter, visible_only) Enumerate open windows.
window_focus(title_or_handle, bring_to_front) Focus window. Uses edit-distance ranking when multiple partial matches exist. Unminimizes.
window_move(title_or_handle, x, y) Move window to coordinates.
window_resize(title_or_handle, width, height) Resize window.
window_maximize(title_or_handle) Maximize window.
window_minimize(title_or_handle) Minimize window.
window_close(title_or_handle) Close window gracefully.
window_screenshot(title_or_handle) Screenshot a specific window.
window_enumerate_controls(title_or_handle, control_type, depth) List all interactive controls with type, text, rect, AutomationId, enabled/visible state. Win32 accessibility snapshot.

Pixel Color

Tool Description
pixel_color(x, y) Get RGB and hex at a coordinate.
pixel_color_region(x, y, w, h) Average color of a region.

Screen Change Detection

Tool Description
wait_for_change(region, timeout, interval, threshold) Block until a screen region changes visually.
screen_diff(region, baseline_path, describe) Compare screen against saved baseline. Optionally send diff to vision model.
screen_hash(region, monitor) Perceptual hash (pHash) of current screen. For fast change detection between steps.
wait_until_stable(region, timeout, interval, monitor) Wait until screen stops changing (hash stabilizes). For loading animations.

OCR (three-tier)

Tool Description
screen_ocr(region, question) Extract text: remote RapidOCR → local ONNX → vision model. Structured output from local tier includes bounding boxes.
find_on_screen_text(text, region, monitor, case_sensitive) Find a text string on screen via OCR. Returns match positions, line numbers, context.

Table Extraction

Tool Description
extract_table(region, monitor, headers) Extract a table from screen via OCR + bounding box clustering. Returns list-of-dicts. Auto-detects headers from first row or accepts explicit headers.

Clipboard

Tool Description
clipboard_get_image() Copy image from clipboard to data/images/.
clipboard_set_image(path) Copy image file to clipboard.
clipboard_get_text() Get text content from system clipboard.
clipboard_set_text(text) Copy text to system clipboard.

Accessibility (UIA)

Tool Description
accessibility_tree(title_or_handle, depth) Get UI Automation tree. Names, types, AutomationIds, rects.
click_by_automation_id(automation_id, title_or_handle, action) Click element by AutomationId. Actions: click, double_click, right_click, invoke.
ui_find(name, control_type, title_or_handle, depth) Find elements by name substring and/or control type. No AutomationId needed. Returns matched elements with rects.
ui_get_value(automation_id, title_or_handle) Read current value/text of a UI element. Useful for text fields, checkboxes, dropdowns.
ui_wait(automation_id, timeout, title_or_handle) Wait for a UI element to appear. Polls UIA tree until found or timeout.

Semantic Screen Memory

Tool Description
remember_screen(label, description, region, monitor) Snapshot screen and store under a label. Avoids re-screenshotting in multi-step workflows.
recall_screen(label) Retrieve a stored screen state's metadata and path.
forget_screen(label) Remove a stored screen state from memory.
list_screens() List all stored screen memory labels with age and hash.

Macro Recorder

Tool Description
macro_record(name) Start recording a macro. All tool calls captured until macro_stop().
macro_stop() Stop recording and save macro as JSONL to data/macros/.
macro_play(name, speed, max_actions) Replay a recorded macro. Adjust speed with multiplier.
macro_list() List all recorded macros with action counts.
macro_delete(name) Delete a recorded macro.

Workflow Graph Engine

Tool Description
workflow_define(name, steps) Define a multi-step workflow as JSON. Steps support retry, branching, and conditional goto.
workflow_run(name, start_step) Execute a workflow. Steps run sequentially with retry/branching support.
workflow_status(name) Get workflow run status, current step, and log.
workflow_list() List all defined workflows.
workflow_delete(name) Delete a workflow definition.

Visual Assertions (UI Testing)

Tool Description
assert_visible(template, threshold, monitor) Assert a template is visible on screen. Returns pass/fail with evidence screenshot.
assert_text_present(text, region, case_sensitive) Assert text is present via OCR. Returns pass/fail with evidence.
assert_element_state(query, expected_state, region, monitor) Assert a UI element exists in the expected state (visible, hidden, clickable).

Form Auto-Fill

Tool Description
form_fill(data, title_or_handle, click_submit) Fill a form by providing {field: value} JSON. Finds UI elements by name via UIA and fills them. Optionally clicks Submit/OK/Login.

Permission System

Tool Description
policy_show() Display the current permission policy.
policy_set(blocked_tools, blocked_processes, allowed_regions, rate_limits) Update the policy. Creates computer_use_policy.json if needed.
policy_reload() Reload policy from disk.

Plugin System

Tool Description
plugin_list() List all available and loaded plugins from data/plugins/.

System

Tool Description
shell_run(command, timeout, cwd) Run shell command, return output.
launch_app(name, args) Launch application by name.
process_list(name_filter) List running processes.
process_kill(pid, force) Kill process by PID.
file_read(path, lines) Read a file.
file_write(path, content) Write a file. Creates parent dirs.
file_list(path, pattern) List files in a directory.
file_exists(path) Check if file/directory exists.
action_trace(clear, last_n) Action trace log — every tool call recorded for crash diagnosis. Persist to disk via TRACE_LOG_PATH.

Multi-Monitor

Uses ImageGrab.grab(all_screens=True) for multi-monitor setups. monitor=1 or monitor=2 for specific display, monitor=0 (default) for all.

Template Matching Methods

Multi-Scale Template Matching (method="template")

Brute-force pixel correlation at 7 scales [0.75, 0.8, 0.9, 1.0, 1.1, 1.2, 1.25]. Results are NMS-deduplicated to prevent overlapping detections.

ORB Feature Matching (method="feature")

ORB keypoints → BFMatcher + Lowe ratio test (0.75) → findHomography(RANSAC) → perspectiveTransform for bounding box. Resilient to rotation, scale, partial occlusion.

Auto Mode (method="auto", default)

ORB first, fall back to template matching. Best of both worlds.

OCR Pipeline

Three-tier fallback chain:

  1. Remote RapidOCR (HTTP) — fastest, <100ms, requires OCR_SERVICE_URL host
  2. Local ONNX (rapidocr-onnxruntime) — no network needed, returns structured bounding boxes
  3. Vision model (Ollama/NVIDIA NIM) — slowest, best for Q&A about screen content

Plugin System

Drop Python files in data/plugins/ to extend the server with custom tools. Each plugin can use @mcp.tool() to register new tools. Example:

# data/plugins/my_tools.py
from mcp.server.fastmcp import FastMCP

@mcp.tool()
def my_custom_tool(arg: str) -> dict:
    return {"result": arg.upper()}

Permission System

Define computer_use_policy.json to restrict tool access:

{
  "blocked_tools": ["shell_run"],
  "blocked_processes": ["regedit", "taskmgr"],
  "allowed_regions": [{"x": 0, "y": 0, "w": 1920, "h": 1080}],
  "rate_limits": {
    "computer_click": {"max_calls": 10, "window_seconds": 1}
  }
}

Or set it programmatically with policy_set().

UIA Workflow for Native Apps

For form-fill or data-extraction over native Windows apps:

  1. window_enumerate_controls(title) — see all controls with their IDs and rects
  2. ui_find(name="Username", control_type="Edit") — find elements without knowing IDs
  3. ui_get_value(automation_id) — read current text/value
  4. click_by_automation_id(automation_id, action="invoke") — click buttons
  5. ui_wait(automation_id, timeout=5) — wait for elements to appear
  6. accessibility_tree(title, depth=3) — full tree for complex navigation

Agent Workflow Patterns

Multi-step with screen memory

remember_screen("initial") → click → wait_until_stable() → recall_screen("initial")

Macro recording

macro_record("login") → [do stuff] → macro_stop() → macro_play("login")

Workflow automation

workflow_define("deploy", '[{"tool":"computer_click","args":{"x":500,"y":300},"retry":3}]') → workflow_run("deploy")

Visual testing

assert_visible("logo.png") → assert_text_present("Welcome") → assert_element_state("submit button", "visible")