Computer Use MCP Server: PyAutoGUI + OpenCV + Ollama vision

Python 100%

Find a file

slothitude ab6a232a66 Fix: remove invalid host/port kwargs from FastMCP.run() FastMCP.run() only takes transport and mount_path. Also fixed COMPACT_MODE env var init so it reads before __main__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-06-02 04:57:55 +10:00
.gitignore
.mcp.json	Add startup env var diagnostics, fix OCR CUDA on Lappy	2026-06-01 23:37:37 +10:00
computer_use_mcp.py	Fix: remove invalid host/port kwargs from FastMCP.run()	2026-06-02 04:57:55 +10:00
ocr_server.py	Add OCR server with idle VRAM unload (120s timeout)	2026-06-01 23:47:04 +10:00
README.md	Add any_tool meta-tool dispatch + compact mode (14K → 440 tokens)	2026-06-02 04:53:44 +10:00
run_tests.py	Expand to 26 tools: window mgmt, multi-monitor, pixel color, OCR, clipboard	2026-06-01 15:12:58 +10:00
test_tools.py	Expand to 26 tools: window mgmt, multi-monitor, pixel color, OCR, clipboard	2026-06-01 15:12:58 +10:00

README.md

computer-use-mcp

MCP server for desktop automation on Windows. PyAutoGUI + OpenCV + Vision AI + Win32.

75 tools for mouse, keyboard, screenshots, screen understanding, window management, pixel color, OCR, clipboard, template matching, UI element detection, recording, semantic screen memory, macro recording, workflow automation, visual testing, YOLO detection, table extraction, form auto-fill, policy enforcement, plugins, and more.

Setup

pip install mcp pyautogui opencv-contrib-python Pillow pywin32 psutil pywinauto

Use opencv-contrib-python (not opencv-python) for ORB feature matching. pywinauto required for UIA/accessibility tools.

Optional (for local OCR fallback):

pip install rapidocr-onnxruntime

Optional (for YOLO UI detection):

pip install ultralytics

Environment variables:

Variable	Default	Description
`VISION_BACKEND`	`nvidia`	Vision provider: `ollama` or `nvidia`
`COMPUTER_VISION_MODEL`	`qwen2.5vl:3b`	Ollama vision model
`OLLAMA_BASE`	`http://localhost:11434`	Ollama API base URL
`NVIDIA_VISION_URL`	(NIM endpoint)	NVIDIA NIM endpoint
`NVIDIA_VISION_MODEL`	`meta/llama-3.2-90b-vision-instruct`	NVIDIA model
`NVIDIA_API_KEY`	(from .mcp.json)	NVIDIA API bearer token
`VISION_TIMEOUT`	`300`	Vision API timeout (seconds)
`SCREEN_MAX_DIMENSION`	`1280`	Max dimension for vision screenshots
`COMPUTER_USE_DATA_DIR`	`data`	Directory for templates and screenshots
`OCR_SERVICE_URL`	`http://192.168.0.33:8100/ocr`	Remote RapidOCR endpoint
`TRACE_LOG_PATH`	(disabled)	Set to a JSONL path to persist action trace to disk
`COMPUTER_USE_POLICY_PATH`	`data/computer_use_policy.json`	Permission policy file

Adding to Claude Desktop

In claude_desktop_config.json (or via claude mcp add):

{
  "mcpServers": {
    "computer-use": {
      "command": "python",
      "args": ["-m", "computer_use_mcp"],
      "cwd": "C:/Users/aaron/computer-use-mcp",
      "env": {
        "NVIDIA_API_KEY": "your-key-here"
      }
    }
  }
}

HTTP API Mode

Run as HTTP/SSE server for LAN remote desktop driving or agent meshes:

python -m computer_use_mcp --http --host 0.0.0.0 --port 8000

The server exposes all 75 tools via FastMCP's SSE transport.

Compact Mode — any_tool (saves ~13K tokens per request)

By default, all 75 tools send their schemas to the LLM (~14K tokens). Compact mode exposes only 3 meta-tools and lets the LLM discover + call everything dynamically:

python -m computer_use_mcp --compact
# or
COMPACT_MODE=1 python -m computer_use_mcp

Mode	MCP tools sent to LLM	Token cost
Full (default)	78 tools	~14K tokens
Compact (`--compact`)	3 meta-tools	~440 tokens

How compact mode works

Three meta-tools replace the entire catalog:

Tool	Description
`any_tool(tool_name, arguments)`	Call any tool by name. Supports aliases (`click` → `computer_click`, `screenshot` → `computer_screenshot`).
`tool_catalog(category, search, level)`	Discover tools. Level 0 = names only (~800 tokens), 1 = names + descriptions (~1.3K tokens), 2 = full schemas.
`tool_info(name)`	Get full parameter details for one tool.

Agent workflow:

tool_catalog(level=0) → see all tool names
tool_info("computer_click") → get params {x, y, button, clicks}
any_tool("computer_click", '{"x": 500, "y": 300}') → execute
any_tool("click", '{"x": 500, "y": 300}') → same, via alias

Tools

Screenshots & Vision

Tool	Description
`computer_screenshot(monitor, region, inline_b64)`	Screenshot. `monitor=0` = all, `1+` = specific. `region="x,y,w,h"` crop. `inline_b64=True` returns inline `data:image/png;base64,...` URI for remote clients.
`analyze_screen(question, monitor)`	Send screenshot to vision AI. Returns text description.
`get_monitors()`	List connected monitors with bounds, resolution, primary flag.
`screen_record(duration, fps, region, monitor)`	Record short video to `data/videos/` for agent replay debugging.

Template Matching

Tool	Description
`save_template(name, x, y, width, height, monitor)`	Crop a screen region and save as reusable template (`data/templates/`).
`find_on_screen(template, threshold, multi_scale, monitor, method)`	Find a template on screen. Results are NMS-deduplicated.
	`method="auto"` (default) — ORB feature matching first, falls back to template
	`method="feature"` — ORB only. Best for rotation/scale.
	`method="template"` — Multi-scale pixel matching only.
`find_and_click_all(template, button, threshold, min_distance, multi_scale)`	Find all instances and click each. Clusters nearby matches.

UI Element Detection (template-free)

Tool	Description
`find_elements(query, region, min_area, monitor)`	Find UI elements by color, shape, or both — no template. Returns scored, sorted, NMS-deduplicated results.
`click_element(query, index, min_area, monitor)`	Find elements and click one at its center.
`detect_ui(labels, confidence, region, monitor)`	YOLO neural network detection. Semantic labels (keyboard, mouse, phone, screen). <50ms CPU.
`click_detected(label, index, confidence, monitor)`	Detect with YOLO and click the result.

Query language (find_elements):

Query	What it finds
`"blue buttons"`	Blue-colored rectangles
`"red circles"`	Red circles (close buttons, indicators)
`"close button"`	Red circles (close/X buttons)
`"yellow"`	All yellow-colored regions
`"icons"`	Any blob-shaped contours
`"Save button"`	Rectangle-shaped elements

Colors: red, blue, green, yellow, orange, purple, white, gray, black Shapes: rectangle (button, rect, square, box, bar), circle (round, dot, oval, ellipse), blob (icon, shape)

Mouse

Tool	Description
`mouse_position()`	Get current mouse cursor position `{x, y}`.
`computer_click(x, y, button, clicks)`	Click at coordinates. Button: `left`/`right`/`middle`.
`computer_move(x, y, duration)`	Move mouse without clicking.
`computer_scroll(amount, direction)`	Scroll at current position.
`computer_drag(x1, y1, x2, y2, duration)`	Drag from point A to B.

Keyboard

Tool	Description
`computer_type(text, interval)`	Type text or hotkeys. Supports `ctrl+a`, `alt+tab`, `win+d`, `f1`-`f24`, `delete`, `backspace`, arrows, numpad, and all standard keys. Unknown keys in combos produce an error instead of silently typing.

Window Management (Win32)

Tool	Description
`window_list(title_filter, visible_only)`	Enumerate open windows.
`window_focus(title_or_handle, bring_to_front)`	Focus window. Uses edit-distance ranking when multiple partial matches exist. Unminimizes.
`window_move(title_or_handle, x, y)`	Move window to coordinates.
`window_resize(title_or_handle, width, height)`	Resize window.
`window_maximize(title_or_handle)`	Maximize window.
`window_minimize(title_or_handle)`	Minimize window.
`window_close(title_or_handle)`	Close window gracefully.
`window_screenshot(title_or_handle)`	Screenshot a specific window.
`window_enumerate_controls(title_or_handle, control_type, depth)`	List all interactive controls with type, text, rect, AutomationId, enabled/visible state. Win32 accessibility snapshot.

Pixel Color

Tool	Description
`pixel_color(x, y)`	Get RGB and hex at a coordinate.
`pixel_color_region(x, y, w, h)`	Average color of a region.

Screen Change Detection

Tool	Description
`wait_for_change(region, timeout, interval, threshold)`	Block until a screen region changes visually.
`screen_diff(region, baseline_path, describe)`	Compare screen against saved baseline. Optionally send diff to vision model.
`screen_hash(region, monitor)`	Perceptual hash (pHash) of current screen. For fast change detection between steps.
`wait_until_stable(region, timeout, interval, monitor)`	Wait until screen stops changing (hash stabilizes). For loading animations.

OCR (three-tier)

Tool	Description
`screen_ocr(region, question)`	Extract text: remote RapidOCR → local ONNX → vision model. Structured output from local tier includes bounding boxes.
`find_on_screen_text(text, region, monitor, case_sensitive)`	Find a text string on screen via OCR. Returns match positions, line numbers, context.

Table Extraction

Tool	Description
`extract_table(region, monitor, headers)`	Extract a table from screen via OCR + bounding box clustering. Returns list-of-dicts. Auto-detects headers from first row or accepts explicit headers.

Clipboard

Tool	Description
`clipboard_get_image()`	Copy image from clipboard to `data/images/`.
`clipboard_set_image(path)`	Copy image file to clipboard.
`clipboard_get_text()`	Get text content from system clipboard.
`clipboard_set_text(text)`	Copy text to system clipboard.

Accessibility (UIA)

Tool	Description
`accessibility_tree(title_or_handle, depth)`	Get UI Automation tree. Names, types, AutomationIds, rects.
`click_by_automation_id(automation_id, title_or_handle, action)`	Click element by AutomationId. Actions: `click`, `double_click`, `right_click`, `invoke`.
`ui_find(name, control_type, title_or_handle, depth)`	Find elements by name substring and/or control type. No AutomationId needed. Returns matched elements with rects.
`ui_get_value(automation_id, title_or_handle)`	Read current value/text of a UI element. Useful for text fields, checkboxes, dropdowns.
`ui_wait(automation_id, timeout, title_or_handle)`	Wait for a UI element to appear. Polls UIA tree until found or timeout.

Semantic Screen Memory

Tool	Description
`remember_screen(label, description, region, monitor)`	Snapshot screen and store under a label. Avoids re-screenshotting in multi-step workflows.
`recall_screen(label)`	Retrieve a stored screen state's metadata and path.
`forget_screen(label)`	Remove a stored screen state from memory.
`list_screens()`	List all stored screen memory labels with age and hash.

Macro Recorder

Tool	Description
`macro_record(name)`	Start recording a macro. All tool calls captured until `macro_stop()`.
`macro_stop()`	Stop recording and save macro as JSONL to `data/macros/`.
`macro_play(name, speed, max_actions)`	Replay a recorded macro. Adjust speed with multiplier.
`macro_list()`	List all recorded macros with action counts.
`macro_delete(name)`	Delete a recorded macro.

Workflow Graph Engine

Tool	Description
`workflow_define(name, steps)`	Define a multi-step workflow as JSON. Steps support retry, branching, and conditional goto.
`workflow_run(name, start_step)`	Execute a workflow. Steps run sequentially with retry/branching support.
`workflow_status(name)`	Get workflow run status, current step, and log.
`workflow_list()`	List all defined workflows.
`workflow_delete(name)`	Delete a workflow definition.

Visual Assertions (UI Testing)

Tool	Description
`assert_visible(template, threshold, monitor)`	Assert a template is visible on screen. Returns pass/fail with evidence screenshot.
`assert_text_present(text, region, case_sensitive)`	Assert text is present via OCR. Returns pass/fail with evidence.
`assert_element_state(query, expected_state, region, monitor)`	Assert a UI element exists in the expected state (`visible`, `hidden`, `clickable`).

Form Auto-Fill

Tool	Description
`form_fill(data, title_or_handle, click_submit)`	Fill a form by providing `{field: value}` JSON. Finds UI elements by name via UIA and fills them. Optionally clicks Submit/OK/Login.

Permission System

Tool	Description
`policy_show()`	Display the current permission policy.
`policy_set(blocked_tools, blocked_processes, allowed_regions, rate_limits)`	Update the policy. Creates `computer_use_policy.json` if needed.
`policy_reload()`	Reload policy from disk.

Plugin System

Tool	Description
`plugin_list()`	List all available and loaded plugins from `data/plugins/`.

System

Tool	Description
`shell_run(command, timeout, cwd)`	Run shell command, return output.
`launch_app(name, args)`	Launch application by name.
`process_list(name_filter)`	List running processes.
`process_kill(pid, force)`	Kill process by PID.
`file_read(path, lines)`	Read a file.
`file_write(path, content)`	Write a file. Creates parent dirs.
`file_list(path, pattern)`	List files in a directory.
`file_exists(path)`	Check if file/directory exists.
`action_trace(clear, last_n)`	Action trace log — every tool call recorded for crash diagnosis. Persist to disk via `TRACE_LOG_PATH`.

Multi-Monitor

Uses ImageGrab.grab(all_screens=True) for multi-monitor setups. monitor=1 or monitor=2 for specific display, monitor=0 (default) for all.

Template Matching Methods

Multi-Scale Template Matching (`method="template"`)

Brute-force pixel correlation at 7 scales [0.75, 0.8, 0.9, 1.0, 1.1, 1.2, 1.25]. Results are NMS-deduplicated to prevent overlapping detections.

ORB Feature Matching (`method="feature"`)

ORB keypoints → BFMatcher + Lowe ratio test (0.75) → findHomography(RANSAC) → perspectiveTransform for bounding box. Resilient to rotation, scale, partial occlusion.

Auto Mode (`method="auto"`, default)

ORB first, fall back to template matching. Best of both worlds.

OCR Pipeline

Three-tier fallback chain:

Remote RapidOCR (HTTP) — fastest, <100ms, requires OCR_SERVICE_URL host
Local ONNX (rapidocr-onnxruntime) — no network needed, returns structured bounding boxes
Vision model (Ollama/NVIDIA NIM) — slowest, best for Q&A about screen content

Plugin System

Drop Python files in data/plugins/ to extend the server with custom tools. Each plugin can use @mcp.tool() to register new tools. Example:

# data/plugins/my_tools.py
from mcp.server.fastmcp import FastMCP

@mcp.tool()
def my_custom_tool(arg: str) -> dict:
    return {"result": arg.upper()}

Permission System

Define computer_use_policy.json to restrict tool access:

{
  "blocked_tools": ["shell_run"],
  "blocked_processes": ["regedit", "taskmgr"],
  "allowed_regions": [{"x": 0, "y": 0, "w": 1920, "h": 1080}],
  "rate_limits": {
    "computer_click": {"max_calls": 10, "window_seconds": 1}
  }
}

Or set it programmatically with policy_set().

UIA Workflow for Native Apps

For form-fill or data-extraction over native Windows apps:

window_enumerate_controls(title) — see all controls with their IDs and rects
ui_find(name="Username", control_type="Edit") — find elements without knowing IDs
ui_get_value(automation_id) — read current text/value
click_by_automation_id(automation_id, action="invoke") — click buttons
ui_wait(automation_id, timeout=5) — wait for elements to appear
accessibility_tree(title, depth=3) — full tree for complex navigation

Agent Workflow Patterns

Multi-step with screen memory

remember_screen("initial") → click → wait_until_stable() → recall_screen("initial")

Macro recording

macro_record("login") → [do stuff] → macro_stop() → macro_play("login")

Workflow automation

workflow_define("deploy", '[{"tool":"computer_click","args":{"x":500,"y":300},"retry":3}]') → workflow_run("deploy")

Visual testing

assert_visible("logo.png") → assert_text_present("Welcome") → assert_element_state("submit button", "visible")

README.md

computer-use-mcp

Setup

Adding to Claude Desktop

HTTP API Mode

Compact Mode — any_tool (saves ~13K tokens per request)

How compact mode works

Tools

Screenshots & Vision

Template Matching

UI Element Detection (template-free)

Mouse

Keyboard

Window Management (Win32)

Pixel Color

Screen Change Detection

OCR (three-tier)

Table Extraction

Clipboard

Accessibility (UIA)

Semantic Screen Memory

Macro Recorder

Workflow Graph Engine

Visual Assertions (UI Testing)

Form Auto-Fill

Permission System

Plugin System

System

Multi-Monitor

Template Matching Methods

Multi-Scale Template Matching (method="template")

ORB Feature Matching (method="feature")

Auto Mode (method="auto", default)

OCR Pipeline

Plugin System

Permission System

UIA Workflow for Native Apps

Agent Workflow Patterns

Multi-step with screen memory

Macro recording

Workflow automation

Visual testing

Multi-Scale Template Matching (`method="template"`)

ORB Feature Matching (`method="feature"`)

Auto Mode (`method="auto"`, default)