Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

X-PLUG/OSWorld-MCP

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

🔔 Updates

2025-10-28: We released our paper and project page! 🎉

📄 Read the Paper  |  🌐 Visit the Project Page


📑 Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


⚙️ Installation & Usage

1️⃣ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2️⃣ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3️⃣ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

📐 Key Metrics

  1. Task Accuracy (Acc) — % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) — average number of actions per completed task.

📊 Leaderboard (Sorted by Accuracy)

🔗 Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / Agent Acc TIR ACS
Agent-S2.5 42.1 30.0 10.0
Claude-4-Sonnet 35.3 30.0 10.4
Seed1.5-VL 32.0 25.1 10.2
Qwen3-VL 31.3 24.5 10.5
Gemini-2.5-Pro 20.5 16.8 11.4
OpenAI o3 20.4 16.7 11.6
Qwen2.5-VL 15.8 13.1 13.5

Max Steps: 50

Model / Agent Acc TIR ACS
Agent-S2.5 49.5 35.3 17.0
Claude-4-Sonnet 43.3 36.6 20.1
Qwen3-VL 39.1 29.5 21.1
Seed1.5-VL 38.4 29.0 23.0
Gemini-2.5-Pro 27.2 21.5 29.7
OpenAI o3 25.2 21.0 32.1
Qwen2.5-VL 14.8 10.9 37.2

📚 Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.