MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Published in arXiv preprint arXiv:2508.20453, 2025

As LLM agents become more capable of using tools, robust benchmarks are needed to evaluate their performance on complex real-world tasks. This paper introduces MCP-Bench, a benchmark for evaluating tool-using LLM agents through the Model Context Protocol (MCP). Our benchmark includes diverse tasks that require agents to interact with multiple MCP servers, handle multi-step workflows, and manage state across extended interactions.

Recommended citation: @article{wang2025mcpbench, title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers}, author={Wang, Zihan and Chang, Qiyuan and Patel, Harsh and Biju, Sonia and Wu, Chenghan E and Liu, Qinghao and Ding, Aoyuan and Rezazadeh, Arman and Shah, Ankit Parag and others}, journal={arXiv preprint arXiv:2508.20453}, year={2025} }
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ankit Shah

Share on