MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Published in The Thirteenth International Conference on Learning Representations (ICLR 2026), 2025
As LLM agents become more capable of using tools, robust benchmarks are needed to evaluate their performance on complex real-world tasks. This paper introduces MCP-Bench, a benchmark for evaluating tool-using LLM agents through the Model Context Protocol (MCP). Our benchmark includes diverse tasks that require agents to interact with multiple MCP servers, handle multi-step workflows, and manage state across extended interactions.
Recommended citation: @article{wang2025mcpbench, title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers}, author={Wang, Zihan and Chang, Qiyuan and Patel, Harsh and Biju, Sonia and Wu, Chenghan E and Liu, Qinghao and Ding, Aoyuan and Rezazadeh, Arman and Shah, Ankit Parag and others}, journal={arXiv preprint arXiv:2508.20453}, year={2025} }
Download Paper