MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Published in arXiv preprint arXiv:2508.20453, 2025
As LLM agents become more capable of using tools, robust benchmarks are needed to evaluate their performance on complex real-world tasks. This paper introduces MCP-Bench, a benchmark for evaluating tool-using LLM agents through the Model Context Protocol (MCP). Our benchmark includes diverse tasks that require agents to interact with multiple MCP servers, handle multi-step workflows, and manage state across extended interactions.
@article{wang2025mcpbench,
title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
author={Wang, Zihan and Chang, Qiyuan and Patel, Harsh and Biju, Sonia and Wu, Chenghan E and Liu, Qinghao and Ding, Aoyuan and Rezazadeh, Arman and Shah, Ankit Parag and others},
journal={arXiv preprint arXiv:2508.20453},
year={2025}
}