by SeraphimSerapis · Agent Tool · ★ 150
Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h
tool-eval-bench A tool-calling quality benchmark for evaluating LLM tool-use in agentic workflows across open-weight model serving stacks (vLLM, LiteLLM, llama.cpp). Also includes pluggable accuracy benchmarks (GSM8K, MMLU, IFEval) via the same OpenAI-compatible endpoints. Inspired by ToolCall-15, this tool runs 69 deterministic scenarios (+ 15 opt-in Hard Mode) through OpenAI-compatible endpoints, scores each result as pass, partial, or fail, and produces detailed trace reports. Mock tool responses include realistic payload noise (extra metadata, timestamps, nested objects) to test whether models can extract relevant fields from noisy API responses. It also includes an integrated throughput benchmark (llama-bench style) for measuring prefill and token generation speed. Scope. tool-eval-bench measures tool-calling quality — whether a model picks the right tool, passes the right parameters, chains tools correctly, and handles errors and safety boundaries. It is not a full agentic system benchmark (see Related Work for how it compares to BFCL, PinchBench, and Claw-Eval). What It Measures Tool-Call Quality (69 scenarios across 15 categories) Picking the right tool f
| Stars | 150 |
| Forks | 15 |
| Language | Python |
| Category | Agent Tool |
| License | MIT |
| Quality Score | 72.5566441930571/100 |
| Last Updated | 2026-07-01 |
| Created | 2026-04-17 |
| Platforms | python |
| Est. Tokens | ~23k |
These tools work well together with tool-eval-bench for enhanced workflows:
Looking for a tool-eval-bench alternative? If you're comparing tool-eval-bench with other agent tool tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.
Skill to give Claude Code (and any coding agent) the ability to generate beautiful and practical Excalidraw di
A collection of Agent skills and Claude Code plugins for HashiCorp products.
A collection of standardized Agent Skills to teach GitHub Copilot, Claude, Gemini and Cursor about modern Andr
Claude Code Skill Factory — A powerful open-source toolkit for building and deploying production-ready Claude
Lightweight registry to discover, install, and manage all public Claude plugins and agent skills for your favo
Claude Code Skills for software engineering workflows - Git automation, testing, and code review
Explore other popular agent tool tools:
tool-eval-bench is Tool-calling quality benchmark for LLM serving stacks. 80+ deterministic scenarios testing multi-turn orchestration, safety boundaries, and structured output. Supports vLLM, SGLang, and llama.cpp.. It is categorized as a Agent Tool with 150 GitHub stars.
tool-eval-bench is primarily written in Python.
You can find installation instructions and usage details in the tool-eval-bench GitHub repository at github.com/SeraphimSerapis/tool-eval-bench. The project has 150 stars and 15 forks, indicating an active community.
tool-eval-bench is released under the MIT license, making it free to use and modify according to the license terms.
The top alternatives to tool-eval-bench on Agent Skills Hub include excalidraw-diagram-skill, agent-skills, awesome-android-agent-skills. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.