tool-eval-bench — Agent Tool by SeraphimSerapis

Last updated: 2026-07-01 · Indexed by AgentSkillsHub · Auto-synced every 8h

About tool-eval-bench

tool-eval-bench A tool-calling quality benchmark for evaluating LLM tool-use in agentic workflows across open-weight model serving stacks (vLLM, LiteLLM, llama.cpp). Also includes pluggable accuracy benchmarks (GSM8K, MMLU, IFEval) via the same OpenAI-compatible endpoints. Inspired by ToolCall-15, this tool runs 69 deterministic scenarios (+ 15 opt-in Hard Mode) through OpenAI-compatible endpoints, scores each result as pass, partial, or fail, and produces detailed trace reports. Mock tool responses include realistic payload noise (extra metadata, timestamps, nested objects) to test whether models can extract relevant fields from noisy API responses. It also includes an integrated throughput benchmark (llama-bench style) for measuring prefill and token generation speed. Scope. tool-eval-bench measures tool-calling quality — whether a model picks the right tool, passes the right parameters, chains tools correctly, and handles errors and safety boundaries. It is not a full agentic system benchmark (see Related Work for how it compares to BFCL, PinchBench, and Claw-Eval). What It Measures Tool-Call Quality (69 scenarios across 15 categories) Picking the right tool f

Quick Facts

Stars	150
Forks	15
Language	Python
Category	Agent Tool
License	MIT
Quality Score	72.5566441930571/100
Last Updated	2026-07-01
Created	2026-04-17
Platforms	python
Est. Tokens	~23k

Compatible Skills

These tools work well together with tool-eval-bench for enhanced workflows:

ContextPilot — semantic(0.18)+complementary+same_lang+similar_pop+shared_platform (51%)

tool-eval-bench alternative? Top 6 similar tools

Looking for a tool-eval-bench alternative? If you're comparing tool-eval-bench with other agent tool tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

excalidraw-diagram-skill by coleam00 · ⭐ 718
Skill to give Claude Code (and any coding agent) the ability to generate beautiful and practical Excalidraw di
agent-skills by hashicorp · ⭐ 639
A collection of Agent skills and Claude Code plugins for HashiCorp products.
awesome-android-agent-skills by new-silvermoon · ⭐ 588
A collection of standardized Agent Skills to teach GitHub Copilot, Claude, Gemini and Cursor about modern Andr
claude-code-skill-factory by alirezarezvani · ⭐ 571
Claude Code Skill Factory — A powerful open-source toolkit for building and deploying production-ready Claude
claude-plugins by Kamalnrf · ⭐ 522
Lightweight registry to discover, install, and manage all public Claude plugins and agent skills for your favo
claude-skills-marketplace by mhattingpete · ⭐ 442
Claude Code Skills for software engineering workflows - Git automation, testing, and code review

More Agent Tool Tools

Explore other popular agent tool tools:

View all Agent Tool tools →

Popular Python Agent Tools

TrendRadar ⭐ 60.2k · MCP Server
gpt-researcher ⭐ 27.9k · MCP Server
Scrapling ⭐ 67.2k · MCP Server
serena ⭐ 26.0k · MCP Server
MaxKB ⭐ 21.6k · MCP Server

Frequently Asked Questions

What is tool-eval-bench?

tool-eval-bench is Tool-calling quality benchmark for LLM serving stacks. 80+ deterministic scenarios testing multi-turn orchestration, safety boundaries, and structured output. Supports vLLM, SGLang, and llama.cpp.. It is categorized as a Agent Tool with 150 GitHub stars.

What programming language is tool-eval-bench written in?

tool-eval-bench is primarily written in Python.

How do I install or use tool-eval-bench?

You can find installation instructions and usage details in the tool-eval-bench GitHub repository at github.com/SeraphimSerapis/tool-eval-bench. The project has 150 stars and 15 forks, indicating an active community.

What license does tool-eval-bench use?

tool-eval-bench is released under the MIT license, making it free to use and modify according to the license terms.

What are the best alternatives to tool-eval-bench?

The top alternatives to tool-eval-bench on Agent Skills Hub include excalidraw-diagram-skill, agent-skills, awesome-android-agent-skills. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Agent Tool tools