by philschmid · Agent Tool · ★ 107
Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h
AI Agent Benchmark Compendium This post provides a high-level overview to over 50 of modern benchmarks, grouped into four key categories Function Calling and Tool Use, General Assistant and Reasoning, Coding and Software Engineering and Computer Interactions. Would love to keep this up to date and extend when need benchmarks are coming up. Please Open PRs or Issues. Function Calling & Tool Use BFCL (Berkeley Function Calling Leaderboard) BFCL is a comprehensive benchmark designed to evaluate the function calling (also known as tool use) capabilities of Large Language Models (LLMs) in a wide range of real-world settings. It assesses models across various scenarios, including serial (simple), parallel, and multi-turn interactions, and evaluates agentic capabilities such as reasoning in stateful multi-step environments, memory, web search, and format sensitivity. Links: Paper Dataset ToolBench A massive-scale benchmark designed for evaluating and facilitating large language models in mastering over 16,000 real-world RESTful APIs. It functions as an instruction-tuning dataset for tool use, which was automatically generated using ChatGPT to enhance the general tool-use c
| Stars | 107 |
| Forks | 9 |
| Category | Agent Tool |
| Quality Score | 50.9849039055433/100 |
| Open Issues | 2 |
| Last Updated | 2025-10-15 |
| Created | 2025-10-15 |
| Est. Tokens | ~3k |
Looking for a ai-agent-benchmark-compendium alternative? If you're comparing ai-agent-benchmark-compendium with other agent tool tools, these 5 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.
Lightweight registry to discover, install, and manage all public Claude plugins and agent skills for your favo
Claude Code Skills for software engineering workflows - Git automation, testing, and code review
A Claude Code skill that turns PDFs, docs, and codebases into Obsidian study vaults
86 product management skills from Lenny's Podcast for Claude Code and AI agents. Hiring, user research, strate
Power rename/refactor tool (now with agent skill support!)
Explore other popular agent tool tools:
ai-agent-benchmark-compendium is Compendium of over 50 benchmarks for evaluating AI agents, categorized into Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interaction.. It is categorized as a Agent Tool with 107 GitHub stars.
You can find installation instructions and usage details in the ai-agent-benchmark-compendium GitHub repository at github.com/philschmid/ai-agent-benchmark-compendium. The project has 107 stars and 9 forks, indicating an active community.
The top alternatives to ai-agent-benchmark-compendium on Agent Skills Hub include claude-plugins, claude-skills-marketplace, tutor-skills. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.