ai-agent-benchmark-compendium — Agent Tool by philschmid

Last updated: 2025-10-15 · Indexed by AgentSkillsHub · Auto-synced every 8h

About ai-agent-benchmark-compendium

AI Agent Benchmark Compendium This post provides a high-level overview to over 50 of modern benchmarks, grouped into four key categories Function Calling and Tool Use, General Assistant and Reasoning, Coding and Software Engineering and Computer Interactions. Would love to keep this up to date and extend when need benchmarks are coming up. Please Open PRs or Issues. Function Calling & Tool Use BFCL (Berkeley Function Calling Leaderboard) BFCL is a comprehensive benchmark designed to evaluate the function calling (also known as tool use) capabilities of Large Language Models (LLMs) in a wide range of real-world settings. It assesses models across various scenarios, including serial (simple), parallel, and multi-turn interactions, and evaluates agentic capabilities such as reasoning in stateful multi-step environments, memory, web search, and format sensitivity. Links: Paper Dataset ToolBench A massive-scale benchmark designed for evaluating and facilitating large language models in mastering over 16,000 real-world RESTful APIs. It functions as an instruction-tuning dataset for tool use, which was automatically generated using ChatGPT to enhance the general tool-use c

Quick Facts

Stars	107
Forks	9
Category	Agent Tool
Quality Score	50.9849039055433/100
Open Issues	2
Last Updated	2025-10-15
Created	2025-10-15
Est. Tokens	~3k

ai-agent-benchmark-compendium alternative? Top 5 similar tools

Looking for a ai-agent-benchmark-compendium alternative? If you're comparing ai-agent-benchmark-compendium with other agent tool tools, these 5 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

claude-plugins by Kamalnrf · ⭐ 522
Lightweight registry to discover, install, and manage all public Claude plugins and agent skills for your favo
claude-skills-marketplace by mhattingpete · ⭐ 442
Claude Code Skills for software engineering workflows - Git automation, testing, and code review
tutor-skills by RoundTable02 · ⭐ 400
A Claude Code skill that turns PDFs, docs, and codebases into Obsidian study vaults
lenny-skills by RefoundAI · ⭐ 382
86 product management skills from Lenny's Podcast for Claude Code and AI agents. Hiring, user research, strate
repren by jlevy · ⭐ 371
Power rename/refactor tool (now with agent skill support!)

More Agent Tool Tools

Explore other popular agent tool tools:

View all Agent Tool tools →

Frequently Asked Questions

What is ai-agent-benchmark-compendium?

ai-agent-benchmark-compendium is Compendium of over 50 benchmarks for evaluating AI agents, categorized into Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interaction.. It is categorized as a Agent Tool with 107 GitHub stars.

How do I install or use ai-agent-benchmark-compendium?

You can find installation instructions and usage details in the ai-agent-benchmark-compendium GitHub repository at github.com/philschmid/ai-agent-benchmark-compendium. The project has 107 stars and 9 forks, indicating an active community.

What are the best alternatives to ai-agent-benchmark-compendium?

The top alternatives to ai-agent-benchmark-compendium on Agent Skills Hub include claude-plugins, claude-skills-marketplace, tutor-skills. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Agent Tool tools