PawBench — Codex Skill by agentscope-ai

Last updated: 2026-06-25 · Indexed by AgentSkillsHub · Auto-synced every 8h

About PawBench

🐾 PawBench English · 简体中文 A Model × Harness co-evaluation benchmark for agentic AI. 150 agent tasks · 9 models · 3 harnesses · task slices · diagnostic traces The same model can behave very differently once it is placed inside a real agent runtime. A failure may come from model reasoning, missing tools, weak skill discovery, poor workspace awareness, brittle web access, or a completion check that is too loose. A single final pass rate cannot separate these causes. PawBench is built around one c

agent benchmark harness hermes llm openclaw qwenpaw

Quick Facts

Stars	72
Forks	5
Language	Python
Category	Codex Skill
License	Apache-2.0
Quality Score	66.7888531271658/100
Open Issues	2
Last Updated	2026-06-25
Created	2026-05-15
Platforms	python
Est. Tokens	~16k

Compatible Skills

These tools work well together with PawBench for enhanced workflows:

OpenClawProBench — semantic(0.72)+same_lang+similar_pop+shared_platform (60%)
ollama-benchmark — semantic(0.37)+complementary+same_lang+similar_pop+shared_platform (58%)
llm-d-benchmark — semantic(0.37)+complementary+same_lang+similar_pop+shared_platform (58%)

PawBench alternative? Top 6 similar tools

Looking for a PawBench alternative? If you're comparing PawBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

clawshell by clawshell · ⭐ 280
The Runtime Security Layer for OpenClaw/Hermes-agent, the essential safety harness for PII & sensitive credent
omnicoreagent by omnirexflora-labs · ⭐ 241
Open Python agent harness for production AI apps: tools, MCP, memory, workspace, telemetry, subagents, backgro
LLM-Agent-Benchmark-List by zhangxjohn · ⭐ 167
A banchmark list for evaluation of large language models.
Network-AI by Jovancoding · ⭐ 65
Traffic light for AI Agents and TypeScript/Node multi-agent orchestrator with shared state, guardrails, and ad
ollama-benchmark by aidatatools · ⭐ 345
LLM Benchmark for Throughput via Ollama (Local LLMs)
kindly-web-search-mcp-server by Shelpuk-AI-Technology-Consulting · ⭐ 332
Kindly Web Search MCP Server: Web search + robust content retrieval for AI coding tools (Claude Code, Codex, C

More Codex Skill Tools

Explore other popular codex skill tools:

refly ⭐ 7.1k
nano-banana-pro-prompts-recommend-skill ⭐ 1.7k
skillshare ⭐ 2.3k
pi-skills ⭐ 1.9k
skillkit ⭐ 1.2k
ai-maestro ⭐ 709
upskill ⭐ 652
Claude-to-IM-skill ⭐ 2.8k
learning-opportunities ⭐ 2.1k
solana-dev-skill ⭐ 520

View all Codex Skill tools →

Popular Python Agent Tools

TrendRadar ⭐ 60.2k · MCP Server
gpt-researcher ⭐ 27.9k · MCP Server
Scrapling ⭐ 67.2k · MCP Server
serena ⭐ 26.0k · MCP Server
MaxKB ⭐ 21.6k · MCP Server

Frequently Asked Questions

What is PawBench?

PawBench is A benchmark for evaluating LLM × harness performance.. It is categorized as a Codex Skill with 72 GitHub stars.

What programming language is PawBench written in?

PawBench is primarily written in Python. It covers topics such as agent, benchmark, harness.

How do I install or use PawBench?

You can find installation instructions and usage details in the PawBench GitHub repository at github.com/agentscope-ai/PawBench. The project has 72 stars and 5 forks, indicating an active community.

What license does PawBench use?

PawBench is released under the Apache-2.0 license, making it free to use and modify according to the license terms.

What are the best alternatives to PawBench?

The top alternatives to PawBench on Agent Skills Hub include clawshell, omnicoreagent, LLM-Agent-Benchmark-List. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Codex Skill tools