PawBench — Codex Skill by agentscope-ai

by agentscope-ai · Codex Skill · ★ 72

Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h

About PawBench

🐾 PawBench English · 简体中文 A Model × Harness co-evaluation benchmark for agentic AI. 150 agent tasks · 9 models · 3 harnesses · task slices · diagnostic traces The same model can behave very differently once it is placed inside a real agent runtime. A failure may come from model reasoning, missing tools, weak skill discovery, poor workspace awareness, brittle web access, or a completion check that is too loose. A single final pass rate cannot separate these causes. PawBench is built around one c

agentbenchmarkharnesshermesllmopenclawqwenpaw

Quick Facts

Stars72
Forks5
LanguagePython
CategoryCodex Skill
LicenseApache-2.0
Quality Score66.7888531271658/100
Open Issues2
Last Updated2026-06-25
Created2026-05-15
Platformspython
Est. Tokens~16k

Compatible Skills

These tools work well together with PawBench for enhanced workflows:

  • OpenClawProBench — semantic(0.72)+same_lang+similar_pop+shared_platform (60%)
  • ollama-benchmark — semantic(0.37)+complementary+same_lang+similar_pop+shared_platform (58%)
  • llm-d-benchmark — semantic(0.37)+complementary+same_lang+similar_pop+shared_platform (58%)

PawBench alternative? Top 6 similar tools

Looking for a PawBench alternative? If you're comparing PawBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

  • clawshell by clawshell · ⭐ 280

    The Runtime Security Layer for OpenClaw/Hermes-agent, the essential safety harness for PII & sensitive credent

  • omnicoreagent by omnirexflora-labs · ⭐ 241

    Open Python agent harness for production AI apps: tools, MCP, memory, workspace, telemetry, subagents, backgro

  • LLM-Agent-Benchmark-List by zhangxjohn · ⭐ 167

    A banchmark list for evaluation of large language models.

  • Network-AI by Jovancoding · ⭐ 65

    Traffic light for AI Agents and TypeScript/Node multi-agent orchestrator with shared state, guardrails, and ad

  • ollama-benchmark by aidatatools · ⭐ 345

    LLM Benchmark for Throughput via Ollama (Local LLMs)

  • kindly-web-search-mcp-server by Shelpuk-AI-Technology-Consulting · ⭐ 332

    Kindly Web Search MCP Server: Web search + robust content retrieval for AI coding tools (Claude Code, Codex, C

More Codex Skill Tools

Explore other popular codex skill tools:

View all Codex Skill tools →

Popular Python Agent Tools

Frequently Asked Questions

What is PawBench?

PawBench is A benchmark for evaluating LLM × harness performance.. It is categorized as a Codex Skill with 72 GitHub stars.

What programming language is PawBench written in?

PawBench is primarily written in Python. It covers topics such as agent, benchmark, harness.

How do I install or use PawBench?

You can find installation instructions and usage details in the PawBench GitHub repository at github.com/agentscope-ai/PawBench. The project has 72 stars and 5 forks, indicating an active community.

What license does PawBench use?

PawBench is released under the Apache-2.0 license, making it free to use and modify according to the license terms.

What are the best alternatives to PawBench?

The top alternatives to PawBench on Agent Skills Hub include clawshell, omnicoreagent, LLM-Agent-Benchmark-List. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Codex Skill tools