ClawProBench — Codex Skill by suyoumo

by suyoumo · Codex Skill · ★ 576

Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h

About ClawProBench

ClawProBench Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime. 102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage. ClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the profile; broader active coverage remains available through , , , and . The current worktree inventory reports active scenarios and total catalog scenarios ( incubating) via and . Leaderboard Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench. [](https://suyoumo.github.io

agentbenchmarkevaluationharnessleaderboardllmopenclaw

Quick Facts

Stars576
Forks49
LanguagePython
CategoryCodex Skill
LicenseApache-2.0
Quality Score53.296/100
Last Updated2026-04-30
Created2025-03-02
Platformspython
Est. Tokens~199k

ClawProBench alternative? Top 6 similar tools

Looking for a ClawProBench alternative? If you're comparing ClawProBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

  • Awesome-LLM-Eval by onejune2018 · ⭐ 615

    Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, ma

  • claw-eval by claw-eval · ⭐ 514

    Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

  • hope-agent by shiwenwen · ⭐ 318

    会记忆、能成长的随身 AI 助手 · 桌面 / 云端 / IM 随叫随到,手机远程也能连 | Personal AI that remembers and grows — lives on desktop, self-h

  • memsearch by zilliztech · ⭐ 1.6k

    A persistent, unified memory layer for all your AI agents (e.g. Claude Code, Codex), backed by Markdown and Mi

  • AutoR by AutoX-AI-Labs · ⭐ 633

    AI handles execution, humans own the direction, and every run becomes an inspectable research artifact on disk

  • OpenJudge by agentscope-ai · ⭐ 581

    OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

More Codex Skill Tools

Explore other popular codex skill tools:

View all Codex Skill tools →

Popular Python Agent Tools

Frequently Asked Questions

What is ClawProBench?

ClawProBench is ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.. It is categorized as a Codex Skill with 576 GitHub stars.

What programming language is ClawProBench written in?

ClawProBench is primarily written in Python. It covers topics such as agent, benchmark, evaluation.

How do I install or use ClawProBench?

You can find installation instructions and usage details in the ClawProBench GitHub repository at github.com/suyoumo/ClawProBench. The project has 576 stars and 49 forks, indicating an active community.

What license does ClawProBench use?

ClawProBench is released under the Apache-2.0 license, making it free to use and modify according to the license terms.

What are the best alternatives to ClawProBench?

The top alternatives to ClawProBench on Agent Skills Hub include Awesome-LLM-Eval, claw-eval, hope-agent. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Codex Skill tools