ClawProBench — Codex Skill by suyoumo

Last updated: 2026-04-30 · Indexed by AgentSkillsHub · Auto-synced every 8h

About ClawProBench

ClawProBench Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime. 102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage. ClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the profile; broader active coverage remains available through , , , and . The current worktree inventory reports active scenarios and total catalog scenarios ( incubating) via and . Leaderboard Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench. [](https://suyoumo.github.io

agent benchmark evaluation harness leaderboard llm openclaw

Quick Facts

Stars	576
Forks	49
Language	Python
Category	Codex Skill
License	Apache-2.0
Quality Score	53.296/100
Last Updated	2026-04-30
Created	2025-03-02
Platforms	python
Est. Tokens	~199k

ClawProBench alternative? Top 6 similar tools

Looking for a ClawProBench alternative? If you're comparing ClawProBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

Awesome-LLM-Eval by onejune2018 · ⭐ 615
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, ma
claw-eval by claw-eval · ⭐ 514
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
hope-agent by shiwenwen · ⭐ 318
会记忆、能成长的随身 AI 助手 · 桌面 / 云端 / IM 随叫随到，手机远程也能连 | Personal AI that remembers and grows — lives on desktop, self-h
memsearch by zilliztech · ⭐ 1.6k
A persistent, unified memory layer for all your AI agents (e.g. Claude Code, Codex), backed by Markdown and Mi
AutoR by AutoX-AI-Labs · ⭐ 633
AI handles execution, humans own the direction, and every run becomes an inspectable research artifact on disk
OpenJudge by agentscope-ai · ⭐ 581
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

More Codex Skill Tools

Explore other popular codex skill tools:

openclaw ⭐ 368.6k
hermes-agent ⭐ 133.8k
ui-ux-pro-max-skill ⭐ 57.6k
awesome-openclaw-skills ⭐ 48.0k
cherry-studio ⭐ 45.0k
siyuan ⭐ 43.6k
graphify ⭐ 43.1k
nanobot ⭐ 41.7k
airi ⭐ 39.0k
1Panel ⭐ 35.2k

View all Codex Skill tools →

Popular Python Agent Tools

AutoGPT ⭐ 184.0k · Agent Tool
langflow ⭐ 147.7k · Agent Tool
langchain ⭐ 135.8k · Agent Tool
open-webui ⭐ 135.5k · MCP Server
hermes-agent ⭐ 133.8k · Codex Skill

Frequently Asked Questions

What is ClawProBench?

ClawProBench is ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.. It is categorized as a Codex Skill with 576 GitHub stars.

What programming language is ClawProBench written in?

ClawProBench is primarily written in Python. It covers topics such as agent, benchmark, evaluation.

How do I install or use ClawProBench?

You can find installation instructions and usage details in the ClawProBench GitHub repository at github.com/suyoumo/ClawProBench. The project has 576 stars and 49 forks, indicating an active community.

What license does ClawProBench use?

ClawProBench is released under the Apache-2.0 license, making it free to use and modify according to the license terms.

What are the best alternatives to ClawProBench?

The top alternatives to ClawProBench on Agent Skills Hub include Awesome-LLM-Eval, claw-eval, hope-agent. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Codex Skill tools