Discover tools for evaluating, benchmarking, and comparing AI model performance and outputs.
Web Codegen Scorer is a tool for evaluating the quality of web code generated by LLMs.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
BackdoorAgent is a stage-aware framework and benchmark that instruments LLM-agent workflows (planning, memory, tools) to systematically inject, track, and evaluate backdoor triggers across multi-step trajectories using unified metrics like ASR and clean accuracy.
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards
This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.
| Tool | Stars | Language | License | Score |
|---|---|---|---|---|
| web-codegen-scorer | ★ 705 | TypeScript | MIT | 42 |
| Awesome-LLM-Eval | ★ 615 | — | MIT | 31 |
| BackdoorAgent | ★ 30 | Python | Apache-2.0 | 31 |
| promptfoo | ★ 18.7k | TypeScript | MIT | 47 |
| agentops | ★ 5.4k | Python | MIT | 43 |
| chinese-llm-benchmark | ★ 5.7k | — | — | 41 |
| AI-Infra-Guard | ★ 3.3k | Python | Apache-2.0 | 45 |
| AgentBench | ★ 3.2k | Python | Apache-2.0 | 38 |
| OpenJudge | ★ 493 | Python | Apache-2.0 | 38 |
| Long-Context | ★ 600 | Python | Apache-2.0 | 28 |
The top model evaluation tools include web-codegen-scorer, Awesome-LLM-Eval, BackdoorAgent. These are ranked by our composite score based on GitHub stars, community activity, and code quality.
Most tools listed here are open-source. 9 out of 10 have explicit open-source licenses, making them free to use and modify.
Consider your tech stack (language compatibility), project scale (stars indicate community trust), and specific features you need. Use the comparison table above to evaluate side by side.
Top 20 fastest-growing AI tools delivered every Monday. Free.
No spam, unsubscribe anytime.