by aisa-group · Codex Skill · ★ 354
Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h
PostTrainBench: Can LLM Agents Automate LLM Post-Training? We introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D. [!IMPORTANT] Harbor support coming soon! This repository currently targets our internal HPC cluster (HTCondor). We are adding Harbor support to make it straightforward to run on rented hardware (e.g., cloud GPUs). See our PR. Leaderboard Scores are weighted averages across 7 benchmarks and 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B). Agents with multiple runs show averaged results. 25.5
| Stars | 354 |
| Forks | 44 |
| Language | Python |
| Category | Codex Skill |
| License | MIT |
| Quality Score | 38.9/100 |
| Open Issues | 15 |
| Last Updated | 2026-06-10 |
| Created | 2025-11-28 |
| Platforms | claude-code, cli, codex, gemini, python |
| Est. Tokens | ~1020k |
These tools work well together with PostTrainBench for enhanced workflows:
Looking for a PostTrainBench alternative? If you're comparing PostTrainBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.
Agent orchestration & security template featuring MCP tool building, agent2agent workflows, mechanistic interp
The control plane for AI coding agents.
Engineering decisions engine that know when they're stale. Frame, compare, decide — with evidence decay and p
A curated list of awesome LLM and AI Agent Skills, resources and tools for customising AI Agent workflows - th
Coordinate your coding agents like a group chat — read receipts, delivery tracking, and remote ops from your p
Run Claude Code, Codex, and Gemini side by side — each in its own git worktree
Explore other popular codex skill tools:
PostTrainBench is Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours. It is categorized as a Codex Skill with 354 GitHub stars.
PostTrainBench is primarily written in Python. It covers topics such as ai-research-automation, ai-safety, claude-code.
You can find installation instructions and usage details in the PostTrainBench GitHub repository at github.com/aisa-group/PostTrainBench. The project has 354 stars and 44 forks, indicating an active community.
PostTrainBench is released under the MIT license, making it free to use and modify according to the license terms.
The top alternatives to PostTrainBench on Agent Skills Hub include template-repo, ai-devkit, quint-code. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.