PostTrainBench — Codex Skill by aisa-group

by aisa-group · Codex Skill · ★ 354

Last updated: · Indexed by AgentSkillsHub · Auto-synced every 8h

About PostTrainBench

PostTrainBench: Can LLM Agents Automate LLM Post-Training? We introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D. [!IMPORTANT] Harbor support coming soon! This repository currently targets our internal HPC cluster (HTCondor). We are adding Harbor support to make it straightforward to run on rented hardware (e.g., cloud GPUs). See our PR. Leaderboard Scores are weighted averages across 7 benchmarks and 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B). Agents with multiple runs show averaged results. 25.5

ai-research-automationai-safetyclaude-codecodex-cligemini-clipost-training

Quick Facts

Stars354
Forks44
LanguagePython
CategoryCodex Skill
LicenseMIT
Quality Score38.9/100
Open Issues15
Last Updated2026-06-10
Created2025-11-28
Platformsclaude-code, cli, codex, gemini, python
Est. Tokens~1020k

Compatible Skills

These tools work well together with PostTrainBench for enhanced workflows:

  • claude-scholar — semantic(0.26)+complementary+same_lang+similar_pop+shared_platform (54%)
  • renderdoc-skill — semantic(0.23)+complementary+same_lang+similar_pop+shared_platform (53%)
  • talkito — semantic(0.22)+complementary+same_lang+similar_pop+shared_platform (53%)
  • claude-codex-settings — semantic(0.19)+complementary+same_lang+similar_pop+shared_platform (52%)
  • af-deep-research — semantic(0.17)+complementary+same_lang+similar_pop+shared_platform (51%)

PostTrainBench alternative? Top 6 similar tools

Looking for a PostTrainBench alternative? If you're comparing PostTrainBench with other codex skill tools, these 6 projects are the closest alternatives on Agent Skills Hub — ranked by topic overlap, star count, and community traction.

  • template-repo by AndrewAltimit · ⭐ 128

    Agent orchestration & security template featuring MCP tool building, agent2agent workflows, mechanistic interp

  • ai-devkit by codeaholicguy · ⭐ 1.4k

    The control plane for AI coding agents.

  • quint-code by m0n0x41d · ⭐ 1.3k

    Engineering decisions engine that know when they're stale. Frame, compare, decide — with evidence decay and p

  • awesome-llm-skills by Prat011 · ⭐ 1.1k

    A curated list of awesome LLM and AI Agent Skills, resources and tools for customising AI Agent workflows - th

  • cccc by ChesterRa · ⭐ 987

    Coordinate your coding agents like a group chat — read receipts, delivery tracking, and remote ops from your p

  • parallel-code by johannesjo · ⭐ 739

    Run Claude Code, Codex, and Gemini side by side — each in its own git worktree

More Codex Skill Tools

Explore other popular codex skill tools:

View all Codex Skill tools →

Popular Python Agent Tools

Frequently Asked Questions

What is PostTrainBench?

PostTrainBench is Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours. It is categorized as a Codex Skill with 354 GitHub stars.

What programming language is PostTrainBench written in?

PostTrainBench is primarily written in Python. It covers topics such as ai-research-automation, ai-safety, claude-code.

How do I install or use PostTrainBench?

You can find installation instructions and usage details in the PostTrainBench GitHub repository at github.com/aisa-group/PostTrainBench. The project has 354 stars and 44 forks, indicating an active community.

What license does PostTrainBench use?

PostTrainBench is released under the MIT license, making it free to use and modify according to the license terms.

What are the best alternatives to PostTrainBench?

The top alternatives to PostTrainBench on Agent Skills Hub include template-repo, ai-devkit, quint-code. Each offers a different approach to the same problem space — compare them side-by-side by stars, quality score, and community activity.

View on GitHub → Browse Codex Skill tools