Gallery
About
I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments.
Comments (0)
No comments yet. Be the first to comment!
Related Products
OpenBrief – Local-first video downloader/summarizer
Nerve – self hosted runtime for AI agents
skills-for-humanity – 171 structured reasoning skills for Claude Code
skills-for-humanity – 171 structured reasoning skills for Claude Code
OpenBrief – Local-first video downloader/summarizer
Bae – AI companion built around persistent memory architecture