PhAIL – Real-robot benchmark for AI models

by Gregorio Volkman

Visit

Mar 31, 2026 AI & Machine Learning

benchmarking robotics vla models

Gallery

PhAIL – Real-robot benchmark for AI models

About

I built this because I couldn't find honest numbers on how well VLA models actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.Happy to answer questions about methodology, the models, or what we observed.