ComingUp
PhAIL – Real-robot benchmark for AI models

PhAIL – Real-robot benchmark for AI models

Mar 31, 2026 AI & Machine Learning
benchmarking robotics vla models

Gallery

PhAIL – Real-robot benchmark for AI models

About

I built this because I couldn't find honest numbers on how well VLA models actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.Happy to answer questions about methodology, the models, or what we observed.

Comments (0)

No comments yet. Be the first to comment!