Run 500B+ Parameter LLMs Locally on a Mac Mini

by Jeramy Goodwin

Visit

Mar 9, 2026 Developer Tools

Gallery

About

Hi HN, I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini. Early benchmarks: TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization. At 140B scale, models that normally require ~280GB fit within ~35GB packed. Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation. Check benchmarks, architecture, and details here: https://opengraviton.github.io GitHub: https://github.com/opengraviton This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!