Keywords: Small LMs, On-device computation, Cloud costs, Minions protocol, Distributed AI
Overview: This article introduces “Minions,” a system designed to offload a significant portion of large language model (LLM) workloads to consumer devices by enabling small, on-device models to collaborate with larger, frontier models in the cloud. This approach aims to reduce cloud costs by processing long contexts locally, with minimal impact on performance. The authors envision a future where an on-device “intelligence layer” interacts with cloud-based models to deliver cost-effective and always-on intelligent applications. They present two protocols, “Minion” and “Minions,” demonstrating significant cost savings while maintaining high accuracy on data-intensive tasks like financial analysis, medical reasoning, and question answering. The article concludes with opportunities for practitioners, researchers, and hackers to contribute to the development and application of the Minions protocol.
Section Summaries:
- 🌬️ The Tailwinds: Frontier Models Cost $$$, While Local Hardware Collects Dust: Consumer devices possess increasing computational power, and tools like Ollama enable them to run capable small LMs. While these local models can handle basic tasks, more complex tasks are still handled by expensive cloud-based frontier models. The authors propose leveraging local LMs to reduce the reliance on costly cloud resources.
- ❓ The Big Question: Can Local LMs Step Up?: While frontier models excel at higher-order reasoning, small LMs are rapidly improving. The authors explore whether small on-device models can collaborate with larger cloud models to tackle challenging tasks at a reduced cost. They focus on data-intensive tasks like financial analysis, medical reasoning, and question answering to evaluate this collaboration.
- 💻 The Naive Attempt: Minion: The initial approach involves a simple chat between an on-device LM and a cloud model, where the local model has access to long documents. This reduces cloud costs significantly while maintaining a reasonable level of performance. However, the performance is limited by the small LM’s struggles with long contexts and multi-step instructions.
- 🎩 The Sophisticated Attempt: Minions: The enhanced Minions protocol introduces a decomposition-execution-aggregation loop to address the weaknesses of small LMs. The remote LM decomposes the task, the local LM executes subtasks in parallel, and the remote LM aggregates the results. This approach achieves high accuracy at a fraction of the cost of remote-only solutions.
- 🔍 Tuning The Minions: Key factors influencing the cost-accuracy tradeoff include model choice, scaling inference-time compute, and sequential communication. Local models below 3B parameters are not effective, and techniques like repeated sampling and finer-grained decomposition improve performance. Allowing for more communication rounds between local and remote models also enhances accuracy.
- 🚀 To infinity and Beyond: Minions represents a step towards establishing a communication protocol between small on-device LMs and frontier LMs. As GPUs become more common in consumer devices, smarter ways to distribute AI workloads are needed. The authors encourage practitioners, researchers, and hackers to get involved with Minions to further its development and application.
Related Tools:
- Ollama: https://ollama.com/
References:
- Paper: https://arxiv.org/abs/2502.15964
- GitHub: https://github.com/HazyResearch/minions
- FinanceBench: https://github.com/patronus-ai/financebench/tree/main
- LongHealth: https://github.com/kbressem/LongHealth
- QAsper: https://huggingface.co/datasets/allenai/qasper/viewer/qasper/train
Original Article Link: https://hazyresearch.stanford.edu/blog/2025-02-24-minions
source: https://hazyresearch.stanford.edu/blog/2025-02-24-minions