文章摘要:
关键字: DeepSeek R1, Logits Distillation, Model Optimization, Qwen, Llama3
概述: This article explores the re-distillation of smaller DeepSeek R1 models using logits distillation from larger models. The authors demonstrate that this method significantly improves performance across various benchmarks, including mathematical reasoning and general knowledge, while being highly cost-effective. By using the output distributions of larger models to guide smaller ones, they achieve performance gains with a relatively small dataset of 35,000 samples. The re-distilled models, based on Qwen and Llama3 architectures, are made freely available on Hugging Face. The authors emphasize the efficiency and cost-effectiveness of their approach, highlighting its potential for further advancements in the development of high-performance, small-scale models. The experiments show that re-distillation is a practical way to enhance the capabilities of smaller models without requiring extensive resources.
分节阅读:
- Introduction: The DeepSeek R1 distilled models are popular for running the R1 model on consumer hardware. These models include Qwen2.5 and Llama3 models, which can be optimized with quantization methods like HQQ. The original distilled models were trained using supervised fine-tuning, but this work proposes using logits distillation for further alignment.
- Re-Distillation Approach: The approach uses larger models’ output distributions to guide smaller models, which is effective without extensive training data. Quantization using HQQ is necessary for larger teacher models due to memory constraints. The training objective uses KL-divergence loss with logits clipping to prevent training instability.
- Benchmarks: The benchmarks show significant improvements across multiple tasks, with the GSM8K score increasing by over 4% for Qwen models and almost 14% for the Llama3 version. The re-distilled models consistently outperform the original distilled models in various tests. The results demonstrate the effectiveness of the re-distillation approach.
- Conclusion: Logit alignment is a highly effective and economical method for enhancing smaller models like the DeepSeek R1 models. The improved models are shared on Hugging Face, encouraging further advancements in high-performance, small-scale models. The authors believe this approach will inspire further innovation in the field.
相关工具:
- Hugging Face: https://huggingface.co/collections/mobiuslabsgmbh/deepseek-r1-redistill-6793d3bea92c7fff0639ab4d
- HQQ: https://github.com/mobiusml/hqq/
- GemLite: https://github.com/mobiusml/gemlite/
参考文献:
- DeepSeek R1: https://arxiv.org/html/2412.19437v1
- Logits Distillation: https://arxiv.org/abs/1503.02531
- DeepSeek R1 distilled models: https://huggingface.co/deepseek-ai/DeepSeek-R1
- Qwen2.5: https://github.com/QwenLM/Qwen2.5
- Llama3: https://ai.meta.com/blog/future-of-ai-built-with-llama/
- DeepSeek-R1-Distill-Qwen-32B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/
- DeepSeek-R1-Distill-Qwen-1.5B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/
- DeepSeek-R1-Distill-Qwen-7B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/
- DeepSeek-R1-Distill-Llama-70B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- DeepSeek-R1-Distill-Llama-8B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- MetaMathQA: https://huggingface.co/datasets/meta-math/MetaMathQA
- Orca Math: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k
- Evol-Instruct: https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
- Numira R1: https://huggingface.co/datasets/mlfoundations-dev/filtered_numina_R1
- R1-Distill-SFT: https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT
- Qwen1.5B Re-Distill: https://huggingface.co/mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.1
- Qwen7B Re-Distill: https://huggingface.co/mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1
- Llama3-8B Re-Distill: https://huggingface.co/mobiuslabsgmbh/DeepSeek-R1-ReDistill-LLama3-8B-v1.1