好的,这是对您提供的文本的分析和总结:
语言: 中文
关键字: LLM, 模式, 工程, 产品, 评估
概述: 本文深入探讨了将大型语言模型(LLM)集成到系统和产品中的实用模式。文章从评估、检索增强生成、微调、缓存、防护栏、防御性用户体验和收集用户反馈七个关键模式入手,详细阐述了如何提高LLM的性能、降低成本和风险,以及如何更好地服务于用户。此外,文章还讨论了数据飞轮、级联和监控等其他机器学习模式在LLM系统中的应用。作者强调,在构建基于LLM的系统和产品时,需要根据实际问题选择合适的模式,并不断收集用户反馈以优化模型。
分节阅读:
- Evals: To measure performance(评估:衡量性能)
- 评估是用于评估模型在任务上的表现的一组测量方法,包括基准数据和指标。评估对于认真构建产品的团队至关重要,可以衡量系统的表现并检测任何退化。没有评估,我们将盲目飞行,或者每次更改都必须目视检查LLM输出。
- Retrieval-Augmented Generation: To add knowledge(检索增强生成:添加知识)
- 检索增强生成(RAG)从基础模型外部获取相关数据,并使用此数据增强输入,提供更丰富的上下文以改善输出。RAG通过将模型建立在检索到的上下文中来帮助减少幻觉,从而提高事实性。此外,保持检索索引的最新状态比持续预训练LLM更便宜。
- Fine-tuning: To get better at specific tasks(微调:更好地完成特定任务)
- 微调是指采用预训练模型(已经使用大量数据进行训练)并在特定任务上进一步改进它的过程。微调可以提高现成基础模型的性能,甚至可以超越第三方LLM。通过微调和托管我们自己的模型,我们可以确保数据不会离开我们的网络,并可以根据需要扩展吞吐量。
- Caching: To reduce latency and cost(缓存:减少延迟和成本)
- 缓存是一种存储先前检索或计算的数据的技术,以便将来对相同数据的请求可以更快地得到服务。缓存可以显着减少先前已服务的响应的延迟。通过消除一次又一次计算相同输入的响应的需要,我们可以减少LLM请求的数量,从而节省成本。
- Guardrails: To ensure output quality(防护栏:确保输出质量)
- 在LLM的上下文中,防护栏验证LLM的输出,确保输出不仅听起来不错,而且在语法上正确、真实且没有有害内容。防护栏有助于确保模型输出的可靠性和一致性,以便在生产中使用。防护栏还提供额外的安全层,并保持对LLM输出的质量控制。
- Defensive UX: To anticipate & handle errors gracefully(防御性用户体验:预测并优雅地处理错误)
- 防御性用户体验是一种设计策略,它承认在用户与机器学习或基于LLM的产品交互期间可能会发生不好的事情,例如不准确或幻觉。防御性用户体验可以通过提供增加的可访问性、增加的信任和更好的用户体验来帮助缓解上述问题。通过设计系统和用户体验来处理模棱两可的情况和错误,防御性用户体验为更流畅、更愉快的用户体验铺平了道路。
- Collect user feedback: To build our data flywheel(收集用户反馈:构建我们的数据飞轮)
- 收集用户反馈使我们能够了解他们的偏好。具体到LLM产品,用户反馈有助于构建评估、微调和防护栏。用户反馈有助于我们的模型改进,并使我们能够适应个人偏好。反馈循环还有助于我们评估系统的整体性能。
相关工具:
- EleutherAI Eval: https://github.com/EleutherAI/lm-evaluation-harness
- AlpacaEval: https://github.com/tatsu-lab/alpaca_eval
- sentence-transformers: https://github.com/UKPLab/sentence-transformers
- FAISS: https://github.com/facebookresearch/faiss
- HNSW: https://github.com/nmslib/hnswlib
- ScaNN: https://github.com/google-research/google-research/tree/master/scann
- GPTCache: https://github.com/zilliztech/GPTCache
- Guardrails package:https://github.com/ShreyaR/guardrails
- NeMo-Guardrails: https://github.com/NVIDIA/NeMo-Guardrails
- Guidance: https://github.com/microsoft/guidance
- List of Dirty, Naughty, Obscene, and Otherwise Bad Words: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
- profanity detection: https://pypi.org/project/profanity-check/
参考文献:
- Measuring massive multitask language understanding: https://arxiv.org/abs/2009.03300
- A Framework for Few-Shot Language Model Evaluation: https://github.com/EleutherAI/lm-evaluation-harness
- Holistic evaluation of language models: https://arxiv.org/abs/2211.09110
- AlpacaFarm: A Simulation Framework for Methods That Learn from Human Feedback: https://github.com/tatsu-lab/alpaca_eval
- Bleu: a method for automatic evaluation of machine translation: https://dl.acm.org/doi/10.3115/1073083.1073135
- Rouge: A package for automatic evaluation of summaries: https://aclanthology.org/W04-1013/
- Bertscore: Evaluating text generation with bert: https://arxiv.org/abs/1904.09675
- MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance: https://arxiv.org/abs/1909.02622
- A survey of evaluation metrics used for NLG systems: https://arxiv.org/abs/2008.12009
- Rogue Scores: https://aclanthology.org/2023.acl-long.107/
- Gpteval: Nlg evaluation using gpt-4 with better human alignment: https://arxiv.org/abs/2303.16634
- What’s going on with the Open LLM Leaderboard?: https://huggingface.co/blog/evaluating-mmlu-leaderboard#whats-going-on-with-the-open-llm-leaderboard
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena: https://arxiv.org/abs/2306.05685
- Qlora: Efficient finetuning of quantized llms: https://arxiv.org/abs/2305.14314
- MPT-7B and The Beginning of Context=Infinity: https://www.latent.space/p/mosaic-mpt-7b#details
- The New Language Model Stack: https://www.sequoiacap.com/article/llm-stack-perspective/
- Learning transferable visual models from natural language supervision: https://arxiv.org/abs/2103.00020
- Search: Query Matching via Lexical, Graph, and Embedding Methods: https://eugeneyan.com/writing/search-query-matching/
- How context affects language models’ factual predictions: https://arxiv.org/abs/2005.04611
- Dense passage retrieval for open-domain question answering: https://arxiv.org/abs/2004.04906
- Retrieval-augmented generation for knowledge-intensive nlp tasks: https://arxiv.org/abs/2005.11401
- Leveraging passage retrieval with generative models for open domain question answering: https://arxiv.org/abs/2007.01282
- Improving language models by retrieving from trillions of tokens: https://arxiv.org/abs/2112.04426
- Internet-augmented language models through few-shot prompting for open-domain question answering: https://arxiv.org/abs/2203.05115
- Codet5+: Open code large language models for code understanding and generation: https://arxiv.org/abs/2305.07922
- Precise zero-shot dense retrieval without relevance labels: https://arxiv.org/abs/2212.10496
- Obsidian-Copilot: An Assistant for Writing & Reflecting: https://eugeneyan.com/writing/obsidian-copilot/
- Enriching word vectors with subword information: https://arxiv.org/abs/1607.04606
- Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation: https://arxiv.org/abs/2004.09813
- Text embeddings by weakly-supervised contrastive pre-training: https://arxiv.org/abs/2212.03533
- One embedder, any task: Instruction-finetuned text embeddings: https://arxiv.org/abs/2212.09741
- Billion-Scale Similarity Search with GPUs: https://arxiv.org/abs/1702.08734
- Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs: https://arxiv.org/abs/1603.09320
- Accelerating Large-Scale Inference with Anisotropic Vector Quantization: https://arxiv.org/abs/1908.10396
- Training language models to follow instructions with human feedback: https://arxiv.org/abs/2203.02155
- Universal language model fine-tuning for text classification: https://arxiv.org/abs/1801.06146
- Bert: Pre-training of deep bidirectional transformers for language understanding: https://arxiv.org/abs/1810.04805
- Improving language understanding with unsupervised learning: https://openai.com/research/language-unsupervised
- Exploring the limits of transfer learning with a unified text-to-text transformer: https://arxiv.org/abs/1910.10683
- The power of scale for parameter-efficient prompt tuning: https://arxiv.org/abs/2104.08691
- Prefix-tuning: Optimizing continuous prompts for generation: https://arxiv.org/abs/2101.00190
- Parameter-efficient transfer learning for NLP: https://arxiv.org/abs/1902.00751
- Lora: Low-rank adaptation of large language models: https://arxiv.org/abs/2106.09685
- Qlora: Efficient finetuning of quantized llms: https://arxiv.org/abs/2305.14314
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference: https://cims.nyu.edu/~sbowman/multinli/
- Training a helpful and harmless assistant with reinforcement learning from human feedback: https://arxiv.org/abs/2204.05862
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models: https://arxiv.org/abs/2303.08896
- Guidelines for human-AI interaction: https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
- People + AI Guidebook: https://pair.withgoogle.com/guidebook/
- Human Interface Guidelines for Machine Learning: https://developer.apple.com/design/human-interface-guidelines/machine-learning
- A Human Perspective on Algorithmic Similarity: https://slideslive.com/38934788/a-human-perspective-on-algorithmic-similarity?ref=folder-59726