Patterns for Building LLM-based Systems & Products

好的，这是对您提供的文本的分析和总结：

语言： 中文

关键字： LLM, 模式, 工程, 产品, 评估

概述： 本文深入探讨了将大型语言模型（LLM）集成到系统和产品中的实用模式。文章从评估、检索增强生成、微调、缓存、防护栏、防御性用户体验和收集用户反馈七个关键模式入手，详细阐述了如何提高LLM的性能、降低成本和风险，以及如何更好地服务于用户。此外，文章还讨论了数据飞轮、级联和监控等其他机器学习模式在LLM系统中的应用。作者强调，在构建基于LLM的系统和产品时，需要根据实际问题选择合适的模式，并不断收集用户反馈以优化模型。

分节阅读：

Evals: To measure performance（评估：衡量性能）
- 评估是用于评估模型在任务上的表现的一组测量方法，包括基准数据和指标。评估对于认真构建产品的团队至关重要，可以衡量系统的表现并检测任何退化。没有评估，我们将盲目飞行，或者每次更改都必须目视检查LLM输出。
Retrieval-Augmented Generation: To add knowledge（检索增强生成：添加知识）
- 检索增强生成（RAG）从基础模型外部获取相关数据，并使用此数据增强输入，提供更丰富的上下文以改善输出。RAG通过将模型建立在检索到的上下文中来帮助减少幻觉，从而提高事实性。此外，保持检索索引的最新状态比持续预训练LLM更便宜。
Fine-tuning: To get better at specific tasks（微调：更好地完成特定任务）
- 微调是指采用预训练模型（已经使用大量数据进行训练）并在特定任务上进一步改进它的过程。微调可以提高现成基础模型的性能，甚至可以超越第三方LLM。通过微调和托管我们自己的模型，我们可以确保数据不会离开我们的网络，并可以根据需要扩展吞吐量。
Caching: To reduce latency and cost（缓存：减少延迟和成本）
- 缓存是一种存储先前检索或计算的数据的技术，以便将来对相同数据的请求可以更快地得到服务。缓存可以显着减少先前已服务的响应的延迟。通过消除一次又一次计算相同输入的响应的需要，我们可以减少LLM请求的数量，从而节省成本。
Guardrails: To ensure output quality（防护栏：确保输出质量）
- 在LLM的上下文中，防护栏验证LLM的输出，确保输出不仅听起来不错，而且在语法上正确、真实且没有有害内容。防护栏有助于确保模型输出的可靠性和一致性，以便在生产中使用。防护栏还提供额外的安全层，并保持对LLM输出的质量控制。
Defensive UX: To anticipate & handle errors gracefully（防御性用户体验：预测并优雅地处理错误）
- 防御性用户体验是一种设计策略，它承认在用户与机器学习或基于LLM的产品交互期间可能会发生不好的事情，例如不准确或幻觉。防御性用户体验可以通过提供增加的可访问性、增加的信任和更好的用户体验来帮助缓解上述问题。通过设计系统和用户体验来处理模棱两可的情况和错误，防御性用户体验为更流畅、更愉快的用户体验铺平了道路。
Collect user feedback: To build our data flywheel（收集用户反馈：构建我们的数据飞轮）
- 收集用户反馈使我们能够了解他们的偏好。具体到LLM产品，用户反馈有助于构建评估、微调和防护栏。用户反馈有助于我们的模型改进，并使我们能够适应个人偏好。反馈循环还有助于我们评估系统的整体性能。

相关工具：

EleutherAI Eval: https://github.com/EleutherAI/lm-evaluation-harness
AlpacaEval: https://github.com/tatsu-lab/alpaca_eval
sentence-transformers: https://github.com/UKPLab/sentence-transformers
FAISS: https://github.com/facebookresearch/faiss
HNSW: https://github.com/nmslib/hnswlib
ScaNN: https://github.com/google-research/google-research/tree/master/scann
GPTCache: https://github.com/zilliztech/GPTCache
Guardrails package:https://github.com/ShreyaR/guardrails
NeMo-Guardrails: https://github.com/NVIDIA/NeMo-Guardrails
Guidance: https://github.com/microsoft/guidance
List of Dirty, Naughty, Obscene, and Otherwise Bad Words: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
profanity detection: https://pypi.org/project/profanity-check/

参考文献：

Measuring massive multitask language understanding: https://arxiv.org/abs/2009.03300
A Framework for Few-Shot Language Model Evaluation: https://github.com/EleutherAI/lm-evaluation-harness
Holistic evaluation of language models: https://arxiv.org/abs/2211.09110
AlpacaFarm: A Simulation Framework for Methods That Learn from Human Feedback: https://github.com/tatsu-lab/alpaca_eval
Bleu: a method for automatic evaluation of machine translation: https://dl.acm.org/doi/10.3115/1073083.1073135
Rouge: A package for automatic evaluation of summaries: https://aclanthology.org/W04-1013/
Bertscore: Evaluating text generation with bert: https://arxiv.org/abs/1904.09675
MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance: https://arxiv.org/abs/1909.02622
A survey of evaluation metrics used for NLG systems: https://arxiv.org/abs/2008.12009
Rogue Scores: https://aclanthology.org/2023.acl-long.107/
Gpteval: Nlg evaluation using gpt-4 with better human alignment: https://arxiv.org/abs/2303.16634
What’s going on with the Open LLM Leaderboard?: https://huggingface.co/blog/evaluating-mmlu-leaderboard#whats-going-on-with-the-open-llm-leaderboard
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena: https://arxiv.org/abs/2306.05685
Qlora: Efficient finetuning of quantized llms: https://arxiv.org/abs/2305.14314
MPT-7B and The Beginning of Context=Infinity: https://www.latent.space/p/mosaic-mpt-7b#details
The New Language Model Stack: https://www.sequoiacap.com/article/llm-stack-perspective/
Learning transferable visual models from natural language supervision: https://arxiv.org/abs/2103.00020
Search: Query Matching via Lexical, Graph, and Embedding Methods: https://eugeneyan.com/writing/search-query-matching/
How context affects language models’ factual predictions: https://arxiv.org/abs/2005.04611
Dense passage retrieval for open-domain question answering: https://arxiv.org/abs/2004.04906
Retrieval-augmented generation for knowledge-intensive nlp tasks: https://arxiv.org/abs/2005.11401
Leveraging passage retrieval with generative models for open domain question answering: https://arxiv.org/abs/2007.01282
Improving language models by retrieving from trillions of tokens: https://arxiv.org/abs/2112.04426
Internet-augmented language models through few-shot prompting for open-domain question answering: https://arxiv.org/abs/2203.05115
Codet5+: Open code large language models for code understanding and generation: https://arxiv.org/abs/2305.07922
Precise zero-shot dense retrieval without relevance labels: https://arxiv.org/abs/2212.10496
Obsidian-Copilot: An Assistant for Writing & Reflecting: https://eugeneyan.com/writing/obsidian-copilot/
Enriching word vectors with subword information: https://arxiv.org/abs/1607.04606
Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation: https://arxiv.org/abs/2004.09813
Text embeddings by weakly-supervised contrastive pre-training: https://arxiv.org/abs/2212.03533
One embedder, any task: Instruction-finetuned text embeddings: https://arxiv.org/abs/2212.09741
Billion-Scale Similarity Search with GPUs: https://arxiv.org/abs/1702.08734
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs: https://arxiv.org/abs/1603.09320
Accelerating Large-Scale Inference with Anisotropic Vector Quantization: https://arxiv.org/abs/1908.10396
Training language models to follow instructions with human feedback: https://arxiv.org/abs/2203.02155
Universal language model fine-tuning for text classification: https://arxiv.org/abs/1801.06146
Bert: Pre-training of deep bidirectional transformers for language understanding: https://arxiv.org/abs/1810.04805
Improving language understanding with unsupervised learning: https://openai.com/research/language-unsupervised
Exploring the limits of transfer learning with a unified text-to-text transformer: https://arxiv.org/abs/1910.10683
The power of scale for parameter-efficient prompt tuning: https://arxiv.org/abs/2104.08691
Prefix-tuning: Optimizing continuous prompts for generation: https://arxiv.org/abs/2101.00190
Parameter-efficient transfer learning for NLP: https://arxiv.org/abs/1902.00751
Lora: Low-rank adaptation of large language models: https://arxiv.org/abs/2106.09685
Qlora: Efficient finetuning of quantized llms: https://arxiv.org/abs/2305.14314
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference: https://cims.nyu.edu/~sbowman/multinli/
Training a helpful and harmless assistant with reinforcement learning from human feedback: https://arxiv.org/abs/2204.05862
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models: https://arxiv.org/abs/2303.08896
Guidelines for human-AI interaction: https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
People + AI Guidebook: https://pair.withgoogle.com/guidebook/
Human Interface Guidelines for Machine Learning: https://developer.apple.com/design/human-interface-guidelines/machine-learning
A Human Perspective on Algorithmic Similarity: https://slideslive.com/38934788/a-human-perspective-on-algorithmic-similarity?ref=folder-59726

原文链接： https://eugeneyan.com/writing/llm-patterns/

source: https://eugeneyan.com/writing/llm-patterns/