Understanding LLMs from Scratch Using Middle School Math | Towards Data Science

Here’s an analysis of the provided text, formatted as requested:

Keywords: LLMs, neural networks, transformer architecture, self-attention, embeddings

Overview: This article provides a comprehensive, self-contained explanation of how Large Language Models (LLMs) work, starting from basic mathematical principles (addition and multiplication). It aims to demystify the inner workings of LLMs and the Transformer architecture by stripping away jargon and representing concepts as numerical operations. The article covers a wide range of topics, from simple neural networks and training processes to more advanced concepts like embeddings, self-attention, and positional encoding, ultimately explaining the GPT and Transformer architectures. The goal is to enable a determined reader to theoretically recreate a modern LLM from the information provided.

Section-by-section reading:

A simple neural network: A neural network classifies objects based on numerical inputs, such as RGB color values and volume. The network’s output is interpreted to represent different classes, like “leaf” or “flower,” based on which output neuron has a higher value. The network learns to associate specific weights with each input to produce the desired output, and the same network can be repurposed for different tasks by changing the input data and output interpretation.
How are these models trained? Training a model involves starting with random weights and iteratively adjusting them to minimize the difference (loss) between the model’s output and the desired output from training data. This process, called gradient descent, involves calculating the gradient of each parameter and moving it in the direction that reduces the loss. Training deep networks is complex due to issues like vanishing or exploding gradients, and requires significant computational resources.
How does all this help generate language? Language generation is achieved by interpreting the numerical output of a neural network as characters in a sentence. By assigning a number to each character, the network can be trained to predict the next character in a sequence. This process can be repeated recursively to generate entire sentences, forming the basis of generative AI.
What makes LLMs work so well? Several innovations contribute to the effectiveness of LLMs, including embeddings, which are trained numerical representations of input tokens. Subword tokenizers break words into smaller units, allowing the model to handle a larger vocabulary and understand relationships between similar words. Self-attention mechanisms allow the model to weigh the importance of different words in the input sequence when making predictions.
Embeddings: Embeddings are trained numerical representations of inputs, where gradient descent is applied not only to the weights but also to the number representations for the inputs. Instead of arbitrarily assigning numbers to characters, embeddings learn better numbers that allow us to train better networks. Embedding vectors have more than one number to capture the richness of a concept.
Sub-word tokenizers: Subword tokenizers break words down into smaller units and then embed them. This makes it easier for the model to understand the concept of “s” followed by other familiar words. A tokenizer takes you input text and splits it into the tokens and gives you the corresponding numbers that you need to look up the embedding vector for that token in the embedding matrix.
Self Attention: Self-attention adds up the embedding vectors for each of the words, but instead of directly adding them up it applies some weights to each. The weights are dependent on the vector we are adding and the word immediately preceding the one we are going to predict. Self attention uses the word immediately preceding the word we are about to predict, i.e., the last word in the sentence available.
Softmax: Softmax converts the outputs of the last layer in (0,1) range so that it preserves the order. It is a generalization of the logistic function above but with additional features. If you give it 10 arbitrary numbers – it will give you 10 outputs, each between 0 and 1 and importantly, all 10 adding up to 1 so that we can interpret them as chance.
Residual connections: Residual connections take the output of self-attention block and before passing it to the next block, we are adding to it the original Input. As networks get deeper (more layers between input and output) it gets increasingly harder to train them. Residual connections have been shown to help with these training challenges.
Layer Normalization: Layer normalization is a fairly simple layer that takes the data coming into the layer and normalizes it by subtracting the mean and dividing it by standard deviation. It basically stabilizes the input vector and helps with training deep networks. The layer norm layer has a scale and a bias parameter.
Dropout: Dropout is a simple but effective method to avoid model overfitting. By inserting a dropout layer during training what you are doing is randomly deleting a certain percentage of the direct neuron connections between the layers that dropout is inserted. This forces the network to train with a lot of redundancy.
Multi-head Attention: Multi-head attention runs several attention heads in parallel (they all take the same inputs). Then we take all their outputs and simply concatenate them. We are generating the same key, query and values for each of the heads.
Positional encoding and embedding: A positional embedding is no different than any other embedding except that instead of embedding the word vocabulary we will embed numbers 1, 2, 3 etc. This embedding is a matrix of the same length as word embedding, and each column corresponds to a number. A positional encoding is simply a vector of the same length as the word embedding vector, except it is not an embedding in the sense that it is not trained.
The GPT architecture: The GPT architecture is used in most GPT models (with variation across). It is called “transformer” here because it is derived from and is a type of transformer. GPT is just a special arrangement of these blocks that is shown above with an interpretation that we discussed in Part 1.
The transformer architecture: Transformers not only improved the prediction accuracy, they are also easier/more efficient than previous models (to train), allowing for larger model sizes. Transformer was originally created for this task and consists of an “encoder” and a “decoder” – which are basically two separate blocks. One block simply takes the german sentence and gives out an intermediate representation (again, bunch of numbers, basically) – this is called the encoder.

Related Tools:

References:

Vaswani et al. (2017), Attention is all you need

Original Article Link: https://towardsdatascience.com/understanding-llms-from-scratch-using-middle-school-math-e602d27ec876/

source: https://towardsdatascience.com/understanding-llms-from-scratch-using-middle-school-math-e602d27ec876/