Large Language Models — Structured Notes — .md Directory

# Large Language Models — Structured Notes ## What is an LLM? - A mathematical function that takes text as input and returns a probability distribution over possible next tokens - Does not produce words directly; produces tokens ## Making an LLM: Overview Two stages: 1. **Pretraining** 2. **Post-training** ## Pretraining ### 1. Data Collection Three goals: - **Quantity** — large volume so the model learns language rules - **Diversity** — broad knowledge coverage - **Quality** — no harmful content, no PII, no noise, no duplicates Pipeline steps (see [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)): - URL filtering - Text extraction - Cleaning / noise removal - Deduplication - PII removal ### 2. Tokenizer Training #### What is a Token? - The single unit of text an LLM produces per pass - Not a word, not a character — a subword unit - LLMs charge per token (see [OpenRouter](https://openrouter.ai/models)) **Why not words?** - Over 1 million English words; can't enumerate them all - Can't handle novel/coined words **Why not characters?** - Some next characters are trivially obvious (e.g. "tokenizatio" → "n"); wasteful to run full forward pass - Characters carry no meaning individually; tokens do ("token", "ization" have meaning; "t", "o", "k" don't) - Would make inference far more expensive **Tokenization demo:** [Tiktokenizer](https://tiktokenizer.vercel.app/?model=gpt2) **Famous consequence:** LLMs never see "strawberry" — they see "st", "raw", "berry" (as token IDs). This explains apparent counting failures. #### BPE (Byte Pair Encoding) - Algorithm used by GPT - Start with every character as its own token - Repeatedly merge the most frequent adjacent pair into a new token - After thousands of merges: common words become single tokens; rare/novel words split into known sub-tokens - Trained independently, before the LLM itself - [BPE demo](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/bpe_demo.html) ### 3. Embeddings #### What are Embeddings? - Each token ID is mapped to a high-dimensional vector - GPT-2: ~1000 dimensions; GPT-3: ~12,000 dimensions; Qwen3-8B: 4,096 dimensions - [TensorFlow Embedding Projector](https://projector.tensorflow.org/) **Why vectors and not token IDs?** - A neural network cannot do arithmetic on a token ID - Vectors allow encoding and deriving *meaning* through arithmetic **Examples:** ``` king - man + woman ≈ queen father - mother = man - woman Hitler + Italy - Germany ≈ Mussolini ``` **Why high dimensionality?** - Each dimension encodes some latent property (e.g. "Italian-ness", "dictator-ness") - No human can pre-program these; they emerge from training #### Polysemy Problem - Many tokens have multiple meanings (e.g. "bank", "mole") - Solution: embeddings are not static — the transformer updates them using surrounding context #### Similarity Measurement - Similarity between two embedding vectors = **dot product** - Dot product = 0 → no correlation - Dot product > 0 → similar direction - Dot product < 0 → opposite - [Semantle demo](https://semantle.com/) ### 4. Transformer Architecture **Origin paper:** [Attention Is All You Need (2017)](https://arxiv.org/pdf/1706.03762) - Originally designed for machine translation - Now powers nearly all text-based AI - "T" in GPT = Transformer **Visualizer:** [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) #### The Core Problem > "The cat sat on the mat because **it** was tired." - "it" cannot be resolved by position alone - The model must attend to other tokens to update the meaning of "it" #### Attention Mechanism Three vectors per token (derived by multiplying the token embedding by trained weight matrices): - **Query (Q):** "What am I looking for?" - **Key (K):** "What do I contain?" - **Value (V):** "What information do I give if matched?" **Intuition:** `k[q] = v` — attention is a *soft* dictionary lookup **Formula:** $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$ - $QK^T$: dot product of Q and K (measures match) - $\sqrt{d_k}$: normalisation — prevents dot product from inflating with more dimensions - $M$: mask — set to $-\infty$ for future tokens, so softmax zeroes them out - Result is multiplied by V to extract the matched value **Masked Self-Attention:** - "Masked" = future tokens are hidden during training - "Self" = the sentence attends to itself [Attention formula demo](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/transformer_attention.html) #### Multi-Head Attention - Attention is run many times in parallel (multiple "heads") - Outputs are combined, then passed through layer normalisation and a feedforward neural network #### Full Forward Pass Summary ``` Input tokens → Token embeddings + Positional embeddings → Q, K, V vectors (via weight matrices) → QKᵀ → mask → softmax → × V → Feedforward neural network → Logits vector → Softmax → probability distribution over next token ``` [LLM visualisation (BBycroft)](https://bbycroft.net/llm) ### 5. Training Objective: Cross-Entropy Loss [Cross-entropy explainer](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/cross_entropy_explainer.html) - At the start, all weights are random → outputs are random - **Loss function:** Cross-entropy $$L = -\log(p_{\text{correct token}})$$ - Low probability assigned to correct token → high loss - Model updates weights via backpropagation to reduce loss **Perplexity:** - Interpretability wrapper around cross-entropy - [Perplexity demo](https://perplexity.vercel.app/) ## Post-Training ### Base Model Problem - The pretrained model is a **base model / foundation model** - It only predicts next tokens - Ask "What is the capital of France?" → it outputs more quiz questions, not an answer - It is a **stochastic parrot**: repeats patterns from training data without following instructions ### Stage 1: Supervised Fine-Tuning (SFT) - Same cross-entropy loss, different data - Data = (prompt, ideal response) pairs - Teaches the model what assistant-style output *looks like* **Limitation:** SFT has no concept of "wrong" - It only increases probability of the demonstration response - Does not decrease probability of harmful completions - Rephrasing a prompt can bypass SFT refusals entirely ### Stage 2: Preference Fine-Tuning (RLHF / DPO) **Why needed:** - SFT alone does not align the model's behaviour robustly - Example of unaligned model: Microsoft Bing "Sydney" — [LessWrong post](https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned) **Data collection steps:** 1. Create a set of prompts 2. Run SFT model to get multiple outputs per prompt 3. Have human annotators *rank* the outputs 4. Store the ranked preferences as a dataset **Training methods:** - **RLHF** (Reinforcement Learning from Human Feedback) — e.g. PPO - **DPO** (Direct Preference Optimization) — e.g. IPO - Both can be combined with **PEFT / LoRA** for efficiency **Key difference from SFT:** - Preference training with negative examples *directly decreases* probability of harmful completions across the reward landscape, not just at a single prompt ### Stage 3: Reasoning Fine-Tuning - "Think step by step" prompting used to improve results - LLM providers have now baked this in automatically via fine-tuning - Model is trained on step-by-step reasoning from textbooks - Reinforcement learning is used to reinforce correct reasoning chains - This is what produces "thinking" / reasoning models (vs instruct models) > Arguing LLMs can't "reason" is like arguing submarines can't "swim." ## Evaluation (Evals) - [ARC Prize Leaderboard](https://arcprize.org/leaderboard) — reasoning benchmark; humans ~100%, best models ~84% - [METR](https://metr.org/) — coding/agent benchmark (somewhat controversial) - [Chatbot Arena](https://arena.ai/leaderboard/search) — human preference-based ranking (no fixed test set) ## Quantization **Problem:** Running large models requires storing all parameters in VRAM **Example calculation:** - Qwen3-8B: 8 billion parameters × 16 bits (BF16) = **16 GB** required - A typical laptop GPU has 6 GB VRAM **Solution — Quantization:** - Round each 16-bit weight value down to fewer bits - Common: **4-bit quantization** → ~4 GB for an 8B model **GGUF format** — used by [llama.cpp](https://github.com/ggerganov/llama.cpp) **Q4_K_M explained:** | Component | Meaning | |---|---| | `Q4` | 4 bits per weight | | `_K` | K-quants: different layers get different precision (attention layers kept higher) | | `_M` | Medium quality tier (balanced size vs quality) | Other tiers: `_S` (small/aggressive), `_L` (large/higher precision) **Reference:** [Quantization explainer](https://www.maartengrootendorst.com/blog/quantization/) | [Bartowski's quantized models](https://huggingface.co/bartowski/Qwen_Qwen3-8B-GGUF) ## Running Models Locally - [Ollama](https://ollama.com/) or llama.cpp to run GGUF models - [Hugging Face](https://huggingface.co/) — repository for open-source models and datasets ## Full Training Pipeline Summary 1. Collect data 2. Train tokenizer (BPE) 3. Train base model (cross-entropy on raw text) 4. Train SFT model (cross-entropy on demonstration data) 5. Train aligned model (RLHF / DPO with human preference data) 6. Optional: fine-tune for specific use cases 7. Quantize for deployment ## Resources for Building Your Own Model | Resource | Purpose | |---|---| | [Neural Networks: Zero to Hero — Karpathy](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) | Full from-scratch training course | | [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) | GPT-2 level model, trains in ~2 min | | [Unsloth](https://github.com/unslothai/unsloth) | Efficient LoRA fine-tuning of existing models | | [Hugging Face Datasets](https://huggingface.co/datasets) | Fine-tuning datasets |

Large Language Models — Structured Notes

Related Documents

Testing

Multi-class: exactly one of the sentiment labels applies

HPC (High Performance Computing) bookmarks

Ruby 2.7