Loading...
Loading...
Loading...
- A mathematical function that takes text as input and returns a probability distribution over possible next tokens
# Large Language Models — Structured Notes
## What is an LLM?
- A mathematical function that takes text as input and returns a probability distribution over possible next tokens
- Does not produce words directly; produces tokens
## Making an LLM: Overview
Two stages:
1. **Pretraining**
2. **Post-training**
## Pretraining
### 1. Data Collection
Three goals:
- **Quantity** — large volume so the model learns language rules
- **Diversity** — broad knowledge coverage
- **Quality** — no harmful content, no PII, no noise, no duplicates
Pipeline steps (see [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)):
- URL filtering
- Text extraction
- Cleaning / noise removal
- Deduplication
- PII removal
### 2. Tokenizer Training
#### What is a Token?
- The single unit of text an LLM produces per pass
- Not a word, not a character — a subword unit
- LLMs charge per token (see [OpenRouter](https://openrouter.ai/models))
**Why not words?**
- Over 1 million English words; can't enumerate them all
- Can't handle novel/coined words
**Why not characters?**
- Some next characters are trivially obvious (e.g. "tokenizatio" → "n"); wasteful to run full forward pass
- Characters carry no meaning individually; tokens do ("token", "ization" have meaning; "t", "o", "k" don't)
- Would make inference far more expensive
**Tokenization demo:** [Tiktokenizer](https://tiktokenizer.vercel.app/?model=gpt2)
**Famous consequence:** LLMs never see "strawberry" — they see "st", "raw", "berry" (as token IDs). This explains apparent counting failures.
#### BPE (Byte Pair Encoding)
- Algorithm used by GPT
- Start with every character as its own token
- Repeatedly merge the most frequent adjacent pair into a new token
- After thousands of merges: common words become single tokens; rare/novel words split into known sub-tokens
- Trained independently, before the LLM itself
- [BPE demo](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/bpe_demo.html)
### 3. Embeddings
#### What are Embeddings?
- Each token ID is mapped to a high-dimensional vector
- GPT-2: ~1000 dimensions; GPT-3: ~12,000 dimensions; Qwen3-8B: 4,096 dimensions
- [TensorFlow Embedding Projector](https://projector.tensorflow.org/)
**Why vectors and not token IDs?**
- A neural network cannot do arithmetic on a token ID
- Vectors allow encoding and deriving *meaning* through arithmetic
**Examples:**
```
king - man + woman ≈ queen
father - mother = man - woman
Hitler + Italy - Germany ≈ Mussolini
```
**Why high dimensionality?**
- Each dimension encodes some latent property (e.g. "Italian-ness", "dictator-ness")
- No human can pre-program these; they emerge from training
#### Polysemy Problem
- Many tokens have multiple meanings (e.g. "bank", "mole")
- Solution: embeddings are not static — the transformer updates them using surrounding context
#### Similarity Measurement
- Similarity between two embedding vectors = **dot product**
- Dot product = 0 → no correlation
- Dot product > 0 → similar direction
- Dot product < 0 → opposite
- [Semantle demo](https://semantle.com/)
### 4. Transformer Architecture
**Origin paper:** [Attention Is All You Need (2017)](https://arxiv.org/pdf/1706.03762)
- Originally designed for machine translation
- Now powers nearly all text-based AI
- "T" in GPT = Transformer
**Visualizer:** [Transformer Explainer](https://poloclub.github.io/transformer-explainer/)
#### The Core Problem
> "The cat sat on the mat because **it** was tired."
- "it" cannot be resolved by position alone
- The model must attend to other tokens to update the meaning of "it"
#### Attention Mechanism
Three vectors per token (derived by multiplying the token embedding by trained weight matrices):
- **Query (Q):** "What am I looking for?"
- **Key (K):** "What do I contain?"
- **Value (V):** "What information do I give if matched?"
**Intuition:** `k[q] = v` — attention is a *soft* dictionary lookup
**Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$
- $QK^T$: dot product of Q and K (measures match)
- $\sqrt{d_k}$: normalisation — prevents dot product from inflating with more dimensions
- $M$: mask — set to $-\infty$ for future tokens, so softmax zeroes them out
- Result is multiplied by V to extract the matched value
**Masked Self-Attention:**
- "Masked" = future tokens are hidden during training
- "Self" = the sentence attends to itself
[Attention formula demo](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/transformer_attention.html)
#### Multi-Head Attention
- Attention is run many times in parallel (multiple "heads")
- Outputs are combined, then passed through layer normalisation and a feedforward neural network
#### Full Forward Pass Summary
```
Input tokens
→ Token embeddings + Positional embeddings
→ Q, K, V vectors (via weight matrices)
→ QKᵀ → mask → softmax → × V
→ Feedforward neural network
→ Logits vector
→ Softmax → probability distribution over next token
```
[LLM visualisation (BBycroft)](https://bbycroft.net/llm)
### 5. Training Objective: Cross-Entropy Loss
[Cross-entropy explainer](file:///D:/pro/0.%20Ongoing%20Projects/gpt-presentation/cross_entropy_explainer.html)
- At the start, all weights are random → outputs are random
- **Loss function:** Cross-entropy
$$L = -\log(p_{\text{correct token}})$$
- Low probability assigned to correct token → high loss
- Model updates weights via backpropagation to reduce loss
**Perplexity:**
- Interpretability wrapper around cross-entropy
- [Perplexity demo](https://perplexity.vercel.app/)
## Post-Training
### Base Model Problem
- The pretrained model is a **base model / foundation model**
- It only predicts next tokens
- Ask "What is the capital of France?" → it outputs more quiz questions, not an answer
- It is a **stochastic parrot**: repeats patterns from training data without following instructions
### Stage 1: Supervised Fine-Tuning (SFT)
- Same cross-entropy loss, different data
- Data = (prompt, ideal response) pairs
- Teaches the model what assistant-style output *looks like*
**Limitation:** SFT has no concept of "wrong"
- It only increases probability of the demonstration response
- Does not decrease probability of harmful completions
- Rephrasing a prompt can bypass SFT refusals entirely
### Stage 2: Preference Fine-Tuning (RLHF / DPO)
**Why needed:**
- SFT alone does not align the model's behaviour robustly
- Example of unaligned model: Microsoft Bing "Sydney" — [LessWrong post](https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned)
**Data collection steps:**
1. Create a set of prompts
2. Run SFT model to get multiple outputs per prompt
3. Have human annotators *rank* the outputs
4. Store the ranked preferences as a dataset
**Training methods:**
- **RLHF** (Reinforcement Learning from Human Feedback) — e.g. PPO
- **DPO** (Direct Preference Optimization) — e.g. IPO
- Both can be combined with **PEFT / LoRA** for efficiency
**Key difference from SFT:**
- Preference training with negative examples *directly decreases* probability of harmful completions across the reward landscape, not just at a single prompt
### Stage 3: Reasoning Fine-Tuning
- "Think step by step" prompting used to improve results
- LLM providers have now baked this in automatically via fine-tuning
- Model is trained on step-by-step reasoning from textbooks
- Reinforcement learning is used to reinforce correct reasoning chains
- This is what produces "thinking" / reasoning models (vs instruct models)
> Arguing LLMs can't "reason" is like arguing submarines can't "swim."
## Evaluation (Evals)
- [ARC Prize Leaderboard](https://arcprize.org/leaderboard) — reasoning benchmark; humans ~100%, best models ~84%
- [METR](https://metr.org/) — coding/agent benchmark (somewhat controversial)
- [Chatbot Arena](https://arena.ai/leaderboard/search) — human preference-based ranking (no fixed test set)
## Quantization
**Problem:** Running large models requires storing all parameters in VRAM
**Example calculation:**
- Qwen3-8B: 8 billion parameters × 16 bits (BF16) = **16 GB** required
- A typical laptop GPU has 6 GB VRAM
**Solution — Quantization:**
- Round each 16-bit weight value down to fewer bits
- Common: **4-bit quantization** → ~4 GB for an 8B model
**GGUF format** — used by [llama.cpp](https://github.com/ggerganov/llama.cpp)
**Q4_K_M explained:**
| Component | Meaning |
|---|---|
| `Q4` | 4 bits per weight |
| `_K` | K-quants: different layers get different precision (attention layers kept higher) |
| `_M` | Medium quality tier (balanced size vs quality) |
Other tiers: `_S` (small/aggressive), `_L` (large/higher precision)
**Reference:** [Quantization explainer](https://www.maartengrootendorst.com/blog/quantization/) | [Bartowski's quantized models](https://huggingface.co/bartowski/Qwen_Qwen3-8B-GGUF)
## Running Models Locally
- [Ollama](https://ollama.com/) or llama.cpp to run GGUF models
- [Hugging Face](https://huggingface.co/) — repository for open-source models and datasets
## Full Training Pipeline Summary
1. Collect data
2. Train tokenizer (BPE)
3. Train base model (cross-entropy on raw text)
4. Train SFT model (cross-entropy on demonstration data)
5. Train aligned model (RLHF / DPO with human preference data)
6. Optional: fine-tune for specific use cases
7. Quantize for deployment
## Resources for Building Your Own Model
| Resource | Purpose |
|---|---|
| [Neural Networks: Zero to Hero — Karpathy](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) | Full from-scratch training course |
| [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) | GPT-2 level model, trains in ~2 min |
| [Unsloth](https://github.com/unslothai/unsloth) | Efficient LoRA fine-tuning of existing models |
| [Hugging Face Datasets](https://huggingface.co/datasets) | Fine-tuning datasets |[← Back to docs](README.md)
title: 'LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification'
+ [Learning and practice of high performance computing](https://github.com/cjmcv/hpc)
title: Ruby 2.7 changes