## Introduction to Embedding Models
Embedding models transform discrete tokens like words or sentences into continuous vector representations, capturing semantic relationships essential for natural language processing (NLP) tasks. These dense vectors enable machines to understand context, similarity, and meaning numerically. This guide takes you through the evolution and implementation of embedding techniques, starting from classical methods and progressing to modern transformer-based approaches. You'll gain practical skills by building models from scratch using PyTorch, evaluating their performance, and applying them effectively.
Whether you're enhancing search engines, recommendation systems, or chatbots, mastering embeddings unlocks powerful NLP capabilities. We draw from proven architectures, providing code examples and best practices to make concepts actionable.
## Fundamentals of Embeddings
At their core, embeddings map high-dimensional sparse data (e.g., one-hot vectors) to low-dimensional dense spaces where similar items are close together. Key properties include:
- **Dimensionality**: Typically 50-768 dimensions, balancing expressiveness and efficiency.
- **Semantic Capture**: Vectors encode relationships like 'king' - 'man' + 'woman' ≈ 'queen'.
### Why Embeddings Matter
Embeddings reduce the vocabulary curse of dimensionality and enable downstream tasks like classification or clustering without raw text processing. In practice, pre-trained models like Word2Vec or BERT save massive compute, but understanding their internals allows customization.
Start by visualizing: Use t-SNE or PCA to plot embeddings and observe clusters for related words.
## Skip-Gram Model: Predicting Context from Target
The skip-gram model, popularized by Word2Vec, predicts surrounding context words given a target word. This self-supervised approach leverages vast unlabeled text.
### Architecture Overview
- **Input**: One-hot encoded target word.
- **Embedding Layer**: Linear projection to dense vector (weights are the embeddings).
- **Output**: Softmax over vocabulary for context words.
Mathematically, maximize P(context | target) = softmax(target_embedding · context_embedding^T).
### Implementation in PyTorch
Here's a basic skip-gram setup:
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
class SkipGram(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.embeddings = nn.Embedding(vocab_size, embed_dim)
self.output = nn.Linear(embed_dim, vocab_size)
def forward(self, target):
embed = self.embeddings(target)
return self.output(embed)
# Training loop sketch
model = SkipGram(vocab_size=10000, embed_dim=300)
optimizer = torch.optim.Adam(model.parameters())
for target, context in dataloader:
logits = model(target)
loss = F.cross_entropy(logits, context.view(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
Train on windowed contexts (e.g., window size 5) from corpora like Wikipedia dumps.
## Optimizing Training: Negative Sampling and Hierarchical Softmax
Full softmax over large vocabularies (millions) is computationally expensive (O(V) per example).
### Negative Sampling
Approximate by sampling K negative (random) contexts and optimizing binary logistic loss:
Loss = log sigmoid(target · positive) + Σ log sigmoid(-target · negative)
This scales linearly with K (typically 5-20). Modify the model:
```python
def negative_sampling_loss(embed_target, embed_pos, embed_neg, K=5):
score_pos = torch.sum(embed_target * embed_pos, dim=1)
score_neg = torch.bmm(embed_neg, embed_target.unsqueeze(2)).squeeze()
loss = F.logsigmoid(score_pos).mean() - (1/K) * F.logsigmoid(-score_neg).mean()
return -loss
```
### Hierarchical Softmax
Tree-structured softmax reduces complexity to O(log V) using a Huffman tree where frequent words have shorter paths.
For production, negative sampling often suffices due to simplicity.
## Evaluating Embedding Quality
Intrinsic evaluation:
- **Word Similarity**: Spearman correlation on datasets like WordSim-353 (e.g., 'car' and 'automobile' should be close).
- **Analogy Tasks**: Vector arithmetic accuracy.
Extrinsic: Plug into downstream tasks like NER or sentiment analysis.
Use libraries like Gensim for benchmarks:
```python
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('embeddings.bin')
print(model.similarity('king', 'queen'))
```
## GloVe: Global Vectors for Word Representation
Unlike skip-gram's local contexts, GloVe factorizes a global word co-occurrence matrix.
### Core Idea
Minimize || W * W^T - X || where X_ij is log co-occurrence count, W are embeddings.
Loss: Σ f(X_ij) (w_i^T w_j + b_i + b_j - log X_ij)^2
f scales down rare co-occurrences.
Implementation mirrors skip-gram but uses precomputed co-occurrence stats. GloVe embeddings excel on similarity tasks.
## FastText: Subword Information
Extends Word2Vec with n-gram subwords (e.g., 'apple' → '<ap', 'app', 'ppl', 'ple>', 'le>'). Handles OOV words by averaging subword vectors.
Great for morphologically rich languages. Training similar to skip-gram, with bag-of-subwords input.
## Modern Embeddings: Transformers and Contextual Representations
Static embeddings (Word2Vec/GloVe) assign one vector per word. Contextual models like BERT generate dynamic embeddings via transformers.
### BERT-Style Embeddings
- Token + positional + segment embeddings → Transformer layers → [CLS] or pooled output.
Fine-tune for sentence embeddings using contrastive losses (e.g., Multiple Negatives Ranking Loss).
## Sentence Transformers: Ready-to-Use Framework
The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) builds on transformers for semantic search.
Fine-tuning example:
```python
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer('all-MiniLM-L6-v2')
train_examples = [InputExample(texts=['text1', 'text2'], label=0.8)]
train_dataloader = DataLoader(train_examples, shuffle=True)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)])
```
## Hands-On Course Resources
For complete notebooks covering all architectures, check the official repository: [https://github.com/rasbt/embedding-models-course](https://github.com/rasbt/embedding-models-course). It includes PyTorch implementations, datasets, and evaluation scripts.
## Practical Applications
- **Search**: FAISS indexing of embeddings for fast similarity.
- **Recommendations**: User-item dot products.
- **Clustering**: K-means on document embeddings.
Example: Semantic search pipeline:
```python
embeddings = model.encode(sentences)
import faiss
index = faiss.IndexFlatIP(384) # For 384-dim
index.add(embeddings)
D, I = index.search(query_embedding, k=5)
```
## Best Practices and Scaling
- Use subword tokenizers (BPE) for robustness.
- Train on diverse, large corpora.
- Quantize embeddings for deployment (8-bit).
- Monitor for biases via WEAT tests.
This guide equips you to build production-grade embedding systems. Experiment with the repo, iterate on your data, and integrate into apps for immediate impact.
---
<div style="text-align: center; margin-top: 2rem;">
<a href="https://www.deeplearning.ai/short-courses/embedding-models-from-architecture-to-implementation/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a>
</div>