## Why Can't My Older GPU Keep Up with FP8 AI Models?
Modern large language models (LLMs) and AI workloads are pushing the boundaries of efficiency with formats like FP8. But what if your hardware—say, a Turing-generation GPU like the RTX 2080 or GTX 1080 Ti—lacks native FP8 support? You're stuck watching newer Ampere or Hopper GPUs lap you in inference speed. This raises a key question: do you need to upgrade your rig, or is there a software workaround?
The answer is the latter. A innovative software FP8 implementation emulates the format entirely in CUDA kernels, bypassing hardware restrictions. This lets you quantize models to FP8 and run them blazingly fast on older cards. No kernel recompilation or driver hacks required—just drop-in compatibility with popular frameworks.
## Breaking Down FP8: The 8-Bit Float Revolution
FP8 refers to 8-bit floating-point formats designed for AI training and inference. Unlike FP16 or BF16, FP8 squeezes tensor data into half the bits, slashing memory use and accelerating compute-bound operations.
There are two main variants:
- **E4M3**: 1 sign bit, 4 exponent bits, 3 mantissa bits. Great for activations with a wide dynamic range.
- **E5M2**: 1 sign, 5 exponent, 2 mantissa. Ideal for weights, offering higher precision in typical ranges.
NVIDIA's H100 and beyond natively accelerate these via tensor cores, delivering up to 4x speedups over FP16. But on pre-Ampere GPUs? Native FP8 instructions don't exist, so models fall back to slower FP16 or require custom handling.
This software FP8 solution changes that. It implements dequantization and quantization kernels in pure CUDA, converting FP8 tensors to FP16 on-the-fly during matmuls. The result: near-native performance without hardware upgrades.
### Real-World Impact: Memory and Speed Gains
Consider Llama 2 70B. In FP16, it devours ~140GB VRAM. Quantized to 4-bit, that's ~35GB—but inference crawls without optimizations. FP8 hits a sweet spot: ~70GB VRAM usage with 2-3x faster throughput on supported hardware.
On an RTX 2080 Ti (11GB VRAM), pure FP16 Llama 2 7B barely fits and runs at 10-15 tokens/sec. With software FP8 emulation via this library, you hit 40-50 tokens/sec, rivaling bitsandbytes 4-bit quant.
## How the Software Magic Works Under the Hood
At its core, this is about **emulated low-precision matmuls**. During forward passes:
1. **Dequantize FP8 weights to FP16** using lookup tables (LUTs) or bit manipulation.
2. Perform GEMM (general matrix multiply) in FP16 on the GPU's tensor cores.
3. **Quantize activations back to FP8** post-matmul.
Key innovations:
- **Block-wise quantization**: Processes 128-element blocks for cache efficiency.
- **Fused kernels**: Combines quant/dequant with matmul to minimize memory traffic.
- **Deterministic scaling**: Avoids stochastic rounding issues in emulation.
The implementation lives in a lightweight CUDA extension, compiled once for your CUDA version (11.8+ recommended). No PyTorch modifications needed—it hooks into `torch.autograd.Function` for seamless integration.
Here's a peek at the core dequant kernel logic (simplified from source):
```cuda
__global__ void dequantize_fp8_e4m3_kernel(const uint8_t* input, half* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
uint8_t val = input[idx];
bool sign = val & 0x80;
int exp = (val >> 3) & 0x0F;
int mant = val & 0x07;
// Reconstruct FP16 value...
output[idx] = __float2half(reconstructed_fp16);
}
}
```
This runs warp-synchronous, leveraging Tensor Cores for the heavy lifting.
## Getting Started: Installation and Setup
Ready to try it? Fire up a CUDA-enabled environment (tested on 11.8-12.4).
1. Clone the repo:
```bash
git clone https://github.com/woct0rdho/software-fp8
git submodule update --init --recursive
```
2. Install dependencies:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate
cd software-fp8 && pip install -e .
```
3. Load a model with FP8:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from software_fp8 import load_fp8_model
model_name = "meta-llama/Llama-2-7b-hf" # Use FP8-quantized checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = load_fp8_model(model_name, torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```
`load_fp8_model` handles weight loading, quantization state restoration, and kernel registration. Supports safetensors for fast loading.
## Benchmarks: Numbers Don't Lie
Tested on RTX 2080 Ti (Turing, SM_75):
| Model | FP16 (t/s) | 4-bit QLoRA (t/s) | Software FP8 (t/s) | VRAM (GB) |
|-------------|------------|-------------------|--------------------|-----------|
| Llama2-7B | 12.5 | 28.4 | 45.2 | 6.8 |
| Llama2-13B | 8.2 | 22.1 | 36.7 | 10.2 |
| CodeLlama-34B | OOM | 18.5 | 29.1 | 10.9 |
On A100 (Ampere, native FP8): Software FP8 matches hardware within 5-10% overhead—proving the emulation is tight.
Compared to alternatives:
- **bitsandbytes 8-bit**: Slower (20-30 t/s on 7B) due to double dequant.
- **GPTQ/AWQ**: Great for weights-only, but no activation quant.
Software FP8 shines for full model quantization, especially in memory-constrained setups.
### Edge Cases and Limitations
- **Precision loss**: FP8 can degrade perplexity by 5-10% vs FP16. Mitigate with mixed-precision (FP8 weights, FP16 activations).
- **Batch size 1 only**: Optimized for single-sequence inference; batching needs kernel tweaks.
- **No training support**: Inference-only; fine-tuning requires gradients in higher precision.
- **CUDA compute capability 7.5+**: Turing and up. Pascal may work with tweaks.
## Advanced Usage: Custom Models and Integrations
Got a custom FP8 checkpoint? Use `register_fp8_hooks(model)` post-loading:
```python
from software_fp8 import register_fp8_hooks
model = AutoModelForCausalLM.from_pretrained("path/to/fp8-model", torch_dtype=torch.float16)
register_fp8_hooks(model)
```
This monkey-patches linear layers for FP8 matmuls. Pairs beautifully with vLLM or ExLlama for even faster serving.
For Hugging Face integration, check [transformers](https://github.com/huggingface/transformers) docs on custom dtype loading.
## Scaling to Production: Tips and Tricks
- **Multi-GPU**: Use `accelerate` for device mapping; emulation scales linearly.
- **Quantization pipeline**: Convert BF16 models to FP8 with provided scripts:
```bash
python convert_to_fp8.py --model meta-llama/Llama-2-7b --outdir ./fp8-llama7b
```
- **Monitoring**: Use `nvidia-smi` and `nvprof` to verify Tensor Core utilization >90%.
In a real-world setup, deploy on a cluster of 4x RTX 3090s: Serve 70B models at 100+ t/s total throughput, under $5k hardware.
## Future Directions: What's Next?
This software FP8 bridges the gap until everyone upgrades. Upstream potential:
- PRs to bitsandbytes or AutoGPTQ.
- AMD/Intel ROCm ports.
- Training support via FP8 gradients.
Fork the repo at [software-fp8](https://github.com/woct0rdho/software-fp8) and contribute—issues welcome.
Bottom line: Hardware barriers are crumbling. With this tool, your older GPUs get a new lease on life for cutting-edge AI.
---
<div style="text-align: center; margin-top: 2rem;">
<a href="https://towardsdatascience.com/breaking-the-hardware-barrier-software-fp8-for-older-gpus/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a>
</div>