What I Learned Benchmarking FP16 vs INT4 LLM Inference with vLLM
Most quantization posts say INT4 is faster and cheaper, but few show exactly how much faster on real hardware. I wanted to answer one question for myself: is INT4 actually faster than FP16 in production, or is that just something everyone repeats?
So I built llm-inference-bench, ran Mistral-7B in FP16 and INT4 on an NVIDIA L4, and measured throughput, latency, and scaling across batch sizes and sequence lengths. The result was useful, but the most important lesson was unexpected.
Project: github.com/bereketlemma/llm-inference-bench
Dashboard: bench.bereketlemma.com
Why This Question Matters
This is not just a benchmark vanity metric. The answer affects real production tradeoffs.
- Inference cost: better throughput means fewer GPUs for the same traffic.
- User experience: lower tail latency means fewer slow responses.
- Memory limits: quantization can let larger models fit safely in VRAM.
- Capacity planning: higher requests per second on the same hardware budget.
Concepts, Quickly
FP16: 16-bit floating point weights, common baseline for fast GPU inference.
INT4 quantization: compresses weights to 4-bit integers, reducing memory bandwidth and often improving throughput.
AWQ (Activation-aware Weight Quantization): a method for preserving quality while quantizing.
vLLM: an inference engine optimized for high-throughput serving with strong batching behavior.
Throughput vs latency: throughput tells you system capacity, latency tells you what each user feels. You need both.
Why P99 matters: averages hide bad tails; P99 captures slow requests that hurt production experience.
How I Ran the Benchmark
I benchmarked Mistral-7B on an NVIDIA L4 (24GB) across 18 configurations.
- FP16 model: mistralai/Mistral-7B-v0.1
- INT4 model: TheBloke/Mistral-7B-v0.1-AWQ
- Engine: vLLM 0.16.0
- Batch sizes: 1, 4, 8
- Sequence lengths: 128, 256, 512
- Warmup: 3 iterations
- Measurement: 10 runs per configuration
- Decoding: greedy, temperature = 0.0 for reproducibility
- Region/hardware: GCP us-west1-a, NVIDIA L4 24GB
The Graphs That Tell the Story
Graph 1: Average throughput, FP16 vs INT4
Caption: INT4 AWQ-Marlin delivered the headline gain quickly and consistently across the sweep.
Graph 2: P50 and P99 latency comparison
Caption: INT4 improves not only average speed, but also tail behavior that users actually notice.
Graph 3: Throughput scaling by batch size
Caption: quantization benefits grow with larger batch sizes where memory pressure rises.
Graph 4: Extreme case (batch=8, tokens=512)
Caption: FP16 shows a clear latency wall while INT4 keeps much higher usable throughput.
You can explore these views live on bench.bereketlemma.com.
Main Findings
INT4 AWQ-Marlin was clearly faster on average, but the improvement was not uniform. The gap widened under heavier load, where FP16 hit stronger memory and latency pressure.
One practical signal from this run: throughput alone can look great while latency tails still hurt. If you only optimize tokens/sec, you can still ship a bad user experience.
What Surprised Me Most
Before this project, I assumed quantization speedups were mostly automatic. I thought loading an AWQ model into vLLM would naturally take the best path.
That was wrong.
Standard AWQ was not enough in my tests. The real unlock was explicitly setting quantization="awq_marlin". Without that, performance can fall back to a slower path even if logs suggest Marlin is available.
That single configuration detail changed the conclusion from "INT4 is mixed" to "INT4 is clearly better here."
Representative Rows
| FP16 Baseline | |||||
|---|---|---|---|---|---|
| Batch | Tokens | P50 (ms) | P99 (ms) | Tok/s | Req/s |
| 1 | 128 | 3,587 | 3,590 | 17.9 | 1.00 |
| 4 | 256 | 3,740 | 3,760 | 68.3 | 1.07 |
| 8 | 512 | 30,590 | 30,600 | 133.9 | 0.26 |
| INT4 AWQ-Marlin | |||||
|---|---|---|---|---|---|
| Batch | Tokens | P50 (ms) | P99 (ms) | Tok/s | Req/s |
| 1 | 128 | 2,084 | 2,087 | 61.4 | 0.48 |
| 4 | 256 | 4,394 | 4,395 | 233.1 | 0.91 |
| 8 | 512 | 9,545 | 9,548 | 429.1 | 0.84 |
What This Taught Me About Inference Optimization
Quantization choice, kernel path, and benchmark methodology are tightly coupled. You cannot trust one without checking the others.
My checklist now is simple:
- Always evaluate throughput and P99 together.
- Test the exact kernel path, not just model format labels.
- Sweep batch size and sequence length, not one default config.
- Use warmups and multiple measured runs for stable numbers.
How to Repeat This on Your Hardware
If you want to reproduce this quickly, start with one GPU and a small matrix, then scale up.
git clone https://github.com/bereketlemma/llm-inference-bench.git cd llm-inference-bench python -m venv venv source venv/bin/activate pip install -r requirements.txt # quick GPU sanity check (small model) python main.py
For production-style runs, switch to:
config = BenchmarkConfig.production_config()
Then run your sweep and inspect both throughput and P99. If AWQ is not faster, check kernel configuration first, then batch/sequence settings.
Quantization is usually worth it when you are memory-bound, throughput-constrained, or trying to serve higher concurrency on fixed hardware. It may matter less if your workload is lightly loaded and already below latency targets.
Closing Thought
This project taught me that inference optimization is not just about picking a quantized checkpoint. It is about verifying the actual execution path. In my case, the difference between awq and awq_marlin completely changed the result.
If you try this on a different GPU or model family, I would love to compare results. Reach out through my contact page.
𝒷𝑒𝓇𝑒𝓀𝑒𝓉 𝓁𝑒𝓂𝓂𝒶