AIBenchmarkingSystemsApril 6, 2026 · 12 min read

What I Learned Benchmarking FP16 vs INT4 LLM Inference with vLLM

Most quantization posts say INT4 is faster and cheaper, but few show exactly how much faster on real hardware. I wanted to answer one question for myself: is INT4 actually faster than FP16 in production, or is that just something everyone repeats?

So I built llm-inference-bench, ran Mistral-7B in FP16 and INT4 on an NVIDIA L4, and measured throughput, latency, and scaling across batch sizes and sequence lengths. The result was useful, but the most important lesson was unexpected.

Project: github.com/bereketlemma/llm-inference-bench
Dashboard: bench.bereketlemma.com

Why This Question Matters

This is not just a benchmark vanity metric. The answer affects real production tradeoffs.

Inference cost: better throughput means fewer GPUs for the same traffic.
User experience: lower tail latency means fewer slow responses.
Memory limits: quantization can let larger models fit safely in VRAM.
Capacity planning: higher requests per second on the same hardware budget.

Concepts, Quickly

FP16: 16-bit floating point weights, common baseline for fast GPU inference.

INT4 quantization: compresses weights to 4-bit integers, reducing memory bandwidth and often improving throughput.

AWQ (Activation-aware Weight Quantization): a method for preserving quality while quantizing.

vLLM: an inference engine optimized for high-throughput serving with strong batching behavior.

Throughput vs latency: throughput tells you system capacity, latency tells you what each user feels. You need both.

Why P99 matters: averages hide bad tails; P99 captures slow requests that hurt production experience.

How I Ran the Benchmark

I benchmarked Mistral-7B on an NVIDIA L4 (24GB) across 18 configurations.

FP16 model: mistralai/Mistral-7B-v0.1
INT4 model: TheBloke/Mistral-7B-v0.1-AWQ
Engine: vLLM 0.16.0
Batch sizes: 1, 4, 8
Sequence lengths: 128, 256, 512
Warmup: 3 iterations
Measurement: 10 runs per configuration
Decoding: greedy, temperature = 0.0 for reproducibility
Region/hardware: GCP us-west1-a, NVIDIA L4 24GB

The Graphs That Tell the Story

Graph 1: Average throughput, FP16 vs INT4
Caption: INT4 AWQ-Marlin delivered the headline gain quickly and consistently across the sweep.

Graph 2: P50 and P99 latency comparison
Caption: INT4 improves not only average speed, but also tail behavior that users actually notice.

Graph 3: Throughput scaling by batch size
Caption: quantization benefits grow with larger batch sizes where memory pressure rises.

Graph 4: Extreme case (batch=8, tokens=512)
Caption: FP16 shows a clear latency wall while INT4 keeps much higher usable throughput.

You can explore these views live on bench.bereketlemma.com.

Main Findings

Throughput speedup (avg)

3.35x

P99 latency reduction (avg)

37.5%

Peak throughput (INT4, BS=8)

452.3 tok/s

Peak throughput (FP16, BS=8)

133.9 tok/s

INT4 AWQ-Marlin was clearly faster on average, but the improvement was not uniform. The gap widened under heavier load, where FP16 hit stronger memory and latency pressure.

One practical signal from this run: throughput alone can look great while latency tails still hurt. If you only optimize tokens/sec, you can still ship a bad user experience.

What Surprised Me Most

Before this project, I assumed quantization speedups were mostly automatic. I thought loading an AWQ model into vLLM would naturally take the best path.

That was wrong.

Standard AWQ was not enough in my tests. The real unlock was explicitly setting quantization="awq_marlin". Without that, performance can fall back to a slower path even if logs suggest Marlin is available.

That single configuration detail changed the conclusion from "INT4 is mixed" to "INT4 is clearly better here."

Representative Rows

FP16 Baseline
Batch	Tokens	P50 (ms)	P99 (ms)	Tok/s	Req/s
1	128	3,587	3,590	17.9	1.00
4	256	3,740	3,760	68.3	1.07
8	512	30,590	30,600	133.9	0.26

INT4 AWQ-Marlin
Batch	Tokens	P50 (ms)	P99 (ms)	Tok/s	Req/s
1	128	2,084	2,087	61.4	0.48
4	256	4,394	4,395	233.1	0.91
8	512	9,545	9,548	429.1	0.84

What This Taught Me About Inference Optimization

Quantization choice, kernel path, and benchmark methodology are tightly coupled. You cannot trust one without checking the others.

My checklist now is simple:

Always evaluate throughput and P99 together.
Test the exact kernel path, not just model format labels.
Sweep batch size and sequence length, not one default config.
Use warmups and multiple measured runs for stable numbers.

How to Repeat This on Your Hardware

If you want to reproduce this quickly, start with one GPU and a small matrix, then scale up.

git clone https://github.com/bereketlemma/llm-inference-bench.git
cd llm-inference-bench
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# quick GPU sanity check (small model)
python main.py

For production-style runs, switch to:

config = BenchmarkConfig.production_config()

Then run your sweep and inspect both throughput and P99. If AWQ is not faster, check kernel configuration first, then batch/sequence settings.

Quantization is usually worth it when you are memory-bound, throughput-constrained, or trying to serve higher concurrency on fixed hardware. It may matter less if your workload is lightly loaded and already below latency targets.

Closing Thought

This project taught me that inference optimization is not just about picking a quantized checkpoint. It is about verifying the actual execution path. In my case, the difference between awq and awq_marlin completely changed the result.

If you try this on a different GPU or model family, I would love to compare results. Reach out through my contact page.

𝒷𝑒𝓇𝑒𝓀𝑒𝓉 𝓁𝑒𝓂𝓂𝒶