Precise VRAM accounting — not the "weights only" estimate everyone else gives you. Includes KV cache (grows with context), activation memory, runtime overhead. Answers: will this actually fit?
Total VRAM needed
29.4GB
20.0 GB weights + 7.77 KV + 0.1 act + 1.5 overhead
Est. speed · on RTX 4090 reference
34tok/s
memory-bound · scale ∝ your GPU bandwidth/total
Memory pressure vs RTX 4090 (24 GB cap) ⚠ Doesn't fit · −5.4 GB short
▸ Why this is more accurate than llama.cpp's default estimate
Most VRAM calculators only tell you weight size. That's
missing the point — KV cache often dominates at long context. Here's what we compute:
Model weights: quant-specific GGUF block size × paramsKV cache: ctx_len × n_layers × n_kv_heads × dim_head × 2 (K+V) × 2 bytesActivation: peak per-layer scratch (~1% with flash attention)Overhead: CUDA workspace, scheduler queues, lib staticsSpike v0.2 · 26 models · 88 GPUs · 3 sources (LiveBench / Aider / Arena)
Methodology & changelog: coming soon