what/llm/can/i/run

home/calculator

VRAM Calculator

Precise VRAM accounting — not the "weights only" estimate everyone else gives you. Includes KV cache (grows with context), activation memory, runtime overhead. Answers: will this actually fit?

▸Configure

model

Qwen2.5 Coder 32B(32.0B params)

quantization

context length

Currently: 32,768 tokens

2k 4k 8k 16k 32k 64k 128k

batch size

OS / framework reserve

1G 1.5G 2G 8G 12G

▸Result

Total VRAM needed

29.4GB

20.0 GB weights + 7.77 KV + 0.1 act + 1.5 overhead

Est. speed · on RTX 4090 reference

34tok/s

memory-bound · scale ∝ your GPU bandwidth/total

Memory pressure vs RTX 4090 (24 GB cap)⚠ Doesn't fit · −5.4 GB short

▸Fits on these GPUs

✓RTX 509032 GB+2.6 GB headroom ✗RTX 508016 GB-13.4 GB short ✗RTX 5070 Ti16 GB-13.4 GB short ✗RTX 507012 GB-17.4 GB short ✗RTX 5060 Ti 16GB16 GB-13.4 GB short ✗RTX 50608 GB-21.4 GB short ✗RTX 409024 GB-5.4 GB short ✗RTX 4080 SUPER16 GB-13.4 GB short ✗RTX 408016 GB-13.4 GB short ✗RTX 4070 Ti SUPER16 GB-13.4 GB short ✗RTX 4070 Ti12 GB-17.4 GB short ✗RTX 4070 SUPER12 GB-17.4 GB short ✗RTX 407012 GB-17.4 GB short ✗RTX 4060 Ti 16GB16 GB-13.4 GB short ✗RTX 4060 Ti 8GB8 GB-21.4 GB short ✗RTX 40608 GB-21.4 GB short ✗RTX 3090 Ti24 GB-5.4 GB short ✗RTX 309024 GB-5.4 GB short ✗RTX 3080 Ti12 GB-17.4 GB short ✗RTX 3080 12GB12 GB-17.4 GB short ✗RTX 3080 10GB10 GB-19.4 GB short ✗RTX 3070 Ti8 GB-21.4 GB short ✗RTX 30708 GB-21.4 GB short ✗RTX 3060 Ti8 GB-21.4 GB short ✗RTX 3060 12GB12 GB-17.4 GB short ✗RTX 3060 8GB8 GB-21.4 GB short ✗RTX 3050 8GB8 GB-21.4 GB short ✗RTX 2080 Ti11 GB-18.4 GB short ✗RTX 2080 SUPER8 GB-21.4 GB short ✓A6000 Pro48 GB+18.6 GB headroom ✓RTX 6000 Ada48 GB+18.6 GB headroom ✓H100 PCIe80 GB+50.6 GB headroom ✓H100 SXM580 GB+50.6 GB headroom ✓H200 SXM5141 GB+111.6 GB headroom ✓A100 80GB80 GB+50.6 GB headroom ✓A100 40GB40 GB+10.6 GB headroom ✓L40S48 GB+18.6 GB headroom ✗L424 GB-5.4 GB short ✗M1 16GB16 GB-13.4 GB short ✗M1 Pro 16GB16 GB-13.4 GB short ✓M1 Pro 32GB32 GB+2.6 GB headroom ✓M1 Max 32GB32 GB+2.6 GB headroom ✓M1 Max 64GB64 GB+34.6 GB headroom ✓M1 Ultra 64GB64 GB+34.6 GB headroom ✓M1 Ultra 128GB128 GB+98.6 GB headroom ✗M2 16GB16 GB-13.4 GB short ✗M2 24GB24 GB-5.4 GB short ✗M2 Pro 16GB16 GB-13.4 GB short ✓M2 Pro 32GB32 GB+2.6 GB headroom ✓M2 Max 32GB32 GB+2.6 GB headroom ✓M2 Max 64GB64 GB+34.6 GB headroom ✓M2 Max 96GB96 GB+66.6 GB headroom ✓M2 Ultra 64GB64 GB+34.6 GB headroom ✓M2 Ultra 128GB128 GB+98.6 GB headroom ✓M2 Ultra 192GB192 GB+162.6 GB headroom ✗M3 16GB16 GB-13.4 GB short ✗M3 24GB24 GB-5.4 GB short ✗M3 Pro 18GB18 GB-11.4 GB short ✓M3 Pro 36GB36 GB+6.6 GB headroom ✓M3 Max 36GB36 GB+6.6 GB headroom ✓M3 Max 48GB48 GB+18.6 GB headroom ✓M3 Max 64GB64 GB+34.6 GB headroom ✓M3 Max 96GB96 GB+66.6 GB headroom ✓M3 Max 128GB128 GB+98.6 GB headroom ✓M3 Ultra 96GB96 GB+66.6 GB headroom ✓M3 Ultra 192GB192 GB+162.6 GB headroom ✓M3 Ultra 256GB256 GB+226.6 GB headroom ✓M3 Ultra 512GB512 GB+482.6 GB headroom ✗M4 16GB16 GB-13.4 GB short ✗M4 24GB24 GB-5.4 GB short ✓M4 32GB32 GB+2.6 GB headroom ✗M4 Pro 24GB24 GB-5.4 GB short ✓M4 Pro 48GB48 GB+18.6 GB headroom ✓M4 Max 36GB36 GB+6.6 GB headroom ✓M4 Max 48GB48 GB+18.6 GB headroom ✓M4 Max 64GB64 GB+34.6 GB headroom ✓M4 Max 96GB96 GB+66.6 GB headroom ✓M4 Max 128GB128 GB+98.6 GB headroom ✗RX 7900 XTX24 GB-5.4 GB short ✗RX 7900 XT20 GB-9.4 GB short ✗RX 7900 GRE16 GB-13.4 GB short ✗RX 7800 XT16 GB-13.4 GB short ✗RX 7700 XT12 GB-17.4 GB short ✗RX 6900 XT16 GB-13.4 GB short ✗RX 6800 XT16 GB-13.4 GB short ✓MI300X192 GB+162.6 GB headroom ✗Arc A770 16GB16 GB-13.4 GB short ✗Intel Iris Xe4 GB-25.4 GB short

▸ Why this is more accurate than llama.cpp's default estimate

Most VRAM calculators only tell you weight size. That's missing the point— KV cache often dominates at long context. Here's what we compute:

Model weights: quant-specific GGUF block size × params
KV cache: ctx_len × n_layers × n_kv_heads × dim_head × 2 (K+V) × 2 bytes
Activation: peak per-layer scratch (~1% with flash attention)
Overhead: CUDA workspace, scheduler queues, lib statics