Guide

Local LLM Hardware Sizing Guide

Estimate the GPU memory needed for local LLM inference based on model size, quantization, context length, and concurrency.

Model and format

Model size

Balanced local LLM baseline

Quantization

Smallest common local format, quality must be tested.

Context tokens

Concurrent users

Runtime overhead GB

Model memory

6 GB

KV cache

2 GB

Overhead

4 GB

Target

Typical fit

Notes

Laptop / CPU-only

3B–8B quantized

Good for demos, low throughput, not ideal for production concurrency.

16 GB GPU

7B–8B quantized

Good starter for private RAG and prototypes.

24 GB GPU

7B–14B quantized

Strong local RAG baseline. Common single-GPU target.

48 GB GPU

14B–32B quantized

Better quality and larger context headroom.

80 GB GPU

32B–70B quantized or smaller FP16

Serious self-hosted production and long-context workloads.

Multi-GPU server

70B+ or high concurrency

Needed for large models, high throughput, and enterprise serving.

SovAIHub can help size GPU, CPU, memory, storage, runtime, and deployment topology for private RAG or air-gapped AI.