ResourcesRequest sizing help
Guide
Local LLM Hardware Sizing Guide
Estimate the GPU memory needed for local LLM inference based on model size, quantization, context length, and concurrency.
Model and format
Balanced local LLM baseline
Smallest common local format, quality must be tested.
Runtime assumptions
Model memory
6 GB
KV cache
2 GB
Overhead
4 GB
Common hardware targets
Target
Typical fit
Notes
Laptop / CPU-only
3B–8B quantized
Good for demos, low throughput, not ideal for production concurrency.
16 GB GPU
7B–8B quantized
Good starter for private RAG and prototypes.
24 GB GPU
7B–14B quantized
Strong local RAG baseline. Common single-GPU target.
48 GB GPU
14B–32B quantized
Better quality and larger context headroom.
80 GB GPU
32B–70B quantized or smaller FP16
Serious self-hosted production and long-context workloads.
Multi-GPU server
70B+ or high concurrency
Needed for large models, high throughput, and enterprise serving.
Need a real hardware sizing review?
SovAIHub can help size GPU, CPU, memory, storage, runtime, and deployment topology for private RAG or air-gapped AI.