Your first local LLM in 10 minutes with Ollama
No cloud, no API key, no data leaving your machine. A genuinely beginner-proof path to running a capable model locally — plus how to know which size your hardware can actually handle.
- Min hardware
- 8GB RAM (16GB+ for the good stuff)
- Read time
- 10 min
- Stack
- Ollama, llama.cpp
You don’t need a data-centre GPU to run a useful model at home. You need Ollama and about ten minutes. This is the path I point everyone to first.
1. Install Ollama
One installer, every platform. Grab it from ollama.com. On macOS and Windows it’s a normal app; on Linux it’s a one-line script. Done.
2. Pull and run a model
ollama run llama3.2
That’s it. The first run downloads the weights; after that it’s instant and fully offline. You’re now chatting with a model that lives entirely on your machine.
3. Pick a size your hardware can handle
This is where most people get frustrated. A rough rule of thumb:
| Your RAM / VRAM | Comfortable model size |
|---|---|
| 8 GB | 1–3B parameters |
| 16 GB | 7–8B parameters |
| 24 GB+ | 13–14B, or quantized larger |
| 48 GB+ | 30B+ quantized |
If a model swaps to disk, it’ll crawl. When in doubt, go one size smaller — a fast 8B beats a 70B that takes a minute per sentence.
Quantization is your friend. A 4-bit (
Q4) quant of a bigger model often beats a full-precision smaller one, and fits in far less memory.
4. Talk to it from your own code
Ollama exposes an OpenAI-compatible endpoint on localhost:11434, so most
tooling “just works” by pointing at it:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain gridfinity in one sentence."
}'
Where to go next
- Try a code-focused model for editor autocomplete.
- Add a small embedding model and you’ve got the start of local RAG.
- Watch your temperatures — sustained local inference is a real workout for a laptop.
Next guide: giving your local model your own documents to read, without anything touching the cloud.