Question 1

How do I run a model from Hugging Face in Ollama?

Accepted Answer

Ollama can pull GGUF models directly from Hugging Face: `ollama run hf.co/{username}/{repository}` — for example `ollama run hf.co/unsloth/Qwen3-8B-GGUF`. To pick a specific quantization, append it as a tag: `hf.co/unsloth/Qwen3-8B-GGUF:Q4_K_M`. This works for any public GGUF repository and is the most reliable way to get exact quantizations. This tool generates these commands for you when you select a Hugging Face model.

Question 2

How do I increase the context window in Ollama?

Accepted Answer

Two ways: **temporarily** in a session with `/set parameter num_ctx 32768`, or **permanently** with a Modelfile: `FROM llama3.1:8b` then `PARAMETER num_ctx 32768`, saved via `ollama create mymodel -f Modelfile`. Ollama defaults to a small context (often 4K) regardless of what the model supports — this is the most common reason long prompts get truncated. Remember larger context uses more VRAM.

Question 3

What is a Modelfile and when do I need one?

Accepted Answer

A Modelfile is Ollama's recipe format for customizing models — like a Dockerfile for LLMs. You need one to: set a permanent system prompt, change default parameters (context size, temperature), or package a custom GGUF file. The format: `FROM `, `PARAMETER ` lines, and `SYSTEM ""`. Then `ollama create -f Modelfile` builds it.

Question 4

How do I make Ollama listen on my network (not just localhost)?

Accepted Answer

Set the OLLAMA_HOST environment variable to 0.0.0.0 before starting the server: `OLLAMA_HOST=0.0.0.0:11434 ollama serve` (Linux/macOS) or set it as a system environment variable on Windows. Be aware this exposes the API to your network without authentication — only do it on trusted networks or behind a reverse proxy.

Question 5

How do I run multiple models or serve multiple users with Ollama?

Accepted Answer

Two environment variables control this: `OLLAMA_MAX_LOADED_MODELS` (how many models stay in memory simultaneously — each needs its own VRAM) and `OLLAMA_NUM_PARALLEL` (how many requests one model serves concurrently — each parallel slot needs its own KV cache). For a single-GPU setup serving a few users, OLLAMA_NUM_PARALLEL=4 with one loaded model is a reasonable starting point.

Question 6

Why is my Ollama model slow?

Accepted Answer

Most common causes, in order: **1)** The model does not fully fit in VRAM and layers spilled to CPU — check with `ollama ps` (it shows the GPU/CPU split). **2)** Context size set very high, inflating the KV cache beyond VRAM. **3)** Another model is also loaded, competing for VRAM. **4)** Flash attention is off — set OLLAMA_FLASH_ATTENTION=1. Use our "What LLM Can I Run?" tool to check what actually fits your GPU.

Question 7

How do I delete models and free up disk space?

Accepted Answer

`ollama rm ` deletes a model. `ollama list` shows everything you have downloaded with sizes. Models live in ~/.ollama/models (macOS/Linux) or C:\Users\\.ollama\models (Windows). Note that removing a model you created with `ollama create` does not remove the base model it was built FROM — remove that separately if you no longer need it.

Ollama Command Builder

Build Ollama Commands Without Memorizing Syntax

What It Covers

Key Concepts

When to Use It

Ollama Commands: The Complete Mental Model

Getting Exact Quantizations: Library Tags vs Hugging Face

Frequently Asked Questions

Related tools