Ollama Command Builder
Build Ollama commands without memorizing syntax: run, pull, and create commands for any model, complete Modelfiles, and server environment configuration — with built-in VRAM checks.
Command
You build the idea. I'll ship the product.
Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.
Ollama Commands: The Complete Mental Model
Ollama has three layers of commands, and knowing which layer you are working in makes everything click:
Model management (like a package manager): ollama pull downloads, ollama list shows what you have, ollama rm deletes, ollama cp duplicates. Models are identified as name:tag where the tag encodes size and quantization (llama3.1:8b, gemma3:27b-q8_0).
Running models (interactive or API): ollama run <model> starts an interactive chat (and pulls the model if needed). ollama serve runs the API server that other apps connect to. ollama ps shows what is currently loaded in memory and whether it is on GPU or CPU.
Customization (Modelfiles): ollama create <name> -f Modelfile builds a custom variant — your own system prompt, parameters, or imported GGUF weights.
The session commands (/set parameter, /show info, /bye) work inside an ollama run session and are temporary; Modelfile settings are permanent.
Getting Exact Quantizations: Library Tags vs Hugging Face
Ollama's library (ollama.com/library) hosts popular models with a limited set of quantization tags. The default tag (e.g. llama3.1:8b) is always Q4_K_M. Other quantizations follow the pattern 8b-instruct-q8_0, but not every model has every quant.
When you need a specific quantization the library does not have — or any model not in the library — pull directly from Hugging Face: ollama pull hf.co/{user}/{repo}:{QUANT}. Any public GGUF repository works, and quant tags like Q4_K_M, Q5_K_M, Q8_0 map directly to the GGUF filenames in the repo.
Rule of thumb: use library tags for the common case (default Q4), and hf.co pulls when you care about the exact quantization or want models outside the library.
Frequently Asked Questions
Common questions about the Ollama Command Builder
Ollama can pull GGUF models directly from Hugging Face: ollama run hf.co/{username}/{repository} — for example ollama run hf.co/unsloth/Qwen3-8B-GGUF. To pick a specific quantization, append it as a tag: hf.co/unsloth/Qwen3-8B-GGUF:Q4_K_M. This works for any public GGUF repository and is the most reliable way to get exact quantizations. This tool generates these commands for you when you select a Hugging Face model.
Two ways: temporarily in a session with /set parameter num_ctx 32768, or permanently with a Modelfile: FROM llama3.1:8b then PARAMETER num_ctx 32768, saved via ollama create mymodel -f Modelfile. Ollama defaults to a small context (often 4K) regardless of what the model supports — this is the most common reason long prompts get truncated. Remember larger context uses more VRAM.
A Modelfile is Ollama's recipe format for customizing models — like a Dockerfile for LLMs. You need one to: set a permanent system prompt, change default parameters (context size, temperature), or package a custom GGUF file. The format: FROM <base model>, PARAMETER <name> <value> lines, and SYSTEM "<your prompt>". Then ollama create <name> -f Modelfile builds it.
Set the OLLAMA_HOST environment variable to 0.0.0.0 before starting the server: OLLAMA_HOST=0.0.0.0:11434 ollama serve (Linux/macOS) or set it as a system environment variable on Windows. Be aware this exposes the API to your network without authentication — only do it on trusted networks or behind a reverse proxy.
Two environment variables control this: OLLAMA_MAX_LOADED_MODELS (how many models stay in memory simultaneously — each needs its own VRAM) and OLLAMA_NUM_PARALLEL (how many requests one model serves concurrently — each parallel slot needs its own KV cache). For a single-GPU setup serving a few users, OLLAMA_NUM_PARALLEL=4 with one loaded model is a reasonable starting point.
Most common causes, in order: 1) The model does not fully fit in VRAM and layers spilled to CPU — check with ollama ps (it shows the GPU/CPU split). 2) Context size set very high, inflating the KV cache beyond VRAM. 3) Another model is also loaded, competing for VRAM. 4) Flash attention is off — set OLLAMA_FLASH_ATTENTION=1. Use our "What LLM Can I Run?" tool to check what actually fits your GPU.
ollama rm <model> deletes a model. ollama list shows everything you have downloaded with sizes. Models live in ~/.ollama/models (macOS/Linux) or C:\Users\ollama create does not remove the base model it was built FROM — remove that separately if you no longer need it.
Explore More Tools
Continue with these related tools