Skip to main content
Home/Tools/Developer/Ollama Command Builder

Ollama Command Builder

Build Ollama commands without memorizing syntax: run, pull, and create commands for any model, complete Modelfiles, and server environment configuration — with built-in VRAM checks.

100% Private - Runs Entirely in Your Browser
No data is sent to any server. All processing happens locally on your device.
Loading Ollama Command Builder...

Command

Loading interactive tool...

You build the idea. I'll ship the product.

Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.

Ollama Commands: The Complete Mental Model

Ollama has three layers of commands, and knowing which layer you are working in makes everything click:

Model management (like a package manager): ollama pull downloads, ollama list shows what you have, ollama rm deletes, ollama cp duplicates. Models are identified as name:tag where the tag encodes size and quantization (llama3.1:8b, gemma3:27b-q8_0).

Running models (interactive or API): ollama run <model> starts an interactive chat (and pulls the model if needed). ollama serve runs the API server that other apps connect to. ollama ps shows what is currently loaded in memory and whether it is on GPU or CPU.

Customization (Modelfiles): ollama create <name> -f Modelfile builds a custom variant — your own system prompt, parameters, or imported GGUF weights.

The session commands (/set parameter, /show info, /bye) work inside an ollama run session and are temporary; Modelfile settings are permanent.

Getting Exact Quantizations: Library Tags vs Hugging Face

Ollama's library (ollama.com/library) hosts popular models with a limited set of quantization tags. The default tag (e.g. llama3.1:8b) is always Q4_K_M. Other quantizations follow the pattern 8b-instruct-q8_0, but not every model has every quant.

When you need a specific quantization the library does not have — or any model not in the library — pull directly from Hugging Face: ollama pull hf.co/{user}/{repo}:{QUANT}. Any public GGUF repository works, and quant tags like Q4_K_M, Q5_K_M, Q8_0 map directly to the GGUF filenames in the repo.

Rule of thumb: use library tags for the common case (default Q4), and hf.co pulls when you care about the exact quantization or want models outside the library.

Frequently Asked Questions

Common questions about the Ollama Command Builder

Ollama can pull GGUF models directly from Hugging Face: ollama run hf.co/{username}/{repository} — for example ollama run hf.co/unsloth/Qwen3-8B-GGUF. To pick a specific quantization, append it as a tag: hf.co/unsloth/Qwen3-8B-GGUF:Q4_K_M. This works for any public GGUF repository and is the most reliable way to get exact quantizations. This tool generates these commands for you when you select a Hugging Face model.

Two ways: temporarily in a session with /set parameter num_ctx 32768, or permanently with a Modelfile: FROM llama3.1:8b then PARAMETER num_ctx 32768, saved via ollama create mymodel -f Modelfile. Ollama defaults to a small context (often 4K) regardless of what the model supports — this is the most common reason long prompts get truncated. Remember larger context uses more VRAM.

A Modelfile is Ollama's recipe format for customizing models — like a Dockerfile for LLMs. You need one to: set a permanent system prompt, change default parameters (context size, temperature), or package a custom GGUF file. The format: FROM <base model>, PARAMETER <name> <value> lines, and SYSTEM "<your prompt>". Then ollama create <name> -f Modelfile builds it.

Set the OLLAMA_HOST environment variable to 0.0.0.0 before starting the server: OLLAMA_HOST=0.0.0.0:11434 ollama serve (Linux/macOS) or set it as a system environment variable on Windows. Be aware this exposes the API to your network without authentication — only do it on trusted networks or behind a reverse proxy.

Two environment variables control this: OLLAMA_MAX_LOADED_MODELS (how many models stay in memory simultaneously — each needs its own VRAM) and OLLAMA_NUM_PARALLEL (how many requests one model serves concurrently — each parallel slot needs its own KV cache). For a single-GPU setup serving a few users, OLLAMA_NUM_PARALLEL=4 with one loaded model is a reasonable starting point.

Most common causes, in order: 1) The model does not fully fit in VRAM and layers spilled to CPU — check with ollama ps (it shows the GPU/CPU split). 2) Context size set very high, inflating the KV cache beyond VRAM. 3) Another model is also loaded, competing for VRAM. 4) Flash attention is off — set OLLAMA_FLASH_ATTENTION=1. Use our "What LLM Can I Run?" tool to check what actually fits your GPU.

ollama rm <model> deletes a model. ollama list shows everything you have downloaded with sizes. Models live in ~/.ollama/models (macOS/Linux) or C:\Users\\.ollama\models (Windows). Note that removing a model you created with ollama create does not remove the base model it was built FROM — remove that separately if you no longer need it.

0