Local AI Chat
Chat with an AI model that runs entirely in your browser — free, no signup, no API key. Pick a model like Llama 3.2 or Qwen 2.5, or load your own GGUF file. Conversations never leave your device and it works offline once the model is downloaded.
Chat
You build the idea. I'll ship the product.
Productized MVP development for founders. 8 SaaS apps shipped — yours could be next, in 6 weeks. Secure by default.
How AI Chat Can Run in a Browser
Until recently, chatting with an LLM meant either a cloud API (your text goes to a server) or installing software like Ollama. WebGPU changed that: browsers can now run compute on your graphics card directly, which is exactly what LLM inference needs.
This tool uses two engines. WebLLM compiles models to run on your GPU through WebGPU — this is the fast path, used for the recommended models and HuggingFace MLC repos. wllama is llama.cpp compiled to WebAssembly — slower because it runs on the CPU, but it works in any browser and can open standard .gguf model files.
When you pick a model, your browser downloads its weights from a public CDN (this is the only network traffic involved — your messages are never part of it), caches them, and loads them into GPU memory. From then on, every token the AI generates is computed on your hardware. The privacy is structural, not a policy promise: there is no server that could log your conversation even if we wanted one.
Choosing the Right Model for Your Hardware
Bigger models give better answers but need more GPU memory and run slower. A practical guide:
1B models (Llama 3.2 1B, Qwen 2.5 1.5B) — run on nearly anything with WebGPU, including integrated graphics and phones. Good for quick factual questions, simple drafts, and trying out local AI. Download: under 1GB.
3B models (Llama 3.2 3B, Qwen 2.5 3B, Phi 3.5 Mini) — need roughly 4GB of GPU memory. Noticeably smarter: better reasoning, longer coherent answers, fewer mistakes. The sweet spot for most laptops with a real GPU or Apple Silicon.
7-8B models (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) — need 6GB+ VRAM or a 16GB+ Apple Silicon Mac. These are genuinely useful assistants — the same class of model many people run with Ollama.
The tool detects your GPU and marks what fits, but the real test is trying one: if generation feels too slow (under ~5 tokens/second), step down a size. And if your hardware can handle more than the browser allows, you will get better results running models natively — see What LLM Can I Run for the full picture of your machine's capability.
Frequently Asked Questions
Common questions about the Local AI Chat
Yes. The AI model runs on your own GPU (or CPU) inside your browser using WebGPU — the same technology games use for graphics. Your messages are processed locally and never sent to us or any AI provider. Conversations are saved only in your browser's local storage, and you can verify the privacy claim yourself: once a model is downloaded, the chat keeps working with your internet disconnected.
Three sources. Recommended: a curated set from Llama 3.2 1B (runs on almost anything) up to Llama 3.1 8B and Qwen 2.5 7B (needs a GPU with 6GB+ memory). HuggingFace: paste any repo with MLC-format weights. Your own file: load a .gguf model file (up to ~2GB) from your computer — it runs on CPU via WebAssembly, slower but works in any browser. The tool detects your GPU and marks which models will fit.
The first time you use a model, your browser downloads its weights (roughly 700MB for a 1B model up to ~4GB for an 8B model) and compiles it for your GPU. That download is cached by your browser, so every later session starts in seconds — and the same cached model is shared by all the local-AI tools on this site (the summarizer, PII redactor, and phishing analyzer).
Yes, after the first download. Model weights are cached in your browser, so once you have chatted with a model while online, you can load the page and keep chatting with no internet connection. This also makes it one of the few AI chat options that works on an air-gapped or restricted network (load the model first, then disconnect).
For the recommended models: a browser with WebGPU (Chrome, Edge, or Safari 18+) and a GPU. Small 1B models run on integrated graphics and Apple Silicon; the 7-8B models want a discrete GPU with 6GB+ VRAM or a Mac with 16GB+ unified memory. For .gguf files, any modern browser works — no GPU needed — but generation is slower since it runs on the CPU. Use our What LLM Can I Run tool to see exactly what your machine handles.
Two big differences. Privacy: ChatGPT sends every message to OpenAI's servers; this tool sends nothing anywhere. Capability: ChatGPT runs models with hundreds of billions of parameters; browser models top out around 8 billion, so answers are noticeably less capable — fine for quick questions, drafts, brainstorming, and learning, but not a replacement for frontier models on hard problems. You are trading some intelligence for complete privacy and zero cost.
Not directly — Ollama and LM Studio store models in their own internal formats and folders that web pages cannot read (a browser security restriction). Instead, download the .gguf file for the same model from HuggingFace and load it with the "Your .gguf file" option. Note the ~2GB browser limit: that covers Q4 quantizations of 1B-3B models. For bigger models, run them natively — our Ollama Command Builder gives you the exact command.
Model weights are cached per-website, not per-tool. When you download Llama 3.2 1B here, the Private AI Summarizer, PII Redactor, and Phishing Email Analyzer can all use it instantly without re-downloading. The "Downloaded" manager in the model picker shows everything cached and lets you delete models to free disk space.
It is the speed at which the model generates text — roughly how many words per second appear (a token is about three-quarters of a word). 10+ tok/s feels smooth; below 5 tok/s feels slow. The number depends almost entirely on your GPU's memory bandwidth, which is why the same model is fast on an RTX 4090 and slow on integrated graphics. Our LLM Inference Speed Calculator predicts this number for any GPU and model combination.
Model size. The models that fit in a browser have 1-8 billion parameters; frontier cloud models have hundreds of billions plus extensive fine-tuning. Small models make more factual mistakes, follow complex instructions less reliably, and write less polished prose. They are still genuinely useful for everyday questions, summaries, code snippets, and drafts — and they are improving fast. For the best local quality, use the biggest model your hardware supports.
Explore More Tools
Continue with these related tools