TLDR: WebLLM + WebGPU now runs real LLMs in the browser — offline, zero API cost, data never leaves the device. The 2 GB download and GPU dependency make it a hybrid-fallback problem in practice, not a cloud replacement. Strong fit for privacy-sensitive features and high-frequency low-complexity tasks.
What if your AI feature worked offline, had zero API cost, and never sent user data to a server? That's what on-device AI in the browser promises — and in 2025, it's no longer a research curiosity. Teams are shipping it.
I've been experimenting with this for a while and the results are more useful than I expected for specific use cases — and clearly wrong for others. Here's the honest picture.
Why this matters for frontend engineers
The standard LLM architecture: user → your server → API provider → response. On-device flips this: the model runs in the browser, on the user's GPU.
What you gain:
- Zero marginal cost per inference — no API fees per user interaction
- Works offline after initial model download
- Data never leaves the device — privacy by architecture, not policy
- No latency from network round-trips
- No rate limits, no cold starts
What you give up:
- Large initial download (1–7 GB depending on model — you're asking users to download this)
- Completely dependent on user hardware — no dedicated GPU means slow inference
- Smaller models = less capable (no Sonnet-quality reasoning on a phone)
- WebGPU support is still incomplete across browsers
- Smaller context windows than cloud models
The use cases where on-device wins are specific: offline PWAs, privacy-sensitive inputs, high-frequency low-complexity tasks where you'd otherwise be paying per request. The use cases where cloud is clearly better: anything requiring real reasoning, enterprise on older hardware, or when you need consistent quality across all users.
WebGPU: The foundation everything runs on
Everything in browser AI runs on WebGPU — the new web standard for GPU compute, available in Chrome 113+, Edge, and Safari 18+.
async function checkWebGPUSupport() {
if (!navigator.gpu) {
return { supported: false, reason: "WebGPU not available in this browser" };
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
return { supported: false, reason: "No GPU adapter found" };
}
const info = await adapter.requestAdapterInfo();
return {
supported: true,
vendor: info.vendor,
architecture: info.architecture,
};
}
Firefox doesn't support WebGPU by default yet. Always check before attempting to load a model, and always have a cloud API fallback ready.
WebLLM: The most production-ready option
WebLLM from MLC AI compiles LLMs to WebGPU using Machine Learning Compilation. It's the most mature option for running LLMs in the browser today.
npm install @mlc-ai/web-llm
Basic usage
import * as webllm from "@mlc-ai/web-llm";
const engine = new webllm.MLCEngine();
// First run downloads ~2-4 GB — show a clear progress indicator
await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
initProgressCallback: (progress) => {
updateProgressBar(progress.progress);
},
});
// OpenAI-compatible API
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Explain React hooks in one paragraph." }],
temperature: 0.7,
max_tokens: 512,
});
Streaming
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: userMessage }],
stream: true,
});
let fullText = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
fullText += delta;
updateUI(fullText);
}
Model choices
| Model | Download size | Context | Use case |
|---|---|---|---|
| Llama-3.2-1B-Instruct-q4f32_1 | ~700 MB | 128k | Ultra-fast, basic tasks |
| Llama-3.2-3B-Instruct-q4f32_1 | ~2 GB | 128k | Good balance |
| Llama-3.1-8B-Instruct-q4f32_1 | ~4.5 GB | 128k | Near-GPT-3.5 quality |
| Phi-3.5-mini-instruct-q4f16_1 | ~2.2 GB | 128k | Microsoft's compact model |
| Gemma-2-2b-it-q4f32_1 | ~1.5 GB | 8k | Google's lightweight model |
Starting recommendation: Llama-3.2-3B — 2 GB download, fast on mid-range GPUs, surprisingly capable for autocomplete, summarization, and simple Q&A. Don't start with the 8B model until you've validated that users' hardware handles it.
Chrome's built-in AI APIs
Google is embedding AI directly into Chrome via the Chrome AI APIs (Gemini Nano). No download required — the model is already on the device if the user has Chrome with AI features enabled.
if ("ai" in window && "languageModel" in window.ai) {
const session = await window.ai.languageModel.create({
systemPrompt: "You are a helpful writing assistant.",
});
const stream = session.promptStreaming("Improve this sentence: " + userText);
for await (const chunk of stream) {
outputElement.textContent = chunk;
}
session.destroy();
}
Current limitations as of mid-2025: requires Chrome Canary or Chrome 127+ with flags, Gemini Nano is small (great for autocomplete and summarization, not for complex reasoning), no user-controlled model selection. Not production-ready today, but worth testing early so you're ready when it stabilizes.
The hybrid architecture pattern
Pure on-device doesn't work for everyone's hardware. The pattern I'd ship is hybrid: try on-device first, fall back to cloud if the device can't handle it.
class AIProvider {
private webLLMEngine?: webllm.MLCEngine;
private onDeviceAvailable = false;
async initialize() {
const gpuCheck = await checkWebGPUSupport();
if (gpuCheck.supported) {
try {
this.webLLMEngine = new webllm.MLCEngine();
await this.webLLMEngine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC");
this.onDeviceAvailable = true;
} catch {
// On-device load failed — fall through to cloud
}
}
}
async complete(messages: Message[]): Promise<string> {
if (this.onDeviceAvailable && this.webLLMEngine) {
const result = await this.webLLMEngine.chat.completions.create({ messages });
return result.choices[0].message.content ?? "";
}
return callCloudAPI(messages);
}
}
Users with capable hardware get free, private, offline inference. Everyone else gets a consistent cloud experience. Neither group needs to know which path they're on.
Angular + Web Workers: Keep the main thread free
Running inference on the main thread blocks the UI completely. You need a Web Worker — no exceptions.
// ai.worker.ts
import * as webllm from "@mlc-ai/web-llm";
const engine = new webllm.MLCEngine();
self.onmessage = async (event) => {
const { type, payload } = event.data;
if (type === "LOAD") {
await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
initProgressCallback: (p) =>
self.postMessage({ type: "PROGRESS", progress: p.progress }),
});
self.postMessage({ type: "READY" });
}
if (type === "INFER") {
const stream = await engine.chat.completions.create({
messages: payload.messages,
stream: true,
});
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content ?? "";
if (text) self.postMessage({ type: "CHUNK", text });
}
self.postMessage({ type: "DONE" });
}
};
@Injectable({ providedIn: "root" })
export class OnDeviceAIService {
private worker = new Worker(new URL("./ai.worker.ts", import.meta.url));
readonly status = signal<"idle" | "loading" | "ready" | "inferring">("idle");
readonly progress = signal(0);
constructor() {
this.worker.onmessage = ({ data }) => {
if (data.type === "PROGRESS") this.progress.set(data.progress);
if (data.type === "READY") this.status.set("ready");
};
}
load() {
this.status.set("loading");
this.worker.postMessage({ type: "LOAD" });
}
}
When to actually use on-device
| Use case | On-device? | Why |
|---|---|---|
| Offline-first PWA | Yes | No network needed |
| Privacy-sensitive input (journals, health) | Yes | Data never leaves device |
| High-frequency autocomplete | Yes | Sub-100ms possible, no API cost |
| Complex reasoning / long documents | No | Small models struggle here |
| Consistent quality across all users | No | Varies by hardware |
| Enterprise / older devices | No | Insufficient GPU |
On-device AI isn't replacing cloud LLMs. It's filling the gap where cloud is overkill — offline, private, high-frequency, low-complexity tasks. The 2 GB download is the biggest UX barrier right now. As models get smaller and better (Llama 4 Scout is already impressive at its activation size), and as Chrome's built-in AI stabilizes, the tradeoffs will continue to shift. Worth understanding now so you're not learning from scratch when it becomes the obvious choice.