Skip to main content

Command Palette

Search for a command to run...

Next.js + WebGPU: Build an Offline-First AI Chatbot That Runs 100% in the Browser (No Backend Needed)

Making offline AI for you next web application

Updated
5 min read
Next.js + WebGPU: Build an Offline-First AI Chatbot That Runs 100% in the Browser (No Backend Needed)

Overview

The future of software engineering is AI. But the future of AI is not on the server, it’s in the browser - local and offline directly running on the browser without a server.

But is it really possible as of 2025? Short answer - yes.

With WebGPU, modern browsers can run AI models locally on the user’s device.
Combine that with a frontend framework or library such as Next.js, and you can build a fully-functional offline AI chatbot that works:

  • without a backend,

  • without API keys,

  • without sending data to the cloud, and

  • even in airplane mode.

This is Offline-First AI, and it’s one of the hottest topics in web development right now.

In this tutorial, you’ll learn how to build an LLM-powered offline chatbot in Next.js using WebGPU + Transformers.js.

Why Offline-First AI Is Exploding in 2025

Before diving into the tutorial, let’s first understand why Offline first AI is the future emerging today -

  • No backend, no GPU servers - LLMs are expensive to host. Browser LLMs = zero infra cost.

  • Privacy-friendly - Messages never leave the device. Ideal for enterprise, health, finance.

  • Fast & responsive - Running on the user's GPU via WebGPU = instant inference.

  • Works even with poor internet - Download once → use forever.

Architecture of an Offline WebGPU Chatbot in Next.js

Next.js (Client Components Only)
│
├── Transformers.js (WebGPU backend)
│       └── TinyLlama / Phi / DistilGPT2 model
│
├── Model Hosted Locally
│       └── /public/models/gguf or safetensors
│
└── No API calls / No backend

Everything runs in-browser.

Let the code begin -

Step 1 - Create a New Next.js Project

npx create-next-app offline-chatbot
cd offline-chatbot

Step 2 - Install Transformers.js (WebGPU Enabled)

npm install @xenova/transformers

Transformers.js automatically uses WebGPU when available. You get GPU inference in Chrome, Edge, and soon Firefox.

Step 3 - Download a Small LLM (TinyLlama or Phi)

You need a small model for fast browser inference.

Recommended:

  • Xenova/tiny-llama

  • Xenova/phi-1_5

  • Xenova/distilgpt2 (lightest)

For this tutorial, we’ll use TinyLlama.

The model is auto-downloaded and cached by Transformers.js, so no backend required.

Step 4 - Build the Offline Chatbot Component

Create the chatbot page:

app/chat/page.tsx

"use client";

import { useState } from "react";
import { pipeline } from "@xenova/transformers";

export default function Chatbot() {
  const [messages, setMessages] = useState([
    { from: "bot", text: "Hello! I'm your offline AI assistant 🤖" },
  ]);
  const [input, setInput] = useState("");
  const [loading, setLoading] = useState(false);

  async function handleSend() {
    if (!input.trim()) return;

    const userMessage = input;
    setMessages((m) => [...m, { from: "user", text: userMessage }]);
    setInput("");
    setLoading(true);

    // Load model on-demand (cached afterward)
    const generator = await pipeline("text-generation", "Xenova/tiny-llama");

    const output = await generator(userMessage, {
      max_new_tokens: 60,
      temperature: 0.7,
    });

    setMessages((m) => [
      ...m,
      {
        from: "bot",
        text: output[0].generated_text.replace(userMessage, "").trim(),
      },
    ]);

    setLoading(false);
  }

  return (
    <div className="p-8 max-w-xl mx-auto">
      <h1 className="text-3xl font-bold mb-4">
        Offline AI Chatbot (Next.js + WebGPU)
      </h1>

      <div className="border p-4 rounded-md mb-4 h-96 overflow-y-auto">
        {messages.map((msg, i) => (
          <div
            key={i}
            className={`my-2 p-2 rounded-md ${
              msg.from === "user" ? "bg-blue-200 ml-auto w-fit" : "bg-gray-200"
            }`}
          >
            {msg.text}
          </div>
        ))}
        {loading && <div className="italic">Thinking...</div>}
      </div>

      <div className="flex gap-2">
        <input
          className="border p-2 flex-1 rounded-md"
          placeholder="Type a message..."
          value={input}
          onChange={(e) => setInput(e.target.value)}
        />
        <button
          onClick={handleSend}
          className="bg-black text-white px-4 py-2 rounded-md"
        >
          Send
        </button>
      </div>
    </div>
  );
}

This component:

  • loads TinyLlama locally

  • performs inference using WebGPU

  • stores chat history in component state

  • produces responses offline

Step 5 - Confirm WebGPU Acceleration Is Active

Create a utility:

app/components/WebGPUStatus.tsx

"use client";

export default function WebGPUStatus() {
  const supported = typeof navigator !== "undefined" && !!navigator.gpu;

  return (
    <p className="text-sm text-gray-700 my-2">
      WebGPU: {supported ? "Supported ✓" : "Not Supported ✗"}
    </p>
  );
}

Add to the page:

<WebGPUStatus />

If WebGPU isn't available, the model still works but falls back to WASM (slower).

Step 6 - Add Streaming Responses (LLM Typing Effect) – Optional but Awesome

Replace model generation with streaming:

const generator = await pipeline("text-generation", "Xenova/tiny-llama", {
  progress_callback: (progress) => {},
});

let full = "";

await generator.stream(userMessage, {
  max_new_tokens: 80,
  temperature: 0.7,
  callback_function: (token) => {
    full += token;
    setMessages((m) => [
      ...m.slice(0, -1),
      { from: "bot", text: full },
    ]);
  },
});

Now your chatbot types like ChatGPT, fully offline.

Step 7 - Memory System (Storing Chat History Locally)

Add offline persistence via LocalStorage:

useEffect(() => {
  const saved = localStorage.getItem("chat-history");
  if (saved) setMessages(JSON.parse(saved));
}, []);

useEffect(() => {
  localStorage.setItem("chat-history", JSON.stringify(messages));
}, [messages]);

Now your chatbot remembers conversations even if the device is offline forever.

Optional: Step 8 - Ship the Model With Your App (True Fully Offline Mode)

Place a model file into /public/models/tiny-llama/ and load:

const generator = await pipeline("text-generation", "/models/tiny-llama/");

This avoids downloading from HuggingFace, making our application perfect for:

  • offline enterprise apps

  • kiosk apps

  • PWA apps

  • private deployments

Final Results

You now have a Next.js + WebGPU Offline-First Chatbot that runs:

  • with zero backend

  • fully private

  • blazingly fast thanks to WebGPU

  • offline forever after first load

Conclusion - Offline-First AI Is the Future of Web Apps

  • We’re entering a new era where AI no longer requires servers, GPUs, or API keys.
    With Next.js + WebGPU, you can build AI products that run:

    • locally on the user's device

    • fast, private, and secure

    • without depending on the cloud

    • without paying for inference GPUs

    • and without worrying about rate limits or outages

The offline-first chatbot you built here is more than a cool demo - it’s the foundation of a new category of web apps:

  • local-only AI assistants

  • enterprise-secure internal chatbots

  • private productivity tools (notes, summarizers, generators)

  • AI-powered PWAs that work in airplane mode

  • fully edge-based WebGPU ML applications

As browsers continue to improve WebGPU support and more compact LLMs appear (e.g., TinyLlama, Phi-2, Gemma-2B, Qwen2-1.5B), we’re heading toward a world where your browser is the new AI runtime.

This is the perfect time to start building offline-first AI experiences. With Next.js, you already have the framework that developers love and Google ranks highly.