"DiffusionGemma" has been haunting every other AI newsletter since June 10, complete with the headline figure of over 1,000 tokens per second — and hardly anyone has had time to clarify what "diffusion" even means for text. Here you'll read what the model genuinely changes, and which part of it matters for an SME.
At a glance: DiffusionGemma is an open AI model from Google (Apache 2.0, released June 10, 2026) that generates text via diffusion: not token by token, but an entire 256-token block in parallel. That makes it up to four times faster than comparable Gemma models — over 700 tokens per second on a single RTX 5090 — and quantized it fits into 18 GB of VRAM. For mid-sized companies, exactly one thing follows from all this: a fast model now runs offline on a consumer GPU. The model is experimental, not a finished production model.
What is DiffusionGemma?
DiffusionGemma is an open language model from the Gemma 4 family by Google, Apache-2.0-licensed and therefore free of license fees. Technically it is a mixture-of-experts model with 26 billion parameters, of which only around 3.8 billion actively compute per request. All weights sit in memory, but only a fraction does the computing. That's the usual MoE trick for speed despite size.
The real novelty sits in the "diffusion" part. Conventional language models work autoregressively, like a typewriter: one token, then the next, each depending on the previous one. DiffusionGemma instead starts from noise and denoises a complete block in a single pass. Google uses the image of a printing press stamping out an entire paragraph at once. Concretely, the model generates 256 tokens in parallel per forward pass, not one after another.
That diffusion works for text at all is the real surprise. The technique comes from image generation, where it carves pixels out of noise. Text is discrete, every token counts, and for a long time autoregressive prediction was considered the more natural approach. DiffusionGemma shows that the parallel approach has become practical for text.
Before this turns into a hype reflex, a sober framework helps. What does diffusion change — and what doesn't it?
- Changes: speed. Parallel block generation instead of token by token — according to Google, up to four times faster than comparable Gemma models.
- Changes: the everyday viability of local AI. Fast enough on a single consumer GPU, not just in a data center.
- Doesn't necessarily change: top-end performance. The claim is speed, not the best reasoning benchmark. DiffusionGemma is explicitly experimental.
- Doesn't change: the data sovereignty logic. Local stays local — the sovereignty question arises exactly as it does with any other open model.
As a side note, according to the model card the model also accepts image and video inputs and outputs text. For the case that matters here — a fast text assistant — that's a bonus, not the main argument.
Why is DiffusionGemma causing such a stir right now?
The trigger is concrete and fresh: Google released DiffusionGemma openly under Apache 2.0 on June 10, with numbers that are easy to share. Over 1,000 tokens per second on an NVIDIA H100, over 700 on an RTX 5090. Figures like that instantly spawn the big thesis.
On LinkedIn it sounds like this: "New architecture, the next GPT killer, diffusion makes cloud LLMs obsolete." That's the wrong focus. The record on the H100 is a research headline, not an argument for mid-sized companies. The relevant number sits right next to it: over 700 tokens per second on a single RTX 5090 — and quantized, the model fits into 18 GB of VRAM. That's not a data center. That's a decent workstation.
And that is the actual news. For years, the standard excuse against local AI went: "too slow for productive use — better stick with the cloud." DiffusionGemma dismantles exactly that sentence. Not because it's the strongest model in the world, but because fast inference no longer depends on a cloud data center. Frontier performance doesn't shift as a result. The everyday viability of local AI does.
What does DiffusionGemma mean for local AI in mid-sized companies?
"Local" sounds like a GPU farm, a server room, and an ops team nobody has. In reality it means: a high-end consumer GPU with 18 GB of memory and a tool like Ollama that loads the model. Requests don't travel through a third-party cloud API — they stay on your machine. That's the entire difference, technically and legally.
In practice, that means a fast text assistant running offline on a workstation. Drafts, summaries, internal Q&A workflows: tasks where speed and confidentiality matter and the last decimal of a benchmark doesn't. For this broad middle of everyday work, a fast local model is often the calmer choice than a frontier model accessed through a US API. One example: a law firm has client correspondence summarized and pre-drafted locally. The data stays in-house, the answer arrives in seconds, and nobody has to explain after the fact why legal briefs traveled through someone else's server.
Two honest caveats belong here. First, DiffusionGemma is experimental — a research release, not a battle-tested production model. Second, speed is no substitute for reasoning: for demanding analysis or complex coding, the frontier models remain ahead. Which open model is worth it for which case, and when the cloud remains the better answer, is covered in detail in the guide to local AI models. Whether a specific model fits your existing GPU is something the hardware calculator for local models works out for you.
How do I connect DiffusionGemma to Corporate LLM?
A local model is set up quickly. The real question is how it reaches a team's day-to-day work — with an interface, permission management, and shared agents. That's exactly where Corporate LLM comes in, without bundling the model itself.
The first route is Bring Your Own Model. You connect your own model endpoint with your own key: OpenAI-compatible such as vLLM or llama.cpp, a securely reachable Ollama endpoint, or OpenRouter. DiffusionGemma runs on your own hardware or with the hosting provider of your choice, requests go directly to your endpoint, and billing runs through your contract. BYOM is available on the Free plan and all paid tiers; the interface with Spaces, Agents, and team management sits on top.
The second route is the same open-source models, EU-hosted. No ops team of your own, yet data stays in the EU, with a data processing agreement (DPA) and no US transfer. The decisive point: you decide per Space or per Agent, without swapping your tooling stack. Your own local model via BYOM where speed and full control matter, EU-hosted models where the operational overhead is too much for you. Both in one interface, instead of a company-wide "cloud or local" decision of principle. How to connect DiffusionGemma concretely via Bring Your Own Model, step by step, is shown in the update Use DiffusionGemma in Corporate LLM now.
Can DiffusionGemma be used locally in a GDPR-compliant way?
Running locally eliminates third-country data transfers and with them the Schrems II question — but it is not a free pass. As long as inference runs on your own hardware, inputs never leave your own infrastructure. That removes the transfer under Art. 44 GDPR and the subsequent articles on third-country transfers. The technical and organizational measures under Art. 32 GDPR are often easier to satisfy on your own hardware, because the data never leaves the building.
Two points remain. The obligations of purpose limitation, a deletion concept, and documentation continue to apply unchanged. And as soon as the hardware sits with an external hosting provider — say, via colocation or rented servers — you need a data processing agreement under Art. 28 GDPR with that operator. The LLM inference itself does not constitute processing on behalf of a controller here; the agreement attaches to the hardware, not the model. With a true on-premise air gap, this point disappears as well. That the model is openly licensed changes none of this: the license governs use of the model, not the handling of your data.
How to deploy strong cloud models in a legally sound way beyond local operation — from the right provider plan through EU hosting to running locally — is mapped out in the guide Using Claude in a GDPR-compliant way.
Using DiffusionGemma locally: your next steps
Three steps keep things matter-of-fact instead of chasing the headline number. First, check in the hardware calculator whether an 18 GB model like DiffusionGemma fits your existing GPU. Then compare in the model database with live scores how open models stack up side by side on strength, memory footprint, and license before you commit to one. And once a model convinces you, connect its endpoint via Bring Your Own Model instead of sending sensitive inputs through someone else's API. You connect DiffusionGemma itself exactly the same way — explained step by step in the update Use DiffusionGemma in Corporate LLM now. If you'd rather sort out the fundamental platform options first, you'll find that framing in the four routes to an LLM platform for SMEs.
DiffusionGemma isn't the end of the cloud — it's one more piece of evidence for the same movement: open models are becoming fast and good enough that data sovereignty is no longer a question of comfort. With this sequence you hold a defensible decision, ready to answer the data question as the controller before the first audit asks it.
Frequently asked questions
What is DiffusionGemma in one sentence?
An open Google model that generates text via diffusion — entire token blocks in parallel instead of token by token — making it significantly faster and able to run locally on a consumer GPU.
How fast is DiffusionGemma?
Google cites over 1,000 tokens per second on an NVIDIA H100 and over 700 on an RTX 5090 — up to four times faster than comparable Gemma models. The exact figure depends on hardware and quantization.
Can I run DiffusionGemma locally?
Yes. It is Apache-2.0-licensed and fits into 18 GB of VRAM when quantized, so it runs on a high-end consumer GPU. The weights are available on Hugging Face, Kaggle, and Vertex AI.
Is diffusion better than the usual architecture for text?
Not across the board. Diffusion primarily delivers speed through parallel block generation. Top-tier reasoning is not the claim, and DiffusionGemma is explicitly experimental.
Does DiffusionGemma make cloud LLMs obsolete?
No. It shifts the everyday viability of local AI, not frontier performance. For many SMEs, a hybrid remains the sensible setup: sensitive workloads local or EU-hosted, the rest via the cloud.



