Back to feed

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

The latest research from Google

Jun 26, 2026

6/26/2026

On-Device Latency Reduction Through Freezing The Base Model And Training A Lightweight Drafting Head With Verification While Preserving Bit-For-Bit Output

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction · The latest research from Google

Science, Technology & Innovation · Jun 26, 2026

Google retrofitted Multi-Token Prediction onto Gemini Nano v3 by freezing the backbone and training a lightweight drafting head plus verifier, yielding verified, bit-for-bit identical outputs with out-of-the-box latency speedups on Pixel devices without retraining or requalifying the base model.


6/26/2026

Zero-Copy Drafter Reuses Main Model KV Cache To Eliminate Prefill Latency And Reduce Runtime Memory For Edge AI

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction · The latest research from Google

Science, Technology & Innovation · Jun 26, 2026

A “zero-copy” drafter design reuses the main model’s frozen key-value cache via an MTP head that cross-attends to it, eliminating drafter prefill latency and cutting ~130MB of runtime memory per instance versus a standalone drafter—addressing the mobile dynamic-memory bottleneck and showing memory architecture can be as decisive as model quality for edge AI deployment.


6/26/2026

MTP Improves On-Device Inference Efficiency By Increasing Tokens Per Pass And Reducing Verification Frequency, Improving Battery Life.

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction · The latest research from Google

Science, Technology & Innovation · Jun 26, 2026

Google's MTP uses speculative verification with richer drafts and a redesigned on-device inference stack to validate nearly two extra tokens per pass in production features (e.g., AI Notification Summaries, Proofread), cutting verification frequency, reducing how often heavy processors wake, lowering energy use and improving battery life—making on-device AI more usable and commercially defensible.


6/26/2026

Integrated MTP Heads Outperform Standalone Drafters By Reusing Backbone Activations

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction · The latest research from Google

Science, Technology & Innovation · Jun 26, 2026

Google found that attaching a lightweight MTP drafter head to a model’s final hidden states (a late-exit design) outperforms similarly sized standalone drafters in speculative decoding—giving ~50%+ speedups on Pixel 9 and up to 55% higher token acceptance—implying reuse of backbone internal state trumps mere parameter parity.