Per-Layer Embeddings : The secret powering Gemma 4 models

Google dropped Gemma 4 on April 2, 2026, a family of open models that push the boundaries of efficient, multimodal, and agentic AI. The lineup includes a 31B dense model, a 26B Mixture-of-Experts (MoE) variant, and two standout smaller models: Gemma 4 E2B and Gemma 4 E4B.

What makes the “E” models special?

They’re not just smaller versions of bigger transformers. They introduce (or refine) a clever architectural trick called “Per-Layer Embeddings (PLE)” that delivers strong performance with far fewer effective parameters. This is the kind of innovation that makes frontier-level intelligence feasible on phones, laptops, and edge devices. Let’s unpack why PLE feels like magic and how it works under the hood.

Traditional Embeddings: One Shot at the Start

In a standard decoder-only transformer (think Llama, Mistral, or earlier Gemma models):

Input tokens are converted into dense vectors via a single embedding table (vocabulary size × hidden dimension) right at the beginning.
This embedding vector enters the residual stream and gets refined layer by layer through attention and feed-forward networks.
By the time you reach deeper layers, the original token identity can get “diluted” as contextual information accumulates.

This works great for large models with plenty of capacity. But for tiny models (a few billion parameters or less), it’s a bottleneck. You either need more layers (increasing compute) or accept lower representational power.

Enter PLE : Fresh Token Signals at Every Layer

Per-Layer Embeddings flip this script. Instead of relying solely on the initial embedding that propagates through the entire network, PLE gives each decoder layer its own small, dedicated embedding lookup for every token. Here’s the high-level flow (based on Gemma 4’s design):

Main Residual Stream : Starts with the standard token embedding (e.g., 1,536 dimensions for E2B or 2,560 for E4B) and flows through attention + FFN as usual.
Parallel PLE Pathway : For each token and each layer:
- A lightweight lookup happens in a per-layer embedding table (much smaller dimension — around 256 dims in some descriptions, though exact sizes vary).
- This combines:
  - A token-identity component (pure lookup from a dedicated table — like reminding the layer “this is the word ‘cat’”).
  - A context-aware component (a learned projection from the current main hidden state, injecting what’s happening in the sequence so far).
- The result is a small vector tailored specifically for that layer.
Injection : This per-layer vector is added (or modulates) the hidden states via a lightweight residual block, often after attention and feed-forward sub-layers. A gating mechanism can weigh how strongly to apply it.

The outcome? Every layer gets a “fresh reminder” of the token’s identity and role, without forcing the main residual stream to carry everything from layer 1. It’s like giving each stage of processing its own cheat sheet. This was previewed in earlier models like Gemma-3n and is fully leveraged in Gemma 4’s edge-focused variants.

Why It Feels Like Magic : Parameter Efficiency Redefined

Here’s where the numbers get interesting:

Gemma 4 E2B: ~5.1B total parameters (including embeddings) → 2.3B effective parameters.
Gemma 4 E4B: ~8B total parameters (including embeddings) → 4.5B effective parameters.

The “effective” count reflects the active compute during inference. The extra parameters live mostly in large but cheap lookup tables — fast matrix lookups with minimal FLOPs. Benefits include:

On-device friendliness — These tables can be memory-mapped (loaded from flash/storage on demand) rather than sitting fully in precious VRAM/RAM. Combined with 2-bit/4-bit quantization via LiteRT, the E2B can run in under 1.5GB memory on some devices.
Better retention of token information — Deeper layers don’t lose fine-grained token details as easily.
Scalable expressiveness — Small models punch above their weight in reasoning, coding, and multimodal tasks (text + image + audio on the E variants) without ballooning compute.
Inference speed — Lookups are blazing fast; the model behaves like its effective size during forward passes.

In short: You get more “intelligence per parameter” by decoupling static token knowledge from dynamic computation.

How PLE Fits Into Gemma 4’s Bigger Picture

PLE doesn’t exist in isolation. Gemma 4 pairs it with other efficiency tricks for the smaller models:

Hybrid attention: Alternating sliding-window (local, 512 tokens) and global attention layers.
Dual RoPE (standard for local, proportional for global) → up to 128K–256K context.
Shared KV cache in later layers.
Lightweight multimodal encoders (vision ~150M params, audio ~300M).

The larger 31B dense and 26B MoE models skip PLE (they have more capacity already) but share the hybrid attention backbone. Result? Gemma 4 delivers strong benchmark performance across reasoning, agentic workflows (tool use, structured output), and on-device multimodal understanding — all while staying open and deployable.

Why This Matters for Developers and the Edge

For years, we’ve chased bigger models in the cloud. PLE is part of a broader shift: making capable AI truly local and private.

Build offline agents on mobile.
Run multimodal apps (camera + mic input) with near-zero latency.
Fine-tune or deploy without massive GPUs.

If you’re experimenting, grab the models from Hugging Face, Google AI Studio, or Ollama. The E2B/E4B shine when quantized and optimized with tools like LiteRT or MLX.

Final Thoughts

Per-Layer Embeddings aren’t flashy like Mixture-of-Experts or massive scale. They’re a quiet, elegant hack , rethinking how information flows in transformers to squeeze more capability out of limited hardware.

In Gemma 4, this “magic” helps close the gap between cloud giants and on-device AI. As edge hardware improves and quantization gets smarter, techniques like PLE could become standard in the next generation of efficient open models.