This paper introduces the Kondo gate for Reinforcement Learning, a compute-efficient mechanism that uses a "delight" signal (the product of advantage and surprisal) during the forward pass to decide whether to execute an expensive backward pass. Evaluations on MNIST and Transformer token reversal show that the method can skip over 97% of backward passes while matching the learning quality of full Delightful Policy Gradient (DG), significantly outperforming standard PG and PPO.
TL;DR
DeepMind researchers have introduced the Kondo Gate, a mechanism that decides whether to perform a backward pass based on a "delight" signal calculated during the forward pass. By filtering for samples that are both valuable (high advantage) and unexpected (high surprisal), they can skip up to 97% of backpropragation steps without losing model performance. It effectively treats training like speculative decoding: use a cheap check to see if the expensive work is necessary.
The "Compute-Inefficiency" Problem in RL
In modern Reinforcement Learning, the backward pass is typically 2-4x more expensive than the forward pass. Yet, in standard Policy Gradient (PG) or PPO, we spend the same amount of compute on every sample.
- The Problem: Most samples are "boring"—they either confirm what the agent already knows or punish a mistake the agent has already learned to avoid.
- The Insight: If we could identify "breakthroughs" during the forward pass, we could skip the math for everything else.
Methodology: The Kondo Gate
The core of the paper is the definition of Delight ($\chi$): $$\chi = ext{Advantage} imes ext{Surprisal}$$
Why this product? Advantage alone ignores rarity; surprisal alone ignores value (prioritizing "surprising failures"). Delight targets the intersection: rare successes that teach the learner something new.
The Kondo Gate samples a Bernoulli variable $G$:
- Forward Pass: Calculate delight $\chi$.
- Gating: Draw $G \sim ext{Ber}(\sigma((\chi - \lambda)/\eta))$, where $\lambda$ is the "compute price."
- Conditional Backprop: Only compute $ abla_ heta$ if $G=1$.

Experimental Evidence
1. MNIST: 100x Efficiency Boost
In an RL version of MNIST, the Kondo Gate ($\rho=0.03$) matched the error rate of the full Delightful Policy Gradient while using two orders of magnitude fewer backward passes. As the cost ratio of backward/forward passes increases, the speedup of the Kondo Gate grows linearly.
(a) Forward passes show DG-K matching DG, but (b) shows the massive advantage in backward-pass savings.
2. Transformer Token Reversal
The authors tested the gate on a Transformer tasked with reversing sequences. As sequences got longer ($H$) and vocabularies larger ($M$), informative events became rarer. This is where the Kondo gate shines: it focuses the fixed compute budget on the rare "learning frontier" tokens.
In backward-pass space, the Kondo gate (fixed $\rho=3%$) solves significantly harder problems than PG or PPO.
The "Gambling" Pathology: When it Fails
The authors honestly identify a failure mode: The Gambling Regime. If a suboptimal action has extremely high reward variance (like a slot machine), a "lucky draw" can look like a breakthrough. Because the policy rarely picks this arm, the surprisal is high, and delight is maximized. This causes the gate to "spark joy" on a false signal. However, this is a fundamental limit of per-sample statistics, not just the Kondo gate.
Deep Insight: Speculative Training
One of the most exciting takeaways is that the gate tolerates approximate delight. You don't need a full-precision forward pass to decide if a sample is worth a backward pass. You could use:
- Quantized/Distilled models for screening.
- Cached values from previous iterations.
This points toward a Speculative Decoding for Training paradigm: a cheap, fast "referee" model screens data, and only the "joy-sparking" samples are sent to the large, expensive model for actual weight updates.
Conclusion
The Kondo Gate proves that in complex sequence modeling, the vast majority of gradient updates are redundant. By selectively "tidying up" our training process and only computing gradients for delightful samples, we can achieve SOTA performance with a fraction of the traditional compute budget.
References:
- Osband, I. (2026). Does This Gradient Spark Joy? Google DeepMind.
- Osband, I. (2025). Delightful Policy Gradient.
