WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[Google DeepMind] Does This Gradient Spark Joy? Selective Training via the Kondo Gate
Summary
Problem
Method
Results
Takeaways
Abstract

This paper introduces the Kondo gate for Reinforcement Learning, a compute-efficient mechanism that uses a "delight" signal (the product of advantage and surprisal) during the forward pass to decide whether to execute an expensive backward pass. Evaluations on MNIST and Transformer token reversal show that the method can skip over 97% of backward passes while matching the learning quality of full Delightful Policy Gradient (DG), significantly outperforming standard PG and PPO.

TL;DR

DeepMind researchers have introduced the Kondo Gate, a mechanism that decides whether to perform a backward pass based on a "delight" signal calculated during the forward pass. By filtering for samples that are both valuable (high advantage) and unexpected (high surprisal), they can skip up to 97% of backpropragation steps without losing model performance. It effectively treats training like speculative decoding: use a cheap check to see if the expensive work is necessary.

The "Compute-Inefficiency" Problem in RL

In modern Reinforcement Learning, the backward pass is typically 2-4x more expensive than the forward pass. Yet, in standard Policy Gradient (PG) or PPO, we spend the same amount of compute on every sample.

  • The Problem: Most samples are "boring"—they either confirm what the agent already knows or punish a mistake the agent has already learned to avoid.
  • The Insight: If we could identify "breakthroughs" during the forward pass, we could skip the math for everything else.

Methodology: The Kondo Gate

The core of the paper is the definition of Delight ($\chi$): $$\chi = ext{Advantage} imes ext{Surprisal}$$

Why this product? Advantage alone ignores rarity; surprisal alone ignores value (prioritizing "surprising failures"). Delight targets the intersection: rare successes that teach the learner something new.

The Kondo Gate samples a Bernoulli variable $G$:

  1. Forward Pass: Calculate delight $\chi$.
  2. Gating: Draw $G \sim ext{Ber}(\sigma((\chi - \lambda)/\eta))$, where $\lambda$ is the "compute price."
  3. Conditional Backprop: Only compute $ abla_ heta$ if $G=1$.

Algorithm 1: Kondo Gate Implementation

Experimental Evidence

1. MNIST: 100x Efficiency Boost

In an RL version of MNIST, the Kondo Gate ($\rho=0.03$) matched the error rate of the full Delightful Policy Gradient while using two orders of magnitude fewer backward passes. As the cost ratio of backward/forward passes increases, the speedup of the Kondo Gate grows linearly.

Performance on MNIST (a) Forward passes show DG-K matching DG, but (b) shows the massive advantage in backward-pass savings.

2. Transformer Token Reversal

The authors tested the gate on a Transformer tasked with reversing sequences. As sequences got longer ($H$) and vocabularies larger ($M$), informative events became rarer. This is where the Kondo gate shines: it focuses the fixed compute budget on the rare "learning frontier" tokens.

Scaling Results In backward-pass space, the Kondo gate (fixed $\rho=3%$) solves significantly harder problems than PG or PPO.

The "Gambling" Pathology: When it Fails

The authors honestly identify a failure mode: The Gambling Regime. If a suboptimal action has extremely high reward variance (like a slot machine), a "lucky draw" can look like a breakthrough. Because the policy rarely picks this arm, the surprisal is high, and delight is maximized. This causes the gate to "spark joy" on a false signal. However, this is a fundamental limit of per-sample statistics, not just the Kondo gate.

Deep Insight: Speculative Training

One of the most exciting takeaways is that the gate tolerates approximate delight. You don't need a full-precision forward pass to decide if a sample is worth a backward pass. You could use:

  • Quantized/Distilled models for screening.
  • Cached values from previous iterations.

This points toward a Speculative Decoding for Training paradigm: a cheap, fast "referee" model screens data, and only the "joy-sparking" samples are sent to the large, expensive model for actual weight updates.

Conclusion

The Kondo Gate proves that in complex sequence modeling, the vast majority of gradient updates are redundant. By selectively "tidying up" our training process and only computing gradients for delightful samples, we can achieve SOTA performance with a fraction of the traditional compute budget.


References:

  • Osband, I. (2026). Does This Gradient Spark Joy? Google DeepMind.
  • Osband, I. (2025). Delightful Policy Gradient.

Find Similar Papers

Try Our Examples

  • Search for recent papers in "selective backpropagation" or "importance sampling for gradients" that utilize forward-pass metrics to reduce training FLOPs.
  • Which paper first introduced the "Delightful Policy Gradient" (DG) framework, and how does the Kondo gate build upon its original mathematical formulation?
  • Explore if "delight-based gating" has been applied to RLHF (Reinforcement Learning from Human Feedback) or large-scale LLM post-training to mitigate the cost of reward model gradients.
Contents
[Google DeepMind] Does This Gradient Spark Joy? Selective Training via the Kondo Gate
1. TL;DR
2. The "Compute-Inefficiency" Problem in RL
3. Methodology: The Kondo Gate
4. Experimental Evidence
4.1. 1. MNIST: 100x Efficiency Boost
4.2. 2. Transformer Token Reversal
5. The "Gambling" Pathology: When it Fails
6. Deep Insight: Speculative Training
7. Conclusion