WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[ICLR 2026] CLIPO: Beyond Binary Rewards with Contrastive Policy Optimization
总结
问题
方法
结果
要点
摘要

The paper introduces CLIPO (Contrastive Learning in Policy Optimization), a framework that integrates trajectory-level representation learning into Reinforcement Learning with Verifiable Rewards (RLVR). By applying an InfoNCE objective to align successful reasoning paths and separate them from failed ones, CLIPO achieves SOTA generalization and robustness across multiple LLM reasoning benchmarks.

TL;DR

Reinforcement Learning with Verifiable Rewards (RLVR) has a "hallucination" problem: models can get the right answer for the wrong reasons. CLIPO (Contrastive Learning in Policy Optimization) solves this by forcing the model to recognize the "invariant structure" of correct reasoning. By aligning successful trajectories in a latent space, it provides a dense reward signal that improves generalization on MATH and GSM8K benchmarks by up to 3.36 points.

Problem: The "Happy Families" Trap of RLVR

Current RLVR methods like GRPO operate on a simple binary: did the code compile, or is the math answer correct? This coarse feedback creates a "spurious reward" problem. As Leo Tolstoy might say, all correct reasoning paths share a logic ("Happy families are all alike"), while incorrect paths fail in infinite, noisy ways.

When we only reward the final answer, the model might overfit to specific patterns or "cheat" by copying the answer without logic. Prior solutions involved Process Reward Models (PRMs), but these require expensive, human-annotated step-by-step labels that are hard to scale.

Methodology: Distilling Logic through Contrast

The core insight of CLIPO is that successful reasoning paths share underlying semantic invariants, while errors are sporadic and uncorrelated noise.

1. Latent Semantic Embedding

CLIPO appends a lightweight Contrastive Head to the transformer backbone. It takes the mean-pooled hidden states of a full reasoning trajectory and projects them into a -dimensional latent space.

2. The InfoNCE Reward

Within a group of sampled rollouts, CLIPO identifies "Positive" trajectories (correct answers) and "Negative" ones (incorrect). It then optimizes an InfoNCE objective:

  • Align: Pull multiple positive trajectories together in the embedding space.
  • Contrast: Push positive trajectories away from the negative "noise."

Overall Architecture

The contrastive loss is converted into a Dense Reward () and added to the verifiable environment reward. This reshapes the reward landscape, prioritizing solutions that "agree" with the consensus logic of success.

Experiments: Superior Generalization

The researchers tested CLIPO across two tracks: general grade-school math (Track I) and competition-level math (Track II).

Key Findings:

  • Robustness to Perturbations: On GSM8K-P2 (modified problems), CLIPO improved GRPO by +3.36 points. This proves the model isn't just memorizing; it's learning the logic.
  • Cross-Model Universality: Whether using Qwen2.5-7B, Llama-3.1-8B, or DeepSeek-R1-Distill, CLIPO provided consistent gains (up to +1.31 average).
  • Semantic Manifold: Visualization (t-SNE) shows that after training, "correct" reasoning paths form tight, distinct clusters, while "incorrect" ones are scattered.

Experimental Results Figure: t-SNE visualization showing the emergence of a semantic manifold where correct reasoning paths cluster together.

Why It Works: The Denoising Effect

CLIPO acts as a denoising mechanism. By maximizing Mutual Information (MI) between positive rollouts, the model is forced to ignore the "hallucinatory artifacts"—the unique ways to be wrong—and focus on the shared logical steps necessary for success. It effectively "learns" a process reward model implicitly, without ever seeing a single human step-label.

Conclusion and Future Outlook

CLIPO represents a significant shift from "What is the answer?" to "How do we arrive there?" by leveraging the relational structure between solutions. This framework is not limited to math; it holds massive potential for Code Generation and Agentic Planning, where the "search space" for correct solutions is vast but the underlying logic remains consistent.

Limitations

The method depends on the quality of the sampled group. If a group contains only one or zero correct answers, the contrastive signal cannot be computed. Future work may explore cross-prompt contrastive learning to mitigate this "sparse success" bottleneck.


Paper Title: CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
Authors: Sijia Cui, Pengyu Cheng, et al. (Alibaba Qwen Team & CAS)
Code: GitHub - Qwen-Applications/CLIPO

发现相似论文

试试这些示例

  • Examine recent papers that utilize contrastive learning or self-supervised objectives to improve the reasoning consistency and process-level accuracy of Large Language Models.
  • Trace the development of Group Relative Policy Optimization (GRPO) and identify how subsequent works have addressed its reliance on sparse outcome rewards.
  • Explore the application of trajectory-level representation alignment in other verifiable domains such as automated code generation or symbolic logic theorem proving.
目录
[ICLR 2026] CLIPO: Beyond Binary Rewards with Contrastive Policy Optimization
1. TL;DR
2. Problem: The "Happy Families" Trap of RLVR
3. Methodology: Distilling Logic through Contrast
3.1. 1. Latent Semantic Embedding
3.2. 2. The InfoNCE Reward
4. Experiments: Superior Generalization
4.1. Key Findings:
5. Why It Works: The Denoising Effect
6. Conclusion and Future Outlook
6.1. Limitations