Is reinforcement learning from human feedback scalable to superhuman AI?

What is RLHF and where has it already achieved superhuman results?

Reinforcement learning from human feedback (RLHF) is a technique where a human provides feedback—like ratings or corrections—to train an AI agent, especially when the goal is hard to define mathematically. For example, in training ChatGPT, human preferences help the model generate more helpful and harmless responses [1]. This human-in-the-loop approach has proven effective, but its scalability depends on whether humans can reliably judge the AI's performance.

In narrow, well-defined domains, RLHF has already enabled superhuman AI. In 2020, an AI called Pluribus defeated top human professionals in six-player no-limit Texas hold'em poker, a milestone because multiplayer poker was considered a major AI challenge [5]. Similarly, in 2022, a deep learning system surpassed human accuracy in reconstructing 3D neural circuits from electron microscope images, scoring well above the estimated human benchmark on the SNEMI3D challenge [3]. These examples show that when the task is constrained and human feedback can be precisely defined (e.g., winning a poker hand or correctly identifying neuron boundaries), RLHF can scale to superhuman levels.

The human bottleneck: can AI feedback replace human feedback?

The main obstacle to scaling RLHF is that gathering high-quality human feedback is expensive and slow, especially for tasks where humans are not experts. A 2023 study directly compared RLHF with an alternative called RL from AI Feedback (RLAIF), where an off-the-shelf large language model generates the preference labels instead of humans [2]. Across summarization, helpful dialogue, and harmless dialogue tasks, RLAIF achieved performance comparable to RLHF—meaning AI-generated feedback was nearly as good as human feedback, at a fraction of the cost [2]. This suggests that the human bottleneck can be partially bypassed.

However, the AI feedback still relies on a reward model that is ultimately shaped by human-designed objectives. The same study found that a technique called direct-RLAIF, which skips the reward model and uses the AI's raw outputs as rewards, actually outperformed standard RLAIF [2]. This indicates that while AI feedback can scale, it still depends on human-defined goals. The survey on human-in-the-loop RL emphasizes that humans remain essential for defining the task, evaluating the agent, and ensuring safety, especially in high-risk applications [1]. So the bottleneck shifts from gathering feedback to designing the right reward structure.

The fundamental limit: can RLHF guide an AI that surpasses human understanding?

The deepest challenge is that RLHF requires humans to evaluate the AI's actions, but once the AI becomes superhuman in a domain, humans may no longer be competent judges. For instance, in the poker example, the AI learned to bluff and strategize in ways that even expert players could not consistently counter [5]. If the AI's reasoning becomes opaque or its strategies exceed human comprehension, human feedback becomes unreliable—a problem noted in the philosophical analysis of superhuman AI [4].

This creates a paradox: RLHF works best when humans can provide accurate feedback, but superhuman AI by definition exceeds human capabilities. The 2024 survey on human-in-the-loop RL highlights that explainability methods are critical to bridge this gap, allowing humans to understand and trust the AI's decisions even when they cannot directly evaluate them [1]. Without such methods, RLHF may hit a ceiling where human feedback is too noisy or too slow to guide further improvement. So while RLHF can scale to superhuman performance in narrow tasks, scaling it to general superhuman intelligence likely requires new techniques—like AI self-feedback or interpretable reward models—that go beyond current human-in-the-loop approaches.

Sources used in this answer

Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities

Human-in-the-loop RL is fundamentally a human-centric paradigm; explainability methods are essential for scaling to superhuman AI, especially when human feedback becomes unreliable.

2024 · Carl Orge Retzlaff, Srijita Das, Christabel Wayllace, Payam Mousavi, Mohammad Afshari, Tianpei Yang, Anna Saranti, Alessa Angerschmid, Matthew E. Taylor, Andreas Holzinger · J. Artif. Intell. Res.

Original

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

RLAIF (using AI-generated preferences) achieves performance comparable to RLHF across summarization and dialogue tasks, and direct-RLAIF even outperforms standard RLAIF, showing AI feedback can scale but still depends on human-designed reward models.

2023 · Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash · ICML

Original

Superhuman Accuracy on the SNEMI3D Connectomics Challenge

A 3D U-Net trained on affinity prediction surpassed human accuracy on the SNEMI3D connectomics challenge by a large margin, demonstrating superhuman performance in a narrow, well-defined task.

2022 · Kisuk Lee, Jonathan Zung, Peter Li, Viren Jain, H. Sebastian Seung · arXiv (Cornell University)

WisPaper

Original

Superhuman AI

Superhuman AI raises philosophical questions about human ability to evaluate or control systems that exceed human competence, highlighting a fundamental limit for RLHF.

2023 · Gabriele Gramelsberger · Philosophisches Jahrbuch

Original

Superhuman AI for multiplayer poker.

Pluribus defeated top human professionals in six-player no-limit Texas hold'em poker, a milestone showing RL-based AI can achieve superhuman performance in complex multiplayer games.

2020 · Noam Brown, Tuomas Sandholm · Science (New York, N.Y.)

Original