Back to feed

The next big breakthrough will be AIs learning on the job

Dwarkesh Podcast

Jun 26, 2026

6/26/2026

Dreaming As A Fourth Scaling Axis For Post-Deployment Learning Through Internal Simulations

The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast

Science, Technology & Innovation · Jun 26, 2026

The text proposes a speculative “dreaming” scaling axis where deployed models build and train inside simulators to amplify sparse, high-value real-world signals—analogous to model-based RL like EfficientZero—potentially shifting improvement after deployment toward organizations with wide distribution and distillation infrastructure, though full-world simulation is acknowledged as very hard.


6/26/2026

RLVR Faces Horizon Generalization Challenge Short-Horizon Training May Not Generalize To Long-Horizon Real-World Competence

The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast

Science, Technology & Innovation · Jun 26, 2026

The critique argues RL in verifiable, reproducible environments (RLVR) can yield strong short-horizon, within-session agents but may fail to generalize to long-horizon, open-ended real-world competence, so the key empirical question becomes whether competence learned in containerized tasks transfers across horizon length, ambiguity, and real-world interaction.


6/26/2026

Reinforcement Learning Progress Depends On Replayable Environments And Simulator Design More Than Model Scale

The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast

Science, Technology & Innovation · Jun 26, 2026

AI progress driven by reinforcement-learning methods depends less on clear success metrics and more on whether tasks can be converted into highly replayable, deterministic, massively parallel training environments—coding and math fit this mold, while web and real-world economic domains (orders, travel, taxes, business, litigation, trading, politics) lag due to non-replayable workflows, slow/sparse/non-stationary feedback and bot defenses, so the main bottleneck is building simulators and environments rather than simply larger models.


6/26/2026

On-Policy Self-Distillation Provides Reward-Free Dense Supervision For Continual Learning By Distilling Session Context Into The Base Model

The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast

Science, Technology & Innovation · Jun 26, 2026

On-policy self-distillation (OPSD) trains a base “student” model to absorb in-session knowledge from a context-rich “teacher” by matching the teacher’s token-level predictions, offering denser supervision and avoiding an outer-loop reward—claimed to outperform RLVR and naive transcript fine-tuning for continual learning—while noting a failure mode (teacher guidance can become harmful after off-distribution student errors) that Trajectory-Refined Distillation can mitigate, and presenting a practical way to turn deployment traces into model improvement.


6/26/2026

Continual Learning Requires An Intermediate Mechanism To Combine In-Context Efficiency And Weight-Based Generalization

The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast

Science, Technology & Innovation · Jun 26, 2026

The document frames continual learning as a compression tradeoff: in-context learning is sample-efficient but stores huge KV-cache per token (claimed 320 KB/token for Llama 3 70B) versus weights storing ~0.075 bits/token (a ~35M× gap), so scaling context windows won’t substitute for durable learning—moving knowledge into weights requires massive repeated signals (e.g., Cursor Tab’s 400M+ daily accepted-edit objective), implying enterprise on-the-job adaptation needs intermediate mechanisms that combine in-context sample efficiency with weight-like compression, making architecture and learning-rule innovation crucial.