The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast
Science, Technology & Innovation · Jun 26, 2026
The text proposes a speculative “dreaming” scaling axis where deployed models build and train inside simulators to amplify sparse, high-value real-world signals—analogous to model-based RL like EfficientZero—potentially shifting improvement after deployment toward organizations with wide distribution and distillation infrastructure, though full-world simulation is acknowledged as very hard.
The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast
Science, Technology & Innovation · Jun 26, 2026
The critique argues RL in verifiable, reproducible environments (RLVR) can yield strong short-horizon, within-session agents but may fail to generalize to long-horizon, open-ended real-world competence, so the key empirical question becomes whether competence learned in containerized tasks transfers across horizon length, ambiguity, and real-world interaction.
The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast
Science, Technology & Innovation · Jun 26, 2026
AI progress driven by reinforcement-learning methods depends less on clear success metrics and more on whether tasks can be converted into highly replayable, deterministic, massively parallel training environments—coding and math fit this mold, while web and real-world economic domains (orders, travel, taxes, business, litigation, trading, politics) lag due to non-replayable workflows, slow/sparse/non-stationary feedback and bot defenses, so the main bottleneck is building simulators and environments rather than simply larger models.
The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast
Science, Technology & Innovation · Jun 26, 2026
On-policy self-distillation (OPSD) trains a base “student” model to absorb in-session knowledge from a context-rich “teacher” by matching the teacher’s token-level predictions, offering denser supervision and avoiding an outer-loop reward—claimed to outperform RLVR and naive transcript fine-tuning for continual learning—while noting a failure mode (teacher guidance can become harmful after off-distribution student errors) that Trajectory-Refined Distillation can mitigate, and presenting a practical way to turn deployment traces into model improvement.
The next big breakthrough will be AIs learning on the job · Dwarkesh Podcast
Science, Technology & Innovation · Jun 26, 2026
The document frames continual learning as a compression tradeoff: in-context learning is sample-efficient but stores huge KV-cache per token (claimed 320 KB/token for Llama 3 70B) versus weights storing ~0.075 bits/token (a ~35M× gap), so scaling context windows won’t substitute for durable learning—moving knowledge into weights requires massive repeated signals (e.g., Cursor Tab’s 400M+ daily accepted-edit objective), implying enterprise on-the-job adaptation needs intermediate mechanisms that combine in-context sample efficiency with weight-like compression, making architecture and learning-rule innovation crucial.