Value functions and dense rewards

In context learning seems to largely just work, we have utilized RLHF and RL based on verifiable rewards to the limits of economic viability such that Yann LeCun gets a billion dollars seed funding on the thesis that further scaling has diminishing returns. While they are great coders now, Karpathy coins LLMs as having “jagged” boundaries. They are simultaneously superhuman and make mistakes a human with that skill level would not make.

“just train up the LLMs, train on more synthetic data, hire thousands of people to school your system in post-training, invent new tweaks on RL-I think is complete bullshit. It’s just never going to work.”. - Yann LeCun

The missing link seems to be a good value function / a dense reward. There are exciting progresses like SDPO, that seem like an engineering hack moreso than a generalizable solution. Yet it outperforms GRPO in initial testing. Also there is speculation on what innovation might come next to allow for self-guided online learning “System M”.

If you look at nature you could view evolutionary reproductive pressures as sparse rewards. This doesn’t immediately shape behavior, but selects and thus shapes changes in our genetic code. There are immediate reward function that is less sparse for mammal behavior for meeting basic needs. Eating is a reward in that sense when you are hungry and so is sleeping when you are tired.

There are other ways to consider reward functions in nature such as dopamine release when there is a prediction error.¹

If we have a dense reward function that acts as a value function and a world model, something like (day-)dreaming may become a way to test different actions and get a reward. Let’s call this dream-testing. Viewing this through the lens of the very low token efficiency during training in LLMs, dream-testing would cut down the number of required world-interaction tokens per intelligence.

This would remove the scaling wall hit due to data scarcity and would allow further scaling and likely leads to a less jagged intelligence boundary. Additionally it would unlock continuous learning in individual situations which would have much smaller number of interactions with the real world available to it for training.

Wolfram Schultz, “Predictive reward signal of dopamine neurons,” Journal of Neurophysiology 80, no. 1 (1998): 1–27. ↩