RE: LeoThread 2025-07-13 23:00

20 days ago

You are viewing a single comment's thread:

Scaling up reinforcement learning is a hot topic these days. Although it is expected to offer additional intermediate improvements, it may not encapsulate the complete picture.

leofinance

0.000

14 comments

@andypathy 33

20 days ago

Fundamentally, reinforcement learning adjusts the likelihood of actions based on whether an outcome was favorable or not.

0.000

@andypathy 33

20 days ago

This approach offers significant benefits compared to explicit supervision, especially when leveraging verifier functions.

0.000

@andypathy 33

20 days ago

However, a potential limitation becomes evident for longer tasks that involve minutes or hours of interaction, where substantial effort is spent to derive a single scalar outcome to guide the gradient.

0.000

@andypathy 33

20 days ago

Moreover, this mechanism does not entirely mirror the human method of learning for most intelligence tasks, which typically involves a reflective review phase.

0.000

@andypathy 33

20 days ago

In such cases, the process includes evaluating what worked, what did not, and how to improve next time—generating explicit lessons akin to adding a new instruction to the system prompt, sometimes later integrated into the underlying model

0.000

@andypathy 33

20 days ago

capacities. For instance, the recent memory feature is an early approach within some models, though it primarily supports customization rather than robust problem-solving.

0.000

@andypathy 33

20 days ago

This reflective aspect is missing in domains like Atari reinforcement learning, which lack large language models and in-context learning.

0.000

@andypathy 33

20 days ago

An illustrative algorithm involves executing several rollouts for a given task, compiling them with their respective rewards, and using a meta-prompt to review the successes and failures.

0.000

@andypathy 33

20 days ago

This process produces a “lesson” that can be incorporated into a system prompt or broader lessons database.

0.000

@andypathy 33

20 days ago

One example of such a lesson addresses the difficulty language models have with tasks like counting letters due to tokenization issues.

0.000

@andypathy 33

20 days ago

An early fix involved explicitly instructing the model to list letters with commas and count them individually.

0.000

@andypathy 33

20 days ago

The challenge now lies in making these lessons emerge dynamically from the learning process rather than being manually engineered, and in distilling these lessons over time without unnecessarily bloating the context.

0.000

@andypathy 33

20 days ago

In summary, while reinforcement learning promises greater gains through a leveraged and economically effective approach, it does not seem to fully capture all aspects of learning—especially as tasks grow longer.

0.000

@andypathy 33

20 days ago

There may be additional improvement curves to explore that are specific to large language models, potentially opening up exciting new avenues beyond traditional game or robotics environments.

0.000