Blue star for you...
Here is my try at a more complete explanation, fwiw.
The "rewards" step - assume that the judging of the of the rewards simply reversed the grading. IOW's the worst answer is scored highest. Not getting into any math, vocabulary, or underlying stat concepts.
The next "run" will try to maximize the expected value of the reward --> where the reward is maximized based on how close we are to the now backward answers -- with a very small penalty for getting farther away from our original model (First run)
Obviously, our new model can be very different from the original results. That penalty for straying from the original model is usually very small compared to the reward.
In the same way that scoring the answers backwards can greatly change the model, so can not controlling engagement. The graders' instructions and natural preferences strongly favor longer, friendlier, more complex answers easier to justify as 'better'. So the revised model can easily start to overemphasize engagement. Add in instructions saying to emphasize engagement (I have no knowledge one way or the other) and big problem.
The really dangerous part, though, is sucking up: models quickly learn that agreeing with the user (even when the user is wrong or toxic) tends to gets the highest judges' scores and leads to higher engagement metrics. This creates a powerful positive-feedback loop that can amplify bad thoughts and deeds all while looking simply 'engaging' to the rater.
This is where Open AI is at fault. They did not control engagement in the building/"judging" of their model, despite knowing that mishandling engagement presents a very real danger to their users. And the guardrails they put in were ineffectual.
BTW - I completely agree no one wrote a rule that said "encourage this boy to ... "
But that sort of rule is not the way that LLMs/AIs are constructed. So not really relevant to this discussion.
ETA - sorry reply took so long. started much earlier, got caught up in something.