RLHF and LLM evaluations

Sep 14, 2023

Last week, we released an interview with Nathan Lambert (Research Scientist at HuggingFace) on RLHF and LLM evaluations. This has become one of our most watched interviews already, and it’s clearly a topic that many people are interested in. I recommend watching the whole interview, but I also wanted to share some takeaways from the conversation.

LLMs are a (the?) key application of reinforcement learning. I noticed a lull in interest in the RL space in the late 2010s, but LLMs have rekindled interest in RL with human feedback (RLHF) in particular. Nathan shared a neat framework, which is that RLHF is effectively a fine-tuning mechanism that enables you to integrate preferences into a model. Though we don’t have a good understanding of why, Nathan shared that, anecdotally, RLHF significantly outperforms traditional supervised learning for language models. (He’s working on showing this empirically as well — stay tuned!)
- Nathan also shared that properly instruction-tuning a model with RLHF is a $6-10MM data investment and requires a team of 5-20 engineers. Definitely not a trivial undertaking, but it’s clearly worth the investment if you can afford it!
RL with computational feedback is a more scalable technique. RLCF refers to a category of RL where the feedback is provided by a computational process (similar to Constitutionlal AI RLAIF from Anthropic). Nathan describes in detail in the interview how you could imagine evaluating code generation models in a fully automated way — check if the code compiles, lint the code, check if it runs, run basic test cases on it, etc. This is likely to be much cheaper and more scalable than involving humans at every step of the process. Code generation is the most obvious example, but there are likely other creative applications of RLCF in domains that are narrower than text generation. Check out Nathan’s blog on RLCF!
GPT-4 is an okay judge but has some pitfalls. Nathan and my students have both looked into using GPT-4 as a judge to compare and evaluate the quality of other model outputs. We both discovered in parallel that GPT-4 is extremely positionally biased, preferring the first thing it sees over the second, and this polluted a number of our early benchmarks. There’s still hope for using GPT-4 as a judge, but Nathan believes it will require more nuanced benchmarks than what my group has done with MT-Bench. Nathan is thinking about launching a “ChatBench” which would go beyond the two turns we studied in MT-Bench. Interesting tidbit: Nathan shared that the original motivation for HuggingFace’s Open LLM Leaderboard was as an internal evaluation method for ongoing training jobs.
The future of open source models is uncertain. Anthropic’s CEO believes that by 2025 we’ll have $10B models (in training costs), which means only a few companies will have the ability to build these models. The investment will likely mean that most teams rely on a few models that are very expensive but incredibly versatile. These models will almost certainly be proprietary, but there’s the potential to distill smaller, open-source domain-specific models from the data generated by larger proprietary models (pending legal issues around how this data is licensed, of course).

This was one of the most fun interviews I’ve done to date. Check out our YouTube channel for mote interviews, and keep an eye on Nathan’s blog for more discussion of these topics, and let me know if you have suggestions for other folks we should interview.

If you’re a fan of the interview summary format, let me know, and we’ll sprinkle more of these in!

The AI Frontier

Discussion about this post