An introduction to evaluating LLMs
As more and more LLMs have been released over the last 6 months, comparing model quality has become a favorite pastime. We each have personal experiences with different models, and many folks use different models for different tasks — Bard is better for analysis & synthesis, Claude for code generating, and GPT for general knowledge, and so on. Many of us have also probably looked at rankings like the LMSys model leaderboard or the HuggingFace Open LLM leaderboard, both of which are full of many different numbers.
Despite all the comparisons — quantitative and qualitative — it’s still not clear how we should be thinking about model quality. Is there one best model, or are different models well-suited to different tasks. The answer is as always that it depends, but our goal here is to give you an overview of how LLMs evaluations work today.
Evaluation Techniques
A quick disclaimer: As an introduction, this post is going to be light on technical details. For more technical details, we recommend reading the original MT-Bench paper (which covers many of the techniques behind the LMSys leaderboard) as well as many of the posts from our friend Nathan Lambert at Interconnects.
Aggregate Human Evaluations
The ultimate goal is still to produce an output that a human finds useful. That makes human-based evaluations the gold standard for LLMs.
While any one interaction between a model and a human may be noisy (the human is in a bad mood or phrased their question poorly) or biased (in whatever direction the human might be biased), the wisdom of crowds argument that underlies human-based evaluations. If you aggregate data from enough conversations — which the LMSys Arena has been able to do — you see an accurate reflection of human preferences, at least within the population you’re gathering data from.
The LMSys Arena & Leaderboard have applied this most effectively, generating an Elo score (now using the more stable Bradley-Terry model) for each LLM by showing two responses to each human question and asking the human to pick which one is better. Better is, of course, subjective, but considering LLMs are still (for now? 🧐) serving human needs, the aggregated data is quite accurate.
Human evaluations power the LMSys Leaderboard linked above, and the quantity of data LMSys had amassed has made their leaderboard the gold standard in the eyes of many.
Traditional NLP Techniques
Natural language processing has of course been around for decades, and researchers have been developing benchmarks long before ChatGPT existed. In light of how rapidly the LLM space has come to the fore, many of these metrics have been applied to LLM evaluations.
For example, one of the most popular metrics is BLEU (the bilingual evaluation understudy). As the name might imply, this benchmark was originally developed to evaluate the efficacy of machine translation tasks, but it has been applied to text generation as well. While BLEU can be viewed as an effective proxy for text comprehension, the narrowness of the evaluation does not make it a good evaluation of the broad set of tasks an LLM can accomplish.
LLM-Specific Evaluations
As LLMs have matured, we’ve begun to see the popularization of LLM-specific evaluation techniques. These techniques focus on a breadth of tasks and also evaluate the conversational nature of LLMs.
MMLU (massive multitask language understanding), for example, provides an evaluation set of 50+ general knowledge tasks across math, history, and computer science. As you can imagine, the benchmark is designed to evaluate the breadth and generality of a model’s capabilities, and GPT-3 was the first model to achieve above-random accuracy on MMLU. This is one of the main metrics that the Open LLM leaderboard from HuggingFace uses.
MT-Bench is another LLM-oriented evaluation technique that Joey’s group developed at UC Berkeley. The idea is to evaluate a model’s ability to accomplish a task over multiple exchanges with a user. This is an important evaluation technique as most of us know from experience that we rarely get what we’re looking for the first time around. While the dataset for MT-Bench is limited (primarily focused on 2-turn conversations), this is a critical direction to explore.
The MT-Bench paper linked above also discusses using the model-as-a-judge technique, which has become increasingly popular. The idea is to use a state-of-the-art LLM (i.e., GPT-4, usually) to evaluate the quality of other models’ outputs. This is a fascinating direction, but there are some key pitfalls — ranging from easily tackled problems (GPT-4 has strong positional bias for the first thing it sees) to fundamental flaws (perpetuating biases in existing models). There’s much more work to be done here!
Evaluating LLMs is a critical task, and its importance will only grow. As more companies adopt the technology (and as more models come out), understanding what models work well for which tasks is going to be critical. While plenty of ink has been spilled on new techniques, there’s a ton of work left to be done, both in general-purpose model evaluations as well as in specific domains.
One area we’re particularly interested in — which has seen limited work — is understanding what models are best at learning after pre-training. In other words, which models are the best candidates for fine-tuning? If you’re thinking about this problem, please reach out!