We need better LLM evaluations
Imagine someone asked you, “Is Postgres or Snowflake better?” You’d probably find that to be an extremely confused question. Better for what, exactly? If you were picking a database to back your new social media app, Snowflake would be a crazy choice — it would be way too slow. If you were trying to process a few TB of data, you should probably go with Snowflake. Obviously, there’s no one answer to the question.
We’ve grown increasingly frustrated with LLM evaluations as they try to do exactly that: Provide one general answer about which LLM is “better.” We believe this is misguided.
When we wrote our introduction to evaluating LLMs a couple months ago, we were thinking about how to evaluate different models’ performance relative to each other on a wide-range of tasks. We realized — both in our own work and in looking at what’s out there — that there were no simple answers. Existing evals aren’t bad per se, but they are stuffing a lot of different concepts and skills (instruction following, retrieval, reasoning, etc.) into a single benchmark. They are also trying to balance these skills with general-purpose quality (aka, vibes based evals). That’s understandably difficult to do.
Our frustration has crystallized further. Having a single number that tells us how “good” an LLM is doesn’t really tell us much at all — whether I’m looking for a quick answer to a trivia question or making a fundamental architectural decision for my product. The reality is that the whole concept of LLMs being better or worse than each other is a non sequitur.
Different models are naturally going to excel at different tasks (just like humans). For users — especially those building products — having visibility into those tradeoffs is going to be a critical part of the decision-making process. To capture that reality, we need to have a wider set of more nuanced benchmarks.
Here are some of the LLM skills we care about at RunLLM. This list isn’t meant to be comprehensive but a first cut:
Instruction following: How many instructions can a model track and faithfully execute on in when given a complex scenario. From our experiments, it’s clear that, for example, GPT-4 is significantly better than GPT-3.5 at this, and Claude 3 Sonnet is somewhere in between but closer to GPT-4. What’s not clear is to what extent, or where things break.
Information retrieval: This is what all the long context needle-in-a-haystack benchmarks are testing. The latest model releases all seem to excel at these tasks.
Determinism: Even with temperature set to 0, models can often return inconsistent results on the exact same task. Anecdotally, Haiku seems to do this more often than larger models, but we haven’t tested this extensively.
Synthesis: Some models have a tendency to regurgitate the facts that they were fed in the prompt, while others are able to connect dots more intelligently and provide answers that link multiple pieces of information.
Verbosity: We all know that GPT-4 word vomits endlessly. I’d love to see how much of an answer is filler and repetition vs. how much contains new & relevant information. Then we can ask for OpenAI refunds for the useless tokens!
Latency: This is self-explanatory. 🙂
This list is probably far from complete, but it’s hopefully illustrative. We can look at MMLU, Elo, and all the other numbers, but they won’t help us pick models that fall in the right parts of the design space across each of these categories.
Joey’s LMSys research group at Berkeley runs the Chatbot Arena. In additional to Elo, they’ve been developing benchmarks that break down model capabilities on a range of tasks. In the next few weeks, they’ll also be extending the Arena to provide category specific Elo (Bradley-Terry) scores to help provide a more nuanced ranking of models across different skills. This is a big step in the right direction, but we believe we need more active community development of evaluation methods — specifically in task-specific domains.
Today, there’s very few benchmarks that help us understand these characteristics, so deciding which models to use is a guess-and-check exercise. The closest we’ve seen are the needle-in-a-haystack benchmarks mentioned above along with some discussion of variance in latency.
This is a glaring gap. Going back to the database analogy from the beginning, you can look at benchmarks like TPC-C and TPC-H that will help you understand the performance of different systems on industry-standard workloads. Benchmarks won’t solve all your problems for you, of course. In practice, you will deploy Postgres, find your workload has an unexpected read-write ratio, and stick Redis in front of it to improve performance. What benchmarks will make sure you don’t do is build your Django app on Snowflake.
Today, every team building with LLMs starts out flying blind and reinventing the wheel along the way. There’s simply no reason this needs to be the case. Instead of working on general-purpose benchmarks that try to cram every property into a single number, we need to move towards benchmarks that measure specific skills. That’s the first step towards users can make informed tradeoffs.