In defense of vibes-based evaluations

and

Aug 01, 2024

We’ll be the first to tell you that we’re obsessed with delivering high-quality responses at RunLLM. Spending time understanding, analyzing, and improving quality is core to how we build trust with our customers — and, if you ask our engineering team, core to what we nag everyone about. But what are we actually doing when we say we’re optimizing for high-quality responses?

The truth is that there’s no clear definition for what “high-quality” AI actually is. Typically, when we say high quality, we mean, roughly, “We got the answer that we expected to get from the LLM.” In other words, we don’t have an empirical way to measure what high-quality AI is, so we’re mostly relying on whether it meets our expectations. Without an empirical measure, we fall back to trying things out and seeing how the LLM responds. This is what we’ve heard jokingly (though increasingly seriously) called vibes-based evals.

The scientists in us cringe at how fluffy this is, but (to some extent) it’s not bad. We used to begrudgingly tolerated vibes-based evals, but they’ve started to grow on us in the last few months. Slowly, we’ve come around to thinking that vibes are a great place to start (if ultimately insufficient).

Why vibes aren’t so bad

Earlier this year, we were frustrated by the idea of vibes-based evals. Most customers we worked with at RunLLM didn’t have pre-built test sets, we had to rely on just trying some questions to see what worked.

We think of this as a form of the blank page problem — you look at an empty chat window and think, “Okay, well, what do I type here?” and enter the first thing that comes to mind. The style and content of those first few responses has an outsized impact on someone’s impression of the quality of a product.

Having done done dozens of POCs with customers in the last few months, we’re starting to feel more warmly than we did before. The thing with a product like RunLLM is that your customers aren’t going to run a disciplined, scalable evaluation process to determine whether the answer they get is correct or not. They’ll ask a question, get an answer, and be satisfied if it solved their problem. If you evaluate our product in the same way, that’s probably a reasonable place to start.

Even beyond that, we’ve found that evaluating incrementally and without a pre-defined test set has its benefits.

Incremental improvement builds trust. Given that we’re in the early days of AI, it’s unlikely that even the best designed assistant is going to be perfectly suited to a new product. Working with customers to improve incrementally helps build trust. On the other hand, presenting metrics as a static artifact can feel limiting.
End-user requests can be unexpectedly weird. Building test sets is great, and we’d still recommend it to everyone building AI products. In practice, though, real-world requests are going to be occasionally hilariously out-of-distribution in a way that test sets will struggle to capture. (Some of our favorite examples.) Having these requests interspersed with substantive ones can help test the bounds of your system well.
Some feedback isn’t measurable. We’ve found that customer preferences can vary a ton around things like conciseness vs. detailedness and citation depth. Much of this doesn’t come out in a first conversation. Only having tried many questions can someone say with confidence that they want the assistant to behave in a certain way — the hands on keyboard time turns out to be invaluable.

Why not use other evals?

With all that said, wouldn’t it still be better to have some metrics? They might not be comprehensive or perfect, but something’s better than nothing… right?

The short answer is yes. The issue is that there aren’t great evaluation frameworks out there. Measures like MMLU attempt to capture too many different skills in a single number. The LMSys Leaderboard has begun to help on this front this by capturing task-specific Elo scores (e.g., coding, instruction following), but this unfortunately doesn’t help us show how good an individual product is.

Even still, we think we can do better as a community. we’ve argued in the past, we strongly believe we need better LLM evaluations. We even built our own evaluation framework at RunLLM to help guide our development and our customer conversations. Our opinions on none of these things have changed.

What we have found in building our own evaluation framework is there are, of course, pitfalls to our evaluation framework too. The biggest issue is that metrics are hard to understand. Our evaluation framework for RunLLM measures correctness, coherence, conciseness, and relevance — each of these criteria has a specific rubric we use for scoring. Unfortunately, you’ll need to closely read each answer and the rubric to justify the score that you see on the screen. And because we’re using an LLM as a judge, you’ll occasionally see some outlier scores that make you question the trustworthiness of the results. Unless you’re planning on using the same test set to evaluate many products, the time you’ll spend building trust in a metrics framework is better spent building trust in the product itself.

What’s worse is that the quality of an evaluation depends significantly on its tests set, and constructing good test sets is really hard. Unlike writing software tests where inputs are typed and constrained, text can come in all sorts of strange formats — as we touched on earlier, generating this unexpected input is easy for humans to do. It’s one thing to generate a single, shared test set, but it’s significantly more difficult to do this in a programmatic way when you’re building a customized product for each customer.

Finally, we’ve also found that customers are skeptical of assistants being overfit for a particular test set. This is a valid concern, and while we explain how we keep our testing framework separate from assistant improvements, there’s little that builds trust like using a tool yourself.

What are the pitfalls?

Things aren’t always great with vibes-based evals. We’ve found a few consistent trends in where vibes can go off the rails:

First impressions are strong. As we mentioned earlier, if the first few things you try are good, you’ll be biased towards thinking that future responses are good. If the few things you try are bad, the opposite will be true. In the absence of a more detailed evaluation, those first impressions carry a lot of weight, and they can make you prematurely conclude that a product is the best thing ever or absolutely terrible.
Feedback is not detailed enough. Most assistants allow you to give binary positive or negative (thumbs up or thumbs down) feedback to the answers. Unfortunately, this feedback doesn’t capture very much nuance. Positive feedback could mean an (1) you solved my problem; (2) you put me on the right track; or (3) you didn’t answer my question but I’m glad you didn’t hallucinate. Those are all positive things, but they don’t give you much signal on where the assistant can improve. Negative feedback can similarly mean either “this was a wrong answer” or “I’m annoyed that this isn’t possible.” Without a concrete reference point, you’re often left guessing about where you can improve.
Questions are hard to ask. Learning to use an LLM well has its own learning curve, mediated both by the underlying model as well as the fine-tuning and prompting layered on top. Users can learn quickly if they have some combination of patience and experience with LLMs, but in the absence of both, confusingly asked questions (and confusing feedback) can lead assistants astray pretty quickly. This is, of course, the flip side of the points above about generating inputs that represent real-world data. Realistically, you probably want a bit of both, but the balance is hard to strike. This is where an evaluation framework that might incorporate some out-of-distribution questions could be helpful.

These three areas are where evals can make the biggest difference: bringing a consistent, holistic view to the quality of the assistant.

The best of both worlds

A few months back, we viewed vibes-based evals as a stepping stone towards building more empirical test sets. In the absence of that type of test set, we decided we had to make do with vibes-based evals. This is where we’ve changed our minds the most dramatically.

We still believe that we need better LLM benchmarks — both general-purpose and task-specific ones. We still believe that disciplined evaluation processes are super valuable and should be included in how customers are choosing to buy LLM-based products. What we’ve now realized is that you also need to have hands-on-keyboard time with a product to convince yourself that it’s going to do what it’s supposed to do — especially when real-world data is noisy and tricky.

Vibes-based evals are a critical tool for good evaluation, and we shouldn’t deride them as low-quality or undisciplined, as we once did. We should embrace the vibes and to make sure that AI products are delivering good experiences both qualitatively and quantitively.

The AI Frontier

Discussion about this post