Lessons from customer evaluations of an AI product
We’ve had RunLLM in users’ hands for 4 months now, and we’ve learned a lot. Most of what we’ve learned has obviously turned into product improvements, but we’ve also started to notice some clear trends in how customers are evaluating products. These observations are mostly to help teams selling early-stage AI products, but we also hope that they will be gentle nudges for those on the other side of the table.
As we’ve talked about recently, evaluating LLMs and LLM-based products is hard, so we can’t blame anyone evaluating a new tool for not having a clear framework. Unlike evaluations of a CRM or a database, there aren’t set criteria or neat feature matrices. In other words, we’re all making it up as we go. 🙂
You learn by touching the stove. Perhaps the most interesting trend we’ve found is that some of our best customers have tended to be those who have already tried to build a custom assistant in-house or with a third-party vendor. They’ve seen what an AI product that’s not fully matured looks like and how highly variable the responses can be, so they know what kinds of failure modes to look for. They can also recognize higher quality responses more quickly.
These have been the teams who have also shown up to our conversations with pre-defined evaluation sets — they start by baselining their expectations of our assistant with that validation set and then continuing from there. Good performance on the validation set is probably not a sign of good quality, but poor performance is an indicator of bad quality. The flip side is that teams who haven’t yet experienced an AI assistant tend to rely on vibes-only (more below).
It’s all about the data. We wrote about this last week, but it’s become so clear that it bears repeating. At every step of our pipeline — ingesting data sources, fine-tuning an LLM, processing user inputs, and generating responses — the quality and specificity of the data is the #1 determinant of the quality of our answers.
If a customer gives us feedback about an answer that could be improved, the first thing we do is look at our telemetry to understand why we didn’t find the right data. Once we understand that, we can update our data and inference pipelines to address the issue. The simplest solution almost always tends to be to improve the data processing. After tens of these revision cycles, we’ve found that a bad answer is usually a sign that we haven’t ingested all the necessary data for that question.
Managing expectations is difficult. We’ve generally found there two camps of people unhappy with AI products. The first camp is those who think it’s still just a party trick — something that generates useful answers sometimes but doesn’t have the potential to be consistently helpful. The second is those who expect the world, asking a model to impute answers from information that might not be written down anywhere.
Interestingly, the skeptics are easier to convince than the maximalists. The former, if you get them to invest some time into testing an assistant see value very quickly. The latter expect that any error is the product builder’s fault rather than a limitation of the existing technology or the underlying data. Convincing the maximalists means explaining to them that, yes, AI is a powerful tool but it still is only as good as the data you put into it. If your documentation has gaps, so will the assistant’s answers.
Vibes are still king. Vibes-based evals have gone from being a tongue-in-cheek joke to the dominant evaluation method for most LLM-based products. (They always were, but we just never admitted it until recently.) If you’re not familiar with the term, it’s shorthand for “try the model out and see how good the responses are.” It’s obviously not an empirical solution, and it’s not going to give you pristine performance numbers — but the truth is that pristine performance numbers aren’t all that useful.
While referring to them as ‘vibes-based’ may make them sound silly, this is still a great way to evaluate an AI product. In our case, our customers know better than anyone the types of technical support questions they are getting, so their intuition about what our assistant should be able to answer is pretty good. Most of the confidence that’s built or lost during an evaluation comes from trying one-off questions and getting a sense for what the product can or can’t do.
At the end of the day, it’s on us — the product builders — to prove value to our customers, from expectations to delivery. Vibes-based evals are great when expectations are clearly set, but depending on the product, they can sell the product short (e.g., you don’t realize you need more data to work well) or oversell the product (e.g., the user happens to ask 10 questions you ace but the not the 4 you might miss).
What that means is that (in a still-forming market we need) to be thinking about how to better evaluate our own tools and surface those insights to customers. For AI products, generic benchmarks like MMLU are going to tell us almost nothing. To build customer confidence, that’s going to mean developing product- and task-specific measures instead.