AI is (still) all about data

What matters is how you use data — not whether you have it

and

Mar 27, 2025

Back in the 2010s, big data was all the rage. As the famous quote (partially — we won’t include the full thing here) goes: “everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” We’re sure this phenomenon is wholly unfamiliar to everyone working in AI today.

In that era, everything was about gathering data that you could then analyze and eventually use for machine learning. The ability to gather data at scale and use that data to generate better product experiences was at the heart of Google and Facebook’s growth.

Source: GPT-4o. Does this count as Ghibli Mode? 🤔

What’s interesting about the generative AI craze is that gathering data is no longer a commodity in the way that it was 10-15 years ago. LLMs today can generate massive amounts of plausible sounding text for pennies, and AI applications by their nature can both create and process more information than any person can possibly keep track of.

You can now get personally tailored, highly detailed answers for fractions of a cent in a way that simply wasn’t possible before. If you’re steeped in AI like we are, you’ve probably gotten to the point where the lack of ability to get exactly what you’re looking for — instead having to search through documentation or click through Google results — is incredibly frustrating. The benefit of that level of customization is obvious: We can all be more productive almost instantly. But the fascinating side effect of that is it generates a treasure trove of interesting data. The critical question today is not whether data is available — it’s what you do with the data that you’re going to inevitably gather.

Our experience at RunLLM illustrates this perfectly. We’ve seen time and time again that user bases go from asking tens of questions a week to thousands of questions a week once they realize that they can get high-quality, reliable answers from us. Again, that’s a volume of data that no team has time to read through, but because of the nature of LLMs, it’s full of interesting information. We’ve helped customers analyze those conversations to find issues and gaps in documentation, to identify bugs in their products, and to look for feature requests that customers had which weren’t being surfaced elsewhere. In many ways, RunLLM has the clearest picture of what customers are doing and what they’re struggling with — but without the proper analysis methods, those insights would be lost.

To put it briefly, LLMs have encouraged a level of usage that uncovers fascinating trends and also enable a fidelity of analysis that wasn’t possible before. But if you don’t know what you’re looking for, you’re not going to be able to put any of that data to work.

Unfortunately, we don’t have any great principles that guide you through exactly what to look for in your data. The short answer is that it’s going to require a lot of experimentation, domain expertise, and customer feedback. One of the best ways that we’ve learned over the last two years has been by hearing customers say things like, “I was reading through the conversation history, and I was really surprised to see…” or “This chat was fascinating because I didn’t know that documentation page said…” Having heard those themes enough, we realized that these were things (along with a whole bunch more on our roadmap!) we could automate.

What we can share, however, is what we’ve learned in our journey to better harness the data that RunLLM is gathering. We’re super early in that journey, but we’ve already made (plenty of) mistakes and formulated some hypotheses about where this trend is headed.

Insights are hard. Generating good insights is hard — if you leave LLMs to their own devices, it’s very easy to generate confusing, unhelpful, or highly abstract “insights” that don’t actually help anyone. We learned that the hard way: Embarrassingly, our original attempts at topic modeling for RunLLM were not effective — and we heard from our customers that the question categories we first generated were so vague that they didn’t really have a use for them.
These datasets are gold. We may not have the clearest picture of how to harness this data today, but over time, we’re confident that these datasets are going to be some of the most valuable moats that AI companies will be able to create. Of course, OpenAI and Anthropic will have the ultimate breadth of data — they’re gathering an incredible amount of data around general conversation topics, from sports and history to medicine and science. But that dataset is notably lacking in specific kinds of expertise: The reason most (good) AI applications are successful today is because they’re able to provide the kind of content that generic LLMs can’t. This point has two downstream implications, which are the following two bullets.
First, generic LLMs won’t get better at highly specialized tasks. the data that would help them with this kind of improvement simply won’t be available to the large model providers. When we say ‘highly specialized tasks,’ we don’t mean things like programming (which LLMs have obviously gotten incredibly good at) — instead, we’re thinking about things that require general expertise and domain knowledge, like writing sales emails for complex products or doing highly complex technical support.
Second, AI applications will specialize over time. The availability of the same data that generic model providers are missing will enable specialized applications to get better. In turn, that means better results, deeper insights, and more value for customers. As the first generation of AI application companies get entrenched, it’s going to become harder and harder for skeptics to argue that a generic LLM could do that task — that simply won’t be the case.
The hidden challenge is labeling. What we’ve glossed over in this post is how you know what data is good enough to build on. Companies like Scale AI have grown incredibly quickly by enabling teams to get high-quality human labels for data as a service. While that model has worked for generic data, it’s going to become increasingly difficult to do well as applications become more and more specialized. Simultaneously, as the volume of available data increases, we’re going to need to find a more scalable way to label data. This is far from a solved problem, but it will be an absolutely critical aspect of putting this data to work.

For how cool data was in the 2010s, very few companies were able to build data moats based on growing usage — Google was the posterchild, and everyone was chasing that model (though most didn’t get there). With AI applications, data availability is no longer a problem — by their very nature, AI applications generate an exhaust of useful, actionable data.

As with any complex problem, there isn’t a single answer here for what you should be doing with your data. It’s going to depend on what application you’re building, how receptive your customers are to these kinds of analyses, and so on. Regardless of where you fall on each of those answers, what’s obvious is that you should be thinking about two things: (1) how you build your data moat over time, and (2) how you can start experimenting with using that data. Any company that figures those things out is going to be very happy in a few years’ time.

The AI Frontier

Discussion about this post