There's more to LLMs than chat

and

Feb 15, 2024

Largely due to the success of ChatGPT (followed by Claude and Bard), LLMs have become synonymous with a chat-based interface. Chat interfaces make for great demos that show of the comprehension and breadth of LLMs, but they also have some key limitations — many of which have become increasingly obvious as the space has matured.

An anthropomorphic representation of a large language model (LLM) as a robot, standing on a hilltop looking at a sunrise. The robot, symbolizing the LLM, has a sleek, modern design with elements resembling computer circuits and digital displays, indicative of its AI nature. The horizon it looks towards is filled with various symbols representing new modalities, like magnifying glasses for search and speech bubbles for chat. The style is a blend of animated and realistic, with vibrant colors in the sunrise and a detailed, lifelike landscape. The robot should look contemplative and optimistic, symbolizing the LLM's exploration into new areas beyond chat. — Source: DALL-E 3.

Way back in November 2022 — when ChatGPT first entered the zeitgeist — there were plenty of takes suggesting that LLMs would spell the end of Google Search. A year in, it’s safe to say that while LLM-based chat assistants have worked their way into many of our daily workflows, they aren’t a drop-in replacement for search. There are both technical (breadth vs. depth) and human (old habits die hard) reasons for that, but we believe it points to a more fundamental truth: Chat isn’t always the best metaphor for building products that surface useful information.

This isn’t to say that chat is bad — it’s just limited. As we see LLMs work their way into more and more applications, we’re starting to see smart product teams push the boundaries on UX, using LLMs in ways that move beyond a user writing a prompt into a text area and hitting ‘enter’. We’re very excited about these directions.

Of course, an LLM is always going to require text as input, so someone is going to have to write prompt (template). What’s increasingly clear is that this can (and often should) be hidden from end users.

Here are the some of the key limitations of chat we’ve observed, and how some of our favorite products are overcoming them.

Chat is verbose. Because search was the first dominant way to organize the breadth of information on the internet, most of us are used to tossing off a few words into our URL bar and seeing what pops up. The results are algorithmically organized, but it doesn’t end there. In reality, we use some combination of Google’s rankings and our own heuristics about trustworthy sources to pick 1 or 2 of the most promising options to dig in on. Occasionally, we come back and refine our search or try other results.

As we’ve learned, good prompting is different: It requires clear instructions & guidelines on model outputs, which requires prompts to be fleshed out in significant detail. This interaction is very much the opposite of a few words entered into a search box, and users are often either frustrated by the lackluster responses (due to insufficient context) or the time required to get a prompt right — both of which are supplied by a human when using search.

This doesn’t mean LLMs have no role to play in search & data retrieval. In fact, tools like Phind and Perplexity have already shown the power of using a familiar search interface, assuming there’s sufficient plumbing on the backend to provide the necessary context. Without knowing the implementation details, we assume the backend plumbing here is a combination of prompt templating & selection along with traditional information retrieval techniques. Arc’s upcoming Explore feature similarly also takes in search-style inputs and uses LLMs to evaluate and summarize results.

These tools are likely just the first step in moving past the verbosity of chat and the human-driven heuristics required by traditional search, but most importantly, they show how thoughtful UX design can avoid the verbosity of chat.

Conversations don’t express uncertainty. There’s an aspect of human psychology at play when it comes to chat. When you ask a question, you expect to get an answer. If you get the wrong answer, you start to lose trust in the information source, whether a person or a model — which is a bad place to be for an LLM!

There has been some research in quantifying model response confidence, but it’s early, and as we’ve learned from 538’s election forecasts, humans are bad at understanding confidence scores.

Going back to our search example, most of us have learned that the first response to a query isn’t all that much more likely to be right than the second or third one. We don’t go to the 10th page on Google, but the results are more tiered (page 1 > page 2 >. page 10) than ranked (result 1 > result 2 > result 3). That implicit uncertainty has taught us that we’ll have to work to get the right answer, so we don’t blame Google if result 1 is wrong. Because of this, we’re less bothered even when Google actively surfaces information that’s incorrect — there’s immediate options to find another information source.

Chat, on the other hand, creates a sense of certainty — this response is the answer, and there’s no immediate way to get another possible answer. If you don’t like the answer, you can ask again, but models tend to quickly go off the rails when this happens. Designing interfaces that better ground responses in multiple sources and — most importantly — express uncertainty is an open problem. At a minimum, presenting multiple responses eases the burden on having one right answer, but there’s likely more interesting work to be done here — we’re not sure what that looks like, but we’re excited to see what people design.

Users might not know what they want. The chat metaphor works well when you’re seeking out information. You might chat to a friend to make dinner plans or to a senior engineer on your team to understand the codebase better — so naturally, when you need help debugging some code, chatting to an LLM makes sense.

But that’s not the only context in which LLMs are useful. There are a number of places in which LLMs can proactively surface useful information, in response to an external event. For example, a document system might surface relevant documents to the one you’re reading. Or a customer success tool might summarize all recent tickets similar to the one a user just filed for review. Or more familiarly, GitHub Copilot guesses at what code you’d like to write next based on what you’ve been doing thus far.

Chat is simply the wrong way to approach this interaction. You could likely ask a chat interface to see if there’s relevant documents, but you might not know what they contain or even know what types of relevant documents to ask about. Proactively surfacing information, on the other hand, is powerful when a user doesn’t know what they don’t know. There may or may not be relevant information in the world, and they’d like to see it if it exists.

We hate buzzwords, but we believe this is where good UX will start to feel “AI-native.” Rather than relying on you to seek something out, the tool you’re using determines guesses at what you might need based on what you’re doing (and improves over time!).

Ultimately, chat is here to stay — and that’s not a bad thing. There are plenty of times when we just need to ask a question. But thus far, our community has been overly anchored to chat because that’s where OpenAI kicked things off. As LLMs mature into every product, company, or workflow, we’ll need to move past the basics. We’ve listed a few here, but there’s likely many more to come.

The AI Frontier

Discussion about this post