Over the last few months, we have been speaking with several teams who are building LLM-powered applications. The space is so new and changing so fast that it is tough to keep up with best practices or "standard" tooling. (What does standard even mean when there seems to be a new GitHub repo with 100k stars every week?) Even so, we have found some common patterns for what to do and what not to do across these conversations.
LLMs are incredibly powerful, but as with all powerful tools, they need to be used in the right way. Ask the model to sift through large amounts of information or reason with very little background, and its responses might go off the rails. Try to treat an LLM like a scikit-learn Linear Regressor, and you’ll be wasting both your time and your money. Striking the right balance is critical.
To help you get started, this post will cover common pitfalls & best practices for building your first LLM app and some of the most popular open-source tools used today. (And we do mean today!)
Don't fine-tune a model*
If you come from the ancient world of 2022 machine learning, you might be a little incredulous: If you don't own the model, what advantage do you have?
Foundation models have changed the game. Pre-trained models are so powerful that you no longer need to train or fine-tune a model to see value. Most of the cool applications you see today (e.g., Notion AI, Superhuman AI) are using the OpenAI API with little or no customization of the model, as far as I’m aware. The secret’s all in what data you give the model (more on this below).
Of course, as your applications mature, you might break this rule; there are plenty of reasons to fine-tune a model, but the cost and complexity make it a bad place to start.
Where to start: For most people, a hosted model API like OpenAI's GPT or Claude's Anthropic are great starting points. If you cannot use these model APIs for data privacy or security reasons, we would recommend first getting a sense of what is possible on test/dummy data with hosted models. Once you are confident in a direction, you can look into using open-source models like LLaMa 2 or Vicuña. (See here or here for model leaderboards and here for a comparison of serving tools.)
*Of course, once your application’s mature enough, you should also consider fine-tuning, as we learned when building Gorilla. 🙂
Data is (still) king
If you are not training a model yourself, how do you customize it to your use case? The answer is to give it the right data at the right time. This may be counterintuitive — ML models historically have learned how to perform a task by seeing many, many examples of nearly the exact input-output pair they’ll see in the wild. But the general "reasoning" capabilities of LLMs mean that you can now do what’s called zero-shot or in-context learning. In other words, LLMs are capable of completing a task they have never seen before if given sufficient information.
The most common pattern here is retrieval-augmented generation (RAG). Imagine you are building a chatbot for your company's internal wiki, which has hundreds or thousands of documents. You wouldn’t want to pass all these documents into the model to answer every question because: (1) it is expensive to process that much text, and (2) much like a human, the model will get “confused” if it reads a lot of irrelevant information. Instead, you can search for the most relevant documents to your query (e.g., if I ask about vacation policy, pull in the wiki pages about vacation benefits) and only feed that information into the model. You can then ask the model to read the relevant document and answer the query.
It’s still an open question how to best do text search. Model providers like OpenAI provide an API that allows you to construct an "embedding" — a numerical vector representation of your text — which you can then insert into a vector DB. Vector DBs provide APIs to take the embedding for a query and retrieve the most relevant documents to that query (measured by vector similarity). Alternatively, text indices like Elastic can also work well. (More on this in another blog post coming soon!)
Where to start: LLaMa Index is the most popular open-source tool for doing RAG. It provides a nice Python interface that allows you to take a corpus of text data, embed each document, construct an index, and then run retrieval-augmented queries over that index.
If you are feeling adventurous, you should be able to build a simple version of this in Python yourself by using the OpenAI Python SDK and a hosted vector DB like Pinecone.
Experiment with your parameters
With the rate of innovation in every model & tool in this space, there’s tons of optimization you can do to improve the performance of your application. Your mileage may vary with each one of these techniques, but here are a few examples to start with:
Prompt customization: It is common knowledge that models respond differently based on your prompt structure, and there are likely easy wins that can be found by experimenting with different prompt structures to help nudge the model in the direction you would like.
Segmentation before embedding: If you are using an embeddings model to do text retrieval, how you segment your text can be critical. Intuitively, the larger the chunk of text you construct an embedding for, the lower the fidelity of the vector will be. As such, picking the right segmentation to embed your text is critical — LLaMa Index by default chunks up text into 500 token segments, but using semantic knowledge of the data to improve this segmentation can have huge benefits.
Use clever retrieval techniques: The Hyde Paper from CMU showed that constructing a "hypothetical" document by asking GPT to write a document that might answer a query, then embedding the hypothetical document and use that embedding as the search query for your retrieval can have huge benefits. The more you can improve the information you put into your prompt, the better.
Other parameters you can think about are temperature (how “creative” or randomized the model’s response are), embeddings model selection & vector size (larger vectors have higher fidelity but are more expensive to store), and text search mechanisms.
There are likely 20 other things we should be discussing here, but this should help you get off to a strong start.
LLMs are incredibly powerful and are able to process and synthesize large amounts of information, provided you give them the right data. These models do not need to be trained or fine-tuned in the way ML models historically did — at least not right off the bat — but they are also not omnipotent (i.e., garbage in, garbage out is still true).
If you have other lessons from your experiences with building LLM apps, we would love to hear them!