How to Optimize Retrieval-Augmented Generation
The latest episode of our podcast is out! We had Nathan Lambert from HuggingFace on to discuss RLHF, LLM evaluations, and how to improve discussion around AI research. Check it out!
Retrieval-augmented generation (RAG) is the most popular technique for infusing LLMs with private or proprietary data. RAG applications allow you to use a private set of documents to contextualize your prompts — for example, you might build a chatbot that answers questions about your company’s internal wiki. The phrase “retrieval augmentation” is derived from searching for the most relevant documents to the user’s question. Those documents are then combined into a prompt and passed into an LLM to answer the query based on the retrieved documents.
Briefly, an application using retrieval augmented generation will have the following structure:
Indexing: Periodically (e.g., once a day), a text search index is updated with all the latest queryable documents. The most common technique is to use an embeddings model to generate a numerical vector representation of each document and insert the vector-document pair into a vector database. Alternately, a traditional text index like Elastic will also work.
Retrieval: When receiving a query from a user, you first do a text search to retrieve the most relevant documents — either by embedding the query and doing similarity search on your vector DB or by using your text index.
Generation: You combine the retrieved documents into a prompt (roughly) of the form: “Read the following documents: {documents}. Now answer this question: {query}.” This prompt will give the model sufficient information to answer the question.
While naive retrieval augmentation is great for simple applications, more complex queries can quickly expose flaws in the basic methods. The most common issue is that embeddings models do not capture the context of the text with sufficient fidelity, which makes retrieving the correct documents difficult. The questions used in retrieval themselves are often sparse and don’t have embeddings that match the documents (more on this below). If the retrieval process goes awry, you’re likely to get irrelevant or nonsensical answers from the model in the generation step.
Luckily, there are a number of techniques that can help improve RAG applications. These are some of our favorites.
Hypothetical Document Embedding (HyDE). A research paper from CMU1 showed that hypothetical document embedding could significantly improve retrieval performance. As the name suggests, instead of simply embedding the base query, you first ask an LLM to generate a hypothetical document that might be relevant to this query. You embed this hypothetical document, use the hypothetical embedding to retrieve relevant documents from the vector DB, and use those documents to answer the original question.
Intuitively, this longer document will likely contain many of the same keywords and concepts that the actual document(s) you wish to retrieve will contain. The original query on the other hand, might only be a sentence, which means its embedding will be very sparse. As a result, the similarity of the hypothetical document’s vector to the target document(s) will be much higher than the query’s, resulting in better retrieval quality.
Of course, the tradeoff here is cost and performance. You will be paying in both money and time for the LLM to generate the hypothetical document for embeddings purposes. If you’re willing to make that tradeoff, HyDE is a great technique for improving retrieval quality.
Improved text segmentation. Text segmentation is not a new problem in NLP. Tools like Llama Index naively segment text into evenly sized (e.g., 500 token) chunks. However, this segmentation will likely break text across semantically meaningful boundaries. For example, a single paragraph in a blog or a section in a research paper will contain a discussion of a single concept. Blindly splitting after 500 tokens might combine semantically distinct content, and the resulting embedding vector might contain a “weaker signal” about each of the two concepts it contains. As a result, the retrieval process is less likely to surface that information. Even if retrieval does surface that information, the model will receive some relevant and some irrelevant information, which can confuse it when answering the user’s question.
Using a semantic understanding of the text being embedded to segment it at meaningful points — or even just using heuristics like paragraph boundaries — can significantly improve the quality of your retrieval-augmented generation.
Custom (re)ranking functions. This is a more advanced technique than the previous two, but if applied well, it can significantly improve performance of a RAG application.
By default, most vector databases return relevant results by taking the dot product or cosine similarity of the query vector and the database entry; the larger the vector similarity, the more similar the query and the document are considered. Results are ranked by the magnitude of the similarity, a cutoff is applied (e.g., 0.75), and the prompt is generated as described above. However, the embedded vector might not tell the whole story, and the ranking function can be augmented with supplementary metadata that can improve its quality.
For example, consider a RAG application that is answering questions about a company’s internal wiki. A custom ranking function might take the top documents returned from the vector DB and look at a recent activity log to check whether the user has accessed any of these documents recently. If they have, those documents might be more relevant to their question, so we increase their relevance. If the user has asked a similar question recently, we might look at what documents were used to answer those questions and reduce their relevance since the user probably didn’t get the answer they were looking for.
Writing custom ranking functions should be done with care and requires a deeper understanding of the dynamics of a particular application. If done well, it can significantly improve application performance.
RAG (as opposed to fine-tuning) is a great place to start with building LLM applications. While naive RAG will suffice for a proof of concept, you’ll quickly need to optimize your application to make it production-ready. We’ve discussed a few common ideas here, but there’s many other possible optimizations depending on what you’re building. Share your favorite ideas below!
It’s also worth noting that HyDE is an example of a broader technique to rewrite or expand queries in order to include more relevant keywords.