Last year, we planned to write a blog post about the impending war between fine-tuning LLMs and retrieval augmentation. We never quite got around writing that post, and in the meantime, the world shifted under our feet. The current consensus is not either-or but both-and. Fine-tune a model to teach it a skill, and use RAG to ensure freshness of data. This is (roughly) the technique we’re using in our own product at RunLLM.
Over the course of 2023, we saw the proliferation of fine-tuning APIs — starting with OpenAI’s but quickly followed by third-party service providers like Together AI, Anyscale, and Lamini. These services make fine-tuning a model turnkey — upload a dataset, pick a model, and hit go. A few hours later, you have a fine-tuned model and an inference endpoint. There’s certainly a ton of complex engineering going on under the hood, but as a user, you’re exposed to none of it. It feels like everyone should be able to fine-tune a model now, except… where do you get the data?
As with many things in AI, the hardest questions come back to data engineering. There’s certainly some magic involved in fine-tuning LLMs, but the main challenge turns out to be yet-more data engineering. Let’s dive in to how you might fine-tune your own models.
Synthetic data is king
In this context, synthetic data has turned out to be a powerful tool. Rather than relying on human-generated data, which can be slow and expensive to obtain, the “one weird trick” that’s unlocked a whole wave of fine-tuning has been to use a more powerful model (e.g., GPT-4) to generate a large volume of synthetic data that can then be used to fine-tune a smaller model for a particular task.
At first, this felt unintuitive. If we’re defaulting to using a powerful model like GPT-4, why do I need fine-tuning in the first place? The answer is surprisingly simple — you’re using the general-purpose reasoning capabilities of the large model to handle an ambiguous task (e.g., “Read this document, and write 10 question-answer pairs relevant to it.”) You take those pairs and train the smaller model on it. The smaller model won’t have the general-purpose reasoning capabilities that the larger model will have, especially after you fine-tune it, but it will be very good at following the pattern of examples you provided for it.
Using the larger model for dataset generation can be expensive, but it’s a fixed cost — you generate your dataset once for training, fine-tune the model, and then you get cheaper & faster inference moving forward. In any case, it’s orders of magnitude cheaper than relying on humans. If the underlying data changes, you can still use RAG techniques to maintain freshness along as the basic facts & structure you fine-tuned the model on remain constant.
But again, (synthetic) data engineering
Unfortunately, we’ve just done a whole lot of handwaving. While the fine-tuning APIs I mentioned above have abstracted away GPU allocation and deployment, we still need to make sure our data is high quality — and that’s usually the biggest stumbling block. GPT-4 and similar models help us generate data, but even these models are garbage in, garbage out.
This is where the data engineering comes in. Here are some of the challenges our team at RunLLM has had to tackle as we’ve worked through the synthetic dataset generation process.
How large should the input data be for dataset generation? Just because we have 100k+ context windows now, that doesn’t mean we should use them, and the more data your provide, the more easily the model gets distracted. Of course, if you slice too thinly, you lose valuable context.
How many examples is enough? Well, how many licks does it take to reach the center of a tootsie pop? One of life’s great mysteries, and it really depends on the nuances of your data.
Is there such a thing as too small a dataset for fine-tuning? We think so, but we’re not sure —we’ve found a few cases where a small enough dataset was effectively ignored by the model post-fine tuning. Even though we showed the model 100s of examples of a particular usage pattern, it hallucinated its own ideas because the aggregate dataset was too small.
How do you ensure sufficient diversity in your examples? Inundating the model with the same few examples repeatedly is terrible for model performance, so you need to vary type & order of examples significantly.
None of these questions are about the algorithms themselves — we haven’t looked at model architecture, fine-tuning epochs, or LoRA. But we have found that getting the dataset just right is the best determinant of model quality. Unfortunately, we don’t have many answer to share here because we’ve found that the answer varies significantly based on the underlying data — at the moment, a healthy experiment budget is your best friend.
It’s simple until it isn’t
Thus far, we haven’t touched on what’s happening under the hood of these fine-tuning services. The services are powerful and highly cost effective, but they expose relatively few knobs for users to manage. From our best guess, most of these services are using LoRA to fine-tune models and rapidly swap them at inference time. You can control your dataset (of course) and the number of fine-tuning epochs but little else.
However, you may eventually want to move to your own models — either for cost, quality, or privacy reasons (or all the above). When you get to that point, here are some of the challenges you’ll have to start thinking about:
Where do I get GPUs? It seems like the GPU shortage is poised to ease up soon, but in January ‘24, it’s still quite difficult to find GPU allocations on the public clouds.
Do I use LoRA or do full weight updates? Our friends at Anyscale have shown that LoRA is more efficient but leads to slightly worse model quality. It’s a valuable tool for a service that’s rapidly switching between multiple fine-tuned models on top of the same base model, but if you only have a few models, it might be worth investing in full weight updates. (You might be interested in other parameter-efficient fine-tuning methods as well.)
What fine-tuning implementation should I use? There a number of different popular implementations, and new techniques and optimizations are released daily. Finding the right level of stability for your use case is critical.
What about deployments? This was a big challenge pre-LLMs and will continue to be difficult; however, we’ll likely see cloud providers and other services provide bring-your-own-model deployment mechanisms for LLMs.
It may sound silly, but contending with the size of LLMs is one of the biggest challenges. It’s led to a concentration of the deployments, open source and commercial, in a few select organizations. However, if we start to see the release of smaller open-source LLMs, that will alleviate much of the complexity here. (Yes, this is part of our ongoing campaign to advocate for small models.)
Nice read. I think we need to clearly call out the quality of the training data on the foundational models. Also, what are the limitations of the specific foundation models, and what is the reason we are fine -tuning the model. Is it for small footprint of resources or domain specific knowledge scenarios ? We will see different mechanism for these and target segments - one where we need smaller LLMs fine-tuned vis-a-vis fine-tuning for domain specific knowledge. New architectures will emerge, as we keep moving the bottleneck by bringing in additional layers for specific requirements for customization potential, cost structure, skills required, privacy and ownership requirements.