Should you fine-tune a model?

and

Aug 31, 2023

A couple weeks ago, we wrote about where to start when building an LLM application. One of our claims (perhaps controversial) was that you shouldn’t fine-tune your own models. Of course, like all rules, this one should be broken sometimes. There are indeed cases when you should fine-tune an LLM for a particular task, but it’s certainly not where you should start.

If you’re coming from the pre-LLM machine learning world, this may be surprising. Before LLMs, we built (supervised) ML models that we trained by giving them many examples of input-output pairs that were similar to what the model would see in production. Those training examples were exactly what enabled the model to learn its task.

The power of LLMs is that their reasoning is far more general-purpose than a scikit-learn linear regressor. Without getting into a philosophical debate about intelligence, it’s fairly easy to prove to yourself that an LLM is capable of reasoning over information it hasn’t seen before: Construct any arbitrary story and ask it a question about the details of that story. It’ll answer with reasonable accuracy. This means you can get pretty far with an off-the-shelf model.

So when should you fine-tune? Here’s a rough rule of thumb: Fine-tuning is good for teaching a model generalized skills and off-the-shelf LLMs are good for analyzing or synthesizing specific information.

When you shouldn’t fine-tune a model

This list isn’t exhaustive; instead, it’s meant to give you a sense for the types of tasks LLMs can perform off-the-shelf. As a theme, you’ll notice that providing the right information to the model enables it to do zero-shot1 (or few-shot) learning, which means the model learns from what’s in the prompt. The list of things an off-the-shelf model can do will also likely get longer over time2.

To chat with my data: Applications that are doing retrieval-augmented generation (RAG) to allow you to ask questions of a large corpus of documents almost certainly don’t require a fine-tuned model. One of the main challenges here is doing good text search and retrieval (see our interview with the co-creator of Llama Index for more), but assuming you’re able to retrieve the right data for a particular question, the model is more than capable of synthesizing answers based on the information you give it.
To use a particular tone of voice: We’ve met some teams who are considering fine-tuning models to adopt a particular tone of voice (e.g., friendly, helpful, etc.). This almost certainly not necessary to get started. If you’re concerned about faithfully matching tone, the best place to start is by providing examples in your prompt. These examples enable the model to learn (in-context learning) the intended tone. Your prompt might look something like this:

Provide an answer to the following question from a customer. Use a friendly and supportive tone. The customer question is: {input}
Here are some examples of friendly and helpful responses to questions.
Question: “How do I get a refund?”
Answer: “Thanks for your question! To initiate the refund process, you’ll need to first submit information about the issue you’re having with your current product. Once one of our customer support specialists takes a look at your request, you’ll hear back from us. Let me know if there’s any more detail you need!” …

To do basic data analysis: LLMs are very good at reasoning over unstructured data to do basic data extraction (e.g., extract customer name, company name, and title into a JSON blob from the body of an email) or to do basic classification (e.g., is this security alert valid or not). Again, priming the model with examples of correctly formatted JSON or valid security alerts as above enables it to do effective zero-shot learning.
To generate customized text: Lots of LLM applications take the form of filling out a template — for example, generate a sales email customized to this user based on their LinkedIn profile. Being specific to the model about the type of customization (e.g., include sports, or avoid the weather), tone, and level of detail desired in the email gives to the model what it needs to generate an effective email without any fine-tuning.

When you should fine-tune a model

Fine-tuning is good when the model needs to learn a skill that can’t be sufficiently explained in a handful of examples; instead, the model needs to learn a generalizable skill to perform its task.

For example, imagine you were building an application that synthesized information from medical records but needed to exclude any personal information. The specific regulations around what information is permissible to reveal are nuanced, are different per-jurisdiction, and change over time. You can’t provide 5-10 examples that cover the full set of restrictions. In this case, fine-tuning a model with a dataset full of properly redacted summaries will enable the model to learn how to detect & redact personal information.

An example we have personal experience with is Gorilla. Briefly, Gorilla uses an LLM to automatically generate code snippets in various APIs (HuggingFace Transformers, Kubernetes, etc.) based on plain English descriptions. For example, “List all pods running in the default namespace” would generate kubectl get pods -n default.

At first glance, this seems like a sure-fire case where a model would be able to reason its way through a task, presuming that it’s given the right API documents. In practice, the Gorilla team has compared the performance of a fine-tuned model with pure retrieval augmentation. Our empirical results (more on this in a future post!) show that fine-tuning outperforms retrieval augmentation. Our guess as to why is that retrieval augmentation is difficult because API documents might not be written in a way that is conducive to effective vector retrieval. That fact, combined with the breadth of functionality a system like Kubernetes or HuggingFace offers, means that correctly interpreting a question that can be translated into an API is in fact a skill that a model needs to be taught.

The heuristic that fine-tuning is best for teaching models new skills aligns with the most popular fine-tuned models we can think of:

GitHub Copilot uses Codex, which is fine-tuned GPT-3 for code generation. (Replit has built an LLM from scratch for code completion.)
Meta’s new Code LLama models
Gorilla, as described above.

Once you’ve decided to fine-tune a model, there’s a whole bunch of technical details to consider. That’s a broad enough topic that it warrants a separate post. We won’t cover it here, but keep an eye out for more soon!

As with every rule, “don’t fine-tune a model” should sometimes be broken. Fine-tuning is a powerful technique to teach an LLM a particular skill that can’t be summarized in a few pattern-matchable examples. More often than not, however, you’ll save yourself time, money, and a whole lot of stress by avoiding fine-tuning. Start with an off-the-shelf model and turn to fine-tuning when you have no other options3 left.

It’s called zero-shot learning because the model has presumably never seen examples like the one you give it. Few-shot means it might have seen some similar examples.

Off-the-shelf models are constantly getting better and being trained on new skills — for example, look at GPT’s recent update to support function calls. Whether we can keep stuffing more general-purpose skills into a single model remains to be seen, but for the time being, the models are certainly getting more capable.

In a post coming soon, we’ll talk about techniques to optimize retrieval augmentation.

The AI Frontier

Discussion about this post