In recent weeks we’ve been writing about the quality advantage and the cost advantage that proprietary models have over open-source LLMs. The second post turned out to be our most popular post ever (over 30K views! 🤯). Our conclusion has been that open-source LLMs will garner the most adoption by forming the basis for the community that will be fine-tuning models rather than trying to win against proprietary models like GPT. This turned out to be a more controversial topic than we expected! Coincidentally, there has been extensive discussion around the merits of open-source LLMs from the community at large. Our friends on the Retort podcast had a great discussion about what truly constitutes open source.
This has been fascinating. Coming from Berkeley, which has a long open-source tradition, we’re big fans of open-source software in general. But frankly, much of the discussion has been incredibly vague. Advocates argue for open-source LLMs as an unalloyed good often without being specific about what exactly they would like to see.
This has led us to start thinking about why open-source LLMs matter, and what benefits they might have.
First, however, we want to avoid falling into a moving-goalposts trap, so let’s be specific. What does it mean for an LLM to be open source? Here are a few definitions:
Publicly available weights: Models like LLaMa 2 and Mistral fall into this bucket. These models have released the weights that comprise the model under fairly permissive licenses, so that users can pick up the models and build custom deployments.
Publicly available datasets: As far as we’re aware, no major open-source LLM has done this yet, but releasing the data the model was trained would be impressive step as it would allow the community to understand the model’s potential biases and flaws.
Publicly available training code & infrastructure: This has also been something most large model builders have kept guarded to date. The number of configuration parameters that go into a model’s training run combined with the RLHF that many models undergo means that this too could help the community understand the model from first principles.
As has been discussed elsewhere, the dataset creation process and the expertise embedded in the model training process are closely guarded. The major OSS model providers haven’t released much (or any) information about datasets user, much to the chagrin of the open-source community. As a result, thus far, we’ve mostly seen publicly available weights but scant information about datasets and training code + infrastructure.
Let’s return to the original question. Say, hypothetically, that open-source advocates won this war. What value would we gain at large from having truly open-source LLMs — models where weights, datasets, and code + infrastructure were available.
Community oversight: Understanding a model’s blindspots and pitfalls is critical both for future model improvements and for alignment research. Simply interacting with a model like GPT through its chat interface or API already reveals plenty of blindspots, and researchers have been able to test policies by pushing boundaries here with hosted models. Whether visibility into a model’s underlying dataset provides actionable insights into a model’s biases is still open research. Obviously, editorial choices that model builders make (e.g., dropping or including data) are important; however, given the large investment and potentially fraught legality of data use, it’s extremely unlikely we’ll see many of these datasets in full (barring government intervention).
Re-creation of models. This is one of the biggest frustrations in the open-source community with the lack of available information on datasets on and code. Ideally, community efforts to re-create existing models would allow researchers to experiment with different model parameters and alignment efforts. Realistically, however, the scale of these models make re-creation unlikely or impossible. GPUs costs alone for training are prohibitive, and the infrastructure + human costs of RLHF put the nail in the coffin. Unlike commodity storage infrastructure, where a user can realistically deploy Minio instead of using AWS S3, the premium on the hardware and time required to re-create models makes effective experimentation impossible. Community efforts simply won’t be able to re-create GPT (or even LLaMa) scale models — it is possible that public sector efforts or large research consortia would work, but bottom-up experimentation would still be impossible. Alignment research will likely have to be treated as an add-on to existing models.
Self-hosting and custom deployments. This is a commonly cited concern, and while there are highly security-sensitive cases where an enterprise might want a custom LLM, we fully expect OpenAI and Azure (and, correspondingly, AWS + Athropic and GCP) to solve this problem. The growing gulf in model quality makes it difficult to choose to use an open-source LLM if you can get a secure deployment of a proprietary model, especially with the right data sharing protections. Just this week, we spoke to a ~$100B tech company that’s working with a major cloud provider on terms to share private information with the cloud provider’s LLM deployments. Realistically, the economies of scale and deployment efficiencies the major model builders can provide makes them difficult to beat.
Specialization. This is the most compelling argument, and the one we made in our previous post. Open-source LLMs are great bases for specialized LLMs. While the GPT fine-tuning API is powerful, it only allows fine-tuning via LoRA (rather than full weight updates) and it limits users in application of more advanced model specialization techniques like RLHF or RLCF, which are likely very valuable as specialized models get more mature. This is where open-source models are most likely thrive in the coming years.
When it comes to specialization, open-source models are already quite powerful. A LinkedIn commenter on our last post pointed out that Code-LLaMa 34B is already the best code model out there — we agree! This is a great example of the success of domain-specific models. Unfortunately, fine-tuning can still be extremely expensive because of the GPU and time investment required to train a model (see our post on OpenAI’s current cost advantages). Thankfully, we already know from plenty of real-world examples (including our own work) that fine-tuned models don’t need to reach the scale and generality of a model like GPT-4.
This line of thinking leads to an obvious conclusion: Open-source models don’t need to get better; they need to get smaller and more focused. The post linked above shows that there about two OOMs of improvement in cost and scale that open-source LLMs need to achieve in order to match GPT. If they can cross that barrier, they change the level to which companies can effectively specialize models, and they form a viable path forward for OSS.
We’re strong believers in open source generally, but the outcome here is obvious: Open-source models simply can’t compete with the general-purpose quality of hosted LLMs. That’s okay — it’s not a defeat but an opportunity. Someone fine-tuning a model doesn’t need the most general-purpose model but instead one they can train for their task well. If open-source models can maintain quality while reducing size, there’s a whole world of specialization to unlock here.
A nit or maybe important point is that models will never be recreated, but rather replicated. The infra matters too much for an exact match to be viable.