Happy Friday! This is a slightly off-schedule post, but we wanted to get our thoughts on the latest AI news out. We’ll return to our regular Thursday posts next week.
We’ve been on vacation for the last couple weeks — which is why our posts have been sporadic — and boy, did a whole lot happen while we were out! Now that we’ve had a moment to digest, we thought we’d recap what’s been going on.
In case you’ve been living under a rock, here are some of the biggest things that have happened in the last few weeks.
OpenAI released a text-to-video model called Sora, which set the internet on fire. Text-to-video models aren’t new, but this was the first time non-AI audiences were paying attention, and the results were impressive at first glance, although there are some key shortcomings as well.
Google released Gemini 1.5 Pro, which they claim has GPT 4-level performance along with a built-in image generation model. They also shared that they have support for 10 million-token context windows in the works, with 1 million tokens in production with Bard.
There has been significant controversy about the diversity of the images generated by Gemini. Google seems to actively be shipping hotfixes, but the results aren’t quite there yet.
Google released a suite of models called Gemma, which are open-weight models based on the same research techniques that power the closed Gemini models mentioned above.
Also: Mistral opened access to a new model on the LMSys Arena, Stable Diffusion 3 came out, Meta AI Research announced a video embedding tool called V-JEPA, and probably 12 other things we’ve missed.
Okay, that was a lot. We’re out of breath just from typing it out. How should we be thinking about all these changes, and why are they happening all once?
At a high-level, this was inevitable in early 2024. In 2023, the major model releases were driven by OpenAI primarily, followed by Meta and a few open-source model iterations. But with the capital invested and importance placed on general-purpose (and now, multi-modal) LLMs, we’re going to see (pun intended) a lot more. A lot of this is also about attention (pun intended), and January’s always a slow marketing month, so everything got pushed to mid-February. That said, this is very likely not the only flurry of announcements we’ll see this year.
It’s also clear that the state-of-the-art is constantly changing. Every one of the major updates above was driven by innovations in the underlying machine learning techniques — it’s rumored that Gemini 1.5 is using RingAttention, Sora builds on the recaptioning techniques from DALL-E 3, and the Gemma has some architectural idiosyncrasies. We’re not going to dive into the implementations here — there’s plenty on the internet already — but we have a number of thoughts about the implications of these changes.
Alignment is still more of a technological problem than a social one. There are valid conversations to be had about what policies LLM builders should be following from an alignment perspective. We’re not experts here, and we’re not here to debate whether this was a policy failure at Google. However, it’s clear from the Gemini controversy (and even from the previous criticisms of Llama 2) that there are fundamental technical issues here.
Whatever the policies Google tried to enforce were (intentionally or implicitly), it obviously was not their goal to create a model that would not generate images of white people and show images of racially-diverse Nazis. They’ve been shipping hotfixes since the controversy blew up over the last couple weeks, and it seems like they’re playing a game of whack-a-mole right now. What this shows us is that the underlying RL and alignment techniques — both proprietary and third-party ones — have a long way to go. Even if you were to magically specify an alignment policy that pleased everyone, it’s not clear how you’d articulate it programmatically or whether you’d be able to implement it. Right now, we’re relying on human-generated labels, which are both noisy and subject to misinterpretation.
Of course, it doesn’t help when you’re also potentially surreptitiously appending text to user prompts.
Shipping fast is fun, dangerous, and inevitable. Google received a huge amount of criticism for falling behind in the AI race in 2023, and there were corresponding rumors of internal turmoil. To their credit, they started shipping way faster. The minor versioning of the latest Gemini release also implies that they have even bigger changes already in the works. Of course, they also hipped Gemini clearly before it was ready in order to stay in the arms race. It’s a double-edged sword.
Whether this is because of internal negligence or willful ignorance remains to be seen (likely both), but it’s a natural response to the pressure to catch up. More importantly, it’s likely that this is not the last controversy around obviously misaligned models that comes from trying to move too fast. The first-mover advantage for new functionality is immense in 2024, and companies are going to ship first and look later. Meta might be taking a more conservative approach with Llama 3, but we’d bet that another controversy driven by a model with clearly malformed policy decisions will arise this year.
Long context is the point of long context. There are technical deep dives elsewhere covering how long context works, but we’re more interested in why. 10 million-token context length is effectively infinite, so what can we do with this?
We’ve been bearish on long context in the past, and frankly, it’s not very useful for chat in its current form — processing 128K tokens on GPT-4 takes at least a minute, so processing 10 million tokens is out of the question for interactive use cases. Perhaps the latencies will come down in the future, but the path forward is not immediately clear. However, as we’ve discussed recently LLMs can and will move beyond chat, so interactivity doesn’t have to be the only restriction.
All of this is, of course, contingent on improvements in the attention paid to the middle of the context. Early analyses indicate Gemini has made significant progress — we’re excited to see more work done on this front. Our guess is that longer context windows will likely be useful in background processing tasks that are processing enterprises’ internal data sources. (Also for non-text use cases, like genetics.) Our mental model here is that these will be the batch workloads that complement current LLMs’ online capabilities.
We’re also particularly curious to see how other models pick up these techniques and whether retention, recall, and other capabilities can be improved even for smaller contexts.
A few other, quick-fire thoughts:
Multi-modality is hotter than ever: It makes for the coolest demos, and it brings non-technical audiences into the fold in a very visceral way, which is a critical part of general-purpose LLM adoption. We’ll keep seeing more investment in general-purpose, multi-modal LLMs demo — some of them might even be real!
More open-source LLMs. We predicted last month that there would be a new tech company to release an open-source LLM. We have to admit we didn’t think it would be Google, but the continued investment in open-source LLMs is fascinating. How these models are licensed, used, and improved is still not obvious to us — nor is whether any of these models stand a chance against Google and OpenAI.
Open-source LLMs are getting smaller… maybe? The Gemma release received comparatively little attention, but it seems that Gemma-2B matches Llama-7B across a number of performance measures. That’s a good sign for our push for smaller open-source LLMs, but this is just one data point — there’s a long way to go.
Nvidia’s revenue numbers are crazy. We present this tweet without comment.
It’s safe to say there’s a lot going on — we’re planning some more technical content on many of these fronts, so stay tuned!