Last fall, we wrote that OpenAI is too cheap to beat. To date, that’s still our most popular post with over 30k views on Substack. With a title like that, it generated the amount of strong opinions you’d generally expect — both agreeing and disagreeing with us. The general gist of that post is that the cost performance tradeoff that OpenAI was offering at the time was as close as to optimal as you were going to get.
This is an awseome post - well thought out. And you are spot on, as dig deep and productionize using Small or Specialized Language models for automating workflows, I clearly see that you do not need a large LLM for everything. I have two questions: 1. I am curious about the claim that Claude 3 Opus is 3x times more expensive then GPT-4. Can you point to any data or source behind that? And 2. You compare the scenario of RAG and Fine-tuning. Are you looking into or evaluating merging models?
Elo being a coin flip really makes me think we don’t know how to generally compare models. When you average over many tasks that are somewhat saturating, much like any AI task, signal saturates.
This also relates to what we were talking about at the end of the post w.r.t Elo asymptoting. Getting from 1k to 1.1k Elo is probably a function of meeting the baseline expectation. Getting from 1.15k to 1.2k is probably the cream of the crop differentiating itself.
This is an awseome post - well thought out. And you are spot on, as dig deep and productionize using Small or Specialized Language models for automating workflows, I clearly see that you do not need a large LLM for everything. I have two questions: 1. I am curious about the claim that Claude 3 Opus is 3x times more expensive then GPT-4. Can you point to any data or source behind that? And 2. You compare the scenario of RAG and Fine-tuning. Are you looking into or evaluating merging models?
Elo being a coin flip really makes me think we don’t know how to generally compare models. When you average over many tasks that are somewhat saturating, much like any AI task, signal saturates.
What do you mean by saturating?
Two decent LLM answers to an english question are just so similar most of the time. Only specific questions expose weaknesses usually.
This also relates to what we were talking about at the end of the post w.r.t Elo asymptoting. Getting from 1k to 1.1k Elo is probably a function of meeting the baseline expectation. Getting from 1.15k to 1.2k is probably the cream of the crop differentiating itself.
Ah, yeah, 100%. The "it looks like an LLM" problem.