6 Comments

This is an awseome post - well thought out. And you are spot on, as dig deep and productionize using Small or Specialized Language models for automating workflows, I clearly see that you do not need a large LLM for everything. I have two questions: 1. I am curious about the claim that Claude 3 Opus is 3x times more expensive then GPT-4. Can you point to any data or source behind that? And 2. You compare the scenario of RAG and Fine-tuning. Are you looking into or evaluating merging models?

Expand full comment

Elo being a coin flip really makes me think we don’t know how to generally compare models. When you average over many tasks that are somewhat saturating, much like any AI task, signal saturates.

Expand full comment

What do you mean by saturating?

Expand full comment

Two decent LLM answers to an english question are just so similar most of the time. Only specific questions expose weaknesses usually.

Expand full comment

This also relates to what we were talking about at the end of the post w.r.t Elo asymptoting. Getting from 1k to 1.1k Elo is probably a function of meeting the baseline expectation. Getting from 1.15k to 1.2k is probably the cream of the crop differentiating itself.

Expand full comment

Ah, yeah, 100%. The "it looks like an LLM" problem.

Expand full comment