Many of us are in the trenches building AI application startups in 2024. The rate of innovation at every level of the stack means there are always shiny new toys to play with, but it can also feel like building on quicksand. Do I need to check out and evaluate every new model release? Do I need to update my RAG architecture to support the latest vector DB? Do I need to experiment with the latest embeddings model or search technique? Do I need to sleep anymore?
Our team at RunLLM has spent a lot of time figuring out how to keep up with the craziness. These are some of the lessons we learned, some by luck and others through pain. We don’t promise we’re right about any of one of these (best practices change with the technology!), but hopefully these help point you in the right direction.
Keep up with the news, but don’t be an expert. 2024 has had a frankly ridiculous amount of news so far, and it can feel overwhelming to try to keep up with everything. You might be tempted to go into hermit mode — put your headphones on, shut out the world, and just build. That’s okay for short periods, but the rate of change in the space means that you have to poke your head out pretty frequently. The fundamental technology — LLMs, embeddings models, vector DBs, everything — is getting better too fast to be building on a static stack.
That said, don’t spend your time becoming an expert. You don’t need to understand the nuances of mixture of experts models or be able to explain the MMLU benchmark to anyone. Learn enough to evaluate each piece of news with regards to its relevance to your product. If something’s not relevant to you, move past it quickly. Going a mile wide and an inch deep will allow you to keep up with what affects you the most.
Build for configurability. Your applications should not be inextricably tied to any piece of technology in this space. You should build in such a way that allows you to swap components out as the state-of-the-art change. How you do that is a personal choice — you can choose to use a more complex abstraction layer like a model router, or you can build a simple, custom abstraction like our team did at RunLLM.
Note that we’re not advocating (yet?) for real-time configuration changes — given the high variance between different models, it’s unlikely you’re going to be dynamically swapping LLMs using the same prompt. Instead, think about how much work it’d be for an engineer to experiment with Claude 3, for example. Configurability can still require engineering time — just not a ton.
Know what’s sacred, and know what’s not. To prioritize well, you have to know what you’re willing to change and what’s non-negotiable. This is important for two reasons. First, as a startup, you’re simply not going to be able to try out every new piece of technology. Second, if you don’t have your non-negotiables clearly outlined, the pace of innovation in the space will cause you to question whether you shouldn’t just try this one new thing.
This can be a little fluffy, so let us explain by way of example. At RunLLM, what’s sacred is our use of fine-tuned LLMs to learn customers’ technology stacks — regardless of how long context windows get, we believe our solution is more economical and higher quality. What’s also sacred is our focus on responsiveness, so we bias towards using smaller, faster models.
On the other hand, our underlying vector DB is not sacred (we recently switched from Chroma to Elastic to run both vector and text queries), our use of model providers is reasonably flexible, and we’re consistently experimenting with new RAG optimizations. Your list might not be the same as ours, but you sure should have one.
Scope your experiments aggressively. The thing with LLMs (and ML in general) is that there’s a never-ending list of experiments you can run. One more prompt iteration genuinely might unlock the performance you were looking for — these models are complex enough that we really can’t predict behavior linearly.
Optimizing endlessly is an easy trap for engineers to fall in to, so it’s important that you set clear and aggressive timelines for your experiments with new technology. Get used to say that an experiment didn’t work and that the status quo is better — it might be painful at first, but it’s better to save your time for the next experiment.
Trust your team’s instincts. Ultimately, you and your team are the ones building the product. All the frameworks in the world, including this one, can’t accurately guide every decision that you make. The time that you’re spending running experiments and tuning results is honing your intuition about what works & what doesn’t in this space. That means that from time to time, you’ll have an idea that violates the principles here — you don’t want to run after every crazy idea you have, but you’ll certainly have to sometimes.
Like we said above, this guidance isn’t comprehensive, and it might not be right for everyone. If you have other rules or frameworks you’ve been using, we’d love to hear from you. Happy building!