We’re always excited to try out new AI products and short on time to actually do it. Over the holidays, we finally had a chance to try out Devin from Congition, which was one of the most hyped AI products of 2024.
If you’ve been living under a rock, Devin released a very flashy demo of a software engineering agent in March of last year that showed an agent navigating multiple code files to accomplish a reasonably complicated programming task. We were a little skeptical of how generalizable this would be, but Cognition made Devin generally available last December, so we thought the holidays would be a good time to take Devin for a spin.
Before we dive into sharing our experiences, we want to be very clear that any negative thoughts we share below aren’t meant to be dismissive of Devin (or of any other products). We have all the respect in the world for anyone building AI products — the rate of change of the technology and the general craziness in the market give us all more than enough to keep up with. We’re sharing our feedback to both get better at building RunLLM and to hopefully share perspective that benefits others. If we get anything wrong, feel free to share feedback!
A similar blog post about Devin from Answer AI came across our radar this past week. We haven’t had a chance to do a close read, but for what it’s worth, their takeaways seem to be pretty similar to ours.
Introduction
Devin presents itself as a fairly humble and self-aware tool. The expectation it sets is that it’s a junior software engineer that’s able to accomplish tasks with sufficiently specific guidance. Its rule of thumb is that ~3 hours of work can be accomplish autonomously before they worry about the agent diverging. The expectation setting was clear and quite helpful, and this aligns well with what we’ve written previously about AI work and human handoffs.
Scott Wu (the CEO of Cognition) shared his team’s mental model for using Devin in an interview that we found quite interesting: The “right” way to use Devin is to give it a task first thing in the morning and let it run while you’re doing other work. A few hours later, you can come back to it, inspect the progress it made, and provide any necessary guidance. By the end of the day, it should have been able to complete the task you set it. This aligns very well with how you might collaborate with a junior engineer on your team.
Getting Started & UX
Devin’s user experience and getting started flow are well built. All you need to do is connect Devin to your GitHub repository. Devin scans your repo to get an understanding of what the major components are as well as compilation steps, linting, best practices, etc. As you might imagine for a startup, we have relatively little of that, but it was still able to automatically infer the basics of the build process.
Once it’s taken a pass, it presents its finding to you and asks you to add any additional installation or configuration steps based for each individual source directory. As a sanity check, it tries to run all the installation and set up steps.
Specifying a new task for Devin is fairly straightforward. It doesn’t yet have integration with task management systems like Linear or Jira, so it gives you a simple text area to type in a description of the task you want it to tackle. (You can also tag it on a Slack thread, but we haven’t found that to be particularly useful.)
When you give it a new task, Devin will first formulate a plan of action by inspecting the codebase and identifying the changes it thinks it should make. You can watch it inspect different files and see the modifications it’s making in real time on the web app (which is neat), and once it’s finished, it’ll run linters and tests (if you have them!), resolve any issues that come up, and open a pull request. You can comment on the pull request itself to have Devin make changes or fix bugs.
tl;dr: The experience is intuitive and well-integrated into the way most engineering teams work.
What Works
Where Devin excels is in making narrowly-scoped and well-defined changes that are restricted to a single component and touch a few files. Some of the first things that we had it do were: improve the formatting of a pie chart, fix an edge case in how our API was returning data, and set a default update schedule when creating a new data source. Each of these tasks would have taken a person about an hour to investigate, implement, and test. With Devin, it took a minute or two to write a task description and five minutes to test the results before merging.
While it occasionally gets tripped up over general best practices vs. codebase-specific ones, it’s quickly able to react to your guidance — it builds a codebase-specific knowledge base over time that you can inspect and edit. For example, it originally tried to install and import @tanstack/react-query
, which isn’t used in the RunLLM codebase. Once we asked it to follow the pattern used in another file, it immediately self-corrected.
We’re not sure if this is generally true or specific to our experience, but it seems to be better at understanding and fixing frontend components than backend ones. For example, it was able to easily navigate a fairly complex set of Redux stores to properly improve state management, but it got confused with a fairly basic Fast API implementation in Python.
tl;dr: Devin is able to save you time when fixing small bugs or making minor improvements, especially in a frontend codebase.
What Doesn’t Work
After trying out some simpler tasks, we decided to give it something a little more complex that would still expect a junior engineer to be capable of doing. This was a full stack task that required adding a new (but isolated) database table, adding a new API call, and integrating that API call into a frontend interaction. Each change itself was relatively minor, but it would need to put the pieces together. We would expect an engineer to do this in a few hours of work.
This is where things started to go poorly. In the original task description, we specified explicitly that there was no database table that tracked the information we wanted. Devin found a legacy API route that accomplished something similar, and it was able to adapt it for the new task, but it seemed to believe that the existence of the legacy route meant that the database would have the information we wanted. Despite repeated prompting, we weren’t able to get it to add a table to the schema.
On the frontend, it added the new button in a location that would only be shown in certain cases. We asked it repeatedly to show the button in all cases, and again, despite repeated prompting, it wasn’t able to make the change.
After trying a few different prompts for both of the above issues, we decided to clone the branch ourselves and see if we could get the PR into a workable state based on what Devin had started. On closer inspection, it turned out that much of the code it added to connect the frontend to the new API call it added was unworkable — it had neglected the best practices we’d taught it previously and the code for processing API responses was a gnarly mess. At this point, we decided it would be easier to implement this from scratch than it would be to try to fix Devin’s code.
Interestingly, we also found that it was prone to getting stuck in a loop where it was trying to resolve linter errors but cycling between changes that it had made previously. In this loop, it didn’t seem to be prioritizing listening to user feedback about what to improve in the PR.
We’re not totally sure what caused this level of confusion. The task for each component should have been quite easy, so our best guess is that working across multiple components is what tripped Devin up the most. It didn’t seem to be able to isolate its plan or the feedback we gave to particular components.
tl;dr: A full-stack task that we would fully expect a junior engineer to complete within a day was well beyond Devin’s capabilities. If this was an engineer we’d hired, we would not have kept them on the team.
Economics
When Devin works, the economics of using it are pretty good. You currently pay $500 for 250 ACUs, and the small tasks that Devin succeeded at took 1-5 ACUs ($2-10). Paying a few dollars to fix small bugs and save even just one hour per-bug is a great tradeoff — one that we would make any day of the week. The issue is that there’s a very narrow set of tasks that are long enough to require an engineer to context switch and short enough to be in Devin’s working window.
When Devin doesn’t work, the economics start to look suspect. The 3 bigger tasks we tried averaged about 20 ACUs and 2 of the 3 didn’t yield usable results. While $40 would be extremely cheap for implementing these larger tasks, our (to be fair, limited) sample indicates that these larger tasks consume a disproportional number of ACUs — these tasks weren’t 5-10x harder than the smaller ones that succeeded. More importantly, they often fail, so you get nothing for your $40.
Perhaps if RunLLM was a larger company that has 50+ well-scoped bugs a month that Devin could fix, but we’ve already run through most of the tasks that we expect Devin to succeed at.
tl;dr: For tasks that Devin is able to complete, it’s an absolute no-brainer to pay the cost. When Devin doesn’t work, you’ll be frustrated that you’re wasting your money.
Takeaways
The promise of an AI-powered software engineer is obvious, and we’re big fans of the modality of spinning off work for an AI to do while you focus on more important things. From a UX perspective, Devin does a great job of enabling you to do that.
As with everything in AI, though, it’s early. We’ve really enjoyed the chance to experiment with Devin, but we find it hard to agree that it’s at the level of a junior software engineer. It’s very good at fixing well-defined tasks that would take a person about an hour to do. Once you go beyond that in scope, the results start to vary pretty dramatically.
We’re confident that Devin (and the underlying technology) will get better, and as they do, the ability to deliver on the promise will grow. As for where things stand today, there’s a long way to go for any software engineers to worry about losing their jobs.