Testing Devin

12/25/2024

DevinAI went generally available earlier this month. At Liquid AI, we subscribed on day one to try it out.

I tested Devin on our chat application with 10 diverse tasks across the full stack: from deployment improvement, backend refactoring, frontend UI tweak, to end-to-end testing.

The results were disappointing: after burning through 106 credits ($212), out of the 10 tasks, Devin only fully succeeded on one task, and another with partial success.

The single complete success was implementing a mock Prisma client for unit testing, the simplest of the 10. I only left some code review comments, and was able to merge Devin's pull request (PR) without any modification.
The partial success came from adding a new LLM endpoint in the backend. Devin created an alright draft PR, but midway I discovered a better implementation that needed significant refactoring, and it was too much trouble to tell Devin how to do it. So I endend up taking over the PR and completed it myself.
For the remaining 8 requests, Devin either could not create a working PR, or the PRs did not solve the problem, even after rounds of conversations and corrections.

When I started to test Devin, I was impressed by the smooth experience, and felt that it would be the future of software development: one engineer commanding hundreds of AI agents to do all the work (and eventually the engineer is replaced by Devin and loses the job).

In practice, Devin gives similiar code solution as Claude does, and shares the same major flaw: later steps may cancel out previous ones or even regress, probably due to limited context length. As an automous agent, its execution can be slow and more error-prone without human intervention. Altogether, it is way worse than me + Claude in both correctness and execution velocity.

It's good to know that engineering jobs are still safe. Despite the fact that Devin is way cheaper than me, at least for now only I can do the job 😅. But given that the previously best engineers from Scale AI are working on Devin, I do have high hopes for it, and believe that it can be a lot better in the next 6 months.

By the way, to my friends from LiveRamp, Scale AI, and Airbyte, if you are interested in playing with these latest AI tools while contributing to advanced foundation models, come join me at Liquid AI. We are hiring.

《史记》中的初创公司薪酬原则相对论的直观解释