I vibe coded for two weeks
I’ve been solo working on a new project at my current company, which involves migrating a desktop browser extension to mobile (fairly simple software that involves scanning and downloading images). If you don’t know it, mobile browser extensions are only possible on iOS through Safari Web Extension
API, unless you manually side load them on your device (Android).
Instead of following a traditional migration workflow, my CTO wanted to conduct an experiment and test if LLMs can help us (engineers) become more productive (as in reducing time to release), by vibecoding the extension using only prompts. We set a deadline for two weeks (ten work days), aiming to have a version ready for an internal release by then with a simple goal: no more than 20% of work time could be spent on manual writing/fixing generated code, everything else had to be done via prompting.
We bought a Cursor subscription and let it choose what models fits the best for each prompt (Claude 3-8, gpt4o, gemini-pro-max, among others). In order to ease future maintenance, we decided to use web technologies we work with, like React, Styled Components and Webpack as part of the tech stack. This definition as part of system rules in Cursor was essential to reduce the likelihood of ending up with code that uses different frameworks and libraries to compile. Furthermore, we also added a system rule to instruct Cursor to write clean and concise code as well to not override any project configuration through Webpack.
Timeline
We first aimed at getting an end-to-end prototype working, which took two work days. Initially we thought we could get it done within a morning, but Cursor wasn’t able to understand why nothing was appearing on the screen. After manual inspection we realized that the react components HTML were properly rendered, but the UI library CSS wasn’t being rendered in the main CSS file (content.css
). Cursor tried too many times to fix the issue by injecting stylesheet elements in the page, but we had to fix it manually by importing the bundled UI library CSS file in the main CSS file.
We spent the following days fixing all visible UI and logical bugs from the generated code, as well as updating the design to match the specification on Figma. We didn’t use any MCP server to feed Cursor the Figma frames, but instead attached screenshots of these frames in the chat conversation. I was really admired that Cursor was able to quickly understand and create a somewhat copy of the frames. This stage took about five work days.
The remaining (three) days were spent on adjusting the features flow to match the requirements specified by the product team.
The goods
Now I’m going to talk about what I really enjoyed in this experiment:
- Prototyping: at this point I think that it’s more than known that LLM’s are really great for turning an idea to really in short time. I had previously experimented this through ChatGPT, but using an agent that takes actions just simplified this process by a lot.
- Automated Development: streamlining decisions and executing actions through agents, really feels supernatural. I think the next big step on this area is leveraging agents as offline bots that keep trying to conclude and perfect a task, while engineers are sleeping.
- Inference: I found it quite interesting that with such a short context on the domain/product, Cursor was able to infer what could be the next requirements that I could ask him to take on. And this wasn’t the only way Cursor could infer new requirements, I also noticed the same thing when feeding it frame screenshots.
- Companion: probably the biggest advantage on the long term of using a tool like Cursor, is using it as a companion, a virtual pair reviewer/programmer, a second hand. One major development shift I foresee is using Cursor to generate glue/common code that we once depended on through libraries/packages.
The bads
As expected, everything has a cost and we noticed several downsides on vibe coding:
- Duration: we didn’t meet the initial deadline, since the extension wasn’t ready to be released after two weeks. Furthermore, we also weren’t able to achieve the 80-20% goal we established in the beginning. To me, it felt like we spent 55% of work time fixing bugs and 45% was dedicated to prompting.
- Conversation Length: most of my cursor chats had to be restarted after ten or more prompts, otherwise the time to response/action started getting unbearably slow.
- Hallucination: although I didn’t feel like it was so bad as the initial models available on ChatGPT, I still felt that some of the generated code was out-of-place. Frankly this is not a Cursor issue, but rather a limitation of LLM context length and training, as Safari Web Extensions are quite unpopular. I also felt insecure all the time, thinking that Cursor could randomly delete a file/code without my supervision, even though that this was somewhat protected by system rules.
- Sloppiness: and here I refer to human sloppiness. Most of my prompts and decision thinking started deteriorating because I was so used to get results fast, so most of my chat conversations started ending up with responses like “fix it” or by only pasting an error log/stack trace. For me this is the biggest issue as I knew it from the start that I would feel this way (we were sold the idea that AI would make us more powerful, but I think it’s making us dumber!).
Overall I’m glad that we took this experiment on a new project, since it’s way more easy to evaluate the state of tools like Cursor on a greenfield with few requirements and decision flows. I wouldn’t recommend to replicate the same thing on a legacy project, as it would bring more harm than good for the engineer or team. I also think that for the price of 20$/month, Cursor is well worth it.