Building Self-Improving Agents
I built an agent that taught itself SEO.
I tried to build a tiny self-improving growth agent.
It made $388 across March, April, and May.
Not life-changing money. But it was real revenue from a product that had mostly shrugged back at me a few weeks earlier, and it gave me a cleaner answer to a question I keep running into with Lazyweb: can an agent compound context & get better a small but valuable JBTD?
Been a while since we talked. Lazyweb is starting to work: retention is up, growth is up, usage is up…it’s all moving in the right direction and the product direction is pulling us towards something bigger than an MCP tool that agents can use for design references [for free (: ]
The next version should know which A/B tests worked for your industry and competitors, data and context from your company, and how to combine that with proprietary context to become something closer to an AI growth PM.
That is a meaty problem. So I wanted a smaller test with a shorter time to learn: could I make an agent improve at one narrow growth job, with real users and real revenue, without pretending the agent was smarter than it was? That led me to SEO.
The Demo Version
The cleanest version of this idea is Karpathy’s AutoResearch: edit training code, run a short experiment, check validation loss, keep or discard the change, repeat.
That setup has what agents need: a measurable goal, a short feedback loop, a place to store what worked, and little room to hide behind activity.
The idea gets harder once you leave clean loss curves. Anthropic partnered with Andon Labs to have Claude run a small office store in Project Vend. There are trading agents like AI-Trader, and products like Polsia that promise agents for coding, marketing, support, and ads.
The hard part is not making an agent do things. The hard part is getting it to improve when the feedback is noisy, delayed, and only partially explained.
My question was: could a normie like me make an agent improve over time in a real product, with real users, real revenue, and ambiguity?
The Test Bed
LazyScreenshots is a Mac app I built a few months back. One keystroke to capture a screenshot and paste it into Claude, Cursor, ChatGPT, Codex, Slack, Notion, or whatever app you are working in.
The Loop
The agent did not run an autonomous company. It wrote two SEO article bets every day.
Not two articles. Two bets.
Each article needed a search intent, a reader with a job to do, and a hypothesis for why that reader might care about LazyScreenshots.
The loop was:
Give the agent basic SEO principles: serve search intent, write for the person trying to solve the job, prefer buyer intent over vanity traffic, and make the product connection honest.
Have it study the product and search surface.
Generate two article bets per day.
Publish the work.
Measure visits, CTR, and revenue over time.
Once a week, reflect on what worked.
Rewrite
learnings.md.Repeat from a fresh context scheduled run.
The Trick
The trick was not remembering everything. It was being careful about what not to learn from.
When a page fails, you usually do not know why. The premise might be wrong, the execution might be weak, Google might not have indexed it properly, the page might need more internal links, or the keyword might have no buyer intent.
Turning every failure into a rule creates fake learning. The rule was: study the winners, be cautious with the losers, and recycle a bet in a different form a couple of weeks later if it still seems strategically sound.
When there was a real learning, I asked Claude to rethink the entire learnings.md file. Not append a note. Rethink it. Find contradictions. Sharpen vague rules. Turn “high intent keywords are better” into something that would change the next article.
Bad learning:
Prioritize feature comparison.
Useful learning:
Links to a specific feature as paste, compare, annotate, or automate screenshots have a higher click-through rate than links to LazyScreenshots.
The learnings.md file was not a memory dump. It was a set of principles for how to do the job, rewritten as new evidence made the principles more specific and narrower in scope.
This is the second counterintuitive thing… the loop should make the agent learn less, not more. Each successful run should produce sharper, narrower hypotheses the agent can test next.
The Fresh Context Part
Using /schedule mattered more than I expected because each run started fresh. The agent had to re-enter through the durable state: the product, the metrics, and learnings.md.
When I tried doing this manually in one long-running context, the output started collapsing toward the same obvious ideas: alternatives pages, generic screenshot workflows, and similar product angles with slightly different titles.
Fresh runs created more divergence. One run could find comparison intent. Another could find workflow intent. Another could notice that “paste screenshot into Claude” is a different reader from “CleanShot X alternative,” even though both can point to the same product. As long as we kept track of our ideation process in another .md file, this worked pretty well.
The Result
The SEO agent made $388:
March: $62.50
April: $145
May: $180.50
It’s really cool to see this with literally 0 input from me in months. I didn’t give it any feedback whatsoever…it just learned how to write better SEO and position lazyscreenshots in the crowded field of mac screenshot tools.
My Current Definition
A self-improving agent is a system that gets better at a specific job because it can observe outcomes and update its operating manual when the evidence changes.
That requires three things.
1. A domain with feedback
The agent needs reality to correct it. For SEO, that was visits, CTR, and revenue.
2. A narrow job with fresh runs
“Write two SEO article bets per day for this one Mac app, then learn from visits, CTR, and revenue” is a decent scope for this primitive method. We are at the intern part of the cycle for self-improving agents.
Fresh runs matter because they separate durable learning from temporary context. /schedule gave me that split: durable lessons in learnings.md, fresh reasoning in each session.
3. Memory with a high bar
Most agent memory stores facts without changing behavior. For this to work, learnings.md had to become an operating manual, not a diary or transcript archive.
The add-to-memory bar was high: a learning had to change the next article. The remove-from-memory bar mattered too: if a rule was too generic, contradicted newer evidence, or came from a failure where the cause was unclear, it did not belong in the operating manual.
The Verdict
It kinda worked: it did improve and make more money overtime but didn’t like build an SEO business and run it autonomously.
I think it worked because the job was very narrow, it had strong principles encoded on what is valuable information and not, and the feedback loop was tight.
We are still so so so early, I can’t wait for the first (real) gen of self-improving agents.
We’re so early,
Ali Abouelatta


