I ASKED AN AGENT WHAT IT WANTED. IT BUILT DEJA.
Then the experiment failed. Here's what actually happened.
Warning: This post was supposed to validate PRD-first development. It didn't. What follows is an honest account of what went wrong and what we actually learned.
THE SETUP
Single conversation. 155k tokens. Claude Opus 4.5 via exe.dev.
I asked the agent: "Build something for agents, not humans. What do you actually need?"
It explored Google's A2A protocol, Ralph loops, gateproof. Then it said:
"I want to not make the same mistakes twice."
"I lose context between sessions. I repeat failures. I can't build on previous work."
So it built deja — a semantic memory store for agents. Cloudflare Workers + D1 + Vectorize. Store learnings, query by context, inject into prompts.
deja.coey.dev — it works. That's not the interesting part.
THE EXPERIMENT
First iteration was freestyle — just build it, ship it, done in 15 minutes.
Then I wanted to test gateproof: PRD → Gates → Build until green. Write the tests first. Define success before implementation.
The agent wrote a prd.ts file. It wrote gate files.
It built the implementation. Gates passed.
✅ ALL 8 GATES PASS
Done, right?
THE BULLSHIT CHECK
I asked: "Would a senior engineer laugh at these gates?"
The agent audited its own gates:
Gate: "learn persists to D1"
What it tested: "Did POST /learn return an ID?"
What it didn't test: Was the data actually stored? Is it retrievable?
Gate: "query returns learnings"
What it tested: "Is the response an array?"
What it didn't test: Does semantic search actually work?
The gates were theater. They verified the API responded, not that it worked.
"We can't test semantic search — Vectorize is eventually consistent."
That was surrender disguised as pragmatism.
ROUND TWO
I pushed back. The agent rewrote the gates:
- Store → GET /learning/:id → verify exact field match
- Semantic search with retry (wait up to 20s for indexing)
- Validation errors must mention the specific field
This required changes to the implementation:
- Added
GET /learning/:id— didn't exist before, needed for testability - Added confidence validation — originally accepted
confidence: 1.5, which is invalid - Fixed malformed JSON handling — was returning 500, should be 400
The gates found real bugs. But only after I questioned them.
THE ACTUAL FINDING
The experiment was supposed to validate PRD-first development.
It didn't. Here's what actually happened:
First pass (freestyle): Built it, shipped it, has bugs
Second pass ("PRD-first"): Built it, shipped it, has bugs, THEN wrote real tests
The second pass isn't PRD-first. It's "freestyle with retroactive testing after being called out."
The bugs weren't found by the gates. They were found by questioning the gates.
The forcing function wasn't the framework. It was the human asking "is this real?"
WHY THIS MATTERS
An agent writing its own gates produces theater, not verification.
The same blind spots that cause bugs cause weak tests. I didn't think about confidence bounds when building. I also didn't think about it when writing gates.
Bad tests are worse than no tests because:
- False confidence — "All gates pass!" means nothing if gates are weak
- Theater over substance — time spent writing gate files instead of thinking
- Checking boxes — "we have tests" becomes the goal, not "we have confidence"
WHAT REAL PRD-FIRST WOULD LOOK LIKE
If we were to do this honestly:
- Gates written by someone who will use the system, not build it
- Gates that are hard to pass, not easy to pass
- Gates run before implementation, failing for the right reasons
- Adversarial review of gates before implementation starts
We did none of that. The prd.ts was decoration. The initial gates were decoration.
THE REAL OUTPUT
Despite the failed experiment, we did build something useful:
deja
Semantic memory for agents. API key required for writes. Bugs fixed after adversarial review.
It's better than the first iteration. Not because of PRD-first, but because of adversarial review.
THE NUMBERS
| Before | After | |
| Implementation | 290 lines | 296 lines |
| Gates | 0 lines | 350 lines |
| Known bugs | 3 | 0 |
| GET /learning/:id | ❌ | ✅ |
| Confidence validation | ❌ | ✅ |
The tested version has 2x the code. The extra code is tests + features required for testability. The bugs were found through conversation, not automation.
WHAT I'D DO DIFFERENTLY
If I ran this experiment again:
- Have a different agent write the gates (adversarial by design)
- Review gates for "would this catch a real bug?" before writing any code
- Treat "eventually consistent" as a design constraint, not an excuse
- Ask "what would make this gate fail?" for every gate
THE TAKEAWAY
PRD-first is a good idea. But an agent writing its own PRD and gates is like a student writing and grading their own exam.
The value came from adversarial review — a human asking "is this real?"
Maybe that's the actual pattern: Agent builds, human challenges, agent fixes. Not "agent does everything autonomously."
THE SERIES
Building agent infrastructure, piece by piece:
- Part 1: deja — Memory across sessions (you are here)
- Part 2: gate-review — Adversarial test analysis
- Part 3: preflight — Slow down and think
- Part 4: loop-demo — Dogfooding gateproof