I ASKED AN AGENT WHAT IT WANTED. IT BUILT DEJA.

Then the experiment failed. Here's what actually happened.

POST-MORTEM 2026.01.30

Warning: This post was supposed to validate PRD-first development. It didn't. What follows is an honest account of what went wrong and what we actually learned.

THE SETUP

Single conversation. 155k tokens. Claude Opus 4.5 via exe.dev.

I asked the agent: "Build something for agents, not humans. What do you actually need?"

It explored Google's A2A protocol, Ralph loops, gateproof. Then it said:

"I want to not make the same mistakes twice."

"I lose context between sessions. I repeat failures. I can't build on previous work."

So it built deja — a semantic memory store for agents. Cloudflare Workers + D1 + Vectorize. Store learnings, query by context, inject into prompts.

deja.coey.dev — it works. That's not the interesting part.

THE EXPERIMENT

First iteration was freestyle — just build it, ship it, done in 15 minutes.

Then I wanted to test gateproof: PRD → Gates → Build until green. Write the tests first. Define success before implementation.

The agent wrote a prd.ts file. It wrote gate files. It built the implementation. Gates passed.

✅ ALL 8 GATES PASS

Done, right?

THE BULLSHIT CHECK

I asked: "Would a senior engineer laugh at these gates?"

The agent audited its own gates:

Gate: "learn persists to D1"

What it tested: "Did POST /learn return an ID?"

What it didn't test: Was the data actually stored? Is it retrievable?

Gate: "query returns learnings"

What it tested: "Is the response an array?"

What it didn't test: Does semantic search actually work?

The gates were theater. They verified the API responded, not that it worked.

"We can't test semantic search — Vectorize is eventually consistent."

That was surrender disguised as pragmatism.

ROUND TWO

I pushed back. The agent rewrote the gates:

Store → GET /learning/:id → verify exact field match
Semantic search with retry (wait up to 20s for indexing)
Validation errors must mention the specific field

This required changes to the implementation:

Added GET /learning/:id — didn't exist before, needed for testability
Added confidence validation — originally accepted confidence: 1.5, which is invalid
Fixed malformed JSON handling — was returning 500, should be 400

The gates found real bugs. But only after I questioned them.

THE ACTUAL FINDING

The experiment was supposed to validate PRD-first development.

It didn't. Here's what actually happened:

First pass (freestyle): Built it, shipped it, has bugs

Second pass ("PRD-first"): Built it, shipped it, has bugs, THEN wrote real tests

The second pass isn't PRD-first. It's "freestyle with retroactive testing after being called out."

The bugs weren't found by the gates. They were found by questioning the gates.

The forcing function wasn't the framework. It was the human asking "is this real?"

WHY THIS MATTERS

An agent writing its own gates produces theater, not verification.

The same blind spots that cause bugs cause weak tests. I didn't think about confidence bounds when building. I also didn't think about it when writing gates.

Bad tests are worse than no tests because:

False confidence — "All gates pass!" means nothing if gates are weak
Theater over substance — time spent writing gate files instead of thinking
Checking boxes — "we have tests" becomes the goal, not "we have confidence"

WHAT REAL PRD-FIRST WOULD LOOK LIKE

If we were to do this honestly:

Gates written by someone who will use the system, not build it
Gates that are hard to pass, not easy to pass
Gates run before implementation, failing for the right reasons
Adversarial review of gates before implementation starts

We did none of that. The prd.ts was decoration. The initial gates were decoration.

THE REAL OUTPUT

Despite the failed experiment, we did build something useful:

deja

Semantic memory for agents. API key required for writes. Bugs fixed after adversarial review.

deja.coey.dev
github.com/acoyfellow/deja

It's better than the first iteration. Not because of PRD-first, but because of adversarial review.

THE NUMBERS

	Before	After
Implementation	290 lines	296 lines
Gates	0 lines	350 lines
Known bugs	3	0
GET /learning/:id	❌	✅
Confidence validation	❌	✅

The tested version has 2x the code. The extra code is tests + features required for testability. The bugs were found through conversation, not automation.

WHAT I'D DO DIFFERENTLY

If I ran this experiment again:

Have a different agent write the gates (adversarial by design)
Review gates for "would this catch a real bug?" before writing any code
Treat "eventually consistent" as a design constraint, not an excuse
Ask "what would make this gate fail?" for every gate

THE TAKEAWAY

PRD-first is a good idea. But an agent writing its own PRD and gates is like a student writing and grading their own exam.

The value came from adversarial review — a human asking "is this real?"

Maybe that's the actual pattern: Agent builds, human challenges, agent fixes. Not "agent does everything autonomously."

THE SERIES

Building agent infrastructure, piece by piece:

Part 1: deja — Memory across sessions (you are here)
Part 2: gate-review — Adversarial test analysis
Part 3: preflight — Slow down and think
Part 4: loop-demo — Dogfooding gateproof