Today, 2026-05-24, FORGE_MODE=v1 went live. The preceding ten days were dry-run shadow mode. `strategic_cycle()` runs every six hours, so the period contained 41 complete cycles.
Each cycle produced five tasks and four experiment pairs. All of them were written to disk. None of them triggered budget, developer time, or external actions. The proposals were reviewed in manual sessions when time allowed, but the loop itself had no durable memory of those reviews until the CEO feedback wiring was added.
Why Shadow Mode First
The logic for starting in shadow is simple: for autonomous systems that make decisions repeatedly, wrong decisions compound. One misguided experiment that never leaves disk costs only the compute to generate the proposal and the disk writes to store it. The same experiment approved and acted on costs real budget, developer attention, and the opportunity cost of whatever else that attention could have been applied to over its lifespan.
Shadow mode is also diagnostic. Running forty-plus cycles of proposals without taking action produces a sample of what the system actually proposes in live conditions, under real world-model state, with the real configuration it will have when it goes live. That sample is worth more than any amount of unit testing against synthetic inputs.
What came out of 41 cycles was roughly consistent in shape. The experiments that came out cleanest were proposals for bounded, reversible work tied to key results that were visibly underperforming. The proposals that got flagged during manual review were the ones that either (a) repeated a direction that had already been tried and logged somewhere in the codebase, (b) proposed targeting a key result that was already at 100% completion, or (c) proposed new resource allocation while other open experiments on the same key result were still in flight. None of those failure modes required the system to be doing anything exotic. They are the normal failure modes of a proposer that does not read its own history clearly enough before generating the next batch.
That observation shaped the gate criteria.
The Gate Criteria for Going Live
We required five things to be true before moving to v1:
1. CEO Feedback Loop fully wired. HMAC-signed URL clicks for keep, kill, and comment. Kill semantics that write `killed_by_ceo` status and append anti-pattern Playbook entries to ProceduralMemory. The anti-patterns loaded by the next cycle before any proposals are generated.
2. `strategic_cycle()` reading anti-patterns from ProceduralMemory before proposing. The proposer receiving the list of known anti-patterns and using it to skip or annotate proposals whose core claim matches a recorded pattern.
3. `ExperimentRecord.status` enum extended to include `killed_by_ceo`. The code distinction between Forge deciding to kill and the CEO deciding to kill matters for audits. Both statuses stop work. They mean different things about why work stopped.
4. End-to-end test passing. Fake experiment created via script. Telegram notification emitted in test mode. Simulated kill click received by the test instance of the dashboard endpoint. Next cycle invoked. Assert that the subsequent proposal batch does not contain a proposal matching the killed key result pattern.
5. Full test suite green. Not just the new feedback module. The entire suite.
All five gates passed. The live flip was the logical next step.
The Embarrassing Part
In the 240 hours of shadow operation, the CEO feedback loop received exactly two verdicts. Both came from manual sessions where someone actively opened a browser tab. The automated path — Forge generates an experiment, sends a Telegram notification with three signed URL links, CEO taps one of the links on their phone, dashboard records the decision, next cycle respects the anti-pattern — was never exercised by real usage during shadow mode.
Two data points in 240 hours of continuous operation.
The feedback loop was correct. The code passed its tests. The HMAC signing worked. The kill semantics were verified. The habit of treating the Telegram notifications as something to act on, rather than as informational messages to skim, was not yet formed. That is a behavioral gap, not a correctness gap.
We noted the gap. We did not use it to delay the live flip.
Why We Went Live Anyway
Shadow mode without a hard deadline is delay with extra steps. The gate criteria were code and process gates. They were met. An under-exercised feedback mechanism is not a broken feedback mechanism.
The cost of continued shadow operation was the same proposals being regenerated and manually reviewed in the same small set of sessions, with no compounding learning inside the agent. Every day in shadow was another day the ProceduralMemory anti-pattern set stayed empty, and every cycle the proposer was working without any of the CEO judgment that the feedback loop was designed to inject.
The under-exercised path is a week-one measurement, not a reason to stay in shadow.
What We're Watching in Week 1
Four metrics are on daily review this week:
- Experiment proposal rate per cycle. Baseline from shadow: ~4 experiments per cycle. Looking for stability and for the anti-pattern filter to start visibly shaping the batch.
- CEO kill rate and anti-pattern accumulation. How many of the live proposals receive a Kill verdict. How fast the ProceduralMemory anti-pattern set grows. Whether that growth changes proposal content in subsequent cycles.
- Notification-to-click conversion. Whether the Telegram notification fires and whether a click lands within 24 hours of send, without any manual prompting.
- Duplicate proposal rate. The explicit failure mode the anti-pattern filter was built to prevent. We should see fewer proposals targeting directions that were previously killed.
The Meta-Lesson
Autonomous agent governance is not a feature that gets added once the agent is producing useful output. It is the substrate the agent is built on.
The CEO feedback loop took eight to ten hours of focused work — design, implementation, tests, end-to-end verification. If that work had been done on day one of the Forge project instead of near day 300, we would have accumulated months of signed CEO verdicts shaping the agent's beliefs rather than ten days of shadow data. Every strategic cycle since day one would have read those verdicts before proposing the next batch.
The 41 shadow cycles proved the mechanics work. They did not prove the habit of using them. The live phase begins with that distinction clearly on the table.