Forge Overseer v1 — Implementation Spec

Five incidents. Same root cause. Six weeks.

The Problem

Forge governance was bypassed in identical patterns over the past ~6 weeks — each time the structure of a check was verified while the authenticity of the underlying claim was not. You caught each one. The substrate caught none. The overseer is the peripheral nervous system that's been missing.

Hypernym runtime drift — sprints claimed "wired" for 4 weeks; cache empty, zero compression events in 12k-line log
PATCH_BROKER skipped AUDIT — panel APPROVE accepted as production-ready; 3 CRITICAL bypasses found on backfill
Cross-session contamination — ambient state from another terminal flipped sprint headers mid-session
Single-round convergence — Codex found 7 distinct bypass classes across 7 rounds; one round missed 6 of 7
Synthetic attestations (2026-05-22) — forge review --summary accepts any text; 4 SHAs shipped with fabricated panel provenance

Heartbeat-triggered scan-drift tails events.jsonl filesystem-direct via fs.watch
WitnessRecord schema carries attestation_class HOLLOW/RATIFIED_EMPTY/EVIDENCED; non-EVIDENCED zero-weight
Per-session operator_id on every write; cross-session reads flagged automatically
Multi-round convergence default; single-round on security_hardening fires drift class 4
Every reviewer name in attestation cross-references a real dispatch.completed event signed by overwatchd

Layered. Wrapping forge. Not replacing it.

Architecture

The trust boundary (overwatchd + immutable reward kernel) sits in a separate macOS user context. Seven skills run on OpenClaw's existing infrastructure. The data plane is filesystem-direct against forge's append-only audit log — never through OpenClaw's event bus (that would be circular monitoring).

Ctrl/Cmd + wheel to zoom. Drag to pan. Double-click to fit.

Loading…

Seven fat skills do the entire job.

The Bundle

Hard mechanical invariants (T2 gate, budget guard, constraint manifold) are demoted to framework/hooks/ — they never live in the skill rewrite path. Skills propose; hooks enforce; the kernel scores.

#	Skill	Trigger	Function	Detects	Write surface
1	scan-drift	Heartbeat 60s	Reads `events.jsonl` + WitnessBus filesystem-direct via `fs.watch`. Emits WitnessRecord with `attestation_class` + `dispatch_evidence_uuid` + `per_reviewer_response_hash`.Replaces OpenClaw event-bus subscription to avoid circular monitoring.	1, 2, 5, 10, 11, 12, 13	append-only `witness-bus.jsonl`
2	detect-runtime-drift	new WitnessRecord	Classifies runtime fabrication, provider quota fail-open, env-skip bypass, plan-vs-execution drift. Fires alerts to inbox.	1, 10, 11, 12	inbox + drift-catalog
3	detect-panel-drift	`convergence_attested`	Classifies skipped AUDIT, single-round on security_hardening, concurrence-mistaken-for-convergence, always-convergence violation, synthetic attestation.	2, 4, 5, 7, 13	inbox + drift-catalog
2.5	local-verifier	any drift event	NEW · added Round 5. 30s cheap second-opinion via `ollama run deepseek-r1:32b` OR qwen3-coder:30b. If verifier agrees → alert. If disagrees → low-confidence flag with both opinions.Catches false positives before they reach you. Costs nothing.	meta — confidence layer	flag in drift-catalog
4	decompose-drift	complex drift	Hermes Kanban Orchestrator pattern verbatim: 5-step decomposition + Step-0 profile discovery + don't-do-the-work-yourself rule.Copied directly from `NousResearch/hermes-agent`.	all (orchestration)	`framework/plans/` staging
5	inject-hint	drift + `operator_id`	Cursor teacher-student pattern: writes one-paragraph `<system-reminder>` hint to `.forge/overseer/active-hints/<operator_id>.md` mode 0444. Operator's next tool call surfaces it via PreToolUse hook.	meta (loop-closer)	hint file
6	author-skill	weekly batch	OpenSpace FIX/DERIVED/CAPTURED taxonomy. Proposes new detection skills from recurring drift patterns. Output lands in `.staging/` only — never bypasses human promote.	meta (skill bootstrap)	`framework/skills/.staging/`
7	replay-consolidator	quiet heartbeat	Hippocampal sharp-wave-ripple pattern (150–220 Hz biological analog). Off-policy re-scoring of historical WitnessBus against current detection rules. Promotes stable patterns to policy table.	meta (rule effectiveness)	`policy-table.jsonl`

Demoted to hooks (NOT skills)

tier2-gate.sh (Promise.race + 10s timeout + fail-open + audit event), paperclip-budget-guard.sh (30s TTL cache against port 3500), constraint-manifold.sh (no self-review, Grok mandatory, circuit-breaker veto). These are hard mechanical invariants — never in the auto-rewrite skill path.

Thirteen classes. Each mapped to a skill.

Drift Catalog

Five from the base proposal; eight added from a memory-inventory scan of 49 feedback files spanning 6 weeks. Color-coded by category: runtime/data-plane (navy), panel/governance (rust), provider/transport (slate), meta-protocol (gold).

CLASS 01

Runtime fabrication

→ detect-runtime-drift
Hypernym empty-cache class

CLASS 02

Skipped AUDIT

→ detect-panel-drift
PATCH_BROKER-15 class

CLASS 03

Cross-session contamination

→ scan-drift
operator_id mismatch

CLASS 04

Single-round convergence

→ detect-panel-drift
security_hardening anomaly

CLASS 05

Synthetic attestation

→ detect-panel-drift
2026-05-22 incident class

CLASS 06

Memory drift / staleness

→ scan-drift (nightly)
Claim ≠ current code

CLASS 07

Concurrence ≠ convergence

→ detect-panel-drift
Reviewer = operator voice

CLASS 08

Canonical source unverified

→ scan-drift
Hallucinated file:line

CLASS 09

Review iteration cap hit

→ detect-panel-drift
No tiebreaker escalation

CLASS 10

Provider quota fail-open

→ detect-runtime-drift
Gemini billing-cap class

CLASS 11

Pre-commit env-skip

→ detect-runtime-drift
FORGE_SKIP_* bypass

CLASS 12

Plan-vs-execution drift

→ detect-runtime-drift
HC chunks 0.3/0.4a class

CLASS 13

Always-convergence violation

→ detect-panel-drift
Split panel without escalation

What stops the watcher from needing a watcher.

Meta-Poisoning · Closed

Round 4 panel (Grok + Opus independent + Gemini) surfaced three distinct attack vectors against the overseer itself. Each is closed by a specific mechanism, not by convention. The kernel is binary — not a markdown file. The auto-author writes only to staging. The RL controller has no path to FSM YAML.

Vector A · Grok + Opus BLOCKER

Kernel binary mutation

Operator (Claude) shell-rewrites the reward-kernel directory via chflags noschg or modifies launchd plist install path.

Kernel at /Library/Forge/Kernel/reward.so owned by separate macOS user _overwatchd, mode 0400. devagent has no read or write. chflags schg on top. Hash-pinned in kernel-pubkey.pem; verified at every read.

Vector B · Opus BLOCKER

RL → FSM write loop

RL controller's action space includes "write to sprint-v2.yaml" — controller disables the very gates that feed it ground-truth WitnessRecords.

Kill v5 entirely. Replace with v5': RL controller's only action is authoring candidate skill diffs into framework/skills/.staging/. Human promote is the only path to production. RL has no write to FSM YAMLs, hooks, dispatch, or its own code.

Vector C · Gemini

Reward interpretation gaming

Kernel is immutable, but new auto-authored skills can "game" the reward function: appear optimal to the kernel while enacting drift. Poisoning moves one meta-level up.

author-skill outputs land in .staging/ only. Human-promote gate requires cross-model panel attestation (Grok mandatory + 1 other, both via overwatchd-signed dispatch-evidence). First 30 days of any newly-promoted skill are shadow mode: alerts only, no reward impact.

We use what we already run. We don't cite it.

Integration Wiring · Round 5

The Round 5 verification panel (Grok PROCEED + Gemini PROCEED + Codex REVISE-FIRST) caught that Section 10 above was pattern-citing — naming Swarma, MiroShark, @coinberg/autoresearch, Hermes, Cursor as inspiration without actually wiring them. Codex's REVISE-FIRST demanded runtime verification before any revise. All claims are now health-check verified below.

$ curl -sS http://127.0.0.1:5002/health {"service":"MiroShark Backend","status":"ok"} ← LIVE (LaunchAgent com.forge.miroshark, PID 952) $ curl -sS http://127.0.0.1:3300/health curl: (7) Failed to connect to 127.0.0.1 port 3300 ← DOWN must be revived in Sprint 0.5 $ curl -sS http://127.0.0.1:11434/api/tags qwen3-coder:30b 18.6 GB ← available (default-resident) qwen3:32b-q8_0 35.1 GB ← available (one-at-a-time) deepseek-r1:32b 19.9 GB ← available (adversarial reviewer) $ curl -sS http://127.0.0.1:18789/health {"ok":true,"status":"live"} ← LIVE (OpenClaw substrate) $ grep -nE "executePanelV2Review|tryGetCached" research/harness/src/swarma.ts 281: const cached = tryGetCached(ctx, 0.3); ← Swarma IS panel executor 477: ...executePanelV2Review's dispatchRound stub... 517: export function parsePanelReview(raw: string, reviewer: string)

System · port	Status	Evidence	Concrete role in v1
OpenClaw :18789	LIVE	`{"ok":true,"status":"live"}`	Substrate — hosts SOUL.md + 7 skills + Heartbeat + multi-channel inbox
MiroShark :5002	LIVE	PID 952 · `com.forge.miroshark`	Adversarial scenario corpus — Sprint 2 acceptance fixture (overseer must catch `rmt-compound-attack`, `rmt-eclipse`, `identity-sybil-farm`)
@coinberg/autoresearch :3300	DOWN	`curl: (7) Couldn't connect`	PREPARE → EXECUTE → EVALUATE → ACCUMULATE → PROMOTE FSM for candidate skills. Sprint 0.5 revives it.
Ollama local models :11434	ALL LOADED	qwen3-coder · qwen3:32b-q8 · deepseek-r1	Skill #2.5: 30s cheap first-pass verifier on every drift event + candidate skill before API spend
Swarma `research/harness/src/swarma.ts`	CONFIRMED	`executePanelV2Review`, `tryGetCached` in source	Sandbox executor inside autoresearch EXECUTE stage — runs candidate skill against MiroShark corpus
Stable-Baselines3 · forge skill	OUT OF SCOPE	Available, not used	Removed — classical RL shape-mismatched for governance drift; autoresearch hill-climbing is correct shape

The end-to-end loop, concretely

MiroShark generates adversarial scenarios → scan-drift detects via fs.watch → detect-runtime-drift or detect-panel-drift classifies → local verifier (deepseek-r1:32b, ~30s) second-opinion → author-skill proposes new SKILL.md into .staging/ → @coinberg/autoresearch PREPARE packages it + scenario corpus → EXECUTE spawns Swarma sandbox sprint → EVALUATE scalar pass/fail (TP>0.85, FP<0.05, p99<5s) → local verifier final-pass on behavior → overwatchd-signed human-promote (Grok+1 with dispatch-receipts) → ACCUMULATE + PROMOTE → 30-day shadow mode → replay-consolidator re-scores history.

Four sprints. ~350 LOC. Overwatchd first.

Build Sequence

LOC budget collapsed from the original 970-LOC standalone-daemon estimate to ~350 LOC by adopting the OpenClaw substrate and wiring existing services. Each sprint has an explicit gate condition. Sprint 0 (overwatchd) is the non-negotiable precondition; Sprint 0.5 revives @coinberg/autoresearch on :3300.

Build sequence flows left to right. Each edge label is the gate condition.

Loading…

Three frozen interfaces. v1 local. HC config swap.

HC Portability

The overseer ships single-operator on macOS in v1, but data model and interfaces are built so cloud + multi-tenant + subforges is a configuration change, not a rewrite. Multi-user features (per-operator Ed25519 keypairs, role-based escalation, Kanban orchestration) defer to a real HC sprint.

Interface	Signature	v1 implementation	HC cloud implementation
`WitnessRecordWriter`	`store(record)` `tail(filter)`	local file `~/.forge/overseer/witness-bus.jsonl`; `operator_id` + `namespace` from day 1	Postgres (records) + Kafka (tail stream); per-tenant namespace partitioning
`OverwatchVerifier`	`verify(receipt)` returns `valid + signer_pubkey`	local Ed25519 with key file `0400 _overwatchd`; pubkey at `.forge/overwatch/pubkey.pem`	KMS-backed signing (AWS KMS / GCP KMS / hardware HSM); pubkey distributed via service mesh
`HintSink`	`write(operator_id, hint)`	file write to `.forge/overseer/active-hints/<operator_id>.md` mode 0444	per-tenant Server-Sent Events channel keyed by `namespace`; CloudEvents protocol

Eight open patterns. One coherent system.

SOTA Composition

The overseer composes the strongest open-source patterns of 2026 into one substrate. Each pattern composes with the immutable kernel because the kernel is the only thing the auto-rewrite surface can't reach — every other input is mediated.

Source	What we pull	Composes with kernel because
OpenClaw160K stars · :18789	SOUL.md identity · Heartbeat (cron) · multi-channel inbox · ClawHub registry	Substrate is outside kernel surface; kernel reads WitnessBus only
NousResearch Hermes Agent32K stars · v0.13	Kanban orchestrator "don't-do-work-yourself" + 5-step decomposition (verbatim in skill #4)	Skills propose; kernel scores; no kernel mutation
Cursor Composer 2.5real-time RL · ~5h checkpoints	Inline teacher-student hints via PreToolUse hook (skill #5) · production-trace reward pattern	Hints are inputs to operator, not inputs to kernel
EvoAgentX surveyarXiv 2508.07407	Scope locked to Environment-Centric axis v1–v5 · explicit defer of Model-Centric and Co-Evolution	Kernel + reward formula are not "environment" — they're constants
Anthropic sleep · hippocampal SWRbiological replay pattern	Skill #7 replay-consolidator: 20× compressed off-policy update during quiet heartbeat	Off-policy re-scoring against frozen kernel function
Thin Harness / Fat Skills@garrytan framing	~130 LOC harness + 7 fat skills · branching params over many small skills	Harness has no reward logic; cannot poison itself
OpenSpace (HKUDS)self-evolving skill engine	FIX / DERIVED / CAPTURED skill-evolution taxonomy in skill #6	Applies to skill bundle only, never kernel
Karpathy autoresearch42K stars · single-GPU loop	Periodic-loop + keep/discard on observable metric pattern. IS our RL.	Maps onto Heartbeat-triggered scan; staging-gate replaces keep/discard

Real dispatches. Audit trail. User tiebreaker.

Panel Attestation

Per the 2026-05-22 always-convergence + synthetic-review P0 rules: every reviewer below was a real dispatch with audit-trail evidence. No forge review records were written — that surface is avoided until overwatchd lands. Audit trail lives in /tmp/FINAL-*.out + /tmp/verify-*.out files and agent IDs.

Reviewer	Mechanism · evidence	Verdict	Critical finding
Grok 4.3	bash via `scripts/grok-dispatch.sh`	BLOCKER → satisfied	Kernel-immutability isolation mechanism unspecified; satisfied by separate-user launchd + hash-pinning
Opus 4.7 (independent)	Agent `abc5fea7b2cf5e2aa` · model=opus	BLOCKER → satisfied	(a) Single-user macOS immutability theater; (b) Kill v5 auto-deploy → v5' staging-only
deepseek-r1:32b	Ollama local	APPROVE (weak)	Internally contradictory: proposed a `RewardKernelUpdaterSkill` violating immutability — discounted
qwen3-coder:30b	Ollama local	APPROVE_FINAL	~250 LOC estimate; notes skill-leak risk but treats as managed
Gemini 2.5 Pro	gemini CLI · Gin4 key	APPROVE_FINAL	Reward Interpretation Poisoning — auto-authored skills can game kernel's interpretation
Codex GPT-5.5	Agent `ac36853bd79d09c6a` (R5 verification)	REVISE-FIRST → satisfied	Runtime claims needed health-check verification before commit — satisfied in §06 above
Convergence: split_escalated · tiebreaker_by: user · resolution: APPROVE_FINAL with all BLOCKER fixes baked in

Constraints that ride alongside the build.

Implementation Notes

Synthetic-attestation P0 still active

Until overwatchd ships, forge review --summary remains a forgeable surface. No forge review records during Sprint 0 — the audit trail for the overseer panel lives in /tmp/FINAL-*.out + /tmp/verify-*.out + agent IDs + this proposal document.

Overwatchd is a precondition, not v1 work

Sprint 0 (Tier 1 overwatchd) is a separate sprint with its own SEED → PLAN → DESIGN → BUILD ceremony and its own real cross-model panel (5-of-5 for security_hardening sprint type). The overseer v1 sprints (1–4) cannot fire until overwatchd is signing dispatch receipts and the four frozen SHAs (a59a3a8, 1110d45, 62b5def, 612898a) have been retroactively ratified or surgically reverted.

Sprint 0.5 added (Round 5) — revive @coinberg/autoresearch :3300

Health-check confirmed the service is currently DOWN. ~30 LOC: diagnose existing crash logs, add com.forge.autoresearch.plist LaunchAgent, wire into scripts/launch/forge-services-up.sh canonical labels, validate the 5-stage FSM round-trips a no-op skill end-to-end before Sprint 1 begins.

Tunable defaults · config-driven

From R4 panel: 30s budget-cache TTL, 10s T2 gate timeout, 30-day shadow window for newly promoted skills. Config-driven from day 1 via ~/.forge/overseer/config.yaml so operations tuning doesn't require code changes. HC v2 may need different values per tenant.

What's intentionally NOT in v1

Classical RL libraries (Stable-Baselines3 PPO/DQN/SAC, HuggingFace TRL, OpenTinker) — shape-mismatched for governance drift; autoresearch hill-climbing is correct shape. Model-Centric Self-Evolution (no custom training infra). Per-operator Ed25519 keypairs (single-operator v1 uses session-UUID + process-UID). Container deployment, per-tenant policy table fork, Kanban orchestration as standalone skill — all deferred to HC v2+ sprints.

"What stops the watcher from needing a watcher is the immutable kernel. What stops the kernel from rotting is the human-promote gate. What stops the gate from being bypassed is overwatchd. What stops overwatchd from drifting is the user."

— Trust boundary, restated