From spec to script

The spec says what an autonomous ML researcher must do; ml-research.js says how, as deterministic control flow over the Hugging Face skills. This view is the bridge: it aligns the spec's phases with the script's, then shows which layer enforces each rule — the host harness, the script's pure code guards, or the constrained subagents — and calls out exactly what was kept faithfully, adapted, or deliberately omitted.

Tip: boxes in the alignment diagram are clickable — spec boxes open ① Spec explorer, script boxes open ② Script internals at the matching phase.

Phase alignment

spec §5 ↔ script phases

A near 1-to-1 mapping, with a few telling differences: the spec's single Research phase is split into research + an adversarial verify; the spec's "implementation & preflight" fans out into implement, code review, a free CPU smoke, and a tiny billable GPU preflight (the script's answer to the C7 sandbox gap); explicit added gates appear for job-readiness and final conformance; and the §5.9 autonomous improvement loop is omitted.

plan research validate build / preflight run / monitor complete + added ~ adapted × omitted

Spec phase	Script phase(s)	How it is realized
§5.1 Intake & planning	0 Intake & Plan	One `PLAN_SCHEMA` agent classifies the task and freezes the baseline; trivial requests early-exit with a direct answer.
§5.2 Research	1 Research 2 Verify Research	Four parallel `Explore` finders + a synthesis agent (isolation = separate subagent contexts), then an adversarial critic confirms cited papers/datasets/APIs exist. C4 citation graph: reduced fidelity
§5.3 Resource validation	3 Resources	Parallel model + dataset verification; sizes hardware; `needsUserDecision`/`!exists` → STOP (R8).
§5.4 Data audit	4 Data Audit	Audit + a format critic that maps the profile's `schemaRules` to the method; incompatible & un-remappable → STOP.
§5.5 Implementation & preflight	5 Implement 6 Code Review 7 CPU Smoke 8 GPU Preflight	Adapt the shipped reference script → critic → free local CPU smoke → tiny billable GPU job. C7 sandbox gap: smoke = CPU + tiny GPU job
§5.6 Job submission	9 Job Readiness 10 Full Job	The pre-flight checklist becomes an explicit gate agent; then one-job-first submission with a healthy-start confirmation. Pure submit-invariant runs before launch.
§5.7 Monitor & iterate	10 Full Job	Structured trackio alerts drive a bounded corrective retry (divergence → lr×0.1; OOM ladder). grid sweep omitted
§5.8 Evaluate & persist	11 Evaluate & Persist ✓ Final Verification	Evaluate + confirm-it-works + confirm-persisted, then a tool-backed conformance check against the §9 criteria.
§5.9 Autonomous loop	—	omitted The workflow stops at the first verified result; no improve-and-research-again loop.

Three enforcement layers

who guarantees what

The single most useful lens on the implementation: each spec requirement is enforced at one of three layers. The hardest rules live in Layer B — plain JavaScript that the model cannot argue past.

A Host Workflow harness

Properties of the runtime the script executes on — not re-coded in the script.

Deterministic control flow (loops, conditionals, fan-out).
Subagent context isolation → realizes §5.2 research isolation, generalized.
Token budget + compaction → §7.6.
Bounded iteration + repetition / continuation / malformed / truncation guards → §7.1–§7.5.

B Pure code guards

Plain JS in the script, independent of any subagent's self-report.

isScopeChange() / PROTECTED_KEYS → R3 / R4 / R7.
assertSubmitInvariant() → §5.6 by-value + timeout + monitoring + no local path.
canSpend() / COST_CAP_USD → §7.7 autonomous-with-cap.
attemptWithRetry bounded loop, analyze-before-retry → R10.
Phase ordering & explicit stoppedReason returns.

C Constrained subagents

The model does the work, fenced by prompts, schemas, and critics.

HARD_RULES preamble → restates R1–R14 + principles.
13 forced-output schemas → branchable structured results.
FAILURE_TAXONOMY → closed set of allowed minimal fixes.
4 adversarial critics → "verify, never assume" at the risky moments.
TASK_PROFILES → skill + reference-script selection.

Principles → mechanism

§2

Principle	Realized by
1 · Knowledge is stale	C mandatory Research phase + adapt shipped reference scripts + `HARD_RULES`; the research critic confirms APIs are real.
2 · Verify, never assume	C resource verification + data audit + four critics + a tool-backed final verification.
3 · Test small first	B/C free CPU smoke (1 step) then a tiny billable GPU preflight before the full job.
4 · Persist or lose it	B/C concrete `push_to_hub` in the script + submit-invariant + the readiness checklist + eval confirms persistence.
5 · Preserve intent	B frozen `baseline` + `isScopeChange()` pure guard + `scope_change` STOP.

Hard rules → enforcement

§6 · R1–R14

Where each non-negotiable rule actually lives. A harness · B pure guard · C subagent prompt/schema/critic.

Rule	Layer	How it is enforced in the script
R1 persistence as part of run	CB	IMPL sets a concrete `persistenceDest`; readiness checklist requires `persistenceConcrete`; not the template `username/...`.
R2 upload side artifacts	C	`HARD_RULES` instructs explicit upload of logs/scripts to the Hub.
R3 minimal fix	BC	`FAILURE_TAXONOMY` supplies the single allowed fix per category; `isScopeChange()` rejects anything broader.
R4 no scope-changing fix	B	`PROTECTED_KEYS` (method/model/dataset/seqlen) → pure `isScopeChange()` → `scope_change` STOP, even if the agent didn't flag it.
R5 sized timeout	BC	IMPL `timeoutHours`; `assertSubmitInvariant()` refuses a submit without it.
R6 hardware sizing	C	Resources agent recommends a flavor sized to the model footprint; `HARD_RULES` forbids over/undersizing.
R7 OOM ladder	CB	`FAILURE_TAXONOMY.oom` + carried `oomLadderRung` escalate 1→2→3; `isScopeChange()` blocks seqlen/method "fixes".
R8 no silent substitution	BC	`needsUserDecision`/`!exists` and the dataset-format critic's `mustReblock` → STOP and ask.
R9 verify schema by inspection	C	Data Audit (`DATA_AUDIT_SCHEMA`) + format critic before any job.
R10 never retry unchanged	BC	`attemptWithRetry` always inserts a diagnosis + fix between attempts; `FAILURE_TAXONOMY` demands a specific change.
R11 API errors via docs	C	`HARD_RULES` + `recheckedDocs` in the analysis schema for `wrong_trainer_arg`/imports.
R12 prefer prebuilt	C	`HARD_RULES` + the code-review critic flags building heavy deps from source.
R13 secrets from env	C	Prompts pass `HF_TOKEN` as a secret, never inlined or logged.
R14 direct URLs	C	Schemas carry `url` fields; final verification checks every artifact is a direct URL.

Capabilities → skills & agents

§4 · C1–C14 · Appendix A

Cap	Where used	Realized by
C1 hub search	Research, Resources	`hf-cli`, `huggingface-best`
C2 dataset inspect	Data Audit	`huggingface-datasets` + profile `inspector`
C3 papers	Research finders	`huggingface-papers`
C4 citation graph	Research: citations finder	paper links + `WebSearch/WebFetch`; `citationGraphFidelity="reduced"` partial gap, declared
C5 docs	Research, failure analysis	doc-search tools (R11 re-checks)
C6 example code	Research, Implement	adapt the skills' shipped reference scripts
C7 sandbox (CPU/GPU)	CPU Smoke, GPU Preflight	local CPU run + a tiny billable GPU job gap → split realization
C8 managed jobs	GPU Preflight, Full Job	`hf-cli` (`hf jobs run/logs`)
C9 training methods	Implement, jobs	profile trainer skill (LLM / vision / sentence-transformers)
C10 tracking + alerts	Implement, Full Job	`huggingface-trackio` (`list alerts --json`) — the §5.7 decision channel
C11 durable storage	Implement, Evaluate	`push_to_hub` + `hf-cli`
C12 evaluation	Evaluate & Persist	`huggingface-community-evals` / trainer eval
C13 general web	Research fallback	host harness `WebSearch/WebFetch`
C14 notifications	—	not used out of scope for this script

Control contract → harness vs. script

§7

The script runs on a Workflow harness that already provides most of §7. The script adds the cost-cap approval policy and per-stage bounded retries.

Contract	Layer	Realization
§7.1 bounded iteration	AB	Harness caps the run; the script adds per-stage `maxAttempts` (smoke 3, jobs `MAX_JOB_RETRIES`).
§7.2 repetition guard	AB	Harness doom-loop detection; `attemptWithRetry` forbids identical retries by construction.
§7.3 continuation guard	A	Harness; the script's control flow is deterministic — it always proceeds or STOPs explicitly.
§7.4 malformed-action	A	Harness; structured-output schemas also force well-formed returns.
§7.5 output truncation	AB	Harness; large content is written to `WORK_DIR` files, not inlined into messages.
§7.6 context compaction	A	Harness token budget; subagent isolation keeps each context small.
§7.7 approval gate	B	`canSpend()` auto-approves paid jobs up to `COST_CAP_USD`; scheduled/recurring jobs out of scope (human-gated).
§7.8 effort probing	A	Harness; non-normative.

Faithful · adapted · omitted

the scoping at a glance

✓ Kept faithfully

Research-first, mandatory, example-grounded.
Verify by inspection; no silent substitution (R8 STOP).
Smoke before scale; one-job-first.
Pre-flight checklist as a hard gate.
Persistence configured up front.
Alert-driven corrective iteration (§5.7).
No-scope-change as a pure code guard.
Evaluate + tool-backed §9 conformance.

~ Adapted (rule kept, mechanism changed)

C7 GPU sandbox gap → free CPU smoke + a tiny billable GPU preflight.
C4 citation graph gap → paper links + web, with fidelity="reduced" declared.
§7.7 autonomous approval → auto-submit paid jobs up to a dollar cost cap.
Research split into research + an explicit adversarial verify.

× Deliberately omitted

§5.9 autonomous improvement loop — stop at the first verified result.
Grid sweep hyperparameter exploration.
Scheduled / recurring jobs — always human-gated, out of scope.
C14 notifications — not used.

Net effect: the script is a single-shot, autonomous, cost-capped realization that runs the spec's full discipline up to the first verified-and-persisted result, halting and surfacing to the user at any rule boundary it cannot honor with a minimal fix.