ML Research Workflow Spec → Script Map
① Spec explorer ② Script internals ③ Spec → script map

From spec to script

The spec says what an autonomous ML researcher must do; ml-research.js says how, as deterministic control flow over the Hugging Face skills. This view is the bridge: it aligns the spec's phases with the script's, then shows which layer enforces each rule — the host harness, the script's pure code guards, or the constrained subagents — and calls out exactly what was kept faithfully, adapted, or deliberately omitted.

Tip: boxes in the alignment diagram are clickable — spec boxes open ① Spec explorer, script boxes open ② Script internals at the matching phase.

Phase alignment

spec §5 ↔ script phases

A near 1-to-1 mapping, with a few telling differences: the spec's single Research phase is split into research + an adversarial verify; the spec's "implementation & preflight" fans out into implement, code review, a free CPU smoke, and a tiny billable GPU preflight (the script's answer to the C7 sandbox gap); explicit added gates appear for job-readiness and final conformance; and the §5.9 autonomous improvement loop is omitted.

plan research validate build / preflight run / monitor complete + added   ~ adapted   × omitted
Spec · §5 phases ml-research.js · phases §5.1 Phase 0 — Intake & planning classify · plan §5.2 Phase 1 — Research literature-first · subagent isolation §5.3 Phase 2 — Resource validation confirm + inspect model/dataset §5.4 Phase 3 — Data audit format ↔ method compatibility §5.5 Phase 4 — Implementation & preflight sandbox-first · GPU smoke test §5.6 Phase 5 — Job submission pre-flight checklist · one-job-first §5.7 Phase 6 — Monitor & iterate alert-driven decisions · sweeps §5.8 Phase 7 — Evaluate & persist exists · persisted · evaluated · linked §5.9 Autonomous improvement loop research→improve→research again × omitted — stop at first verified result 0 Intake & Planfreeze baseline · PLAN_SCHEMA 1 Researchparallel finders → synthesis 2 Verify Researchadversarial critic + added 3 Resourcesverify + size · R8 STOP 4 Data Audit+ format critic 5 Implementadapt reference script (by value) 6 Code Reviewcritic before any run + added 7 CPU Smokefree / local · 1 step ~ C7 gap 8 GPU Preflighttiny billable job · real GPU paths ~ C7 gap 9 Job Readinesschecklist gate (one-time) + added gate 10 Full Jobone-job-first · alert-driven retry 11 Evaluate & Persistconfirm works · persisted Final Verificationtool-backed §9 conformance + added
Spec phaseScript phase(s)How it is realized
§5.1 Intake & planning0 Intake & PlanOne PLAN_SCHEMA agent classifies the task and freezes the baseline; trivial requests early-exit with a direct answer.
§5.2 Research1 Research
2 Verify Research
Four parallel Explore finders + a synthesis agent (isolation = separate subagent contexts), then an adversarial critic confirms cited papers/datasets/APIs exist. C4 citation graph: reduced fidelity
§5.3 Resource validation3 ResourcesParallel model + dataset verification; sizes hardware; needsUserDecision/!exists → STOP (R8).
§5.4 Data audit4 Data AuditAudit + a format critic that maps the profile's schemaRules to the method; incompatible & un-remappable → STOP.
§5.5 Implementation & preflight5 Implement
6 Code Review
7 CPU Smoke
8 GPU Preflight
Adapt the shipped reference script → critic → free local CPU smoke → tiny billable GPU job. C7 sandbox gap: smoke = CPU + tiny GPU job
§5.6 Job submission9 Job Readiness
10 Full Job
The pre-flight checklist becomes an explicit gate agent; then one-job-first submission with a healthy-start confirmation. Pure submit-invariant runs before launch.
§5.7 Monitor & iterate10 Full JobStructured trackio alerts drive a bounded corrective retry (divergence → lr×0.1; OOM ladder). grid sweep omitted
§5.8 Evaluate & persist11 Evaluate & Persist
✓ Final Verification
Evaluate + confirm-it-works + confirm-persisted, then a tool-backed conformance check against the §9 criteria.
§5.9 Autonomous loopomitted The workflow stops at the first verified result; no improve-and-research-again loop.

Three enforcement layers

who guarantees what

The single most useful lens on the implementation: each spec requirement is enforced at one of three layers. The hardest rules live in Layer B — plain JavaScript that the model cannot argue past.

A Host Workflow harness

Properties of the runtime the script executes on — not re-coded in the script.

  • Deterministic control flow (loops, conditionals, fan-out).
  • Subagent context isolation → realizes §5.2 research isolation, generalized.
  • Token budget + compaction → §7.6.
  • Bounded iteration + repetition / continuation / malformed / truncation guards → §7.1–§7.5.

B Pure code guards

Plain JS in the script, independent of any subagent's self-report.

  • isScopeChange() / PROTECTED_KEYS → R3 / R4 / R7.
  • assertSubmitInvariant() → §5.6 by-value + timeout + monitoring + no local path.
  • canSpend() / COST_CAP_USD → §7.7 autonomous-with-cap.
  • attemptWithRetry bounded loop, analyze-before-retry → R10.
  • Phase ordering & explicit stoppedReason returns.

C Constrained subagents

The model does the work, fenced by prompts, schemas, and critics.

  • HARD_RULES preamble → restates R1–R14 + principles.
  • 13 forced-output schemas → branchable structured results.
  • FAILURE_TAXONOMY → closed set of allowed minimal fixes.
  • 4 adversarial critics → "verify, never assume" at the risky moments.
  • TASK_PROFILES → skill + reference-script selection.

Principles → mechanism

§2
PrincipleRealized by
1 · Knowledge is staleC mandatory Research phase + adapt shipped reference scripts + HARD_RULES; the research critic confirms APIs are real.
2 · Verify, never assumeC resource verification + data audit + four critics + a tool-backed final verification.
3 · Test small firstB/C free CPU smoke (1 step) then a tiny billable GPU preflight before the full job.
4 · Persist or lose itB/C concrete push_to_hub in the script + submit-invariant + the readiness checklist + eval confirms persistence.
5 · Preserve intentB frozen baseline + isScopeChange() pure guard + scope_change STOP.

Hard rules → enforcement

§6 · R1–R14

Where each non-negotiable rule actually lives. A harness · B pure guard · C subagent prompt/schema/critic.

RuleLayerHow it is enforced in the script
R1 persistence as part of runCBIMPL sets a concrete persistenceDest; readiness checklist requires persistenceConcrete; not the template username/....
R2 upload side artifactsCHARD_RULES instructs explicit upload of logs/scripts to the Hub.
R3 minimal fixBCFAILURE_TAXONOMY supplies the single allowed fix per category; isScopeChange() rejects anything broader.
R4 no scope-changing fixBPROTECTED_KEYS (method/model/dataset/seqlen) → pure isScopeChange()scope_change STOP, even if the agent didn't flag it.
R5 sized timeoutBCIMPL timeoutHours; assertSubmitInvariant() refuses a submit without it.
R6 hardware sizingCResources agent recommends a flavor sized to the model footprint; HARD_RULES forbids over/undersizing.
R7 OOM ladderCBFAILURE_TAXONOMY.oom + carried oomLadderRung escalate 1→2→3; isScopeChange() blocks seqlen/method "fixes".
R8 no silent substitutionBCneedsUserDecision/!exists and the dataset-format critic's mustReblock → STOP and ask.
R9 verify schema by inspectionCData Audit (DATA_AUDIT_SCHEMA) + format critic before any job.
R10 never retry unchangedBCattemptWithRetry always inserts a diagnosis + fix between attempts; FAILURE_TAXONOMY demands a specific change.
R11 API errors via docsCHARD_RULES + recheckedDocs in the analysis schema for wrong_trainer_arg/imports.
R12 prefer prebuiltCHARD_RULES + the code-review critic flags building heavy deps from source.
R13 secrets from envCPrompts pass HF_TOKEN as a secret, never inlined or logged.
R14 direct URLsCSchemas carry url fields; final verification checks every artifact is a direct URL.

Capabilities → skills & agents

§4 · C1–C14 · Appendix A
CapWhere usedRealized by
C1 hub searchResearch, Resourceshf-cli, huggingface-best
C2 dataset inspectData Audithuggingface-datasets + profile inspector
C3 papersResearch findershuggingface-papers
C4 citation graphResearch: citations finderpaper links + WebSearch/WebFetch; citationGraphFidelity="reduced" partial gap, declared
C5 docsResearch, failure analysisdoc-search tools (R11 re-checks)
C6 example codeResearch, Implementadapt the skills' shipped reference scripts
C7 sandbox (CPU/GPU)CPU Smoke, GPU Preflightlocal CPU run + a tiny billable GPU job gap → split realization
C8 managed jobsGPU Preflight, Full Jobhf-cli (hf jobs run/logs)
C9 training methodsImplement, jobsprofile trainer skill (LLM / vision / sentence-transformers)
C10 tracking + alertsImplement, Full Jobhuggingface-trackio (list alerts --json) — the §5.7 decision channel
C11 durable storageImplement, Evaluatepush_to_hub + hf-cli
C12 evaluationEvaluate & Persisthuggingface-community-evals / trainer eval
C13 general webResearch fallbackhost harness WebSearch/WebFetch
C14 notificationsnot used out of scope for this script

Control contract → harness vs. script

§7

The script runs on a Workflow harness that already provides most of §7. The script adds the cost-cap approval policy and per-stage bounded retries.

ContractLayerRealization
§7.1 bounded iterationABHarness caps the run; the script adds per-stage maxAttempts (smoke 3, jobs MAX_JOB_RETRIES).
§7.2 repetition guardABHarness doom-loop detection; attemptWithRetry forbids identical retries by construction.
§7.3 continuation guardAHarness; the script's control flow is deterministic — it always proceeds or STOPs explicitly.
§7.4 malformed-actionAHarness; structured-output schemas also force well-formed returns.
§7.5 output truncationABHarness; large content is written to WORK_DIR files, not inlined into messages.
§7.6 context compactionAHarness token budget; subagent isolation keeps each context small.
§7.7 approval gateBcanSpend() auto-approves paid jobs up to COST_CAP_USD; scheduled/recurring jobs out of scope (human-gated).
§7.8 effort probingAHarness; non-normative.

Faithful · adapted · omitted

the scoping at a glance

✓ Kept faithfully

  • Research-first, mandatory, example-grounded.
  • Verify by inspection; no silent substitution (R8 STOP).
  • Smoke before scale; one-job-first.
  • Pre-flight checklist as a hard gate.
  • Persistence configured up front.
  • Alert-driven corrective iteration (§5.7).
  • No-scope-change as a pure code guard.
  • Evaluate + tool-backed §9 conformance.

~ Adapted (rule kept, mechanism changed)

  • C7 GPU sandbox gap → free CPU smoke + a tiny billable GPU preflight.
  • C4 citation graph gap → paper links + web, with fidelity="reduced" declared.
  • §7.7 autonomous approval → auto-submit paid jobs up to a dollar cost cap.
  • Research split into research + an explicit adversarial verify.

× Deliberately omitted

  • §5.9 autonomous improvement loop — stop at the first verified result.
  • Grid sweep hyperparameter exploration.
  • Scheduled / recurring jobs — always human-gated, out of scope.
  • C14 notifications — not used.
Net effect: the script is a single-shot, autonomous, cost-capped realization that runs the spec's full discipline up to the first verified-and-persisted result, halting and surfacing to the user at any rule boundary it cannot honor with a minimal fix.