From spec to script
The spec says what an autonomous ML researcher must do; ml-research.js
says how, as deterministic control flow over the Hugging Face skills. This view is the bridge: it aligns
the spec's phases with the script's, then shows which layer enforces each rule — the host harness, the
script's pure code guards, or the constrained subagents — and calls out exactly what was kept faithfully,
adapted, or deliberately omitted.
Tip: boxes in the alignment diagram are clickable — spec boxes open ① Spec explorer, script boxes open ② Script internals at the matching phase.
Phase alignment
spec §5 ↔ script phasesA near 1-to-1 mapping, with a few telling differences: the spec's single Research phase is split into research + an adversarial verify; the spec's "implementation & preflight" fans out into implement, code review, a free CPU smoke, and a tiny billable GPU preflight (the script's answer to the C7 sandbox gap); explicit added gates appear for job-readiness and final conformance; and the §5.9 autonomous improvement loop is omitted.
| Spec phase | Script phase(s) | How it is realized |
|---|---|---|
| §5.1 Intake & planning | 0 Intake & Plan | One PLAN_SCHEMA agent classifies the task and freezes the baseline; trivial requests early-exit with a direct answer. |
| §5.2 Research | 1 Research 2 Verify Research | Four parallel Explore finders + a synthesis agent (isolation = separate subagent contexts), then an adversarial critic confirms cited papers/datasets/APIs exist. C4 citation graph: reduced fidelity |
| §5.3 Resource validation | 3 Resources | Parallel model + dataset verification; sizes hardware; needsUserDecision/!exists → STOP (R8). |
| §5.4 Data audit | 4 Data Audit | Audit + a format critic that maps the profile's schemaRules to the method; incompatible & un-remappable → STOP. |
| §5.5 Implementation & preflight | 5 Implement 6 Code Review 7 CPU Smoke 8 GPU Preflight | Adapt the shipped reference script → critic → free local CPU smoke → tiny billable GPU job. C7 sandbox gap: smoke = CPU + tiny GPU job |
| §5.6 Job submission | 9 Job Readiness 10 Full Job | The pre-flight checklist becomes an explicit gate agent; then one-job-first submission with a healthy-start confirmation. Pure submit-invariant runs before launch. |
| §5.7 Monitor & iterate | 10 Full Job | Structured trackio alerts drive a bounded corrective retry (divergence → lr×0.1; OOM ladder). grid sweep omitted |
| §5.8 Evaluate & persist | 11 Evaluate & Persist ✓ Final Verification | Evaluate + confirm-it-works + confirm-persisted, then a tool-backed conformance check against the §9 criteria. |
| §5.9 Autonomous loop | — | omitted The workflow stops at the first verified result; no improve-and-research-again loop. |
Three enforcement layers
who guarantees whatThe single most useful lens on the implementation: each spec requirement is enforced at one of three layers. The hardest rules live in Layer B — plain JavaScript that the model cannot argue past.
A Host Workflow harness
Properties of the runtime the script executes on — not re-coded in the script.
- Deterministic control flow (loops, conditionals, fan-out).
- Subagent context isolation → realizes §5.2 research isolation, generalized.
- Token budget + compaction → §7.6.
- Bounded iteration + repetition / continuation / malformed / truncation guards → §7.1–§7.5.
B Pure code guards
Plain JS in the script, independent of any subagent's self-report.
isScopeChange()/PROTECTED_KEYS→ R3 / R4 / R7.assertSubmitInvariant()→ §5.6 by-value + timeout + monitoring + no local path.canSpend()/COST_CAP_USD→ §7.7 autonomous-with-cap.attemptWithRetrybounded loop, analyze-before-retry → R10.- Phase ordering & explicit
stoppedReasonreturns.
C Constrained subagents
The model does the work, fenced by prompts, schemas, and critics.
HARD_RULESpreamble → restates R1–R14 + principles.- 13 forced-output schemas → branchable structured results.
FAILURE_TAXONOMY→ closed set of allowed minimal fixes.- 4 adversarial critics → "verify, never assume" at the risky moments.
TASK_PROFILES→ skill + reference-script selection.
Principles → mechanism
§2| Principle | Realized by |
|---|---|
| 1 · Knowledge is stale | C mandatory Research phase + adapt shipped reference scripts + HARD_RULES; the research critic confirms APIs are real. |
| 2 · Verify, never assume | C resource verification + data audit + four critics + a tool-backed final verification. |
| 3 · Test small first | B/C free CPU smoke (1 step) then a tiny billable GPU preflight before the full job. |
| 4 · Persist or lose it | B/C concrete push_to_hub in the script + submit-invariant + the readiness checklist + eval confirms persistence. |
| 5 · Preserve intent | B frozen baseline + isScopeChange() pure guard + scope_change STOP. |
Hard rules → enforcement
§6 · R1–R14Where each non-negotiable rule actually lives. A harness · B pure guard · C subagent prompt/schema/critic.
| Rule | Layer | How it is enforced in the script |
|---|---|---|
| R1 persistence as part of run | CB | IMPL sets a concrete persistenceDest; readiness checklist requires persistenceConcrete; not the template username/.... |
| R2 upload side artifacts | C | HARD_RULES instructs explicit upload of logs/scripts to the Hub. |
| R3 minimal fix | BC | FAILURE_TAXONOMY supplies the single allowed fix per category; isScopeChange() rejects anything broader. |
| R4 no scope-changing fix | B | PROTECTED_KEYS (method/model/dataset/seqlen) → pure isScopeChange() → scope_change STOP, even if the agent didn't flag it. |
| R5 sized timeout | BC | IMPL timeoutHours; assertSubmitInvariant() refuses a submit without it. |
| R6 hardware sizing | C | Resources agent recommends a flavor sized to the model footprint; HARD_RULES forbids over/undersizing. |
| R7 OOM ladder | CB | FAILURE_TAXONOMY.oom + carried oomLadderRung escalate 1→2→3; isScopeChange() blocks seqlen/method "fixes". |
| R8 no silent substitution | BC | needsUserDecision/!exists and the dataset-format critic's mustReblock → STOP and ask. |
| R9 verify schema by inspection | C | Data Audit (DATA_AUDIT_SCHEMA) + format critic before any job. |
| R10 never retry unchanged | BC | attemptWithRetry always inserts a diagnosis + fix between attempts; FAILURE_TAXONOMY demands a specific change. |
| R11 API errors via docs | C | HARD_RULES + recheckedDocs in the analysis schema for wrong_trainer_arg/imports. |
| R12 prefer prebuilt | C | HARD_RULES + the code-review critic flags building heavy deps from source. |
| R13 secrets from env | C | Prompts pass HF_TOKEN as a secret, never inlined or logged. |
| R14 direct URLs | C | Schemas carry url fields; final verification checks every artifact is a direct URL. |
Capabilities → skills & agents
§4 · C1–C14 · Appendix A| Cap | Where used | Realized by |
|---|---|---|
| C1 hub search | Research, Resources | hf-cli, huggingface-best |
| C2 dataset inspect | Data Audit | huggingface-datasets + profile inspector |
| C3 papers | Research finders | huggingface-papers |
| C4 citation graph | Research: citations finder | paper links + WebSearch/WebFetch; citationGraphFidelity="reduced" partial gap, declared |
| C5 docs | Research, failure analysis | doc-search tools (R11 re-checks) |
| C6 example code | Research, Implement | adapt the skills' shipped reference scripts |
| C7 sandbox (CPU/GPU) | CPU Smoke, GPU Preflight | local CPU run + a tiny billable GPU job gap → split realization |
| C8 managed jobs | GPU Preflight, Full Job | hf-cli (hf jobs run/logs) |
| C9 training methods | Implement, jobs | profile trainer skill (LLM / vision / sentence-transformers) |
| C10 tracking + alerts | Implement, Full Job | huggingface-trackio (list alerts --json) — the §5.7 decision channel |
| C11 durable storage | Implement, Evaluate | push_to_hub + hf-cli |
| C12 evaluation | Evaluate & Persist | huggingface-community-evals / trainer eval |
| C13 general web | Research fallback | host harness WebSearch/WebFetch |
| C14 notifications | — | not used out of scope for this script |
Control contract → harness vs. script
§7The script runs on a Workflow harness that already provides most of §7. The script adds the cost-cap approval policy and per-stage bounded retries.
| Contract | Layer | Realization |
|---|---|---|
| §7.1 bounded iteration | AB | Harness caps the run; the script adds per-stage maxAttempts (smoke 3, jobs MAX_JOB_RETRIES). |
| §7.2 repetition guard | AB | Harness doom-loop detection; attemptWithRetry forbids identical retries by construction. |
| §7.3 continuation guard | A | Harness; the script's control flow is deterministic — it always proceeds or STOPs explicitly. |
| §7.4 malformed-action | A | Harness; structured-output schemas also force well-formed returns. |
| §7.5 output truncation | AB | Harness; large content is written to WORK_DIR files, not inlined into messages. |
| §7.6 context compaction | A | Harness token budget; subagent isolation keeps each context small. |
| §7.7 approval gate | B | canSpend() auto-approves paid jobs up to COST_CAP_USD; scheduled/recurring jobs out of scope (human-gated). |
| §7.8 effort probing | A | Harness; non-normative. |
Faithful · adapted · omitted
the scoping at a glance✓ Kept faithfully
- Research-first, mandatory, example-grounded.
- Verify by inspection; no silent substitution (R8 STOP).
- Smoke before scale; one-job-first.
- Pre-flight checklist as a hard gate.
- Persistence configured up front.
- Alert-driven corrective iteration (§5.7).
- No-scope-change as a pure code guard.
- Evaluate + tool-backed §9 conformance.
~ Adapted (rule kept, mechanism changed)
- C7 GPU sandbox gap → free CPU smoke + a tiny billable GPU preflight.
- C4 citation graph gap → paper links + web, with
fidelity="reduced"declared. - §7.7 autonomous approval → auto-submit paid jobs up to a dollar cost cap.
- Research split into research + an explicit adversarial verify.
× Deliberately omitted
- §5.9 autonomous improvement loop — stop at the first verified result.
- Grid sweep hyperparameter exploration.
- Scheduled / recurring jobs — always human-gated, out of scope.
- C14 notifications — not used.