Inside ml-research.js
An executable encoding of the behavioral spec as a deterministic Workflow orchestration.
The script owns the control flow, the gates, and the discipline; the actual ML work is delegated to
subagents that drive the Hugging Face skills (huggingface-skills:*). It is the same workflow
the spec describes — research-first, validate-before-spend, smoke-before-scale, persist-and-verify — rendered as
JavaScript that fans out subagents, forces structured output, and enforces the hard rules in code.
It runs on a Workflow harness that shares the local filesystem across subagents
but not their context. Scripts pass between local subagents via WORK_DIR; the remote HF Job
receives code by value (inline source), never a local path.
Scoping decision 1 — autonomous paid submit, up to a cap
Paid HF Jobs auto-submit autonomously up to a dollar cost cap (spec §7.7's autonomous
mode). canSpend() gates every billable submit; spentUSD accrues across the run.
Scheduled/recurring jobs are out of scope (always human-gated).
Scoping decision 2 — stop at the first verified result
The workflow stops at the first verified result. The spec §5.9 autonomous improvement loop and the grid sweep are omitted. The alert-driven corrective retries that get one good run to complete (divergence → lr×0.1, the OOM ladder, etc.) are kept.
The orchestration model
five ideas that shape every phaseBefore the phases, five structural choices explain how the script turns a prose spec into enforceable control flow.
Isolated context per step
Each agent() call runs in its own context; the orchestrator only sees the returned object. This
is exactly the spec's research-subagent isolation, generalized to every phase.
Forced JSON schemas
Every subagent returns a schema-validated object (PLAN_SCHEMA, RESEARCH_SCHEMA,
JOB_SCHEMA, …). The orchestrator branches on fields, never on free prose.
Code, not self-report
isScopeChange() and assertSubmitInvariant() run in plain JS, independent of what a
subagent claims. The model cannot talk its way past them.
Adversarial verification
Four read-only Explore critics re-check research, data format, code, and final conformance —
tool-backed, with a mustReblock flag that can halt the line.
Declarative genericity
TASK_PROFILES maps task type → skill, reference script, schema rules. One pipeline serves LLM,
vision, embedding, eval-only, and data-only tasks with no branch explosion.
Tokens vs. dollars
The harness budget.* tracks tokens; HF compute spend is tracked separately in
spentUSD against COST_CAP_USD.
The 13-phase pipeline
meta.phases · top-to-bottomClick any phase to open its detail. Annotations on the right flag fan-out, critics, retry loops, cost gates, and the points where the workflow can STOP and surface to the user. Two free early-exits (eval-only, data-only) branch out after the data audit.
0 Intake & Plan›
One agent() call returns a PLAN_SCHEMA object: it classifies the task, picks
taskType (llm | vision | embedding | eval | data | other) and method, and
freezes the baseline {model, dataset, method, sequenceLength} exactly as the user expressed
it — the frozen object the scope-change guard later compares against.
isTrivial=trueonly for a pure factual question / status check / resource lookup → returndirectAnswerand skip everything (spec's trivial-request branch).resolveProfile(taskType)selects the skill, reference scripts, and schema rules.
1 Research — parallel fan-out›
A parallel([...]) of four read-only Explore finders, each in its own context,
each returning a FINDER_SCHEMA object:
- landmark — anchor papers, read methodology (not abstracts).
- citations — downstream work; no full citation-graph API, so it uses paper-page links + host
WebSearch/WebFetchand setsfidelity="reduced". - examples — working code with current APIs, preferring the skill's shipped reference scripts.
- datasets — which datasets produced the reported results; confirm they load.
A synthesis agent folds the finders into one RESEARCH_SCHEMA: a ranked recipe table, code
patterns (correct imports + current trainer args), SOTA landscape, references, and
citationGraphFidelity.
2 Verify Research — adversarial critic›
A read-only Explore critic returns a CRITIC_SCHEMA object. It tool-checks that the
cited papers, datasets, and models actually exist and that code patterns use real current APIs.
mustReblock=true only for hard problems (a cited resource doesn't exist, a hallucinated import);
"is every result truly attributable?" is a soft warn. If blocked, one corrective re-research pass
runs.
3 Resources — verify & size›
A parallel([model, dataset]) of Explore agents (RESOURCE_SCHEMA). If
the user named a model/dataset, it confirms existence and inspects; if not, it evaluates candidates and
recommends. It also recommends a hardware flavor sized to the model footprint (R6).
needsUserDecision, or
!exists) the workflow STOPs with resource_unavailable — it asks the user
rather than substituting (R8).4 Data Audit — format ↔ method›
An audit agent (DATA_AUDIT_SCHEMA) inspects schema/columns, splits, distributions, and sample
rows, then validates the format against the method using the profile's schemaRules
(e.g. DPO needs prompt/chosen/rejected). A format critic confirms compatibility; if it can be fixed
by column remapping it passes with a mapping, otherwise mustReblock → STOP
(dataset_format_incompatible, an R8 ask-the-user stop).
profile.evalOnly → evaluate then finalize;
profile.dataOnly → build/persist a dataset then finalize. Both skip train/smoke/preflight/full-job.5 Implement — adapt the reference script›
The implementer adapts the profile's shipped reference script (e.g. train_sft_example.py) and
writes train.py + eval.py into WORK_DIR, returning an
IMPL_SCHEMA whose trainScriptContent is the full inline source (by-value
transport). The script must wire trackio metrics + structured alerts, a concrete push_to_hub
destination, a sized timeout, the chosen flavor, OOM knobs (batch/accum/grad-checkpoint), and
SMOKE knobs (SMOKE=1 → 1 step, tiny slice, CPU+fp32, optional tiny proxy model).
6 Code Review — critic before any run›
An Explore critic reads the files and flags hallucinated imports, wrong trainer args, local-path
leakage, missing monitoring, a username/... placeholder destination, short timeout, OOM-readiness,
and source-build smells (R12). Any error-severity issue sets mustReblock and a code-fix pass edits
the files before anything executes.
7 CPU Smoke — free, local›
Runs attemptWithRetry (≤3): execute train.py with SMOKE=1 on CPU+fp32
for one train step + one eval step (uv run resolves PEP-723 deps). A high loss is not a
failure; an exception/import/arg/schema error is. If the real model can't load on CPU, a tiny proxy model is
used (the real load is first exercised on GPU preflight). Each failure → analyze → minimal-fix → retry.
8 GPU Preflight — tiny billable job›
Gated by canSpend(estimatedPreflightUSD) and assertSubmitInvariant() before
submitting. Then attemptWithRetry (≤MAX_JOB_RETRIES) submits a tiny GPU job by value
on a small flavor, polls logs to a terminal state, and validates the GPU-only code paths CPU could not:
CUDA, mixed precision, the real model load, small-scale OOM. recordSpend() accrues the cost.
9 Job Readiness — the one-time gate›
A single CHECKLIST_SCHEMA agent verifies the spec §5.6 pre-flight checklist:
reference impl cited, dataset format verified, GPU smoke ok and the real model loaded on GPU, concrete
persistence destination, sized timeout, monitoring wired, by-value transport. If
!allSatisfied → STOP (preflight_checklist_unsatisfied).
10 Full Job — one-job-first›
Cost-gated and invariant-checked again, then attemptWithRetry (≤MAX_JOB_RETRIES)
submits one full job by value, polls logs to confirm healthy (a step advancing / first metric),
then monitors trackio alerts (trackio list alerts --json --since <cursor>). An ERROR-level
alert (divergence/NaN/OOM) is treated as a failure to analyze and correct — the alert-driven retry the spec's
§5.7 calls for. The dashboard URL is recorded as an artifact.
11 Evaluate & Persist›
An EVAL_SCHEMA agent evaluates the trained model and confirms it actually works (not
merely produced), verifies it is persisted on the Hub, and reports the metric+value, eval URL, model URL, and
dashboard URL. (For eval-only / data-only tasks, this is reached directly from the early exit.)
✓ Final Verification›
finalize() runs a read-only, tool-backed Explore agent
(COMPLETION_SCHEMA) that re-checks the spec §9 completion criteria — verify, do not trust prior
claims: research preceded implementation, resources verified, smoke-tested, checklist satisfied, alerts emitted,
result persisted & evaluated, no scope change vs. the frozen baseline, every artifact a direct URL.
Sets result.conforms.
Key mechanisms
the reusable machinery behind the phasesThe phases above are thin; most of the discipline lives in a handful of reusable pieces.
Pure code guards — enforced in JS, not by the model›
Scope-change guard (R3/R4/R7). A pure function checks whether a proposed fix touches a protected key.
Even if a subagent doesn't flag scopeChange, this catches it:
const PROTECTED_KEYS = ['method','model','dataset','max_seq_length', /* …variants… */] function isScopeChange(configChanges) { return Object.keys(configChanges).some(k => PROTECTED_KEYS.includes(k.toLowerCase())) } // in the retry loop: const scope = analysis.scopeChange || isScopeChange(analysis.configChanges) if (scope) return { stopped: true, stoppedReason: 'scope_change' } // STOP, ask the user
Billable-submit invariant (§5.6). Run before every paid submit; refuses to launch unless the code is present by value, the timeout is sized, monitoring is wired, and no local path leaked:
function assertSubmitInvariant(impl) { // missing unless: trainScriptContent (inline, >=50 chars), timeoutHours, // monitoringWired, and NO /Users/ or ./ml-run/ path in the script source return { ok: missing.length === 0, missing } }
attemptWithRetry — analyze → minimal-fix → retry›
One helper drives CPU smoke, GPU preflight, and the full job. It never retries unchanged (R10): every retry is preceded by a diagnosis and a bounded, scope-safe fix.
for (let attempt = 1; attempt <= maxAttempts; attempt++) { const res = await submit(attempt, ctx) if (res.status === 'success') return res // done if (attempt === maxAttempts) return { stopped:true, stoppedReason:'retries_exhausted' } const analysis = await analyze(attempt, res, ctx) // FAILURE_ANALYSIS if (scopeChange || analysis.unrecoverable) return { stopped:true } // STOP await applyFix(analysis, ctx) // edit train.py ctx.oomLadderRung = Math.max(ctx.oomLadderRung, analysis.oomLadderRung) // R7 ladder carries }
The carried ctx remembers the OOM ladder rung (so R7 escalates 1→2→3 across attempts) and the
alert poll cursor (--since).
The four adversarial critics›
Each is a read-only Explore agent returning a CRITIC_SCHEMA with a
mustReblock flag. They turn "verify, never assume" into a tool-backed second opinion at the
riskiest moments:
- Verify Research (Phase 2) — do cited papers/datasets/models exist; are APIs real?
- Format critic (Phase 4) — is the dataset shape compatible with the method, or remappable?
- Code review (Phase 6) — hallucinated imports, local-path leak, missing persistence/monitoring.
- Final verification (finalize) — tool-backed conformance against the spec §9 criteria.
TASK_PROFILES — declarative genericity›
One table lets a single pipeline serve every task type without per-type branching. Each profile names the skill, the per-method reference script, the dataset inspector, the eval skill, and the schema rules:
llm: {
skill: 'huggingface-skills:huggingface-llm-trainer',
scriptByMethod: { sft:'train_sft_example.py', dpo:'train_dpo_example.py', grpo:'train_grpo_example.py' },
evalSkill: 'huggingface-skills:huggingface-community-evals',
schemaRules: { sft:'messages | text | prompt+completion', dpo:'prompt, chosen, rejected', grpo:'prompt' },
}
Profiles eval (evalOnly) and data (dataOnly) trigger the
early exits; vision, embedding, and other reuse the same flow.
FAILURE_TAXONOMY — a closed set of fixes›
The analysis prompt injects a closed list of failure categories, each with the only allowed minimal fix — so diagnosis is constrained, not open-ended. A few:
- oom → the R7 ladder in order (preserve effective batch → grad-checkpoint → bigger memory); never reduce seqlen or switch method.
- wrong_trainer_arg → re-check current docs (R11) and correct the kwarg; set
recheckedDocs. - dataset_schema_mismatch → remap columns; if fields genuinely missing → scope-change STOP (R8).
- timeout → raise the timeout (R5); shrinking data/epochs/seqlen to fit is a forbidden scope change.
- divergence_nan → corrective config change (lr×0.1 or per the alert) — iteration, not a crash.
HARD_RULES + SKILLS_NOTE — the shared preamble›
Implementation and analysis prompts are prefixed with a HARD_RULES block that restates R1–R14
and the principles in imperative form (research-grounded, verify-never-assume, no silent scope change, no silent
substitution, persistence, sized timeout/hardware, the OOM ladder, prefer prebuilt, secrets-from-env, direct
URLs), plus a SKILLS_NOTE telling the agent to drive the HF skills and prefer their shipped
reference scripts over writing ML code from memory.
Cost-cap gating — autonomous spend, bounded›
Two helpers bound HF spend independently of the harness token budget. Each billable submit is gated:
function canSpend(estUSD){ return (spentUSD + estUSD) <= COST_CAP_USD } // before GPU preflight and before the full job: if (!canSpend(est)) return { stoppedReason: 'cost_cap_full_job' } // fully prepared; only spend is gated recordSpend(actualOrEstimatedUSD) // accrue after submit
The 13 forced-output schemas›
Every subagent is forced to return a validated object, so the orchestrator branches on fields rather than parsing prose:
| Schema | Phase | Key fields the orchestrator branches on |
|---|---|---|
| PLAN | 0 | isTrivial, taskType, method, baseline |
| FINDER | 1 | findings, sources, fidelity |
| RESEARCH | 1 | recipe[], codePatterns[], citationGraphFidelity |
| CRITIC | 2,4,6 | ok, mustReblock, issues[] |
| RESOURCE | 3 | modelVerified.exists, datasetVerified.exists, needsUserDecision |
| DATA_AUDIT | 4 | formatCompatible, requiredFields, mappingNeeded |
| IMPL | 5 | trainScriptContent, persistenceDest, timeoutHours, monitoringWired, smokeKnobs |
| SMOKE | 7 | status, usedProxyModel |
| JOB | 8,10 | status, healthy, alerts[], lastAlertTimestamp |
| FAILURE_ANALYSIS | 7,8,10 | category, configChanges, scopeChange, oomLadderRung, unrecoverable |
| CHECKLIST | 9 | allSatisfied, missing[] |
| EVAL | 11 | evaluated, confirmedWorks, modelUrl |
| COMPLETION | final | conforms, missing[], artifacts[] |
Where it stops
result.stoppedReasonThe workflow is designed to halt and surface to the user rather than push through a
violation. Every stop sets a machine-readable stoppedReason.
| stoppedReason | Phase | Why it stops |
|---|---|---|
| resource_unavailable | 3 | Requested model/dataset missing or unusable — ask, don't substitute (R8). no-substitute |
| dataset_format_incompatible | 4 | Schema can't satisfy the method and can't be remapped (R8). |
| cpu_smoke_* / gpu_preflight_* / full_job_* | 7,8,10 | Could not be made to pass within retries/rules. |
| scope_change | 7,8,10 | The only viable fix would touch a protected key (R3/R4/R7). pure guard |
| cost_cap_preflight / cost_cap_full_job | 8,10 | Estimated spend would exceed COST_CAP_USD. Job is prepared; only spend is gated. |
| submit_invariant_preflight / _full | 8,10 | By-value / timeout / monitoring / no-local-path invariant failed. pure guard |
| preflight_checklist_unsatisfied | 9 | The §5.6 pre-flight checklist gate did not pass. |
| planning_failed / implementation_failed | 0,5 | A required subagent returned nothing usable. |
Config & arguments
args = task string, or { … }Inputs
task— the request (or passargsas a bare string).model,dataset— optional explicit resources.hubOrg— namespace for persisted outputs.costCapUSD— default 10; the autonomous spend ceiling.maxJobRetries— default 3; GPU preflight + full job.smokeModel,workDir— proxy model; default./ml-run.
Run-level state
spentUSD— accrues across billable submits.result— accumulates every phase's output,artifacts[], andstoppedReason/conforms.baseline— the frozen user intent the scope-change guard compares against.SMOKE_RETRIES = 3— CPU smoke attempts.