ML Research Workflow Script Internals
① Spec explorer ② Script internals ③ Spec → script map

Inside ml-research.js

An executable encoding of the behavioral spec as a deterministic Workflow orchestration. The script owns the control flow, the gates, and the discipline; the actual ML work is delegated to subagents that drive the Hugging Face skills (huggingface-skills:*). It is the same workflow the spec describes — research-first, validate-before-spend, smoke-before-scale, persist-and-verify — rendered as JavaScript that fans out subagents, forces structured output, and enforces the hard rules in code.

It runs on a Workflow harness that shares the local filesystem across subagents but not their context. Scripts pass between local subagents via WORK_DIR; the remote HF Job receives code by value (inline source), never a local path.

13orchestrated phases
13forced-output JSON schemas
4adversarial critics
2pure code guards
$10default autonomous cost cap

Scoping decision 1 — autonomous paid submit, up to a cap

Paid HF Jobs auto-submit autonomously up to a dollar cost cap (spec §7.7's autonomous mode). canSpend() gates every billable submit; spentUSD accrues across the run. Scheduled/recurring jobs are out of scope (always human-gated).

Scoping decision 2 — stop at the first verified result

The workflow stops at the first verified result. The spec §5.9 autonomous improvement loop and the grid sweep are omitted. The alert-driven corrective retries that get one good run to complete (divergence → lr×0.1, the OOM ladder, etc.) are kept.

The orchestration model

five ideas that shape every phase

Before the phases, five structural choices explain how the script turns a prose spec into enforceable control flow.

Subagents

Isolated context per step

Each agent() call runs in its own context; the orchestrator only sees the returned object. This is exactly the spec's research-subagent isolation, generalized to every phase.

Structured output

Forced JSON schemas

Every subagent returns a schema-validated object (PLAN_SCHEMA, RESEARCH_SCHEMA, JOB_SCHEMA, …). The orchestrator branches on fields, never on free prose.

Pure guards

Code, not self-report

isScopeChange() and assertSubmitInvariant() run in plain JS, independent of what a subagent claims. The model cannot talk its way past them.

Critics

Adversarial verification

Four read-only Explore critics re-check research, data format, code, and final conformance — tool-backed, with a mustReblock flag that can halt the line.

Profiles

Declarative genericity

TASK_PROFILES maps task type → skill, reference script, schema rules. One pipeline serves LLM, vision, embedding, eval-only, and data-only tasks with no branch explosion.

Budgets

Tokens vs. dollars

The harness budget.* tracks tokens; HF compute spend is tracked separately in spentUSD against COST_CAP_USD.

The 13-phase pipeline

meta.phases · top-to-bottom

Click any phase to open its detail. Annotations on the right flag fan-out, critics, retry loops, cost gates, and the points where the workflow can STOP and surface to the user. Two free early-exits (eval-only, data-only) branch out after the data audit.

Plan Research Validate Implement Test / gate Run Verify
eval-only / data-only early-exit → Evaluate 0 Intake & Plan classify · freeze baseline · build plan trivial? → direct answer PLAN_SCHEMA · resolveProfile() 1 Research papers · citations · examples · datasets ⇉ parallel: 4 finders → synthesis (RESEARCH) 2 Verify Research do the cited papers / datasets / APIs exist? critic · mustReblock → one re-research pass 3 Resources verify model + dataset · size hardware ⇉ model + dataset ⏹ R8 STOP if missing 4 Data Audit schema/stats · format ↔ method critic format critic ⏹ STOP if incompatible 5 Implement adapt reference script → train.py + eval.py IMPL_SCHEMA · by-value monitoring/persist/timeout 6 Code Review hallucinated imports · local-path leak · OOM-ready critic → code-fix 7 CPU Smoke 1 train + 1 eval step · free / local ↻ analyze→fix ≤3 ⏹ STOP on scope-change 8 GPU Preflight tiny billable job · real model load · CUDA paths 💲 cost gate · billable ↻ ≤3 · submit-invariant 9 Job Readiness one-time pre-flight checklist gate 🔒 checklist gate ⏹ STOP if unsatisfied 10 Full Job one-job-first · confirm healthy · monitor alerts 💲 cost gate ↻ ≤3 alert-driven 11 Evaluate & Persist confirm it works · persisted · dashboard URL EVAL_SCHEMA Final Verification tool-backed conformance vs. spec §9 COMPLETION_SCHEMA
0 Intake & Plan

One agent() call returns a PLAN_SCHEMA object: it classifies the task, picks taskType (llm | vision | embedding | eval | data | other) and method, and freezes the baseline {model, dataset, method, sequenceLength} exactly as the user expressed it — the frozen object the scope-change guard later compares against.

  • isTrivial=true only for a pure factual question / status check / resource lookup → return directAnswer and skip everything (spec's trivial-request branch).
  • resolveProfile(taskType) selects the skill, reference scripts, and schema rules.
1 Research — parallel fan-out

A parallel([...]) of four read-only Explore finders, each in its own context, each returning a FINDER_SCHEMA object:

  • landmark — anchor papers, read methodology (not abstracts).
  • citations — downstream work; no full citation-graph API, so it uses paper-page links + host WebSearch/WebFetch and sets fidelity="reduced".
  • examples — working code with current APIs, preferring the skill's shipped reference scripts.
  • datasets — which datasets produced the reported results; confirm they load.

A synthesis agent folds the finders into one RESEARCH_SCHEMA: a ranked recipe table, code patterns (correct imports + current trainer args), SOTA landscape, references, and citationGraphFidelity.

2 Verify Research — adversarial critic

A read-only Explore critic returns a CRITIC_SCHEMA object. It tool-checks that the cited papers, datasets, and models actually exist and that code patterns use real current APIs. mustReblock=true only for hard problems (a cited resource doesn't exist, a hallucinated import); "is every result truly attributable?" is a soft warn. If blocked, one corrective re-research pass runs.

3 Resources — verify & size

A parallel([model, dataset]) of Explore agents (RESOURCE_SCHEMA). If the user named a model/dataset, it confirms existence and inspects; if not, it evaluates candidates and recommends. It also recommends a hardware flavor sized to the model footprint (R6).

If a requested resource is missing/unusable (needsUserDecision, or !exists) the workflow STOPs with resource_unavailable — it asks the user rather than substituting (R8).
4 Data Audit — format ↔ method

An audit agent (DATA_AUDIT_SCHEMA) inspects schema/columns, splits, distributions, and sample rows, then validates the format against the method using the profile's schemaRules (e.g. DPO needs prompt/chosen/rejected). A format critic confirms compatibility; if it can be fixed by column remapping it passes with a mapping, otherwise mustReblockSTOP (dataset_format_incompatible, an R8 ask-the-user stop).

Early exits here: profile.evalOnly → evaluate then finalize; profile.dataOnly → build/persist a dataset then finalize. Both skip train/smoke/preflight/full-job.
5 Implement — adapt the reference script

The implementer adapts the profile's shipped reference script (e.g. train_sft_example.py) and writes train.py + eval.py into WORK_DIR, returning an IMPL_SCHEMA whose trainScriptContent is the full inline source (by-value transport). The script must wire trackio metrics + structured alerts, a concrete push_to_hub destination, a sized timeout, the chosen flavor, OOM knobs (batch/accum/grad-checkpoint), and SMOKE knobs (SMOKE=1 → 1 step, tiny slice, CPU+fp32, optional tiny proxy model).

6 Code Review — critic before any run

An Explore critic reads the files and flags hallucinated imports, wrong trainer args, local-path leakage, missing monitoring, a username/... placeholder destination, short timeout, OOM-readiness, and source-build smells (R12). Any error-severity issue sets mustReblock and a code-fix pass edits the files before anything executes.

7 CPU Smoke — free, local

Runs attemptWithRetry (≤3): execute train.py with SMOKE=1 on CPU+fp32 for one train step + one eval step (uv run resolves PEP-723 deps). A high loss is not a failure; an exception/import/arg/schema error is. If the real model can't load on CPU, a tiny proxy model is used (the real load is first exercised on GPU preflight). Each failure → analyze → minimal-fix → retry.

This phase is the script's free realization of "test small before you spend big": it catches import/dep/arg/schema/wiring bugs before any paid GPU job.
8 GPU Preflight — tiny billable job

Gated by canSpend(estimatedPreflightUSD) and assertSubmitInvariant() before submitting. Then attemptWithRetry (≤MAX_JOB_RETRIES) submits a tiny GPU job by value on a small flavor, polls logs to a terminal state, and validates the GPU-only code paths CPU could not: CUDA, mixed precision, the real model load, small-scale OOM. recordSpend() accrues the cost.

9 Job Readiness — the one-time gate

A single CHECKLIST_SCHEMA agent verifies the spec §5.6 pre-flight checklist: reference impl cited, dataset format verified, GPU smoke ok and the real model loaded on GPU, concrete persistence destination, sized timeout, monitoring wired, by-value transport. If !allSatisfiedSTOP (preflight_checklist_unsatisfied).

10 Full Job — one-job-first

Cost-gated and invariant-checked again, then attemptWithRetry (≤MAX_JOB_RETRIES) submits one full job by value, polls logs to confirm healthy (a step advancing / first metric), then monitors trackio alerts (trackio list alerts --json --since <cursor>). An ERROR-level alert (divergence/NaN/OOM) is treated as a failure to analyze and correct — the alert-driven retry the spec's §5.7 calls for. The dashboard URL is recorded as an artifact.

11 Evaluate & Persist

An EVAL_SCHEMA agent evaluates the trained model and confirms it actually works (not merely produced), verifies it is persisted on the Hub, and reports the metric+value, eval URL, model URL, and dashboard URL. (For eval-only / data-only tasks, this is reached directly from the early exit.)

Final Verification

finalize() runs a read-only, tool-backed Explore agent (COMPLETION_SCHEMA) that re-checks the spec §9 completion criteria — verify, do not trust prior claims: research preceded implementation, resources verified, smoke-tested, checklist satisfied, alerts emitted, result persisted & evaluated, no scope change vs. the frozen baseline, every artifact a direct URL. Sets result.conforms.

Key mechanisms

the reusable machinery behind the phases

The phases above are thin; most of the discipline lives in a handful of reusable pieces.

Pure code guards — enforced in JS, not by the model

Scope-change guard (R3/R4/R7). A pure function checks whether a proposed fix touches a protected key. Even if a subagent doesn't flag scopeChange, this catches it:

const PROTECTED_KEYS = ['method','model','dataset','max_seq_length', /* …variants… */]
function isScopeChange(configChanges) {
  return Object.keys(configChanges).some(k => PROTECTED_KEYS.includes(k.toLowerCase()))
}
// in the retry loop:
const scope = analysis.scopeChange || isScopeChange(analysis.configChanges)
if (scope) return { stopped: true, stoppedReason: 'scope_change' } // STOP, ask the user

Billable-submit invariant (§5.6). Run before every paid submit; refuses to launch unless the code is present by value, the timeout is sized, monitoring is wired, and no local path leaked:

function assertSubmitInvariant(impl) {
  // missing unless: trainScriptContent (inline, >=50 chars), timeoutHours,
  // monitoringWired, and NO /Users/ or ./ml-run/ path in the script source
  return { ok: missing.length === 0, missing }
}
attemptWithRetry — analyze → minimal-fix → retry

One helper drives CPU smoke, GPU preflight, and the full job. It never retries unchanged (R10): every retry is preceded by a diagnosis and a bounded, scope-safe fix.

for (let attempt = 1; attempt <= maxAttempts; attempt++) {
  const res = await submit(attempt, ctx)
  if (res.status === 'success') return res          // done
  if (attempt === maxAttempts) return { stopped:true, stoppedReason:'retries_exhausted' }
  const analysis = await analyze(attempt, res, ctx)   // FAILURE_ANALYSIS
  if (scopeChange || analysis.unrecoverable) return { stopped:true }  // STOP
  await applyFix(analysis, ctx)                       // edit train.py
  ctx.oomLadderRung = Math.max(ctx.oomLadderRung, analysis.oomLadderRung)  // R7 ladder carries
}

The carried ctx remembers the OOM ladder rung (so R7 escalates 1→2→3 across attempts) and the alert poll cursor (--since).

The four adversarial critics

Each is a read-only Explore agent returning a CRITIC_SCHEMA with a mustReblock flag. They turn "verify, never assume" into a tool-backed second opinion at the riskiest moments:

  • Verify Research (Phase 2) — do cited papers/datasets/models exist; are APIs real?
  • Format critic (Phase 4) — is the dataset shape compatible with the method, or remappable?
  • Code review (Phase 6) — hallucinated imports, local-path leak, missing persistence/monitoring.
  • Final verification (finalize) — tool-backed conformance against the spec §9 criteria.
TASK_PROFILES — declarative genericity

One table lets a single pipeline serve every task type without per-type branching. Each profile names the skill, the per-method reference script, the dataset inspector, the eval skill, and the schema rules:

llm: {
  skill: 'huggingface-skills:huggingface-llm-trainer',
  scriptByMethod: { sft:'train_sft_example.py', dpo:'train_dpo_example.py', grpo:'train_grpo_example.py' },
  evalSkill: 'huggingface-skills:huggingface-community-evals',
  schemaRules: { sft:'messages | text | prompt+completion', dpo:'prompt, chosen, rejected', grpo:'prompt' },
}

Profiles eval (evalOnly) and data (dataOnly) trigger the early exits; vision, embedding, and other reuse the same flow.

FAILURE_TAXONOMY — a closed set of fixes

The analysis prompt injects a closed list of failure categories, each with the only allowed minimal fix — so diagnosis is constrained, not open-ended. A few:

  • oom → the R7 ladder in order (preserve effective batch → grad-checkpoint → bigger memory); never reduce seqlen or switch method.
  • wrong_trainer_arg → re-check current docs (R11) and correct the kwarg; set recheckedDocs.
  • dataset_schema_mismatch → remap columns; if fields genuinely missing → scope-change STOP (R8).
  • timeout → raise the timeout (R5); shrinking data/epochs/seqlen to fit is a forbidden scope change.
  • divergence_nan → corrective config change (lr×0.1 or per the alert) — iteration, not a crash.
HARD_RULES + SKILLS_NOTE — the shared preamble

Implementation and analysis prompts are prefixed with a HARD_RULES block that restates R1–R14 and the principles in imperative form (research-grounded, verify-never-assume, no silent scope change, no silent substitution, persistence, sized timeout/hardware, the OOM ladder, prefer prebuilt, secrets-from-env, direct URLs), plus a SKILLS_NOTE telling the agent to drive the HF skills and prefer their shipped reference scripts over writing ML code from memory.

Cost-cap gating — autonomous spend, bounded

Two helpers bound HF spend independently of the harness token budget. Each billable submit is gated:

function canSpend(estUSD){ return (spentUSD + estUSD) <= COST_CAP_USD }
// before GPU preflight and before the full job:
if (!canSpend(est)) return { stoppedReason: 'cost_cap_full_job' }  // fully prepared; only spend is gated
recordSpend(actualOrEstimatedUSD)  // accrue after submit
The 13 forced-output schemas

Every subagent is forced to return a validated object, so the orchestrator branches on fields rather than parsing prose:

SchemaPhaseKey fields the orchestrator branches on
PLAN0isTrivial, taskType, method, baseline
FINDER1findings, sources, fidelity
RESEARCH1recipe[], codePatterns[], citationGraphFidelity
CRITIC2,4,6ok, mustReblock, issues[]
RESOURCE3modelVerified.exists, datasetVerified.exists, needsUserDecision
DATA_AUDIT4formatCompatible, requiredFields, mappingNeeded
IMPL5trainScriptContent, persistenceDest, timeoutHours, monitoringWired, smokeKnobs
SMOKE7status, usedProxyModel
JOB8,10status, healthy, alerts[], lastAlertTimestamp
FAILURE_ANALYSIS7,8,10category, configChanges, scopeChange, oomLadderRung, unrecoverable
CHECKLIST9allSatisfied, missing[]
EVAL11evaluated, confirmedWorks, modelUrl
COMPLETIONfinalconforms, missing[], artifacts[]

Where it stops

result.stoppedReason

The workflow is designed to halt and surface to the user rather than push through a violation. Every stop sets a machine-readable stoppedReason.

stoppedReasonPhaseWhy it stops
resource_unavailable3Requested model/dataset missing or unusable — ask, don't substitute (R8). no-substitute
dataset_format_incompatible4Schema can't satisfy the method and can't be remapped (R8).
cpu_smoke_* / gpu_preflight_* / full_job_*7,8,10Could not be made to pass within retries/rules.
scope_change7,8,10The only viable fix would touch a protected key (R3/R4/R7). pure guard
cost_cap_preflight / cost_cap_full_job8,10Estimated spend would exceed COST_CAP_USD. Job is prepared; only spend is gated.
submit_invariant_preflight / _full8,10By-value / timeout / monitoring / no-local-path invariant failed. pure guard
preflight_checklist_unsatisfied9The §5.6 pre-flight checklist gate did not pass.
planning_failed / implementation_failed0,5A required subagent returned nothing usable.

Config & arguments

args = task string, or { … }

Inputs

  • task — the request (or pass args as a bare string).
  • model, dataset — optional explicit resources.
  • hubOrg — namespace for persisted outputs.
  • costCapUSD — default 10; the autonomous spend ceiling.
  • maxJobRetries — default 3; GPU preflight + full job.
  • smokeModel, workDir — proxy model; default ./ml-run.

Run-level state

  • spentUSD — accrues across billable submits.
  • result — accumulates every phase's output, artifacts[], and stoppedReason / conforms.
  • baseline — the frozen user intent the scope-change guard compares against.
  • SMOKE_RETRIES = 3 — CPU smoke attempts.
For a guided view of how each spec section maps onto these phases and mechanisms, open ③ Spec → script map.