Inside `ml-research.js`

An executable encoding of the behavioral spec as a deterministic Workflow orchestration. The script owns the control flow, the gates, and the discipline; the actual ML work is delegated to subagents that drive the Hugging Face skills (huggingface-skills:*). It is the same workflow the spec describes — research-first, validate-before-spend, smoke-before-scale, persist-and-verify — rendered as JavaScript that fans out subagents, forces structured output, and enforces the hard rules in code.

It runs on a Workflow harness that shares the local filesystem across subagents but not their context. Scripts pass between local subagents via WORK_DIR; the remote HF Job receives code by value (inline source), never a local path.

13orchestrated phases

13forced-output JSON schemas

4adversarial critics

2pure code guards

$10default autonomous cost cap

Scoping decision 1 — autonomous paid submit, up to a cap

Paid HF Jobs auto-submit autonomously up to a dollar cost cap (spec §7.7's autonomous mode). canSpend() gates every billable submit; spentUSD accrues across the run. Scheduled/recurring jobs are out of scope (always human-gated).

Scoping decision 2 — stop at the first verified result

The workflow stops at the first verified result. The spec §5.9 autonomous improvement loop and the grid sweep are omitted. The alert-driven corrective retries that get one good run to complete (divergence → lr×0.1, the OOM ladder, etc.) are kept.

The orchestration model

five ideas that shape every phase

Before the phases, five structural choices explain how the script turns a prose spec into enforceable control flow.

Subagents

Isolated context per step

Each agent() call runs in its own context; the orchestrator only sees the returned object. This is exactly the spec's research-subagent isolation, generalized to every phase.

Structured output

Forced JSON schemas

Every subagent returns a schema-validated object (PLAN_SCHEMA, RESEARCH_SCHEMA, JOB_SCHEMA, …). The orchestrator branches on fields, never on free prose.

Pure guards

Code, not self-report

isScopeChange() and assertSubmitInvariant() run in plain JS, independent of what a subagent claims. The model cannot talk its way past them.

Critics

Adversarial verification

Four read-only Explore critics re-check research, data format, code, and final conformance — tool-backed, with a mustReblock flag that can halt the line.

Profiles

Declarative genericity

TASK_PROFILES maps task type → skill, reference script, schema rules. One pipeline serves LLM, vision, embedding, eval-only, and data-only tasks with no branch explosion.

Budgets

Tokens vs. dollars

The harness budget.* tracks tokens; HF compute spend is tracked separately in spentUSD against COST_CAP_USD.

The 13-phase pipeline

meta.phases · top-to-bottom

Click any phase to open its detail. Annotations on the right flag fan-out, critics, retry loops, cost gates, and the points where the workflow can STOP and surface to the user. Two free early-exits (eval-only, data-only) branch out after the data audit.

Plan Research Validate Implement Test / gate Run Verify

0 Intake & Plan›

One agent() call returns a PLAN_SCHEMA object: it classifies the task, picks taskType (llm | vision | embedding | eval | data | other) and method, and freezes the baseline {model, dataset, method, sequenceLength} exactly as the user expressed it — the frozen object the scope-change guard later compares against.

isTrivial=true only for a pure factual question / status check / resource lookup → return directAnswer and skip everything (spec's trivial-request branch).
resolveProfile(taskType) selects the skill, reference scripts, and schema rules.

PLAN_SCHEMA TASK_PROFILES

1 Research — parallel fan-out›

A parallel([...]) of four read-only Explore finders, each in its own context, each returning a FINDER_SCHEMA object:

landmark — anchor papers, read methodology (not abstracts).
citations — downstream work; no full citation-graph API, so it uses paper-page links + host WebSearch/WebFetch and sets fidelity="reduced".
examples — working code with current APIs, preferring the skill's shipped reference scripts.
datasets — which datasets produced the reported results; confirm they load.

A synthesis agent folds the finders into one RESEARCH_SCHEMA: a ranked recipe table, code patterns (correct imports + current trainer args), SOTA landscape, references, and citationGraphFidelity.

subagent isolation FINDER / RESEARCH schemas

2 Verify Research — adversarial critic›

A read-only Explore critic returns a CRITIC_SCHEMA object. It tool-checks that the cited papers, datasets, and models actually exist and that code patterns use real current APIs. mustReblock=true only for hard problems (a cited resource doesn't exist, a hallucinated import); "is every result truly attributable?" is a soft warn. If blocked, one corrective re-research pass runs.

the four critics

3 Resources — verify & size›

A parallel([model, dataset]) of Explore agents (RESOURCE_SCHEMA). If the user named a model/dataset, it confirms existence and inspects; if not, it evaluates candidates and recommends. It also recommends a hardware flavor sized to the model footprint (R6).

If a requested resource is missing/unusable (needsUserDecision, or !exists) the workflow STOPs with resource_unavailable — it asks the user rather than substituting (R8).

4 Data Audit — format ↔ method›

An audit agent (DATA_AUDIT_SCHEMA) inspects schema/columns, splits, distributions, and sample rows, then validates the format against the method using the profile's schemaRules (e.g. DPO needs prompt/chosen/rejected). A format critic confirms compatibility; if it can be fixed by column remapping it passes with a mapping, otherwise mustReblock → STOP (dataset_format_incompatible, an R8 ask-the-user stop).

Early exits here: profile.evalOnly → evaluate then finalize; profile.dataOnly → build/persist a dataset then finalize. Both skip train/smoke/preflight/full-job.

5 Implement — adapt the reference script›

The implementer adapts the profile's shipped reference script (e.g. train_sft_example.py) and writes train.py + eval.py into WORK_DIR, returning an IMPL_SCHEMA whose trainScriptContent is the full inline source (by-value transport). The script must wire trackio metrics + structured alerts, a concrete push_to_hub destination, a sized timeout, the chosen flavor, OOM knobs (batch/accum/grad-checkpoint), and SMOKE knobs (SMOKE=1 → 1 step, tiny slice, CPU+fp32, optional tiny proxy model).

IMPL_SCHEMA HARD_RULES preamble

6 Code Review — critic before any run›

An Explore critic reads the files and flags hallucinated imports, wrong trainer args, local-path leakage, missing monitoring, a username/... placeholder destination, short timeout, OOM-readiness, and source-build smells (R12). Any error-severity issue sets mustReblock and a code-fix pass edits the files before anything executes.

7 CPU Smoke — free, local›

Runs attemptWithRetry (≤3): execute train.py with SMOKE=1 on CPU+fp32 for one train step + one eval step (uv run resolves PEP-723 deps). A high loss is not a failure; an exception/import/arg/schema error is. If the real model can't load on CPU, a tiny proxy model is used (the real load is first exercised on GPU preflight). Each failure → analyze → minimal-fix → retry.

This phase is the script's free realization of "test small before you spend big": it catches import/dep/arg/schema/wiring bugs before any paid GPU job.

attemptWithRetry FAILURE_TAXONOMY scope-change guard

8 GPU Preflight — tiny billable job›

Gated by canSpend(estimatedPreflightUSD) and assertSubmitInvariant() before submitting. Then attemptWithRetry (≤MAX_JOB_RETRIES) submits a tiny GPU job by value on a small flavor, polls logs to a terminal state, and validates the GPU-only code paths CPU could not: CUDA, mixed precision, the real model load, small-scale OOM. recordSpend() accrues the cost.

cost-cap gate submit invariant retry loop

9 Job Readiness — the one-time gate›

A single CHECKLIST_SCHEMA agent verifies the spec §5.6 pre-flight checklist: reference impl cited, dataset format verified, GPU smoke ok and the real model loaded on GPU, concrete persistence destination, sized timeout, monitoring wired, by-value transport. If !allSatisfied → STOP (preflight_checklist_unsatisfied).

10 Full Job — one-job-first›

Cost-gated and invariant-checked again, then attemptWithRetry (≤MAX_JOB_RETRIES) submits one full job by value, polls logs to confirm healthy (a step advancing / first metric), then monitors trackio alerts (trackio list alerts --json --since <cursor>). An ERROR-level alert (divergence/NaN/OOM) is treated as a failure to analyze and correct — the alert-driven retry the spec's §5.7 calls for. The dashboard URL is recorded as an artifact.

cost-cap gate alert-driven retry

11 Evaluate & Persist›

An EVAL_SCHEMA agent evaluates the trained model and confirms it actually works (not merely produced), verifies it is persisted on the Hub, and reports the metric+value, eval URL, model URL, and dashboard URL. (For eval-only / data-only tasks, this is reached directly from the early exit.)

✓ Final Verification›

finalize() runs a read-only, tool-backed Explore agent (COMPLETION_SCHEMA) that re-checks the spec §9 completion criteria — verify, do not trust prior claims: research preceded implementation, resources verified, smoke-tested, checklist satisfied, alerts emitted, result persisted & evaluated, no scope change vs. the frozen baseline, every artifact a direct URL. Sets result.conforms.

Key mechanisms

the reusable machinery behind the phases

The phases above are thin; most of the discipline lives in a handful of reusable pieces.

Pure code guards — enforced in JS, not by the model›

Scope-change guard (R3/R4/R7). A pure function checks whether a proposed fix touches a protected key. Even if a subagent doesn't flag scopeChange, this catches it:

const PROTECTED_KEYS = ['method','model','dataset','max_seq_length', /* …variants… */]
function isScopeChange(configChanges) {
  return Object.keys(configChanges).some(k => PROTECTED_KEYS.includes(k.toLowerCase()))
}
// in the retry loop:
const scope = analysis.scopeChange || isScopeChange(analysis.configChanges)
if (scope) return { stopped: true, stoppedReason: 'scope_change' } // STOP, ask the user

Billable-submit invariant (§5.6). Run before every paid submit; refuses to launch unless the code is present by value, the timeout is sized, monitoring is wired, and no local path leaked:

function assertSubmitInvariant(impl) {
  // missing unless: trainScriptContent (inline, >=50 chars), timeoutHours,
  // monitoringWired, and NO /Users/ or ./ml-run/ path in the script source
  return { ok: missing.length === 0, missing }
}

attemptWithRetry — analyze → minimal-fix → retry›

One helper drives CPU smoke, GPU preflight, and the full job. It never retries unchanged (R10): every retry is preceded by a diagnosis and a bounded, scope-safe fix.

for (let attempt = 1; attempt <= maxAttempts; attempt++) {
  const res = await submit(attempt, ctx)
  if (res.status === 'success') return res          // done
  if (attempt === maxAttempts) return { stopped:true, stoppedReason:'retries_exhausted' }
  const analysis = await analyze(attempt, res, ctx)   // FAILURE_ANALYSIS
  if (scopeChange || analysis.unrecoverable) return { stopped:true }  // STOP
  await applyFix(analysis, ctx)                       // edit train.py
  ctx.oomLadderRung = Math.max(ctx.oomLadderRung, analysis.oomLadderRung)  // R7 ladder carries
}

The carried ctx remembers the OOM ladder rung (so R7 escalates 1→2→3 across attempts) and the alert poll cursor (--since).

The four adversarial critics›

Each is a read-only Explore agent returning a CRITIC_SCHEMA with a mustReblock flag. They turn "verify, never assume" into a tool-backed second opinion at the riskiest moments:

Verify Research (Phase 2) — do cited papers/datasets/models exist; are APIs real?
Format critic (Phase 4) — is the dataset shape compatible with the method, or remappable?
Code review (Phase 6) — hallucinated imports, local-path leak, missing persistence/monitoring.
Final verification (finalize) — tool-backed conformance against the spec §9 criteria.

TASK_PROFILES — declarative genericity›

One table lets a single pipeline serve every task type without per-type branching. Each profile names the skill, the per-method reference script, the dataset inspector, the eval skill, and the schema rules:

llm: {
  skill: 'huggingface-skills:huggingface-llm-trainer',
  scriptByMethod: { sft:'train_sft_example.py', dpo:'train_dpo_example.py', grpo:'train_grpo_example.py' },
  evalSkill: 'huggingface-skills:huggingface-community-evals',
  schemaRules: { sft:'messages | text | prompt+completion', dpo:'prompt, chosen, rejected', grpo:'prompt' },
}

Profiles eval (evalOnly) and data (dataOnly) trigger the early exits; vision, embedding, and other reuse the same flow.

FAILURE_TAXONOMY — a closed set of fixes›

The analysis prompt injects a closed list of failure categories, each with the only allowed minimal fix — so diagnosis is constrained, not open-ended. A few:

oom → the R7 ladder in order (preserve effective batch → grad-checkpoint → bigger memory); never reduce seqlen or switch method.
wrong_trainer_arg → re-check current docs (R11) and correct the kwarg; set recheckedDocs.
dataset_schema_mismatch → remap columns; if fields genuinely missing → scope-change STOP (R8).
timeout → raise the timeout (R5); shrinking data/epochs/seqlen to fit is a forbidden scope change.
divergence_nan → corrective config change (lr×0.1 or per the alert) — iteration, not a crash.

HARD_RULES + SKILLS_NOTE — the shared preamble›

Implementation and analysis prompts are prefixed with a HARD_RULES block that restates R1–R14 and the principles in imperative form (research-grounded, verify-never-assume, no silent scope change, no silent substitution, persistence, sized timeout/hardware, the OOM ladder, prefer prebuilt, secrets-from-env, direct URLs), plus a SKILLS_NOTE telling the agent to drive the HF skills and prefer their shipped reference scripts over writing ML code from memory.

Cost-cap gating — autonomous spend, bounded›

Two helpers bound HF spend independently of the harness token budget. Each billable submit is gated:

function canSpend(estUSD){ return (spentUSD + estUSD) <= COST_CAP_USD }
// before GPU preflight and before the full job:
if (!canSpend(est)) return { stoppedReason: 'cost_cap_full_job' }  // fully prepared; only spend is gated
recordSpend(actualOrEstimatedUSD)  // accrue after submit

The 13 forced-output schemas›

Every subagent is forced to return a validated object, so the orchestrator branches on fields rather than parsing prose:

Schema	Phase	Key fields the orchestrator branches on
PLAN	0	`isTrivial`, `taskType`, `method`, `baseline`
FINDER	1	`findings`, `sources`, `fidelity`
RESEARCH	1	`recipe[]`, `codePatterns[]`, `citationGraphFidelity`
CRITIC	2,4,6	`ok`, `mustReblock`, `issues[]`
RESOURCE	3	`modelVerified.exists`, `datasetVerified.exists`, `needsUserDecision`
DATA_AUDIT	4	`formatCompatible`, `requiredFields`, `mappingNeeded`
IMPL	5	`trainScriptContent`, `persistenceDest`, `timeoutHours`, `monitoringWired`, `smokeKnobs`
SMOKE	7	`status`, `usedProxyModel`
JOB	8,10	`status`, `healthy`, `alerts[]`, `lastAlertTimestamp`
FAILURE_ANALYSIS	7,8,10	`category`, `configChanges`, `scopeChange`, `oomLadderRung`, `unrecoverable`
CHECKLIST	9	`allSatisfied`, `missing[]`
EVAL	11	`evaluated`, `confirmedWorks`, `modelUrl`
COMPLETION	final	`conforms`, `missing[]`, `artifacts[]`

Where it stops

result.stoppedReason

The workflow is designed to halt and surface to the user rather than push through a violation. Every stop sets a machine-readable stoppedReason.

stoppedReason	Phase	Why it stops
resource_unavailable	3	Requested model/dataset missing or unusable — ask, don't substitute (R8). no-substitute
dataset_format_incompatible	4	Schema can't satisfy the method and can't be remapped (R8).
cpu_smoke_* / gpu_preflight_* / full_job_*	7,8,10	Could not be made to pass within retries/rules.
scope_change	7,8,10	The only viable fix would touch a protected key (R3/R4/R7). pure guard
cost_cap_preflight / cost_cap_full_job	8,10	Estimated spend would exceed `COST_CAP_USD`. Job is prepared; only spend is gated.
submit_invariant_preflight / _full	8,10	By-value / timeout / monitoring / no-local-path invariant failed. pure guard
preflight_checklist_unsatisfied	9	The §5.6 pre-flight checklist gate did not pass.
planning_failed / implementation_failed	0,5	A required subagent returned nothing usable.

Config & arguments

args = task string, or { … }

Inputs

task — the request (or pass args as a bare string).
model, dataset — optional explicit resources.
hubOrg — namespace for persisted outputs.
costCapUSD — default 10; the autonomous spend ceiling.
maxJobRetries — default 3; GPU preflight + full job.
smokeModel, workDir — proxy model; default ./ml-run.

Run-level state

spentUSD — accrues across billable submits.
result — accumulates every phase's output, artifacts[], and stoppedReason / conforms.
baseline — the frozen user intent the scope-change guard compares against.
SMOKE_RETRIES = 3 — CPU smoke attempts.

For a guided view of how each spec section maps onto these phases and mechanisms, open ③ Spec → script map.