An autonomous ML researcher, as a behavioral spec
Given a request to train, fine-tune, evaluate, process data for, or run inference with a model, the system researches the current literature and tooling, validates resources, implements a solution grounded in that research, runs it on managed compute, monitors it, iterates to improve it, and delivers a persisted, verified result with zero avoidable errors. This document is a technology-neutral behavioral specification — the workflow, the rules, and the control contract — not an implementation.
In scope
The end-to-end research-and-engineering workflow, the hard rules that keep it from failing or drifting, and the harness control contract that keeps the autonomous loop productive and bounded.
Out of scope
Concrete tool APIs (provided by hf-skills), UI, transport, billing, and model-provider
specifics. Wherever the workflow needs a concrete capability, the spec names an abstract capability and
maps it to hf-skills in Appendix A.
Design principles
§2 · the invariants that motivate everythingThese five principles are the reason every later rule exists. An implementation that preserves them while changing the mechanics is still conformant.
Assume internal ML knowledge is stale
The agent MUST NOT write ML code from memory. Every implementation MUST be grounded in freshly retrieved literature, documentation, and working example code. Research is the primary mechanism that prevents hallucinated imports and wrong configs.
Verify, never assume
Resource existence, model architecture/size, dataset schema and columns, and format-to-method compatibility MUST be confirmed by inspection before any expensive operation.
Test small before you spend big
Code MUST be smoke-tested on representative-but-tiny inputs/hardware before launch at full scale. One verified small run precedes any batch, sweep, or long job.
Persist or lose it
Compute environments are ephemeral. Any artifact that must survive MUST be explicitly pushed to durable storage as part of the run, not after.
Preserve the user's intent
When something fails, the fix MUST be the minimal change that keeps the original request intact. The agent MUST NOT silently change the task (method, dataset, model, sequence length) to make an error go away.
The workflow
§5 · research-first, validate-before-spend, monitor-and-iterateA research-first, plan-tracked, validate-before-spend, monitor-and-iterate loop. Trivial non-code requests MAY be answered directly; anything that produces or runs ML code runs the full workflow. Click any phase to open its detail. Click C# / R# chips to jump to capabilities & hard rules.
0 Phase 0 — Intake & planning §5.1›
- The agent MUST determine whether the request is trivial (skip to a direct answer) or an implementation task (run the full workflow).
- For any task with three or more steps, the agent MUST create and maintain an explicit
plan — ordered to-do items with status
pending/in_progress/completed.
Plan discipline (normative)
- Exactly one item is
in_progressat any time. - An item is marked
completedimmediately after it genuinely finishes, not batched, and only if it succeeded with no errors. - The plan is updated frequently so progress is legible.
- A failed/blocked item stays
in_progress(orpending); a new item is added to resolve the blocker rather than marking the blocked item done.
1 Phase 1 — Research (literature-first, mandatory) heart of the workflow›
This phase MUST NOT be skipped for implementation tasks. Its goal is to replace stale internal knowledge with a concrete, current, example-grounded recipe before any ML code is written.
Default research procedure
- Find the landmark paper(s) for the task or domain.
- Crawl their citation graph to surface recent downstream work that cites and improves on the anchor.
- Read the methodology sections of the most promising papers (recent, strong results, well-cited, reputable venue). Read methods, not abstracts.
- Extract the recipe: dataset, training method, hyperparameters that produced the reported results. Every extracted fact MUST be attributable to a specific result ("dataset X + method Y produced score Z on benchmark B").
- Confirm the referenced datasets actually exist and are usable.
- Find working example code using the current library APIs for the chosen method.
Research subagent contract
- Isolation. Deep reading MUST run in a context window separate from the orchestrator's; the orchestrator receives only the subagent's final summary.
- Read-only. MUST NOT submit jobs, write durable artifacts, or mutate state. It only reads (papers, docs, code, dataset metadata) and may do general web retrieval.
- Bounded. Its own iteration cap (ref: ~60 steps), context-budget guard (warn high, hard-stop near the limit and force a summary), and the same repetition guard as the orchestrator.
- Input. A specific task description plus context: goal, known anchor papers / arXiv IDs, and what the orchestrator needs. Name anchors when known.
- Output (bounded, structured, ~500–1500 words): a ranked recipe table (paper → result → dataset → method → key hyperparameters → key insight); code patterns (correct imports, config/trainer arguments, current-API snippets); a short state-of-the-art landscape; and essential references with links.
2 Phase 2 — Resource discovery & validation §5.3›
Before implementing, the agent MUST establish concrete, verified resources.
- Model. If named, confirm it exists and inspect it (architecture, size, tokenizer, license, suitability). If not named, evaluate a few candidates (ref: 3–5) and select on task-fit / quality / size / cost. The agent MUST NOT silently substitute a different model; if the requested one is unusable, it says so and asks.
- Dataset. Same: confirm existence and inspect. Format-to-method compatibility MUST be validated here (see Phase 3).
- Hardware. Choose compute sized to the model footprint. Do not default to the most expensive tier without justification, and do not undersize.
3 Phase 3 — Data audit mandatory before using any dataset›
The agent MUST inspect a dataset before working with it and MUST NOT assume its shape.
- Inspect: schema/columns, rows per split, value distributions for key columns, sample rows.
- Surface anything notable: class imbalance, missing values, unexpected formats, outliers, duplicates.
- Validate format against the training method — training fails fast on schema mismatch:
| Method | Required fields |
|---|---|
| SFT | messages, or text, or prompt+completion |
| DPO | prompt, chosen, rejected |
| GRPO | prompt |
| Other | confirm the documented schema during research |
4 Phase 4 — Implementation & preflight (sandbox-first) GPU smoke test›
Code is written from the research findings (Phase 1), not from memory.
- Sandbox-first development. Non-trivial scripts MUST be developed and tested in a disposable
sandbox before launching at scale:
write → install deps → run small → fix → scale up. - GPU preflight smoke test (mandatory for GPU work). If the job will run on GPU or the script loads a model / exercises a GPU code path (CUDA, mixed precision, quantization, fused/optimized attention, graph compilation), the agent MUST run a tiny smoke test on representative hardware — same imports, same model-loading path, same training entrypoint, a tiny subset — then fix any failure before scaling. CPU-only execution cannot validate GPU code paths.
- If preflight hardware cannot fit the full path, test the largest useful subset, state what was not covered, and submit one full job first (Phase 5).
5 Phase 5 — Job submission (managed compute) pre-flight checklist›
- Code must reach the remote environment by value. A managed job runs in a fresh environment with no access to local paths. The script MUST be supplied as inline source, a file written into the job's own environment, or a public URL. Local checkout paths MUST NOT be passed.
Pre-flight checklist — MUST be satisfiable and stated before launch
- Reference implementation — which researched example this is based on.
- Dataset format verified — columns confirmed (Phase 3).
- GPU smoke test — hardware + result, or an explicit reason it is not applicable.
- Persistence configured — durable push enabled with a concrete destination id. Without this the trained artifact is lost when the environment is torn down.
- Timeout set to the work, not the default.
- Monitoring included — tracker wired in and publishing to a live dashboard.
6 Phase 6 — Monitoring & closed-loop iteration §5.7›
Monitoring is not passive logging; it is the decision channel that drives the next iteration.
Structured alerts at decision points (numeric values + actionable suggestion)
- ERROR — stop and change approach (divergence, NaN, OOM).
- WARN — tweak hyperparameters (overfitting, early-stopping signal, KL spike, reward collapse, slow convergence).
- INFO — milestones (training complete, target reached, checkpoint saved).
"loss=12.4 at step 200 — lr likely too high, try x0.1" — a
later step MUST be able to parse it and act.Reference decision policy (read alerts back, not raw metric points)
- diverged → learning-rate × 0.1
- overfitting → weight-decay × 10, or reduce capacity
- early-stopping signal → learning-rate × 0.5, or adjust schedule
- high accuracy → refine around the current config
The agent mutates only the keys the alerts justify changing; it reads the prior config and changes the minimum. Sweeps, not hand-tuning — hyperparameter exploration MUST be done by launching a sweep over a grid and evaluating each run automatically, not by editing one value at a time.
7 Phase 7 — Evaluation, persistence & completion §5.8›
A task is not done until:
- The required output exists (final model / reached metric / updated dataset).
- The output is persisted to durable storage.
- The model has been evaluated and confirmed to work (not merely produced).
- For training runs, a working monitoring dashboard URL has been provided.
∞ Autonomous / headless loop discipline §5.9›
When running with no human in the loop (fixed time/compute budget, no one to re-prompt):
- Every step makes progress via an action. A response that performs no action ends the loop with no way to resume; the agent MUST always take a next action (work the plan, verify outputs, or plan ahead) rather than returning idle text.
- Do not stop early. While budget remains, the agent MUST keep improving and MUST NOT declare itself "done" or ask whether to continue. There is no one to answer.
- Iterate as a loop, not a checklist. After a working result, keep going:
research → implement → train/evaluate → persist → improve → research again. - When out of ideas, go back to the literature. Crawl citation graphs deeper, read unread papers, combine recipes, re-read the task and training logs for missed angles.
- Budget time explicitly. Check remaining budget periodically and reserve a margin at the end (ref: ~10 minutes) for final evaluation and saving, so the loop never ends with an unsaved or unevaluated result.
- Out-of-band notifications are used only when the user asked for them or the task clearly requires reporting to a configured destination — not for routine chatter.
Roles & components
§3 · abstractThe workflow is defined over these abstract roles. An implementation MAY collapse or split them, but the responsibilities MUST exist somewhere.
Orchestrator
Drives the phases, maintains the plan, enforces the rules and the control contract, and decides what to do next. Owns the main conversation/context.
Research subagent
A separately-contexted worker that performs literature and documentation mining and returns a compact, structured summary, keeping the orchestrator's context clean. See Phase 1.
Execution surface
The capabilities that act on the outside world: resource
discovery/inspection, sandbox code execution, managed job submission and monitoring, durable storage. Provided
by hf-skills.
Tracker
The experiment-tracking capability used both to record metrics and to emit and read back structured alerts that drive iteration decisions (Phase 6).
Plan
An explicit, ordered, mutable to-do list that makes progress legible and decomposes multi-step work (Phase 0).
Capability requirements
§4 · C1–C14The abstract capabilities the workflow requires. Each MUST be satisfiable by hf-skills
(or the host harness). Appendix A gives the concrete mapping and flags the gaps.
| # | Abstract capability | Required for |
|---|---|---|
| C1 | Search the model/dataset hub; fetch repo details | Resource discovery & validation |
| C2 | Inspect a dataset: schema, columns, splits, sample rows, statistics | Data audit (Phase 3) |
| C3 | Search and read research papers; follow links to code/datasets | Research (Phase 1) |
| C4 | Trace citation graphs (references and forward citations) | Deep research partial gap |
| C5 | Search/retrieve current library documentation | Research, implementation |
| C6 | Find and read working example code | Research, implementation |
| C7 | Execute code in a disposable sandbox (CPU and GPU tiers) | Preflight (Phase 4) partial gap |
| C8 | Submit, configure, monitor, and cancel managed compute jobs | Job execution (Phase 5) |
| C9 | Train/fine-tune models with standard methods (SFT/DPO/GRPO, etc.) | Implementation |
| C10 | Record metrics and emit/read structured training alerts | Monitoring & iteration (Phase 6) |
| C11 | Durable storage for models, datasets, logs, results | Persistence (Principle 4) |
| C12 | Evaluate a model on a benchmark/task | Completion (Phase 7) |
| C13 | General web/document retrieval | Research fallback gap · host harness |
| C14 | Out-of-band notification (optional) | Reporting (§5.9) gap · host harness |
Hard rules & invariants
§6 · R1–R14The non-negotiable rules. They restate the principles as concrete prohibitions and requirements.
Harness control contract
§7 · keeps the loop bounded & unstuckThese rules govern the loop itself — they keep an autonomous agent bounded, unstuck, and within its context budget. Independent of the ML domain; enforced by the host harness or hooks. Numbers are reference defaults.
§7.1 Bounded iteration›
§7.2 Repetition / "doom-loop" guard›
- Identical repetition: same action + same arguments repeated (ref: 3 in a row) → "stop repeating this, try a fundamentally different strategy".
- Cyclic repetition: a short sequence (ref length 2–5) repeated (ref: ≥2 full cycles) → "you are in a repeating cycle, break it".
§7.3 Continuation guard›
§7.4 Malformed-action guard›
§7.5 Output-truncation recovery›
§7.6 Context compaction›
§7.7 Approval gate›
- Auto-approved: read-only research, inspection, discovery; routine code execution in the default low-cost sandbox; status/metadata queries.
- Approval-required: provisioning non-default (GPU/larger) compute; submitting paid compute jobs; destructive storage operations (delete repo, delete branch/tag, merge, force-upload/overwrite); creating durable repos.
- Always human-gated: recurring/scheduled jobs (a standing cost commitment) require explicit human approval even under otherwise-autonomous policies.
§7.8 Effort / budget probing›
State & artifacts
§8Across the workflow the agent maintains:
The plan
The live decomposition and progress (Phase 0).
Research findings
The recipe table, code patterns, references that ground implementation — the authority later phases cite (Phase 1).
Validated resources
Confirmed model, dataset (with verified schema), and chosen hardware (Phases 2–3).
Run records
Job ids, configs, tracker project/run names, dashboard URLs, and the alert history that drives iteration (Phases 5–6).
Durable outputs
The persisted model/dataset/logs and evaluation results, each linked by URL (Phase 7).
Completion criteria
§9 · conformance summaryAn execution conforms to this spec if, for an ML implementation request:
- Research preceded implementation, and the implementation cites concrete findings (Principle 1, Phase 1).
- Resources were verified by inspection, including dataset-format-to-method compatibility (Principle 2, Phases 2–3).
- GPU code was smoke-tested before scaling, or the omission was justified (Principle 3, Phase 4).
- The pre-flight checklist was satisfied before any job, batches went one-job-first, and durable persistence was configured up front (Phase 5, Principle 4).
- Monitoring emitted structured alerts and the next iteration was driven by them (Phase 6).
- The result was persisted and evaluated, and all artifacts were linked (Phase 7).
- No rule in §6 was violated; in particular no silent scope change or resource substitution occurred.
- The loop stayed bounded and unstuck under the control contract (§7).
Appendices
A: mapping (informative) · B: realization (non-normative)Appendix A — Capability → hf-skills mapping (informative)›
How the abstract capabilities of §4 are satisfied by hf-skills. The skill names are the
execution surface — do not reimplement them.
| Cap | Provided by (hf-skills) | Notes |
|---|---|---|
| C1 | hf-cli (models/datasets list/info), huggingface-best, hub MCP tools | Model/dataset discovery & validation. |
| C2 | huggingface-datasets (Dataset Viewer), trainer-skill validation helpers | Schema, splits, samples, stats. Satisfies the §5.4 audit. |
| C3 | huggingface-papers, hf-cli papers, huggingface-paper-publisher | Read methodology from the markdown; follow linked artifacts. |
| C4 | Partial gap. huggingface-papers exposes linked artifacts but not a full citations graph. | Crawl downstream work via paper-page links + host web retrieval (C13); accept reduced fidelity and say so. |
| C5 | Companion doc tools (hf_doc_search / hf_doc_fetch) | Current TRL/Transformers/etc. APIs. |
| C6 | Trainer skills ship reference scripts; host harness can read repos | Prefer copying production templates over synthesizing. |
| C7 | Partial gap. No drop-in GPU-sandbox skill. | Realize §5.5 preflight as a short, cheap job (C8) on a small GPU flavor with a tiny subset, or a local GPU smoke test via huggingface-community-evals (--limit). |
| C8 | hf-cli (jobs run/inspect/logs/cancel, scheduled jobs), hf_jobs | Set timeout (R5), flavor (R6), env/secrets, persistence. |
| C9 | huggingface-llm-trainer, huggingface-vision-trainer, train-sentence-transformers | Method ↔ data-shape rules in §5.4 align with these skills. |
| C10 | huggingface-trackio (init/log/alert/finish, CLI --json, Space dashboards) | This is the §5.7 decision channel. |
| C11 | hf-cli (upload/repos/buckets), trainers' push_to_hub, datasets upload | Satisfies R1/R2. |
| C12 | huggingface-community-evals (inspect-ai / lighteval), trainer eval hooks | Satisfies §5.8 "evaluated and confirmed". |
| C13 | Gap in hf-skills. Source from the host harness's web search/fetch. | Used by research (C4 fallback) and non-HF docs. |
| C14 | Gap in hf-skills. Source from the host harness (messaging) if needed. | Optional; gated per §5.9. |
Appendix B — Realizing the workflow as skills / subagents / hooks (non-normative)›
Illustrative only. The intent of the split: hf-skills provides the doing, and the
researcher provides only the discipline — phase ordering, verification gates, persistence and
anti-scope-change rules, and loop guards.
- Driving skill(s). Encode the phase order and gate conditions (§5) as the researcher's top-level
procedure: research → validate → audit → preflight → submit → monitor → iterate → evaluate, delegating each
concrete action to the relevant
hf-skill. - Research subagent. Implement §5.2 as a separate subagent with its own context and a read-only toolset. Return schema = recipe table + code patterns + references. This is the one place a subagent is structurally required (context isolation).
- Hooks for the control contract (§7) and hard gates (§6) — enforced not merely requested:
- Pre-job hook: refuse a job unless the §5.6 pre-flight checklist is satisfied; refuse local paths in scripts.
- Batch hook: allow only one job from a batch until its logs confirm a healthy start.
- Repetition / continuation / malformed guards: the §7.2–7.4 detectors.
- Compaction hook: the §7.6 policy. Approval hook: the §7.7 policy.
- Plan/tracker. Use the host harness's todo mechanism for the §5.1 plan, and
huggingface-trackiofor the §5.7 alert-driven loop.