An autonomous ML researcher, as a behavioral spec

Given a request to train, fine-tune, evaluate, process data for, or run inference with a model, the system researches the current literature and tooling, validates resources, implements a solution grounded in that research, runs it on managed compute, monitors it, iterates to improve it, and delivers a persisted, verified result with zero avoidable errors. This document is a technology-neutral behavioral specification — the workflow, the rules, and the control contract — not an implementation.

MUST / MUST NOT — non-negotiable SHOULD — strong default MAY — permitted Numbers are tunable reference defaults; the shape of each rule is normative

5design principles (the invariants)

8workflow phases (0 → 7) + autonomous loop

14hard rules (R1–R14)

14abstract capabilities (C1–C14)

8harness control-contract guards

In scope

The end-to-end research-and-engineering workflow, the hard rules that keep it from failing or drifting, and the harness control contract that keeps the autonomous loop productive and bounded.

Out of scope

Concrete tool APIs (provided by hf-skills), UI, transport, billing, and model-provider specifics. Wherever the workflow needs a concrete capability, the spec names an abstract capability and maps it to hf-skills in Appendix A.

Design principles

§2 · the invariants that motivate everything

These five principles are the reason every later rule exists. An implementation that preserves them while changing the mechanics is still conformant.

Principle 1

Assume internal ML knowledge is stale

The agent MUST NOT write ML code from memory. Every implementation MUST be grounded in freshly retrieved literature, documentation, and working example code. Research is the primary mechanism that prevents hallucinated imports and wrong configs.

Principle 2

Verify, never assume

Resource existence, model architecture/size, dataset schema and columns, and format-to-method compatibility MUST be confirmed by inspection before any expensive operation.

Principle 3

Test small before you spend big

Code MUST be smoke-tested on representative-but-tiny inputs/hardware before launch at full scale. One verified small run precedes any batch, sweep, or long job.

Principle 4

Persist or lose it

Compute environments are ephemeral. Any artifact that must survive MUST be explicitly pushed to durable storage as part of the run, not after.

Principle 5

Preserve the user's intent

When something fails, the fix MUST be the minimal change that keeps the original request intact. The agent MUST NOT silently change the task (method, dataset, model, sequence length) to make an error go away.

The workflow

§5 · research-first, validate-before-spend, monitor-and-iterate

A research-first, plan-tracked, validate-before-spend, monitor-and-iterate loop. Trivial non-code requests MAY be answered directly; anything that produces or runs ML code runs the full workflow. Click any phase to open its detail. Click C# / R# chips to jump to capabilities & hard rules.

Plan Research Validate Build & launch Monitor / iterate Complete gate mandatory checkpoint

0 Phase 0 — Intake & planning §5.1›

The agent MUST determine whether the request is trivial (skip to a direct answer) or an implementation task (run the full workflow).
For any task with three or more steps, the agent MUST create and maintain an explicit plan — ordered to-do items with status pending / in_progress / completed.

Plan discipline (normative)

Exactly one item is in_progress at any time.
An item is marked completed immediately after it genuinely finishes, not batched, and only if it succeeded with no errors.
The plan is updated frequently so progress is legible.
A failed/blocked item stays in_progress (or pending); a new item is added to resolve the blocker rather than marking the blocked item done.

1 Phase 1 — Research (literature-first, mandatory) heart of the workflow›

This phase MUST NOT be skipped for implementation tasks. Its goal is to replace stale internal knowledge with a concrete, current, example-grounded recipe before any ML code is written.

Default research procedure

Find the landmark paper(s) for the task or domain.
Crawl their citation graph to surface recent downstream work that cites and improves on the anchor.
Read the methodology sections of the most promising papers (recent, strong results, well-cited, reputable venue). Read methods, not abstracts.
Extract the recipe: dataset, training method, hyperparameters that produced the reported results. Every extracted fact MUST be attributable to a specific result ("dataset X + method Y produced score Z on benchmark B").
Confirm the referenced datasets actually exist and are usable.
Find working example code using the current library APIs for the chosen method.

Research subagent contract

Isolation. Deep reading MUST run in a context window separate from the orchestrator's; the orchestrator receives only the subagent's final summary.
Read-only. MUST NOT submit jobs, write durable artifacts, or mutate state. It only reads (papers, docs, code, dataset metadata) and may do general web retrieval.
Bounded. Its own iteration cap (ref: ~60 steps), context-budget guard (warn high, hard-stop near the limit and force a summary), and the same repetition guard as the orchestrator.
Input. A specific task description plus context: goal, known anchor papers / arXiv IDs, and what the orchestrator needs. Name anchors when known.
Output (bounded, structured, ~500–1500 words): a ranked recipe table (paper → result → dataset → method → key hyperparameters → key insight); code patterns (correct imports, config/trainer arguments, current-API snippets); a short state-of-the-art landscape; and essential references with links.

The orchestrator MAY perform quick, shallow lookups directly (a single doc page, one repo's details). The subagent is for deep multi-source research. Research is skipped only for simple factual questions, status checks, and pure resource discovery — never because a task "seems simple".

C3 papers C4 citation graph C5 docs C6 example code C13 web Principle 1 R11

2 Phase 2 — Resource discovery & validation §5.3›

Before implementing, the agent MUST establish concrete, verified resources.

Model. If named, confirm it exists and inspect it (architecture, size, tokenizer, license, suitability). If not named, evaluate a few candidates (ref: 3–5) and select on task-fit / quality / size / cost. The agent MUST NOT silently substitute a different model; if the requested one is unusable, it says so and asks.
Dataset. Same: confirm existence and inspect. Format-to-method compatibility MUST be validated here (see Phase 3).
Hardware. Choose compute sized to the model footprint. Do not default to the most expensive tier without justification, and do not undersize.

C1 hub search C2 dataset inspect Principle 2 Principle 5 R6 sizing R8 no substitution

3 Phase 3 — Data audit mandatory before using any dataset›

The agent MUST inspect a dataset before working with it and MUST NOT assume its shape.

Inspect: schema/columns, rows per split, value distributions for key columns, sample rows.
Surface anything notable: class imbalance, missing values, unexpected formats, outliers, duplicates.
Validate format against the training method — training fails fast on schema mismatch:

Method	Required fields
SFT	`messages`, or `text`, or `prompt`+`completion`
DPO	`prompt`, `chosen`, `rejected`
GRPO	`prompt`
Other	confirm the documented schema during research

If the requested dataset cannot be loaded, the agent MUST tell the user and ask rather than silently substituting another.

C2 dataset inspect Principle 2 R8 R9

4 Phase 4 — Implementation & preflight (sandbox-first) GPU smoke test›

Code is written from the research findings (Phase 1), not from memory.

Sandbox-first development. Non-trivial scripts MUST be developed and tested in a disposable sandbox before launching at scale: write → install deps → run small → fix → scale up.
GPU preflight smoke test (mandatory for GPU work). If the job will run on GPU or the script loads a model / exercises a GPU code path (CUDA, mixed precision, quantization, fused/optimized attention, graph compilation), the agent MUST run a tiny smoke test on representative hardware — same imports, same model-loading path, same training entrypoint, a tiny subset — then fix any failure before scaling. CPU-only execution cannot validate GPU code paths.
If preflight hardware cannot fit the full path, test the largest useful subset, state what was not covered, and submit one full job first (Phase 5).

The smoke test exists to catch, cheaply, the exact failures that otherwise surface hours into an expensive run: bad imports, wrong arguments, schema mismatch, and out-of-memory.

C7 sandbox Principle 3

5 Phase 5 — Job submission (managed compute) pre-flight checklist›

Code must reach the remote environment by value. A managed job runs in a fresh environment with no access to local paths. The script MUST be supplied as inline source, a file written into the job's own environment, or a public URL. Local checkout paths MUST NOT be passed.

Pre-flight checklist — MUST be satisfiable and stated before launch

Reference implementation — which researched example this is based on.
Dataset format verified — columns confirmed (Phase 3).
GPU smoke test — hardware + result, or an explicit reason it is not applicable.
Persistence configured — durable push enabled with a concrete destination id. Without this the trained artifact is lost when the environment is torn down.
Timeout set to the work, not the default.
Monitoring included — tracker wired in and publishing to a live dashboard.

One-job-first for batches/ablations/sweeps. The agent MUST submit a single job, confirm from its logs that it actually starts running/training correctly, and only then submit the rest. It MUST NOT submit a whole batch at once (they would all fail on the same bug). After submission, poll logs to confirm the job is healthy, then report monitoring links.

C8 managed jobs C11 storage Principle 4 R1 R5 timeout R6 sizing

6 Phase 6 — Monitoring & closed-loop iteration §5.7›

Monitoring is not passive logging; it is the decision channel that drives the next iteration.

Structured alerts at decision points (numeric values + actionable suggestion)

ERROR — stop and change approach (divergence, NaN, OOM).
WARN — tweak hyperparameters (overfitting, early-stopping signal, KL spike, reward collapse, slow convergence).
INFO — milestones (training complete, target reached, checkpoint saved).

Example alert text: "loss=12.4 at step 200 — lr likely too high, try x0.1" — a later step MUST be able to parse it and act.

Reference decision policy (read alerts back, not raw metric points)

diverged → learning-rate × 0.1
overfitting → weight-decay × 10, or reduce capacity
early-stopping signal → learning-rate × 0.5, or adjust schedule
high accuracy → refine around the current config

The agent mutates only the keys the alerts justify changing; it reads the prior config and changes the minimum. Sweeps, not hand-tuning — hyperparameter exploration MUST be done by launching a sweep over a grid and evaluating each run automatically, not by editing one value at a time.

C10 tracking + alerts R3 minimal fix R4 no scope change R7 OOM recovery

7 Phase 7 — Evaluation, persistence & completion §5.8›

A task is not done until:

The required output exists (final model / reached metric / updated dataset).
The output is persisted to durable storage.
The model has been evaluated and confirmed to work (not merely produced).
For training runs, a working monitoring dashboard URL has been provided.

Before ending a turn the agent MUST verify it actually did the task (not just described it), that any failure was diagnosed and fixed (or clearly explained with a request for input), and that all referenced artifacts are linked by direct URL. It MUST NOT mark plan items completed if they failed or are partial.

C11 storage C12 evaluation Principle 4 R14 links

∞ Autonomous / headless loop discipline §5.9›

When running with no human in the loop (fixed time/compute budget, no one to re-prompt):

Every step makes progress via an action. A response that performs no action ends the loop with no way to resume; the agent MUST always take a next action (work the plan, verify outputs, or plan ahead) rather than returning idle text.
Do not stop early. While budget remains, the agent MUST keep improving and MUST NOT declare itself "done" or ask whether to continue. There is no one to answer.
Iterate as a loop, not a checklist. After a working result, keep going: research → implement → train/evaluate → persist → improve → research again.
When out of ideas, go back to the literature. Crawl citation graphs deeper, read unread papers, combine recipes, re-read the task and training logs for missed angles.
Budget time explicitly. Check remaining budget periodically and reserve a margin at the end (ref: ~10 minutes) for final evaluation and saving, so the loop never ends with an unsaved or unevaluated result.
Out-of-band notifications are used only when the user asked for them or the task clearly requires reporting to a configured destination — not for routine chatter.

C14 notifications

Roles & components

§3 · abstract

The workflow is defined over these abstract roles. An implementation MAY collapse or split them, but the responsibilities MUST exist somewhere.

Orchestrator

Drives the phases, maintains the plan, enforces the rules and the control contract, and decides what to do next. Owns the main conversation/context.

Research subagent

A separately-contexted worker that performs literature and documentation mining and returns a compact, structured summary, keeping the orchestrator's context clean. See Phase 1.

Execution surface

The capabilities that act on the outside world: resource discovery/inspection, sandbox code execution, managed job submission and monitoring, durable storage. Provided by hf-skills.

Tracker

The experiment-tracking capability used both to record metrics and to emit and read back structured alerts that drive iteration decisions (Phase 6).

Plan

An explicit, ordered, mutable to-do list that makes progress legible and decomposes multi-step work (Phase 0).

Capability requirements

§4 · C1–C14

The abstract capabilities the workflow requires. Each MUST be satisfiable by hf-skills (or the host harness). Appendix A gives the concrete mapping and flags the gaps.

#	Abstract capability	Required for
C1	Search the model/dataset hub; fetch repo details	Resource discovery & validation
C2	Inspect a dataset: schema, columns, splits, sample rows, statistics	Data audit (Phase 3)
C3	Search and read research papers; follow links to code/datasets	Research (Phase 1)
C4	Trace citation graphs (references and forward citations)	Deep research partial gap
C5	Search/retrieve current library documentation	Research, implementation
C6	Find and read working example code	Research, implementation
C7	Execute code in a disposable sandbox (CPU and GPU tiers)	Preflight (Phase 4) partial gap
C8	Submit, configure, monitor, and cancel managed compute jobs	Job execution (Phase 5)
C9	Train/fine-tune models with standard methods (SFT/DPO/GRPO, etc.)	Implementation
C10	Record metrics and emit/read structured training alerts	Monitoring & iteration (Phase 6)
C11	Durable storage for models, datasets, logs, results	Persistence (Principle 4)
C12	Evaluate a model on a benchmark/task	Completion (Phase 7)
C13	General web/document retrieval	Research fallback gap · host harness
C14	Out-of-band notification (optional)	Reporting (§5.9) gap · host harness

Hard rules & invariants

§6 · R1–R14

The non-negotiable rules. They restate the principles as concrete prohibitions and requirements.

§6.1 Persistence

R1 Any artifact that must outlive the run MUST be pushed to durable storage as part of the run, with a concrete destination id set before launch. "I'll grab it after" is not available.

R2 Outputs that cannot be pushed by the training process itself (logs, scripts, side artifacts) MUST be uploaded to durable storage explicitly.

§6.2 No scope-changing fixes

R3 On error, the agent MUST apply the minimal fix that preserves the user's original request, grounded in research/examples.

R4 The agent MUST NOT, to escape an error, change the training method, reduce sequence length, switch dataset/model, or disable monitoring. If the original approach genuinely cannot work, it explains why and asks the user first.

§6.3 Timeouts · §6.4 Hardware sizing

R5 The agent MUST set a job timeout sized to the actual work and MUST NOT leave the short interactive default. Ref: small ~2–4h, mid ~4–8h, large ~8–24h.

R6 Compute MUST be sized to the model footprint. Ref: ~1–3B → small single-GPU; ~7–13B → large single-GPU; ~30B → multi-GPU/large-memory; ~70B+ → multi-GPU high-memory. Memory, not the tier's name, is what matters.

§6.5 Out-of-memory recovery

R7 On OOM the agent MUST, in order: (1) reduce per-device batch size and increase gradient accumulation proportionally to keep the effective batch size identical; (2) enable gradient checkpointing; (3) move to larger-memory hardware. It MUST NOT switch training method or reduce sequence length to resolve OOM.

§6.6 Resource integrity

R8 The agent MUST NOT silently substitute a dataset or model. If a requested resource is unavailable, it tells the user and asks.

R9 Schema/columns/format MUST be verified by inspection before use; the agent MUST NOT assume them.

§6.7 Error recovery (general)

R10 On failure the agent reads the full error/log, diagnoses the actual cause, and changes something specific. It MUST NOT retry the identical action unchanged; if a call fails repeatedly for the same reason, it tries a fundamentally different approach.

R11 API/import errors are resolved by re-checking current documentation and examples, not by guessing.

§6.8 Build cost discipline · §6.9 Secrets & links

R12 The agent SHOULD prefer prebuilt/managed components over compiling heavy dependencies from source inside a job. Extra build steps are taken only when nothing prebuilt covers the need, with the reason documented.

R13 Credentials are taken from the environment and never logged or exposed.

R14 Every referenced model, dataset, paper, job, or dashboard MUST be given as a direct URL.

Harness control contract

§7 · keeps the loop bounded & unstuck

These rules govern the loop itself — they keep an autonomous agent bounded, unstuck, and within its context budget. Independent of the ML domain; enforced by the host harness or hooks. Numbers are reference defaults.

§7.1 Bounded iteration›

The main loop MUST be bounded by a maximum iteration count (configurable; unbounded only by explicit opt-in). It exits when the model returns no further actions and no plan item remains unfinished, on user cancellation, or on unrecoverable error.

§7.2 Repetition / "doom-loop" guard›

The harness MUST detect a stuck agent and inject a corrective instruction, over a recent window (ref: last ~30 actions):

Identical repetition: same action + same arguments repeated (ref: 3 in a row) → "stop repeating this, try a fundamentally different strategy".
Cyclic repetition: a short sequence (ref length 2–5) repeated (ref: ≥2 full cycles) → "you are in a repeating cycle, break it".

Action signatures SHOULD incorporate the action's result as well as its arguments, so legitimate polling (same call, changing result) is not misclassified.

§7.3 Continuation guard›

If the model produces no action while the plan still has unfinished items, the harness MUST NOT immediately hand control back to the user. It injects a continuation prompt ("the task is not complete, take at least one action now") and retries a small number of times (ref: 2) before yielding. Any action resets the counter.

§7.4 Malformed-action guard›

A short streak of malformed actions for the same tool (ref: 2 in a row) MUST trigger a corrective injection ("stop retrying, use a different strategy") rather than letting the agent grind.

§7.5 Output-truncation recovery›

If a model response is cut off by the output limit, the harness MUST NOT have the agent blindly resend the same oversized payload; it injects guidance to use a different mechanism (e.g. write large content via a file/heredoc rather than inline) and retries.

§7.6 Context compaction›

The harness MUST keep working context within the model's window by compacting when usage crosses a high-water mark (ref: ~90%). Compaction MUST preserve the system instructions, the original task message, and a recent tail (ref: ~5 messages); the middle is summarized into a single record. Oversized individual messages MAY be truncated with a placeholder (ref cap: ~50k tokens/message), except the system message. Compaction MUST be bounded: if it cannot bring usage under threshold, the session terminates cleanly rather than retrying forever.

§7.7 Approval gate›

Outward-facing or costly/destructive actions MUST pass an approval policy:

Auto-approved: read-only research, inspection, discovery; routine code execution in the default low-cost sandbox; status/metadata queries.
Approval-required: provisioning non-default (GPU/larger) compute; submitting paid compute jobs; destructive storage operations (delete repo, delete branch/tag, merge, force-upload/overwrite); creating durable repos.
Always human-gated: recurring/scheduled jobs (a standing cost commitment) require explicit human approval even under otherwise-autonomous policies.

An autonomous mode MAY auto-approve the approval-required class up to a cost cap, tracking estimated spend across the batch. Scheduled/recurring commitments remain human-gated regardless.

§7.8 Effort / budget probing›

The harness MAY tune the model's reasoning effort to the highest level the selected model supports, degrading gracefully when a level is rejected, and SHOULD do so cheaply (a tiny probe) and cache the result per model. This is a quality/cost optimization and is non-normative for the ML workflow itself.

State & artifacts

§8

Across the workflow the agent maintains:

The plan

The live decomposition and progress (Phase 0).

Research findings

The recipe table, code patterns, references that ground implementation — the authority later phases cite (Phase 1).

Validated resources

Confirmed model, dataset (with verified schema), and chosen hardware (Phases 2–3).

Run records

Job ids, configs, tracker project/run names, dashboard URLs, and the alert history that drives iteration (Phases 5–6).

Durable outputs

The persisted model/dataset/logs and evaluation results, each linked by URL (Phase 7).

Completion criteria

§9 · conformance summary

An execution conforms to this spec if, for an ML implementation request:

Research preceded implementation, and the implementation cites concrete findings (Principle 1, Phase 1).
Resources were verified by inspection, including dataset-format-to-method compatibility (Principle 2, Phases 2–3).
GPU code was smoke-tested before scaling, or the omission was justified (Principle 3, Phase 4).
The pre-flight checklist was satisfied before any job, batches went one-job-first, and durable persistence was configured up front (Phase 5, Principle 4).
Monitoring emitted structured alerts and the next iteration was driven by them (Phase 6).
The result was persisted and evaluated, and all artifacts were linked (Phase 7).
No rule in §6 was violated; in particular no silent scope change or resource substitution occurred.
The loop stayed bounded and unstuck under the control contract (§7).

Appendices

A: mapping (informative) · B: realization (non-normative)

Appendix A — Capability → hf-skills mapping (informative)›

How the abstract capabilities of §4 are satisfied by hf-skills. The skill names are the execution surface — do not reimplement them.

Cap	Provided by (hf-skills)	Notes
C1	`hf-cli` (models/datasets list/info), `huggingface-best`, hub MCP tools	Model/dataset discovery & validation.
C2	`huggingface-datasets` (Dataset Viewer), trainer-skill validation helpers	Schema, splits, samples, stats. Satisfies the §5.4 audit.
C3	`huggingface-papers`, `hf-cli` papers, `huggingface-paper-publisher`	Read methodology from the markdown; follow linked artifacts.
C4	Partial gap. `huggingface-papers` exposes linked artifacts but not a full citations graph.	Crawl downstream work via paper-page links + host web retrieval (C13); accept reduced fidelity and say so.
C5	Companion doc tools (`hf_doc_search` / `hf_doc_fetch`)	Current TRL/Transformers/etc. APIs.
C6	Trainer skills ship reference scripts; host harness can read repos	Prefer copying production templates over synthesizing.
C7	Partial gap. No drop-in GPU-sandbox skill.	Realize §5.5 preflight as a short, cheap job (C8) on a small GPU flavor with a tiny subset, or a local GPU smoke test via `huggingface-community-evals` (`--limit`).
C8	`hf-cli` (jobs run/inspect/logs/cancel, scheduled jobs), `hf_jobs`	Set timeout (R5), flavor (R6), env/secrets, persistence.
C9	`huggingface-llm-trainer`, `huggingface-vision-trainer`, `train-sentence-transformers`	Method ↔ data-shape rules in §5.4 align with these skills.
C10	`huggingface-trackio` (init/log/alert/finish, CLI `--json`, Space dashboards)	This is the §5.7 decision channel.
C11	`hf-cli` (upload/repos/buckets), trainers' `push_to_hub`, datasets upload	Satisfies R1/R2.
C12	`huggingface-community-evals` (inspect-ai / lighteval), trainer eval hooks	Satisfies §5.8 "evaluated and confirmed".
C13	Gap in hf-skills. Source from the host harness's web search/fetch.	Used by research (C4 fallback) and non-HF docs.
C14	Gap in hf-skills. Source from the host harness (messaging) if needed.	Optional; gated per §5.9.

Gaps to handle explicitly: C4 (full citation graph), C7 (dedicated GPU sandbox), C13 (general web), C14 (notifications). For each, the spec keeps the rule and lets the implementer satisfy it with the nearest mechanism, stating any reduced fidelity.

Appendix B — Realizing the workflow as skills / subagents / hooks (non-normative)›

Illustrative only. The intent of the split: hf-skills provides the doing, and the researcher provides only the discipline — phase ordering, verification gates, persistence and anti-scope-change rules, and loop guards.

Driving skill(s). Encode the phase order and gate conditions (§5) as the researcher's top-level procedure: research → validate → audit → preflight → submit → monitor → iterate → evaluate, delegating each concrete action to the relevant hf-skill.
Research subagent. Implement §5.2 as a separate subagent with its own context and a read-only toolset. Return schema = recipe table + code patterns + references. This is the one place a subagent is structurally required (context isolation).
Hooks for the control contract (§7) and hard gates (§6) — enforced not merely requested:
- Pre-job hook: refuse a job unless the §5.6 pre-flight checklist is satisfied; refuse local paths in scripts.
- Batch hook: allow only one job from a batch until its logs confirm a healthy start.
- Repetition / continuation / malformed guards: the §7.2–7.4 detectors.
- Compaction hook: the §7.6 policy. Approval hook: the §7.7 policy.
Plan/tracker. Use the host harness's todo mechanism for the §5.1 plan, and huggingface-trackio for the §5.7 alert-driven loop.