Mainspring

Source: docs/prd.md.

Mainspring — Product Requirements Document (PRD)

Status: canonical plan for Mainspring (autonomous execution loop). Lives at docs/prd.md. The Method that produced this document is at docs/method.md; Mainspring was built with the same Product Requirements Document (PRD)-first discipline it now ships. Owner: Mainspring maintainers. Companion docs: docs/guide.md — operator commands and recovery shortcuts.

This is the document that drives Mainspring as a clean modular Apache-2.0 OSS tool prepared for the mainspring v1.0.0 GitHub source release. Nothing else in the repo speaks for Mainspring’s plan; if it disagrees with this file, this file wins.


Contents

  1. Mission
  2. Durability principles
  3. Current truth snapshot (verified)
  4. Naming and brand boundary
  5. Target architecture
  6. Phase map
  7. Architecture decisions (ADRs)
  8. Operational doctrine
  9. Health rituals
  10. Disaster recovery
  11. Versioning and migration
  12. v1.0 GitHub release checklist
  13. Explicit non-goals
  14. Backlog (Must / Should / Could / Won’t)
  15. Appendix A — Source-of-knowledge recipes
  16. Appendix B — Verification commands
  17. Appendix C — Competitor landscape / competitive positioning

Mission

Mainspring is a single-operator, single-host autonomous execution instrument: it picks work from a Product Requirements Document (PRD) or backlog, runs a writer model or CLI against it, runs an independent reviewer as a hard gate, captures verifiable outcomes as JSONL, and stops or continues based on what actually shipped — not what the agent claims.

Mainspring is positioned as Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery, not as another generic autonomous coding chat wrapper. The public surface must explain the practical buyer problem first: vibe coding is useful for exploration, but production work needs intent, review, evidence, visibility, and recovery.

Three verbs: pick, ship, record. Anything that doesn’t serve one of these is decoration.

Audience:

  1. Operators shipping production-grade local projects with AI agents. Primary.
  2. Developers who want a local Product Requirements Document (PRD)-first AI coding loop with review, evidence, HUD, and Telegram. Secondary.
  3. Contributors extending engines, backlog sources, docs, or release tooling. Tertiary.

Non-audience: enterprise teams, compliance auditors, cloud SaaS users. Mainspring is a local operator tool, not a hosted platform.


Durability principles

The ten commandments. Anything below this line bows to them. If a phase plan, a feature, or an ADR conflicts with one of these, the plan loses.

  1. The wave is the unit. One pass of pick → write → review → log → decide is one wave. Mainspring’s correctness, observability, and metrics all hinge on the wave being a clean atomic concept. Never blur it.
  2. Truth before autonomy. A wrong autonomous loop is worse than a slow one. Mainspring never marks a Taskmaster item done unless the reviewer hard-gate passed and the changed-file check confirmed product code moved.
  3. The reviewer is the only gate. No silent fallback that bypasses the review verdict. If the reviewer is unreachable, the wave fails closed (logs + Telegram, no auto-pass).
  4. Writer output must be visible to the reviewer. The SC2259 silent-failure bug (heredoc-overrides-stdin) is the canonical anti-example. Every writer→reviewer handoff must be byte-verifiable. Never rebuild that bug.
  5. JSONL is the contract. .mainspring/logs/waves.jsonl schema (defined below) is frozen. Adding fields is non-breaking; removing or renaming a field requires a schema_version bump and a 90-day deprecation window.
  6. The CLI is the contract. Every flag in --help is a public API. Adding flags is non-breaking; removing a flag requires deprecation + stderr warning for one minor version.
  7. No embedded heredocs. Bash dispatches; Python scripts compute. No python3 - <<'PY', no python3 -c '...' longer than 80 chars in critical paths. Pretty-printing is a real .py file with tests.
  8. Fail closed, fail loud. If we can’t reach the LLM, the writer’s output, the review verdict, the JSONL writer, or the lock file — we stop the wave with a stderr message. Never paper over.
  9. No fictional features. The doctrine, the guide, and --help may only describe behavior present in the current commit. If a phase claims a feature, the test for it must already be green.
  10. Reversible by default. Auto-checkpoint commits are OK; destructive cleanup, force shutdown, force push, rm -rf of worktrees, git reset –hard — all need an explicit user gate (--restart-team, --repair-state, --force). Mainspring should never destroy the user’s in-progress work to make its own bookkeeping cleaner.

Current truth snapshot (verified)

Captured 2026-06-15 after HUD empty-state and active-card polish, packaged runtime Python isolation for global HUD use, command-help hardening, public release-surface copy cleanup, installed-wheel CLI contract guards, and a fresh release gate against the standalone repository layout.

Metric Value Source
mainspring.sh LOC 391 wc -l mainspring.sh on 2026-06-16
Bash functions across entrypoint + lib 171 rg function-pattern scan on mainspring.sh lib/*.sh on 2026-06-16
Source shell syntax passes bash -n mainspring.sh lib/*.sh
Source shell lint passes shellcheck -S warning mainspring.sh lib/*.sh
Python lint passes ruff check py
Python format passes, 91 files already formatted ruff format --check py
Active embedded Python in shell paths 0 heredoc / inline-parser hits rg scan for python3 - << and python3 -c in shell
Pytest suite files 42 find py/tests -name 'test_*.py' plus count
Bats suite files 12 find tests/bats -name '*.bats' plus count
Python line coverage 90.6% (coverage gate pass) make coverage on 2026-06-16
Full local gate passes: 1124 Python tests passed, 1 skipped; 223 Bats; HUD/docs-site smoke; dependency audit; package smoke; PRD validation make release-check on 2026-06-16
Global editable CLI passes: make install-user; command -v mainspring resolves ~/.local/bin/mainspring; installed HUD captured output exits one-shot; unsupported release-management probes fail with Unknown command local pipx smoke on 2026-06-15
Taskmaster runtime state not tracked in source release py/tests/test_no_hardcoded_paths.py
Public release checklist clean main is public; hosted CI and GitHub Pages are green; remaining external action is signing v1.0.0 and publishing the GitHub Release GitHub repository state, hosted Actions runs, Pages deployment, signing key
PRD readiness score 900 / 1000 source-install product path is green; remaining points are publication checklist items
Telegram notifier health in clean env disabled until MAINSPRING_NOTIFY_ENABLED=1 ./mainspring.sh notify-health --format json

Resolved historical critical bug: the former SC2259 path "${cmd[@]}" | tee "$raw_file" | python3 - <<'PY' ... PY let the heredoc override piped writer output, so the reviewer could see an empty display_file. P1 extracted stream prettification into py/stream_json_prettify.py, added the non-empty display-file regression, and removed active embedded Python from source shell paths.

What works in v1 today (confirmed):

What remains for publication:

1000-point PRD score rule: 900 points are reserved for product readiness: implementation, tests, documentation, packaging smoke, installed CLI behavior, HUD/Telegram usability, and security/public-repo hygiene. The remaining 100 points are reserved for publication checklist items. The current score 900/1000 holds until the signed v1.0.0 release is created from the final clean commit.


Naming and brand boundary

Three names that must never blur:

Name What it is Where it lives
Mainspring This tool — the autonomous execution loop. The CLI binary, the runtime, the brand. Standalone mainspring repository; mainspring.sh plus packaged mainspring console entry point. Apache-2.0.
Team backend External dependency — a separate team-orchestration CLI that Mainspring uses only for explicit --topology team runs. Not part of the normal solo path. Wherever the backend CLI is installed on PATH. Mainspring depends on it for team topology only.
Taskmaster External dependency — backlog source. Mainspring picks work from it. task-master CLI + .taskmaster/ directory.

The user-facing bag:

Compatibility boundary: MAINSPRING_* and .mainspring/ are the public names. Older pre-v1 runtime names are not part of the public configuration contract; historical ledger/replay readers may still parse old recorded fields, but new launches use the Mainspring namespace.


Target architecture

The v1 source tree is intentionally boring: one Bash entrypoint, small Bash modules, tested Python helpers, committed docs, and gitignored project-local runtime state.

File layout

mainspring.sh                      # Bash entrypoint and CLI dispatch
lib/                               # Bash modules: lock, log, help, status,
                                   # doctor, notify, team, wave, wizard
py/                                # tested Python helpers and CLIs
  engines/                         # EngineAdapter implementations
  bench/                           # source-only SWE-bench helpers; not installed runtime
  tests/                           # pytest suite
tests/bats/                        # shell integration tests
docs/                              # README-linked operator, method, metrics,
                                   # architecture, PRD, and operator docs
method/                            # reusable Mainspring Method templates/skill
presets/                           # built-in run profiles
schema/                            # project config JSON Schema
packaging/homebrew/                # source-only Homebrew tap publishing runbook
.mainspring/                       # gitignored project-local runtime state
  logs/waves.jsonl                 # append-only wave ledger
  state/last-run.env               # safe saved setup, parsed without source
  state/notify-state.json          # Telegram dedup/rate-limit state

CLI contract (frozen at v1.0)

mainspring [taskmaster|night] [flags]
  Modes:
    taskmaster                     read .taskmaster/ backlog, pick ready work
    night                          read PRD brief, writer chooses next slice

  Topology:
    --topology solo|team
    --pair <writer>+<reviewer>     e.g. claude+codex, gemini+claude
    --engine <name>                writer engine (when --pair not used)
    --review-engine <name>         reviewer engine
    --model <id>                   override writer model
    --review-model <id>            override reviewer model
    --speed-profile standard|fast|max
    --max-agents 1-6
    --once                         single wave then exit
    --prd <path>                   night mode PRD path

  Observe / inspect:
    hud [--once|--json|--local]    global live operator dashboard
    status                         runtime + git + scheduler + waves snapshot
    last-run [--format json]       show saved setup + repeat commands
    --metrics [--days N]           query waves.jsonl
    engines [--json]               registered engine inventory
    limits [engine ...] [--hours N] run-readiness, quota, and spend snapshot

  Setup / planning:
    init <name>                    scaffold Method PRD docs
    validate-prd <path>            validate Product Requirements Document shape
    decompose <prd-path>           turn one PRD phase into Taskmaster tasks
    next [tasks.json]              print next blocker-aware task id
    scope-check [tasks.json]       audit Taskmaster task shape

  Recovery / verification:
    doctor                         env + dependency sanity check
    stop --force [--all]           stop recorded Mainspring processes
    --repair-state --dry-run       preview stale runtime cleanup
    --repair-state --force         apply reviewed stale runtime cleanup
    --self-test                    one self-test wave on a synthetic task
    --self-test-all                full pair-mode matrix
    notify-test                    send sample Telegram notification
    notify-health [--format json]  inspect notifier daemon state
    notify-restart                 restart only the recorded notifier daemon

  Evidence / local maintenance:
    replay <show|diff|build|run>   inspect or reconstruct recorded waves
    --list-presets                 print available presets

  Run modifiers:
    --wizard                       interactive setup
    --last-run                     reuse .mainspring/state/last-run.env
    --restart-team                 destructive: reset active team backend state
    --preset <name>                load preset env
    --dry-run                      print resolved settings, no API calls

JSONL wave schema (frozen at v1.0, schema_version=1)

.mainspring/logs/waves.jsonl — one JSON line per completed wave. Append-only via flock on waves.jsonl.lock.

Required fields (frozen — adding new ones is the only allowed change without a schema_version bump):

Field Type Description
ts string (ISO-8601 UTC, Z) wave completion timestamp
mode enum taskmaster | night
engine enum writer engine: codex | claude | gemini | …
wave integer 1-indexed wave counter within the run
exit_code integer writer exit code

Standard optional fields (always emitted, may be null):

Field Type Description
review_engine enum reviewer engine
model, review_model string model ids
pair string <engine>+<review_engine> for easy jq grouping
task_id string | null Taskmaster id
work_id string | null subtask id when applicable
topology enum solo | team
team_name string | null active team name when topology=team
duration_s number wall-clock seconds
product_files_changed integer count from count_product_file_changes
verdict enum PASS | FAIL | null (review crashed)
chapter_delta string +50 / -3 / 0 style, signed
competitor_delta string same
launch_delta string same
product_score integer 0–1000 rubric
retry_used boolean one-shot reviewer retry was triggered
failure_reason_class string | null routing:plugin_invisible, engine:quota, review:invalid_json, …
codex_short_delta_pct number | null usage delta as % of short window
claude_short_delta number | null Claude usage delta
gemini_short_delta_pct number | null future engines extend the same shape

Schema versioning: required fields are frozen. Removing or renaming one bumps schema_version and triggers a 90-day deprecation window where wave_log.py writes both old and new shapes.

Concrete code shapes (load-bearing)

These are the current public contracts that anchor the architecture. The live source tree is the root-level mainspring.sh, lib/, and py/ layout.

run_ai_turn(role, prompt, log, display) — the engine dispatcher

# lib/engines.sh delegates command construction to py/engines/registry.py.
# Direct CLI engines (claude/codex) and provider engines (gemini, openai,
# anthropic, azure, openrouter, mistral, grok, ollama, litellm) all fail closed
# through the same registry readiness checks before a wave launches.

Adding a new engine means adding one adapter under py/engines/, registering the default model/readiness contract, and covering it with registry tests. Provider engines must never silently fall back to another provider, model, or reviewer.

acquire_lock / release_lock — flock on fd 9

# lib/lock.sh
acquire_lock() {
  mkdir -p "$(dirname "$LOCK_FILE")"
  exec 9>"$LOCK_FILE"
  if ! flock -n 9; then
    local existing_pid
    existing_pid="$(cat "$LOCK_FILE" 2>/dev/null || true)"
    echo "Mainspring already running (pid ${existing_pid:-unknown}); stop it or wait." >&2
    exit 1
  fi
  echo "$$" >&9
}
release_lock() {
  exec 9>&- 2>/dev/null || true
  if [ -f "$LOCK_FILE" ] && [ "$(cat "$LOCK_FILE" 2>/dev/null || true)" = "$$" ]; then
    rm -f "$LOCK_FILE"
  fi
}

The kernel auto-releases fd 9 on any exit (including SIGKILL), so the script can never leave a stale lock. The PID file content is purely advisory for human inspection.

check_write_scope — post-wave path guard

# lib/write_scope.sh
# Reads newline-separated changed file paths on stdin.
# Returns 0 if all paths are inside the allowed product scope.
# Returns 1 and prints offenders on stderr otherwise.
check_write_scope() {
  local offenders=()
  local path
  while IFS= read -r path; do
    [ -z "${path// }" ] && continue
    case "$path" in
      .env|.env.*|.secret|.secret.*|*/.env|*/.env.*|*/.secret|*/.secret.*)
        offenders+=("$path"); continue ;;
      node_modules/*|*/node_modules/*) offenders+=("$path"); continue ;;
      .git/*|*/.git/*)                  offenders+=("$path"); continue ;;
      dist/*|coverage/*|playwright-report/*|test-results/*)
        offenders+=("$path"); continue ;;
      src/*|apps/*|tests/*|e2e/*|docs/*|.taskmaster/*|scripts/*|shared/*|server/*|public/*|plugins/*)
        continue ;;
      *) continue ;;  # top-level dotfile-clean path tolerated
    esac
  done
  if [ "${#offenders[@]}" -gt 0 ]; then
    printf 'write_scope violation: %s\n' "${offenders[@]}" >&2
    return 1
  fi
  return 0
}

Invoked after the writer finishes, before review prompt build. Failure here forces a review-fail with reason scope:violation.

parse_review.py — required review JSON fields

REQUIRED_FIELDS = (
    "verdict", "chapters", "chapter_delta", "competitor_delta", "launch_delta",
    "product_score", "strengths", "gaps", "next_actions", "verification_evidence", "rationale",
)

The reviewer is prompted to emit a fenced json ... block. If absent, fall back to a Markdown KEY: VALUE parser (legacy v1 shape). If still missing required fields → review FAIL with reason review:missing_fields:<key> written to JSONL.


Phase map

This is the historical implementation map that produced the current source release. Completed items remain as audit trail; current release truth lives in the verified snapshot above and the v1.0 checklist below. Each future phase must end green: make all, targeted tests, docs updates, and fresh evidence.

P-Audit — Release audit remediation (DONE 2026-05-03)

Goal: keep external release-audit findings executable through the Method tooling without moving them out of Taskmaster.

P0 — Reality reset (DONE 2026-04-26)

Goal: docs and disk match reality, no parallel planning artifacts.

P1 — Critical bugs + portability (1-2 days, in place on mainspring.sh) — 🟢 ACTIVE since 2026-04-26

Goal: stop the silent failures. The Claude→Claude review gate must demonstrably see writer output.

Acceptance:

Rubric impact: correctness 95→140, portability 90→105.

P2 — De-monolith (~2 weeks, incremental commits to feature branch)

Goal: main entry ≤ 500 LOC, no embedded Python, no duplication between writer/reviewer paths.

Acceptance:

Rubric impact: architecture 60→155.

P3 — Tests + observability (~1 week)

Goal: safety net for aggressive P4–P7 refactors. Failing test fails the wave.

Acceptance:

Rubric impact: testability 55→135, observability 130→160.

P4 — UX polish (~1 week)

Goal: “production ready” → “delightful to operate”.

Acceptance:

Rubric impact: safety 110→132, UX 70→95.

P4.5 — Mainspring Method tooling (~1 week)

Goal: make the Mainspring Method (the doctrine-first dev flow at docs/method.md) executable as Mainspring CLI subcommands. Today the Method is documented plus CLI-assisted; this phase made its key steps callable from the CLI so operators (and future Mainspring waves themselves) can invoke them programmatically.

The Method package source lives under method/ and ships as part of the Mainspring OSS release. CLI commands in this phase wrap those templates and validators.

Acceptance:

Rubric impact: Method productization +30, UX 95→110.

P5 — Observability and engine support (~1.5 weeks)

Goal: the three big features the operator wants for the OSS release.

Acceptance:

Rubric impact: observability 160→185, ergonomics 95→120, extensibility +20 (new axis).

P6 — Metrics-driven routing (~1 week)

Goal: the routing default (which pair, which topology) gets chosen by data, not preference.

Acceptance:

Rubric impact: observability 185→210, decision quality +30.

P7 — Repo extraction + GitHub release (1-2 days)

Goal: Mainspring ships as its own Apache-2.0 OSS repo on GitHub as a clean source-install v1.0 release. The current public release procedure is the single checklist in v1.0 GitHub release checklist; older scratch bootstrap commands are not part of the public contract.

Acceptance:

Rubric impact: packaging +50 (new axis), distributability +30.


Architecture decisions (ADRs)

Six load-bearing decisions. Each is reversible only at high cost; each is documented here so future maintainers can read them and either re-confirm or override.

ADR-01: License = Apache-2.0

Context: Mainspring will become public OSS. License choice is permanent (changing later requires CLA from every contributor).

Options considered: MIT (simplest), Apache-2.0 (explicit patent grant), BSD-2-Clause.

Decision: Apache-2.0.

Rationale: Mainspring is infrastructure tooling (runs for life), not a 200-LOC library. The patent grant matters because: (a) the engine-adapter pattern is novel-ish; (b) someone could fork Mainspring, patent the adapter approach, and try to enforce against the original. Apache-2.0 blocks that. Mature dev tools (Terraform, Kubernetes, k6, Bun, Vite) default to Apache-2.0; matching that signals enterprise readiness and lets teams adopt without legal review. MIT is simpler but loses the patent grant for no practical gain.

Consequences: every source file gets an SPDX header; NOTICE file required; copyright held by “Mainspring contributors” (no CLA, future contributors implicitly accept under §5).

Reversal cost: very high (relicensing public OSS requires every contributor’s consent). Get this right now.

ADR-02: Nested-repo strategy = configurable team-exclude

Context: Operators sometimes have nested git repos (submodule-style, ignored nested checkouts) inside their workspace. Team workers operate in worktrees that don’t see those nested repos; team-mode dispatch of nested-repo-scoped tasks would silently fail.

Options considered: ignore nested repos and let reviewer failure catch it; force all nested-repo work to solo manually; auto-detect nested git roots on every dispatch; expose an explicit exclude-prefix knob.

Decision: Mainspring exposes MAINSPRING_TEAM_EXCLUDE_PREFIXES plus a pre-v1 compatibility alias as a colon/comma-separated list of path prefixes that team mode skips. .claude/ local helper worktrees are excluded by default, and operator-configured prefixes are additive. Team mode skips matching Taskmaster items with failure_reason_class=routing:plugin_invisible. Such work routes to the solo lane, which sees the nested repo because it runs in the leader workspace.

Rationale: generic exclusion is a one-line, fully-tested guard; the operator decides which paths are nested-repo or team-invisible per project.

Consequences: team mode may leave some ready tasks untouched when their scope matches an exclude prefix; operators must keep the prefix list honest per project. Doctor and routing reports must make skipped scopes visible so the skip is never silent.

Reversal cost: low — change the env var.

ADR-04: Model policy = always premium

Context: routing decision — keep fast/mini lanes for low-risk docs-only work, or force every wave through the most capable model?

Options considered: always premium; low-risk docs-only lanes; dynamic pair selection by recent metrics; manual per-wave model choice.

Decision: always premium. Default models: Codex gpt-5.5 with reasoning_effort=xhigh; Claude opus; Gemini gemini-2.5-pro. No “fast” lane is shipped as a default preset.

Rationale: Mainspring’s mission is high-quality autonomous execution. Cheaping out on a docs task that the reviewer then has to re-do costs more cycles than running premium once. The user explicitly chose this. Future engines must default to their flagship model.

Consequences: the P6 metrics-driven auto-disable rule is restricted to non-default lanes that an operator explicitly enabled. Premium pairs are never auto-disabled.

Reversal cost: low — change defaults in wizard.sh.

ADR-05: Team failure semantics = both failed + blocked

Context: when a team task fails for a non-recoverable routing reason attached to a Taskmaster item (for example an explicit scope block or stale empty parent), what state goes where? Recoverable preflight visibility skips are covered by ADR-02 and route to solo instead of blocking the backlog.

Options considered: mark only the team backend task failed; mark only the Taskmaster item blocked; retry indefinitely with the same routing; dual-mark both systems with the same machine-readable reason.

Decision: for non-recoverable task-scoped routing failures, mark the team backend task failed (with failure_reason_class recorded in the team ledger and waves.jsonl), AND mark the Taskmaster item blocked (with the same machine-readable reason in the task body). Supervision must not re-dispatch a known-blocked item until the operator clears the block manually. Recoverable reasons such as routing:plugin_invisible are logged as failed team-preflight rows and then processed by the solo lane.

Rationale: dual-marking gives the operator two views of the same fact: the team metrics show “this team had N routing failures of class X” (useful for triage), and Taskmaster shows “task #42 is blocked because Y” (useful when picking next work). Single-marking either way loses one of those views.

Consequences: blockers become explicit operator work instead of hidden scheduler state; clearing a false-positive block requires manual Taskmaster action. Metrics can group by failure_reason_class across both ledgers because the same code is written to both places.

Reversal cost: medium — would require unwinding the dispatch ledger schema.

ADR-06: Auto-checkpoint policy = keep recovery commits out of public history

Context: Mainspring’s auto-checkpoint commits operational state during fanout (using Lore trailers + denylist). Final history quality depends on the operator reviewing checkpoints and publishing semantic commits.

Options considered: disable auto-checkpoint entirely; keep operational checkpoint commits as final history; auto-squash without asking; keep checkpointing and document public-history review.

Decision: keep auto-checkpoint as-is, but keep public-history preparation outside the Mainspring CLI. The operator uses normal git review and semantic commits before publication.

Rationale: auto-checkpoint preserves work without operator intervention, which is the whole point of autonomous execution. Manual finalization preserves history quality, which is the point of OSS publication. Doing both means the worst-case path still has an auditable trail of what happened, while the happy path ships clean semantic commits.

Consequences: operators get durable recovery points during autonomous fanout, but PR branches still require final history review. Mainspring does not expose a public history-rewrite command.

Reversal cost: low — this keeps history tooling outside the product surface.

ADR-07: Implementation language strategy = Bash for orchestration, Python for structured data

Context: Mainspring orchestrates AI agents through subprocesses (Codex, Claude, Gemini CLIs), parses their structured outputs (stream-json events, review verdicts), manages tmux + worktree fanout, and reads/writes JSONL state files. The natural shape spans two very different concerns: shell glue (process spawning, pipes, signal handling, fanout) and structured data (JSON parsing, schema validation, formatted reporting). Choosing one language for both means losing the other’s strengths.

Options considered:

  1. Bash only. Verified painful in v1: 4644 LOC monolith, 9 embedded python3 - <<'PY' heredocs that produced the canonical SC2259 silent-failure bug, 8 inline python3 -c '...' parsers, JSON munging through a sed/awk underbelly. The whole P0+P1+P2 effort exists because this approach was untenable.
  2. Python only. Clean tests, single language, async streaming via Anthropic/OpenAI SDKs, idiomatic for AI-agent tools (Aider, GPT Engineer, Mentat, Claude Engineer all chose Python). But: subprocess+pipe plumbing is verbose (Popen with stdin/stdout PIPE + signal handling = ~3× the bash-equivalent LOC); tmux + worktree fanout is awkward through subprocess; loses Bash’s “pipe is a first-class verb” feel.
  3. Go. Single binary, no runtime dependency, fast startup, strong concurrency primitives. But: build pipeline required (cross-compile per platform), every release becomes a binary distribution problem, maintainers would have to own CI for releases; raises the adoption barrier from “git clone, run” to “download binary or set up Go toolchain”.
  4. Rust. Same upside as Go but more packaging and contributor friction than this local operator tool needs.
  5. Node/TypeScript. Awkward as shell-glue (everything goes through child_process.spawn); npm dependencies in an OSS CLI tool is an anti-pattern; would clash with the planned pip install distribution path.
  6. Bash + Python (current). Bash for what it’s good at; Python for what it’s good at; explicit CLI boundary between them; both pre-installed on every macOS/Linux developer workstation; zero build step. The split that organically emerged from the v1 → v2 refactor.

Decision: Bash + Python, with a strict CLI boundary.

Rationale:

Consequences:

Reversal cost: medium-to-high for a full rewrite to pure Python; trivial for incremental Python expansion (e.g. moving more bash logic into Python on a per-module basis). The boundary is designed to allow incremental migration if we ever decide to go pure Python — bash would shrink module by module while CLI calls stay stable.

Re-evaluate when:

  1. We need async streaming directly through Anthropic/OpenAI SDKs (skipping the claude -p / codex exec CLIs). At that point a pure-Python rewrite becomes attractive because the SDKs are Python-first.
  2. We want pip install mainspring as a distribution channel after the v1.0 OSS source release.
  3. We want a real plugin system for engines (Python entry_points beats bash sourcing).

Until any of those three triggers, the current split is the right shape. Bash for shell, Python for data. Each language gets the work it was designed for.


Operational doctrine

How to actually use Mainspring in daily work. This is the lived contract; see guide.md for the full command reference.

When to use solo vs team

Default to solo unless the explicit reason for team is satisfied:

Otherwise solo. Solo is faster to debug, doesn’t require tmux capacity, and produces the same quality output for single-item work.

Rule of thumb: if solo would finish the queue faster than team would even start (because of fanout overhead), pick solo.

Pair selection (until P6 metrics override)

Goal Pair Why
Maximum quality, no speed concern claude+codex (opus + gpt-5.5 xhigh) best writer + best reviewer; differing model families catch each other’s blind spots
Same family double-check claude+claude or codex+codex useful when one provider is rate-limited
Most reasoning needed (complex refactors) codex+codex xhigh Codex with xhigh effort and Codex review is the highest-effort lane
Fastest decision (only when single-step) claude+claude Claude’s tool-calling is faster than Codex round-trips

After P6 lands, consult mainspring --metrics --routing and use the data, not the table.

Reading --metrics

Three signals matter most:

  1. Pass rate per pair, last 14 days. If a pair drops below 70%, investigate before another wave on it. Below 50% → auto-disable should have kicked in (P6).
  2. Top 5 stuck task ids. A stuck task = ≥3 consecutive FAIL waves. Promote stuck tasks out of the queue: either manual review, switch pair, or mark blocked with a reason.
  3. Mean duration trend. Sudden 2x increase = engine quality degrading or task complexity drifting.

When to --restart-team

Only when the team is provably stuck and --repair-state --dry-run doesn’t reveal a recoverable cause. --restart-team preserves worker heads under refs/mainspring-preserve/... before resetting, so it’s not destructive — but it does reset team backend state. Use it as a last resort.

Auto-checkpoint discipline

Auto-checkpoint commits during fanout are operational; review and squash them before publication. Never push those checkpoints directly to a PR branch — they are short-lived recovery checkpoints, not release history.

Cost awareness without cost guardrails

Per ADR-04, no cost guardrail. The operator’s daily-digest Telegram message (P5-1) shows total spend; the operator decides when to pause. The combination of premium-only lanes + visible daily spend + manual stop is sufficient for a single-operator tool.


Health rituals

Two recurring checks. The intent is small enough to actually do; the consequences of skipping are large enough to make the discipline worthwhile.

Weekly (≤ 10 min)

  1. mainspring --metrics --days 7 — check pass rate per pair, top stuck tasks.
  2. mainspring doctor — confirm dependencies + git state clean.
  3. wc -l .mainspring/logs/waves.jsonl — sanity that the JSONL is growing.
  4. Review local git history before pushing public work.
  5. If the daily digest noted any disabled pairs, either re-enable manually with reason or accept and move on.

Monthly (≤ 30 min)

  1. Run the weekly ritual.
  2. Read the last 4 weeks of .mainspring/logs/notifier.log — look for retry-loop events; investigate any task that retry-looped 3+ times.
  3. Run make all from the repository root — must be green.
  4. Audit .mainspring/state/disabled-pairs.json — pairs that have been auto-disabled for > 30 days should be either re-enabled or removed from the registry entirely.
  5. Skim the last 30 days of waves.jsonl for any failure_reason_class value that’s new — every value should map to a known taxonomy entry in P4-5.

Disaster recovery

Six failure modes that have happened or are likely to happen, with concrete recovery steps.

.mainspring/ corrupted (e.g. partial write of waves.jsonl)

Detect: mainspring --metrics errors with Invalid JSON at line N.

Recover:

mv .mainspring/logs/waves.jsonl .mainspring/logs/waves.jsonl.corrupt
jq -c '.' .mainspring/logs/waves.jsonl.corrupt > .mainspring/logs/waves.jsonl 2>/dev/null
# or, more aggressive — keep only well-formed lines:
grep -v '^$' .mainspring/logs/waves.jsonl.corrupt | while IFS= read -r line; do
  echo "$line" | jq -e . >/dev/null 2>&1 && echo "$line"
done > .mainspring/logs/waves.jsonl

Dead host mid-wave

Detect: lock file shows pid that no longer exists; mainspring status shows in-progress wave with timestamp > 30 min old.

Recover: run mainspring --repair-state --dry-run first. If the preview only touches stale runtime bookkeeping, run mainspring --repair-state --force, then resume normal flow. Because flock is fd-based, the OS already released the lock when the host died.

Runaway loop (wave count climbing without progress)

Detect: --metrics shows ≥10 consecutive FAIL waves on the same task, or mainspring hud shows STUCK / repeated RETRY with the same task. Telegram should have already surfaced one actionable retry_loop event.

Recover: use the retry-loop notification’s Open: command to inspect the local HUD/log. If it is genuinely stuck, use the notification’s Stop: command or mainspring stop --force; mark the offending task blocked in Taskmaster with failure_reason_class=manual:runaway; investigate offline. Resume after the block clears.

Stale worktrees / zombie tmux panes

Detect: mainspring doctor warns; git worktree list shows worktrees pointing to non-existent paths.

Recover: start with mainspring --repair-state --dry-run. If git itself reports stale worktrees, run git worktree prune. For team backend state, use mainspring --last-run --restart-team only after checking the preserved worker heads that Mainspring reports. Avoid broad tmux cleanup; kill only a specific session after manually confirming it is unrelated to active work.

Lock without owner (rare; only happens if flock is unavailable on the platform)

Detect: mainspring exits immediately with “already running” but no process matches the recorded pid.

Recover: run mainspring --repair-state --dry-run, then mainspring --repair-state --force only if the preview identifies the lock as stale. Mainspring will re-acquire on next launch. If this happens repeatedly, mainspring doctor should be flagging missing flock support; install util-linux or the platform equivalent.

Telegram daemon stuck

Detect: .mainspring/logs/notifier.log has not appended in > 1 hour despite waves continuing.

Recover: run mainspring notify-health --format json. If it reports next_step=restart-notifier-daemon, run mainspring notify-restart, then mainspring notify-test to confirm Telegram delivery. notify-restart only stops the PID recorded for this runtime’s notifier after validating the process command and current ledger path; do not use broad process-name kills because they can kill another project’s notifier.


Versioning and migration

SemVer policy

Mainspring follows strict SemVer from v1.0.0 onward:

Schema versioning (JSONL)

schema_version=1 for v1.0.0–v1.x. To bump to schema_version=2:

  1. Add the new shape to wave_log.py append. Emit both shapes for 90 days (overlap window).
  2. Update metrics.py to read both versions.
  3. Document the new shape in docs/metrics.md with a compatibility note.
  4. After 90 days, drop emission of the old shape; readers continue to support it for one more major version.

Env var deprecation

Renaming an env var into the MAINSPRING_* namespace in P7:

  1. Read both for one minor version.
  2. If the old one is set and the new one isn’t, log a stderr deprecation notice once per process.
  3. After one minor version, drop reading the old one.

Runtime state note

Public v1 writes .mainspring/ directly. Pre-v1 private runtime trees are outside the public operator path.


v1.0 GitHub release checklist

Mainspring’s source-install product gate is the local gate below. Public publication is ordinary GitHub repository work, not a hidden CLI workflow and not a Mainspring subcommand. There is no public release subcommand for this on purpose: after the verified source tree is ready, publish the reviewed final release commit, make the repository public, sign the tag, and create the GitHub Release. The public main branch, hosted CI, and hosted docs are already live; Homebrew, benchmarks, and provider-matrix evidence are follow-up credibility steps.

Before the maintainer publishes or republishes a release commit:

make release-check

make release-check is intentionally boring: it runs make all, package smoke, Python coverage, Product Requirements Document (PRD) validation, and git diff --check. It performs no GitHub mutations, no tag creation, no provider calls, and no hidden release-state updates.

Then the owner publishes from a reviewed release commit:

Package-manager distribution, benchmark numbers, and live provider matrix evidence are follow-up credibility work, not requirements for the first source-install release. Hosted docs are already published at https://dlogvinenko.github.io/mainspring/.

Correctness

Architecture

Tests

Observability

Safety

UX

Portability

Documentation

Release


Explicit non-goals

These are things Mainspring will never become. Reject any change that moves toward them.


Backlog (Must / Should / Could / Won’t)

Ranked by value-per-effort. Must items block the source-install v1.0 code release; Should items ship in v1.x; Could are later candidates; Won’t are explicit dead ends.

Must (blocks source-install v1.0.0 code release)

  1. P1-1 SC2259 fix (heredoc → real .py) — single highest-impact bug fix.
  2. P1-2 Remove hardcoded $HOME path.
  3. P2-1, P2-3, P2-4 Heredoc extraction + run_ai_turn merge + bash modularization.
  4. P2-2 python3 -c consolidation into team_status.py.
  5. P3-1 / P3-2 / P3-3 / P3-5 Tests + JSONL + Makefile.
  6. P3-4 --metrics command at the standard-questions level.
  7. P4-1 Structured review JSON (kills regex parsing in critical path).
  8. P4-2 --dry-run mode.
  9. P4-3 Presets.
  10. P5-1 Telegram notifications.
  11. P-Comp-1 LiteLLM multi-provider registry and fail-closed provider routing.
  12. P7-1 → P7-6 Repo extraction, public README, and source-install release hygiene.

Should (v1.x post-release)

  1. P4-5 Failure reason taxonomy.
  2. P4-6 / P4-7 / P4-8 Worktree visibility routing + bootstrap auto-close + dispatch ledger.
  3. P5-2 HUD (rich-based TUI).
  4. P6 Metrics-driven routing + auto-disable + daily digest.

Could (v2 candidates, only if data justifies)

  1. Web HUD (separate from TUI; only if multiple users ask).
  2. Additional engines: Ollama (local), OpenRouter (multi-provider), Grok.
  3. Property/fuzz testing on JSONL emitter.
  4. Golden-file testing for review-prompt drift.

Won’t (explicit non-goals — do not propose)


Phase P-Comp — Post-competitor-analysis amendments (2026-04-27)

Goal: apply the recommendations from Appendix C — Competitor landscape and docs/competitive-analysis.md. Ratified 2026-04-27; 8 strong-recommend items + 5 considered items + 4 explicit skips.

Strategic reframe (item 0 — applies to all subsequent work): Mainspring is positioned as Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery, NOT as “yet another autonomous coding agent”. The Method remains the durable asset, but the public entry point must be understandable to people who do not already know the Method: one command to start, clear explanation of PRD vs. vibe coding, visible HUD, Telegram operations, hard reviewer gate, and evidence ledger. Direct competitors (Composio AO, OpenAI Symphony, Taskmaster autopilot) own the orchestrator-only niche; Mainspring’s differentiator is making doctrine executable and auditable across any project.

Strongly recommend (market evidence after the source-install release)

Recommend (v1.x growth work after source release)

v1.x roadmap ownership: P-Comp-6, P-Comp-7 local implementation and hosted publication, and P-Comp-8 are complete. Future hosted-docs work is limited to optional custom-domain polish; the default GitHub Pages site is already published.

Considered (Could — v1.x or later, not blocking v1.0)

Could-lane re-evaluation triggers: P-Comp-9’s core interface is complete; non-Taskmaster adapters become eligible when at least two alternate backlog sources are requested by real operators. P-Comp-12 becomes eligible only after P-Comp-1 proves Python provider dispatch in real waves. P-Comp-13 becomes eligible after public adoption creates a concrete maintainer question that anonymous counts would answer. P-Comp-9, P-Comp-10, and P-Comp-11 are already closed.

Explicit skips (Won’t — confirmed non-goals)

Acceptance for closing P-Comp


Appendix A — Source-of-knowledge recipes

This appendix points future maintainers and AI agents at the current source of truth. It is intentionally a map, not a second implementation plan.

CLI truth

Runtime and logs

Operator visibility

Review and safety gates

Package payload

Verification map


Appendix B — Verification commands

Use these from the repository root when validating a release candidate. These commands are intentionally boring: they prove the source tree, package payload, PRD, and diff hygiene without calling live AI providers.

set -e
make release-check
./mainspring.sh doctor
./mainspring.sh --dry-run --once

Optional live engine smoke, only when credentials/quota are intentionally available:

./mainspring.sh --self-test
./mainspring.sh --self-test-all

Optional portability smoke, only when Docker is available:

docker run --rm -v "$(pwd):/m" -w /m alpine:3.19 sh -c 'apk add bash python3 git shellcheck && bash mainspring.sh doctor || true'

Appendix C — Competitor landscape / competitive positioning (June 2026 refresh)

The detailed current market analysis lives in docs/competitive-analysis.md. It supersedes the April 2026 snapshot that previously lived inline here.

Snapshot date: 2026-06-14. Product claims were checked against official docs and public repository surfaces. Exact popularity metrics are intentionally omitted because popularity signals drift quickly.

Current strategic finding

Mainspring should not compete as “another coding agent.” OpenCode, Cline, Goose, Aider, OpenHands, Roo Code, GitHub Copilot cloud agent, and Devin already own the broad coding-agent mindshare.

Mainspring should compete as:

Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery.

That means Mainspring exists to solve the operator problem that generic agents leave behind: intent, bounded work, independent review, evidence, global status, notifications, local/private model routing, and recovery.

June 2026 release score

The refreshed 1000-point release-readiness score in docs/competitive-analysis.md rates Mainspring’s v1 source release readiness at 900/1000. This is a source-release readiness score, not a claim that Mainspring has more distribution than established competitors.

Mainspring scores high on:

Next public credibility evidence after the source release:

Closest threats

Threat Why it matters Mainspring response
Agent Orchestrator Worktrees, PR automation, CI fixes, review comment loops, tracker integrations. Stay Product Requirements Document (PRD)-first and evidence-first; add optional GitHub/Linear backlog adapters later.
OpenAI Symphony Strong “manage work, not agents” positioning plus OpenAI brand. Stay local/private, multi-engine, and operator-owned.
Claude Task Master Owns PRD-to-task decomposition and overlaps with autopilot. Be explicit: Mainspring complements Taskmaster by adding execution, review, HUD, Telegram, and evidence.
OpenCode / Goose / Aider Broader coding-agent mindshare and provider/local-model support. Do not fight on chat UX; own autonomous execution control.
Cline / Roo Code Strong editor-native trust and approval UX. Own unattended CLI waves where per-action approval is the wrong workflow.
GitHub Copilot cloud agent / Devin Hosted issue-to-PR convenience and enterprise reach. Own local/private, inspectable, non-SaaS workflows.

Search and positioning requirements

Public copy should repeatedly use these phrases where natural:

The next market-facing gates are: signed release announcement, package install path, comparison pages, 60-second demo, and benchmark evidence.


Last edited: 2026-06-15. This file is the canonical plan; if any other file in the repo disagrees, update it.