Source:
docs/prd.md.
Mainspring — Product Requirements Document (PRD)
Status: canonical plan for Mainspring (autonomous execution loop). Lives at docs/prd.md. The Method that produced this document is at docs/method.md; Mainspring was built with the same Product Requirements Document (PRD)-first discipline it now ships.
Owner: Mainspring maintainers.
Companion docs: docs/guide.md — operator commands and recovery shortcuts.
This is the document that drives Mainspring as a clean modular Apache-2.0 OSS
tool prepared for the mainspring v1.0.0 GitHub source release. Nothing else
in the repo speaks for Mainspring’s plan; if it disagrees with this file, this
file wins.
Contents
- Mission
- Durability principles
- Current truth snapshot (verified)
- Naming and brand boundary
- Target architecture
- Phase map
- Architecture decisions (ADRs)
- Operational doctrine
- Health rituals
- Disaster recovery
- Versioning and migration
- v1.0 GitHub release checklist
- Explicit non-goals
- Backlog (Must / Should / Could / Won’t)
- Appendix A — Source-of-knowledge recipes
- Appendix B — Verification commands
- Appendix C — Competitor landscape / competitive positioning
Mission
Mainspring is a single-operator, single-host autonomous execution instrument: it picks work from a Product Requirements Document (PRD) or backlog, runs a writer model or CLI against it, runs an independent reviewer as a hard gate, captures verifiable outcomes as JSONL, and stops or continues based on what actually shipped — not what the agent claims.
Mainspring is positioned as Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery, not as another generic autonomous coding chat wrapper. The public surface must explain the practical buyer problem first: vibe coding is useful for exploration, but production work needs intent, review, evidence, visibility, and recovery.
Three verbs: pick, ship, record. Anything that doesn’t serve one of these is decoration.
Audience:
- Operators shipping production-grade local projects with AI agents. Primary.
- Developers who want a local Product Requirements Document (PRD)-first AI coding loop with review, evidence, HUD, and Telegram. Secondary.
- Contributors extending engines, backlog sources, docs, or release tooling. Tertiary.
Non-audience: enterprise teams, compliance auditors, cloud SaaS users. Mainspring is a local operator tool, not a hosted platform.
Durability principles
The ten commandments. Anything below this line bows to them. If a phase plan, a feature, or an ADR conflicts with one of these, the plan loses.
- The wave is the unit. One pass of
pick → write → review → log → decideis one wave. Mainspring’s correctness, observability, and metrics all hinge on the wave being a clean atomic concept. Never blur it. - Truth before autonomy. A wrong autonomous loop is worse than a slow one. Mainspring never marks a Taskmaster item done unless the reviewer hard-gate passed and the changed-file check confirmed product code moved.
- The reviewer is the only gate. No silent fallback that bypasses the review verdict. If the reviewer is unreachable, the wave fails closed (logs + Telegram, no auto-pass).
- Writer output must be visible to the reviewer. The SC2259 silent-failure bug (heredoc-overrides-stdin) is the canonical anti-example. Every writer→reviewer handoff must be byte-verifiable. Never rebuild that bug.
- JSONL is the contract.
.mainspring/logs/waves.jsonlschema (defined below) is frozen. Adding fields is non-breaking; removing or renaming a field requires aschema_versionbump and a 90-day deprecation window. - The CLI is the contract. Every flag in
--helpis a public API. Adding flags is non-breaking; removing a flag requires deprecation + stderr warning for one minor version. - No embedded heredocs. Bash dispatches; Python scripts compute. No
python3 - <<'PY', nopython3 -c '...'longer than 80 chars in critical paths. Pretty-printing is a real.pyfile with tests. - Fail closed, fail loud. If we can’t reach the LLM, the writer’s output, the review verdict, the JSONL writer, or the lock file — we stop the wave with a stderr message. Never paper over.
- No fictional features. The doctrine, the guide, and
--helpmay only describe behavior present in the current commit. If a phase claims a feature, the test for it must already be green. - Reversible by default. Auto-checkpoint commits are OK; destructive cleanup, force shutdown, force push, rm -rf of worktrees, git reset –hard — all need an explicit user gate (
--restart-team,--repair-state,--force). Mainspring should never destroy the user’s in-progress work to make its own bookkeeping cleaner.
Current truth snapshot (verified)
Captured 2026-06-15 after HUD empty-state and active-card polish, packaged runtime Python isolation for global HUD use, command-help hardening, public release-surface copy cleanup, installed-wheel CLI contract guards, and a fresh release gate against the standalone repository layout.
| Metric | Value | Source |
|---|---|---|
mainspring.sh LOC |
391 | wc -l mainspring.sh on 2026-06-16 |
| Bash functions across entrypoint + lib | 171 | rg function-pattern scan on mainspring.sh lib/*.sh on 2026-06-16 |
| Source shell syntax | passes | bash -n mainspring.sh lib/*.sh |
| Source shell lint | passes | shellcheck -S warning mainspring.sh lib/*.sh |
| Python lint | passes | ruff check py |
| Python format | passes, 91 files already formatted | ruff format --check py |
| Active embedded Python in shell paths | 0 heredoc / inline-parser hits | rg scan for python3 - << and python3 -c in shell |
| Pytest suite files | 42 | find py/tests -name 'test_*.py' plus count |
| Bats suite files | 12 | find tests/bats -name '*.bats' plus count |
| Python line coverage | 90.6% (coverage gate pass) | make coverage on 2026-06-16 |
| Full local gate | passes: 1124 Python tests passed, 1 skipped; 223 Bats; HUD/docs-site smoke; dependency audit; package smoke; PRD validation | make release-check on 2026-06-16 |
| Global editable CLI | passes: make install-user; command -v mainspring resolves ~/.local/bin/mainspring; installed HUD captured output exits one-shot; unsupported release-management probes fail with Unknown command |
local pipx smoke on 2026-06-15 |
| Taskmaster runtime state | not tracked in source release | py/tests/test_no_hardcoded_paths.py |
| Public release checklist | clean main is public; hosted CI and GitHub Pages are green; remaining external action is signing v1.0.0 and publishing the GitHub Release |
GitHub repository state, hosted Actions runs, Pages deployment, signing key |
| PRD readiness score | 900 / 1000 | source-install product path is green; remaining points are publication checklist items |
| Telegram notifier health in clean env | disabled until MAINSPRING_NOTIFY_ENABLED=1 |
./mainspring.sh notify-health --format json |
Resolved historical critical bug: the former SC2259 path
"${cmd[@]}" | tee "$raw_file" | python3 - <<'PY' ... PY let the heredoc
override piped writer output, so the reviewer could see an empty
display_file. P1 extracted stream prettification into
py/stream_json_prettify.py, added the non-empty display-file regression, and
removed active embedded Python from source shell paths.
What works in v1 today (confirmed):
status,doctor,--dry-run,--last-run,--repair-state,--metrics,hud,engines,limits,replay,notify-health,notify-restart,init,decompose,scope-check,next, andvalidate-prdare implemented as local CLI surfaces.- Pre-v1 compatibility commands are not part of the public v1 source surface;
fresh projects start directly in
.mainspring/. - Solo + team topologies for
taskmasterandnightmodes remain the runtime shape; team mode now has visibility routing, bootstrap cleanup, and duplicate dispatch prevention. - Auto-checkpoint keeps Lore trailers and denylist protections; final public history review remains normal git work outside Mainspring’s public CLI.
.mainspring/logs/waves.jsonlis the primary runtime ledger in fresh projects; older pre-v1 runtime state is read only as a compatibility input when already present.MAINSPRING_*env vars are primary; older pre-v1 env aliases remain compatibility inputs and emit deprecation warnings when used.
What remains for publication:
- The source-install path is public on
main. The published history is short, readable, and free of local development artifacts; it is not from checkpoint-heavy working history. - Completed external publication work: the final ref is pushed, the repository
is public, GitHub metadata is set, hosted CI is green, GitHub Pages is live,
and the draft GitHub Release uses notes from
CHANGELOG.md. - The remaining external step is ordinary maintainer signing work: create the
signed
v1.0.0tag on the clean commit and publish the GitHub Release. - Homebrew, benchmark numbers, and real provider-matrix evidence are valuable follow-up credibility work. They are not normal operator commands and do not block the source-install repository.
1000-point PRD score rule: 900 points are reserved for product readiness:
implementation, tests, documentation, packaging smoke, installed CLI behavior,
HUD/Telegram usability, and security/public-repo hygiene. The remaining 100
points are reserved for publication checklist items. The current score
900/1000 holds until the signed v1.0.0 release is created from the final
clean commit.
Naming and brand boundary
Three names that must never blur:
| Name | What it is | Where it lives |
|---|---|---|
| Mainspring | This tool — the autonomous execution loop. The CLI binary, the runtime, the brand. | Standalone mainspring repository; mainspring.sh plus packaged mainspring console entry point. Apache-2.0. |
| Team backend | External dependency — a separate team-orchestration CLI that Mainspring uses only for explicit --topology team runs. Not part of the normal solo path. |
Wherever the backend CLI is installed on PATH. Mainspring depends on it for team topology only. |
| Taskmaster | External dependency — backlog source. Mainspring picks work from it. | task-master CLI + .taskmaster/ directory. |
The user-facing bag:
- CLI binary:
mainspringfrommake install-user/ pipx, with./mainspring.sh --project <path>as the source-checkout fallback. - Runtime state:
.mainspring/for fresh projects. - Env vars:
MAINSPRING_*for runtime defaults. - Config dir on user system: reserved for future global presets / quota cache. Runtime state stays project-local by default.
- Logs:
.mainspring/logs/, JSONL feedwaves.jsonl, latest-symlinkslatest.log/latest-summary.log.
Compatibility boundary: MAINSPRING_* and .mainspring/ are the public names. Older pre-v1 runtime names are not part of the public configuration contract; historical ledger/replay readers may still parse old recorded fields, but new launches use the Mainspring namespace.
Target architecture
The v1 source tree is intentionally boring: one Bash entrypoint, small Bash modules, tested Python helpers, committed docs, and gitignored project-local runtime state.
File layout
mainspring.sh # Bash entrypoint and CLI dispatch
lib/ # Bash modules: lock, log, help, status,
# doctor, notify, team, wave, wizard
py/ # tested Python helpers and CLIs
engines/ # EngineAdapter implementations
bench/ # source-only SWE-bench helpers; not installed runtime
tests/ # pytest suite
tests/bats/ # shell integration tests
docs/ # README-linked operator, method, metrics,
# architecture, PRD, and operator docs
method/ # reusable Mainspring Method templates/skill
presets/ # built-in run profiles
schema/ # project config JSON Schema
packaging/homebrew/ # source-only Homebrew tap publishing runbook
.mainspring/ # gitignored project-local runtime state
logs/waves.jsonl # append-only wave ledger
state/last-run.env # safe saved setup, parsed without source
state/notify-state.json # Telegram dedup/rate-limit state
CLI contract (frozen at v1.0)
mainspring [taskmaster|night] [flags]
Modes:
taskmaster read .taskmaster/ backlog, pick ready work
night read PRD brief, writer chooses next slice
Topology:
--topology solo|team
--pair <writer>+<reviewer> e.g. claude+codex, gemini+claude
--engine <name> writer engine (when --pair not used)
--review-engine <name> reviewer engine
--model <id> override writer model
--review-model <id> override reviewer model
--speed-profile standard|fast|max
--max-agents 1-6
--once single wave then exit
--prd <path> night mode PRD path
Observe / inspect:
hud [--once|--json|--local] global live operator dashboard
status runtime + git + scheduler + waves snapshot
last-run [--format json] show saved setup + repeat commands
--metrics [--days N] query waves.jsonl
engines [--json] registered engine inventory
limits [engine ...] [--hours N] run-readiness, quota, and spend snapshot
Setup / planning:
init <name> scaffold Method PRD docs
validate-prd <path> validate Product Requirements Document shape
decompose <prd-path> turn one PRD phase into Taskmaster tasks
next [tasks.json] print next blocker-aware task id
scope-check [tasks.json] audit Taskmaster task shape
Recovery / verification:
doctor env + dependency sanity check
stop --force [--all] stop recorded Mainspring processes
--repair-state --dry-run preview stale runtime cleanup
--repair-state --force apply reviewed stale runtime cleanup
--self-test one self-test wave on a synthetic task
--self-test-all full pair-mode matrix
notify-test send sample Telegram notification
notify-health [--format json] inspect notifier daemon state
notify-restart restart only the recorded notifier daemon
Evidence / local maintenance:
replay <show|diff|build|run> inspect or reconstruct recorded waves
--list-presets print available presets
Run modifiers:
--wizard interactive setup
--last-run reuse .mainspring/state/last-run.env
--restart-team destructive: reset active team backend state
--preset <name> load preset env
--dry-run print resolved settings, no API calls
JSONL wave schema (frozen at v1.0, schema_version=1)
.mainspring/logs/waves.jsonl — one JSON line per completed wave. Append-only via flock on waves.jsonl.lock.
Required fields (frozen — adding new ones is the only allowed change without a schema_version bump):
| Field | Type | Description |
|---|---|---|
ts |
string (ISO-8601 UTC, Z) |
wave completion timestamp |
mode |
enum | taskmaster | night |
engine |
enum | writer engine: codex | claude | gemini | … |
wave |
integer | 1-indexed wave counter within the run |
exit_code |
integer | writer exit code |
Standard optional fields (always emitted, may be null):
| Field | Type | Description |
|---|---|---|
review_engine |
enum | reviewer engine |
model, review_model |
string | model ids |
pair |
string | <engine>+<review_engine> for easy jq grouping |
task_id |
string | null | Taskmaster id |
work_id |
string | null | subtask id when applicable |
topology |
enum | solo | team |
team_name |
string | null | active team name when topology=team |
duration_s |
number | wall-clock seconds |
product_files_changed |
integer | count from count_product_file_changes |
verdict |
enum | PASS | FAIL | null (review crashed) |
chapter_delta |
string | +50 / -3 / 0 style, signed |
competitor_delta |
string | same |
launch_delta |
string | same |
product_score |
integer | 0–1000 rubric |
retry_used |
boolean | one-shot reviewer retry was triggered |
failure_reason_class |
string | null | routing:plugin_invisible, engine:quota, review:invalid_json, … |
codex_short_delta_pct |
number | null | usage delta as % of short window |
claude_short_delta |
number | null | Claude usage delta |
gemini_short_delta_pct |
number | null | future engines extend the same shape |
Schema versioning: required fields are frozen. Removing or renaming one bumps schema_version and triggers a 90-day deprecation window where wave_log.py writes both old and new shapes.
Concrete code shapes (load-bearing)
These are the current public contracts that anchor the architecture. The live
source tree is the root-level mainspring.sh, lib/, and py/ layout.
run_ai_turn(role, prompt, log, display) — the engine dispatcher
# lib/engines.sh delegates command construction to py/engines/registry.py.
# Direct CLI engines (claude/codex) and provider engines (gemini, openai,
# anthropic, azure, openrouter, mistral, grok, ollama, litellm) all fail closed
# through the same registry readiness checks before a wave launches.
Adding a new engine means adding one adapter under py/engines/, registering
the default model/readiness contract, and covering it with registry tests.
Provider engines must never silently fall back to another provider, model, or
reviewer.
acquire_lock / release_lock — flock on fd 9
# lib/lock.sh
acquire_lock() {
mkdir -p "$(dirname "$LOCK_FILE")"
exec 9>"$LOCK_FILE"
if ! flock -n 9; then
local existing_pid
existing_pid="$(cat "$LOCK_FILE" 2>/dev/null || true)"
echo "Mainspring already running (pid ${existing_pid:-unknown}); stop it or wait." >&2
exit 1
fi
echo "$$" >&9
}
release_lock() {
exec 9>&- 2>/dev/null || true
if [ -f "$LOCK_FILE" ] && [ "$(cat "$LOCK_FILE" 2>/dev/null || true)" = "$$" ]; then
rm -f "$LOCK_FILE"
fi
}
The kernel auto-releases fd 9 on any exit (including SIGKILL), so the script can never leave a stale lock. The PID file content is purely advisory for human inspection.
check_write_scope — post-wave path guard
# lib/write_scope.sh
# Reads newline-separated changed file paths on stdin.
# Returns 0 if all paths are inside the allowed product scope.
# Returns 1 and prints offenders on stderr otherwise.
check_write_scope() {
local offenders=()
local path
while IFS= read -r path; do
[ -z "${path// }" ] && continue
case "$path" in
.env|.env.*|.secret|.secret.*|*/.env|*/.env.*|*/.secret|*/.secret.*)
offenders+=("$path"); continue ;;
node_modules/*|*/node_modules/*) offenders+=("$path"); continue ;;
.git/*|*/.git/*) offenders+=("$path"); continue ;;
dist/*|coverage/*|playwright-report/*|test-results/*)
offenders+=("$path"); continue ;;
src/*|apps/*|tests/*|e2e/*|docs/*|.taskmaster/*|scripts/*|shared/*|server/*|public/*|plugins/*)
continue ;;
*) continue ;; # top-level dotfile-clean path tolerated
esac
done
if [ "${#offenders[@]}" -gt 0 ]; then
printf 'write_scope violation: %s\n' "${offenders[@]}" >&2
return 1
fi
return 0
}
Invoked after the writer finishes, before review prompt build. Failure here forces a review-fail with reason scope:violation.
parse_review.py — required review JSON fields
REQUIRED_FIELDS = (
"verdict", "chapters", "chapter_delta", "competitor_delta", "launch_delta",
"product_score", "strengths", "gaps", "next_actions", "verification_evidence", "rationale",
)
The reviewer is prompted to emit a fenced json ... block. If absent, fall back to a Markdown KEY: VALUE parser (legacy v1 shape). If still missing required fields → review FAIL with reason review:missing_fields:<key> written to JSONL.
Phase map
This is the historical implementation map that produced the current source
release. Completed items remain as audit trail; current release truth lives in
the verified snapshot above and the v1.0 checklist below. Each future phase must
end green: make all, targeted tests, docs updates, and fresh evidence.
P-Audit — Release audit remediation (DONE 2026-05-03)
Goal: keep external release-audit findings executable through the Method tooling without moving them out of Taskmaster.
- P-Audit-1f + P4.5-5 Make canonical PRD validate clean. Preserve the
ADR required subsection headings in
docs/prd.md, keep the canonical PRD as the validator fixture, and verify./mainspring.sh decompose docs/prd.md --phase P-Auditcompletes without PRD validation errors.- 2026-05-03 completion status: Taskmaster task 1 is done; PRD validation and full local gates pass. Remaining PRD gaps are publication checklist items, not P-Audit validator cleanup.
P0 — Reality reset (DONE 2026-04-26)
Goal: docs and disk match reality, no parallel planning artifacts.
- Delete the unproven legacy prototype directory.
- Delete the obsolete loop refactor scratch plan (Codex-authored).
- Extract the old in-project loop script into the standalone root-level
mainspring.sh. - Move the operator guide into the standalone repo as
docs/guide.md. - Rename the old loop-script test file to the Mainspring naming convention and update legacy path references (42 occurrences).
- Update CLAUDE.md, README.md, docs/execution/README.md, current-scope-program-update-plan.md, BIG-BACKLOG-ALGORITHM.md.
- Add
.mainspring/to.gitignore. - Write this PRD (now
docs/prd.mdafter OSS extraction). - Formalize the Mainspring Method (now
docs/method.mdplusmethod/templates after OSS extraction). - Write the operational Playbook (now
docs/playbook.md). - Renumber phases to insert P4.5 (Method tooling) for
mainspring init/decompose/scope-check/next/validate-prdCLI commands.
P1 — Critical bugs + portability (1-2 days, in place on mainspring.sh) — 🟢 ACTIVE since 2026-04-26
Goal: stop the silent failures. The Claude→Claude review gate must demonstrably see writer output.
- P1-1 🔥 Fix SC2259 heredoc-overrides-pipe at lines 2454, 2641. ✅ DONE 2026-04-26. Extracted Python pretty-print to
py/stream_json_prettify.py(--mode writer|reviewer, SPDX Apache-2.0, ruff clean). Replaced both heredocs with| CLAUDE_DISPLAY_FILE="$display_file" python3 "$MAINSPRING_PY_DIR/stream_json_prettify.py" --mode <role>(env var inline, fixes SC2031 too). Also fixed root path resolution after standalone repo extraction. Verified:bash -nclean,shellcheck -S errorreturned 0 SC2259, heredoc count dropped, and smoke tests cover the canonicalCLAUDE_DISPLAY_FILE-not-empty regression for the silent-failure bug. - P1-2 Remove hardcoded
$HOMEpath at line 84. ✅ DONE 2026-04-26, hardened 2026-06-13. Replaced raw$HOMEglob with allowlistedfnm env --jsonparsing whenfnmis on PATH, keeping the glob as a fallback for systems withoutfnmbinary. The launcher no longer evaluates generated shell code during PATH bootstrap. Addedfnmtodoctoras WARN (not FAIL). No hardcoded user paths remain. - P1-3 Fix SC2155 / SC2034. ✅ DONE 2026-04-26. Split all 5
local x="$(date ...)"into separate declare + assign (SC2155). Removed 5 dead variables:status,phase,dead,totalfromactive_team_status_summary;dispatched_any,idle_roundsfromsupervise_team_run(SC2034). Result:shellcheck -S warningnow returns 0 warnings (was 11). - P1-4 Verify gate works. ✅ DONE 2026-04-26. Mainspring Wave 1 (
night --topology solo --pair claude+claude --once=false) ran end-to-end against this PRD: writer streamed visible output viastream_json_prettify.py, reviewer hard-gate ratified with exit code 0 (PASS), and the display file was non-empty. The current release evidence is the v1.0 verification snapshot above; pre-release loop paths are not part of the public contract.
Acceptance:
shellcheck -S error mainspring.sh lib/*.sh= 0shellcheck -S warning mainspring.sh lib/*.sh= 0 or every warning is intentional and documented- One real wave produces non-empty display file + meaningful reviewer rationale on all 4 pair modes
Rubric impact: correctness 95→140, portability 90→105.
P2 — De-monolith (~2 weeks, incremental commits to feature branch)
Goal: main entry ≤ 500 LOC, no embedded Python, no duplication between writer/reviewer paths.
- P2-1 Extract all 9 heredocs to
py/. (Complete — all strictpython3 - <<'PY'heredocs extracted; the later broader heredoc cleanup is also complete.)wave_ledger→wave_log.py(3 subcommands: append, last-summary, routing-hints)usage_snapshot→engine_quota.py(2 subcommands: codex, claude)settings_persister→last_run.py(2 subcommands: save, load)dispatch_parallel_team_tasks→team_dispatch.py dispatchteam_task_counts→team_dispatch.py countsnudge_pending_team_tasks→team_dispatch.py nudgerequeue_stale_team_tasks→team_dispatch.py requeueengine_stream_handler— confirmed both paths already usestream_json_prettify.py(P1-1)
- P2-2 Consolidate 8
python3 -cinvocations into one CLI:py/team_status.py --field status|phase|dead-workers|usable-check. All 8 callsites replaced. - P2-1b Extract remaining 13 broader heredocs (
python3 - <args> <<'PY'). All 13 extracted:taskmaster.py(11 subcommands: status-summary, scope-summary, task-metadata, task-status, pick-work-item, parent-rollup, next, fallback-next, collect-parallel, team-skip-reason, scope-check)prd_brief.py(1 subcommand: build brief from PRD markdown)runtime_state.py(5 subcommands: team-status, runtime-summary, repair, cleanup-stale, find-latest-team)
- P2-3 Merge
run_engine+run_reviewer. Implementrun_ai_turn(role)per the contract above. Extractspawn_codex,spawn_claudeper-engine helpers. Extractrun_with_heartbeat(pid, label)for the heartbeat loop. Deleterun_engineandrun_reviewer. (Complete —run_ai_turn,spawn_codex,spawn_claude,run_with_heartbeatimplemented inlib/engines.sh.) - P2-4 Split into bash modules under
libper the layout above. Usesourceinmainspring.sh. Each module ≤ 600 LOC. Module-private helpers may not be called across module boundaries. (Complete — 24 modules extracted tolib,mainspring.shat 391 LOC, largest modulelib/help.shat 598 LOC, tied withlib/team.sh. Full local gate on 2026-06-16: pytest suite green, 1 skipped; 223 Bats; HUD/docs-site smoke and dependency audit OK.) - P2-5 Make hidden coupling explicit. ✅ DONE 2026-04-26. The 5 audit offenders resolved: removed obsolete
STATUS_FILE; documented the intentional trap globals; kept the settings persister encapsulated; convertedappend_wave_ledgerandauto_checkpoint_dirty_treefrom hidden global reads to explicit parameters; and documented the globals contract inlib/common.sh. Current v1.0 gates cover shell lint, Python tests, Bats, and packaging smoke.
Acceptance:
mainspring.sh≤ 500 LOCshellcheckclean on alllib/*.shbash -nclean on every shell filedoctor,--self-test-all,--last-runall still work- Real wave succeeds end-to-end on all 4 pair modes
- No
python3 - <<'PY'anywhere in the tree (achieved — 0 heredocs of any form remain) python3 -conly in calls under 80 chars and not in critical paths (achieved — 0python3 -ccalls remain)
Rubric impact: architecture 60→155.
P3 — Tests + observability (~1 week)
Goal: safety net for aggressive P4–P7 refactors. Failing test fails the wave.
- P3-1 bats-core suite. ✅ DONE 2026-04-26. 33 bats tests across 5 files (
test_common.bats,test_wizard.bats,test_wave.bats,test_review.bats,test_lock.bats,test_log.bats). Covers:apply_pair_mode,apply_default_models_for_current_pair,count_product_file_changes,count_nonempty_lines,print_limited_lines,format_epoch_local,review_output_hard_validate,extract_review_field,acquire_lock,build_review_prompt,append_review_ledgers,append_wave_summary.# Scenario:+# Expected:convention. Tests skip cleanly on missing modules.bash tests/bats/run.shall green. - P3-2 pytest suite. ✅ DONE 2026-04-26. 130 tests across 10 test files covering all 10
.pymodules. Happy paths, edge cases, regression tests (SC2259 display_file non-empty regression, concurrent flock safety). Target was ≥35; achieved 130. - P3-3 JSONL wave log. ✅ DONE 2026-04-26.
wave_log.py append --ledger <path>now atomically appends under an exclusiveflockon<path>.lock.log.shupdated to use--ledgerinstead of shell>>redirection. Backward-compatible stdout fallback when no--ledgerarg. 4 new pytest tests including concurrent-safety test (5 parallel writers, all 5 entries survive).metrics.pyreader implemented (see P3-4). Current public metrics documentation lives atdocs/metrics.md. - P3-4
--metricscommand. ✅ DONE 2026-04-26.py/metrics.pyimplements all 7 standard questions: total waves, success rate, mean duration, top stuck tasks, mean chapter delta per pair, expensive waves, and pass rate per pair. Flags:--days N,--since DATE,--format json|text,--pair X+Y. Integrated intomainspring.shas--metrics [--days N] [--since DATE] [--pair X+Y] [--format json|text]. Help text and pytest coverage lock compute, filter, format, and CLI paths. - P3-5 Makefile + local CI. ✅ DONE 2026-04-26. The root
Makefileownsshell-lint,ruff,pytest,bats,lint,test,all, andclean. Verified by the current v1.0 gates: shellcheck OK, ruff OK, pytest green, Bats green, HUD smoke OK, and docs-site smoke OK.
Acceptance:
make allfrom the repository root green- ≥ 25 bats + ≥ 35 pytest passing (current v1.0 evidence: 223 Bats and pytest suite green)
- coverage ≥ 80% on
py/(achieved:make coveragereports 90.6%, above the 80% gate) waves.jsonlpopulated by every wave (achieved: flock-guarded append via--ledger)--metricsanswers all 7 questions above (achieved: all 7 standard questions answered)- failing test in suite fails
--self-test-all
Rubric impact: testability 55→135, observability 130→160.
P4 — UX polish (~1 week)
Goal: “production ready” → “delightful to operate”.
- P4-1 Structured review JSON. ✅ DONE 2026-04-26. Created
py/parse_review.py(JSON-strict parser with markdownFIELD: VALUEfallback, canonical fields, CLI withparse/validate/field/shell-varssubcommands). Updatedbuild_review_prompt()inlib/review.shto request fencedjsonas primary format, keeping legacyFIELD: VALUEas documented fallback. Replaced regex-scraping calls across review ledger append, hard validation, repair instructions, and field extraction withparse_review.pycalls. Validation now produces specific error messages such asmissing FILES_TO_TOUCHandreview:missing_fields:CHAPTERS. Current v1.0 local gates cover this path.- 2026-05-03 hardening:
parse_review.py validatenow fails closed on malformedPRODUCT_SCOREvalues (invalid_type:PRODUCT_SCORE) and out-of-range rubric scores (invalid_range:PRODUCT_SCORE) instead of accepting any non-empty string. Verified by focused parse-review tests, review Bats, PRD validation, and the local gate at the time. - 2026-05-03 delta validation hardening: the same gate now rejects malformed
CHAPTER_SCORE_DELTA,COMPETITOR_DELTA, andLAUNCH_DELTAvalues withinvalid_type:*reasons instead of allowing non-numeric review-score evidence through. Verified by focused parse-review tests, review Bats, PRD validation, and the local gate at the time.
- 2026-05-03 hardening:
- P4-2
--dry-runmode. ✅ DONE 2026-04-26. Standalone--dry-runprints resolved settings (mode, pair, engines, models, topology, speed, agents, paths), writer/reviewer command shapes, and dependency checks — zero API calls. Works with--preset,--last-run, and--repair-state. Integrated intolib/wizard.shasrun_dry_run()and covered by Bats plus Python tests.- 2026-05-04 last-run discoverability follow-up:
mainspring last-runnow shows the saved per-project setup without launching work, including.mainspring/state/last-run.env, saved timestamp, mode/topology/pair/models, speed/agents, PRD, CI retry settings, and exact repeat/preview commands.mainspring --last-runremains the execution resume path. - 2026-05-04 wizard resume follow-up: plain interactive
mainspringnow checks the saved per-project setup first, prints the same readable last-run summary, and offersContinue with saved setupbefore falling through to the normal manual wizard. Explicitmainspring --last-runremains the non-interactive resume command. - 2026-06-13 first-run hardening, polished 2026-06-14: plain
mainspringnow enters the guided setup surface even when stdin is not a TTY. Empty non-interactive stdin fails closed with explicit commands (last-runto inspect saved setup,--last-runto resume saved setup,--dry-run --onceto preview defaults, andinit checkout-redesignto scaffold PRD-backed starter docs with a replace-the-name hint) instead of falling through to.taskmaster directory not found. Verified by top-level Bats regressions and a fresh temp-project smoke. - 2026-06-09 command-alias hardening: user-facing noun commands
mainspring status,mainspring doctor, andmainspring notify-testdispatch to the existing read-only/test paths. Compatibility flag spellings remain parser-only for old scripts and stay out of public help/docs. Verified by Bats dispatch regressions and help/README tests.
- 2026-05-04 last-run discoverability follow-up:
- P4-3 Presets. ✅ DONE 2026-04-26. Created root-level
presets/withnightly-max.env,conservative-docs.env, andfast-smoke.env. Loaded vialast_run.pysafe parsing (nosource).--list-presetsshows descriptions, and--preset <name>loads before CLI flag resolution so flags still win. - P4-4 Public history review kept outside the CLI. ✅ UPDATED 2026-06-15. Earlier pre-public builds included a publication-only history tool. It was removed from the runtime, package manifest, tests, and public docs before v1 because it was not product behavior and made the CLI look larger than the real operator workflow. Public history review now stays in normal git release-owner practice.
- P4-5 Failure reason taxonomy. ✅ DONE 2026-05-03.
failure_taxonomy.pystandardisesfailure_reason_classvalues onrouting:*,engine:*,review:*,scope:*, andteam:*, upgrades legacy bare classes, and owns the routing-failure action policy. Recoverable team visibility failures fall back to solo; non-recoverable task-scoped routing failures block Taskmaster with the same machine-readable code. Team preflight consults this policy before applying a Taskmaster block, andmetrics.pyuses the same normalisation helper for repeat-failure clusters. Current v1.0 gates cover taxonomy, ledger, metrics, Taskmaster, team, and full local CI. - P4-6 Worktree visibility routing rule. ✅ DONE 2026-05-03. Team mode skips Taskmaster items whose declared scope matches
MAINSPRING_TEAM_EXCLUDE_PREFIXESor the pre-v1 compatibility alias, emitsrouting:plugin_invisibleas the recoverable team skip reason, and falls back to solo per ADR-02.doctornow scans for nested.gitroots and warns when a nested repo path is not covered by the exclude prefixes, while reporting OK for covered paths and the compatibility alias. Verified bypython3 -m pytest py/tests/test_taskmaster.py -q(39 passed),bats tests/bats/test_doctor.bats(4 passed),bash -n lib/doctor.sh, andshellcheck -S warning lib/doctor.sh tests/bats/test_doctor.bats.- 2026-05-03 local helper worktree hardening:
.claude/is now a built-in team exclude prefix, so local Claude/Codex helper worktrees do not trigger doctor visibility warnings and cannot be selected for git-worktree fanout. Operator-configured exclude prefixes remain additive. Current v1.0 gates cover Taskmaster, doctor, and full local CI.
- 2026-05-03 local helper worktree hardening:
- P4-7 Bootstrap auto-close. ✅ DONE 2026-05-03.
team_dispatch.py close-bootstrapcloses active non-Task Master ...bootstrap tasks by claiming pending tasks when needed and transitioning them to completed; Taskmaster work and already-completed bootstrap tasks are ignored. Verified bypy/tests/test_team_dispatch.py. - P4-8 Duplicate dispatch prevention. ✅ DONE 2026-05-03.
team_dispatch.py dispatchpersists a per-team dispatch ledger keyed by Taskmaster id, refreshes from active team tasks, blocks automatic redispatch for pending/in-progress/completed/failed ids, records failed create attempts, and allows explicit--retry-task <id>overrides. Verified bypy/tests/test_team_dispatch.py.
Acceptance:
- review parser fails specifically (
missing_field:CHAPTERS,invalid_type:PRODUCT_SCORE) — no silent passes (achieved: 39 pytest tests) --dry-runmakes zero external API calls (verified viastrace/test fixture) (achieved: 3 bats tests confirm output + zero API calls)- presets cover 3 flag combos (
nightly-max,conservative-docs,fast-smoke), loaded via safe parsing (achieved: 4 bats tests) - public history review remains outside the public CLI surface
- a wave that violates team worktree visibility records
failure_reason_class=routing:plugin_invisibleand falls back to solo - a non-recoverable task-scoped routing failure records
failure_reason_class=routing:scope_blockedand blocks the Taskmaster item - two team tasks for the same id can’t both be
pending
Rubric impact: safety 110→132, UX 70→95.
P4.5 — Mainspring Method tooling (~1 week)
Goal: make the Mainspring Method (the doctrine-first dev flow at docs/method.md) executable as Mainspring CLI subcommands. Today the Method is documented plus CLI-assisted; this phase made its key steps callable from the CLI so operators (and future Mainspring waves themselves) can invoke them programmatically.
The Method package source lives under method/ and ships as part of the Mainspring OSS release. CLI commands in this phase wrap those templates and validators.
- P4.5-1
mainspring init <name>. ✅ DONE 2026-05-03.py/method_init.pyscaffoldsdocs/<slug>/prd.mdfrom the Method PRD template, creates.mainspring/stateand.mainspring/logs, recordsactive-prd.json, initializes Taskmaster when available, and fails closed when the template or Taskmaster bootstrap is missing. Verified bypy/tests/test_method_init.pyand top-level Bats dispatch coverage.- Create
docs/<name>/prd.mdfrommethod/templates/prd.mdwith ` =` substituted. - Run
task-master initif.taskmaster/doesn’t exist. - Create
.mainspring/runtime state dir. - Print next-step hints (
run mainspring doctor,apply the Method to write the PRD, etc.).
- Create
- P4.5-2
mainspring decompose <prd-path>. ✅ DONE 2026-05-03.py/decompose.pyvalidates PRDs, parses the Phase Map, selects requested or active phases, emits deterministic Taskmaster prompt plans, classifies manual blockers, and can idempotently apply generated tasks to a Taskmaster tasks file with backups and digests. Verified bypy/tests/test_decompose.pyand Bats top-level dispatch/apply coverage. - P4.5-3
mainspring scope-check. ✅ DONE 2026-05-03.taskmaster.py scope-checkaudits active backlog items for vague titles, missing acceptance criteria, missing test plans, manual blockers in the wrong place, and oversized work. It exits non-zero for high-severity violations and reports clean backlogs as zero violations. Verified bypy/tests/test_taskmaster.pyand Bats top-level dispatch coverage.- Flag tasks with vague titles (“improve X”, “clean up Y”).
- Flag manual-blocker tasks NOT in the last phase.
- Flag tasks without acceptance criteria.
- Flag tasks without test plans.
- Flag tasks > half-day estimated effort (must split).
- Print a violations report; exit non-zero if any high-severity violations.
- P4.5-4
mainspring next(blocker-aware). ✅ DONE 2026-05-03.taskmaster.py nextskips blocked tasks and unmet dependencies, returns ready subtasks, and prefers tasks in the active PRD phase from.mainspring/state/active-prd.json. Verified bypy/tests/test_taskmaster.pyand Bats top-level dispatch coverage. - P4.5-5 PRD-shape validator. ✅ DONE 2026-05-03.
py/prd_validate.pyvalidates the 17-section PRD shape, unresolved placeholders, Current Truth Snapshot source commands, ADR required subfields, and Backlog Won’t coverage.mainspring validate-prd docs/prd.mdpasses anddecomposeuses it as preflight. Verified bypy/tests/test_prd_validate.py. - P4.5-6 Method package shipped with Mainspring OSS release. ✅ DONE 2026-05-03. The extracted repo includes
method/with the Method skill, templates, install script, and README. Public README frames the Method as a first-class feature and linksdocs/method.md. Verified bypy/tests/test_public_readme.pyand Method template presence.- 2026-05-03 release-doc reconciliation:
docs/method.md, the Method skill, Method README, Method PRD templates, and the Playbook now reflect the shipped P4.5 CLI reality (mainspring init,decompose,scope-check,next,validate-prd) instead of describing those commands as unshipped roadmap work. Verified bypy/tests/test_method_docs.pyplus the focused docs gate.
- 2026-05-03 release-doc reconciliation:
Acceptance:
mainspring init demo-featureproduces a validdocs/demo-feature/prd.mdskeleton,.taskmaster/initialised,.mainspring/state dir present.mainspring decompose docs/prd.mdproduces a Taskmaster backlog matching the PRD’s current phase structure (idempotent — second run produces no new tasks).mainspring scope-checkon a backlog containing one “improve dashboard” task flags it; on a clean backlog reports “0 violations”.mainspring validate-prd docs/prd.mdexits 0 (the Mainspring PRD is the canonical example and must validate clean against its own validator).mainspring nextskips tasks markedblockedand tasks with unresolved dependencies.- Method package documented in Mainspring’s public README at extraction time.
Rubric impact: Method productization +30, UX 95→110.
P5 — Observability and engine support (~1.5 weeks)
Goal: the three big features the operator wants for the OSS release.
- P5-1 Telegram notifications. ✅ DONE 2026-04-27. Created
py/notify_telegram.py(watch/send/test subcommands, SPDX Apache-2.0, ruff clean). Event classes:wave_failed,retry_loop,loop_stopped,quota_warn,team_stuck,milestone, anddaily_digest. Per-event rate limiting and persistent dedup state live in.mainspring/state/notify-state.json. Daemon failure never blocks a wave.lib/notify.showns daemon start/stop/reap andrun_notify_test;mainspring.shintegratesnotify-test,notify-health, auto-launch, and cleanup. Env:MAINSPRING_TELEGRAM_BOT_TOKEN,MAINSPRING_TELEGRAM_CHAT_ID,MAINSPRING_NOTIFY_ENABLED.- 2026-05-03 recovery hardening:
mainspring notify-health [--format text|json]now reports disabled, unconfigured, healthy, starting, stale, and config-error states with canonicalnext_stepguidance;mainspring notify-restartreplaces only the recorded notifier PID after validating that the process command belongs to the current runtime ledger. The stuck-daemon playbook now routes operators throughnotify-health→notify-restart→notify-testinstead of broad process-name kills. Verified bypython3 -m pytest py/tests/test_notify_telegram.py py/tests/test_hud.py -q(173 passed),bats tests/bats/test_notify.bats tests/bats/test_wizard.bats(74 passed),bash -n mainspring.sh lib/notify.sh lib/help.sh,shellcheck -S error mainspring.sh lib/notify.sh lib/help.sh tests/bats/test_notify.bats tests/bats/test_wizard.bats,ruff check py/notify_telegram.py py/tests/test_notify_telegram.py py/tests/test_hud.py, andpython3 py/prd_validate.py docs/prd.md. - 2026-05-03 acceptance evidence: the watcher has direct regressions for burst failure batches: unrelated failed waves send one
wave_failedalert with suppressed duplicates, while repeated failures for the same task promote to one strongerretry_loopalert. In both paths the watcher advanceslast_line_countthrough the whole batch and does not block the wave path. The killed-daemon acceptance remains covered bytests/bats/test_notify.bats::start_notify_daemon recovers from killed daemon pid, which removes a dead recorded PID, starts a replacement, and returns success. Verified bypython3 -m pytest py/tests/test_notify_telegram.py::TestWatchLoop::test_watch_fifty_failures_do_not_flood_wave_failed -qand the notify Bats gate. - 2026-05-04 project-context hardening: Telegram event messages now include
Project:from the watched runtime ledger root andTag:fromMAINSPRING_TASKMASTER_TAGor.taskmaster/state.jsonwhen Taskmaster context is present. This makes simultaneous Mainspring runs distinguishable in one Telegram chat. Current v1.0 gates cover notifier regressions and full local CI. - 2026-05-04 idle-stop alerting: the wave loop now appends an explicit
STOPledger row when it exits after the idle streak threshold, and the Telegram watcher emitsloop_stoppedwith project/tag/task/reason context. This fixes the prior blind spot where the daemon processed finalIDLErows but sent no terminal notification. Verified by notifier pytest and notify Bats coverage. - 2026-06-09 operator payload hardening: actionable Telegram events now include
Folder:plus task, pair, result, reason, next action, and duration fields where the wave ledger has them.retry_loopandteam_stuckalerts now point at the latest affected task/pair instead of only saying that something is stuck. This fixes the multi-project operator gap where one Telegram chat could not reliably tell which checkout, tag, and run needed attention. - 2026-05-04 shutdown drain hardening: cleanup now waits briefly for an owned notifier daemon to drain pending ledger lines before reaping it. This closes the race where the idle-threshold
STOProw was written but the watcher was killed before its next poll. Verified by Bats coverage plus local log/state diagnosis from a live project runtime. - 2026-05-04 actionable retry-loop alerting: when the latest ledger state is already a retry loop, the watcher sends one
retry_loopmessage before per-wave events and suppresses duplicatewave_failednoise for that batch. Retry-loop messages now include the reason plus project-localOpen:andStop:commands when the watched ledger root is known. Verified by notifier pytest coverage.
- 2026-05-03 recovery hardening:
- P5-2 HUD / dashboard. ✅ DONE 2026-05-03.
mainspring huddispatches topy/hud.py, a read-only terminal HUD with text, JSON, watch, and Rich Live render modes. Snapshot panels cover current wave, recent waves, today-style metrics, quota gauges for Codex / Claude / Gemini, Telegram notifier health, ledger health, and active team state. Flags include--once,--watch,--rich,--json,--since,--width,--interval, and bounded--iterations; there is no web port or mutation control. Runtime dependencyrich>=13.7,<15is declared inrequirements.txt. Verified bypython3 py/hud.py --json --once --ledger .mainspring/logs/waves.jsonl --state-dir .mainspring/state | python3 -m json.tool, Rich renders at widths 80 and 200,py/tests/test_hud.pycoverage inmake all, and thehud-rich-smokeMakefile gate.- 2026-05-04 operator HUD polish:
mainspring hud --rich --watchnow renders the live Rich dashboard instead of rejecting the documented flag combination. HUD snapshots include the watched project folder, current/recent wave started and stopped times, and recent waves sorted newest stopped first. Verified bypython3 -m pytest py/tests/test_hud.py -q,./mainspring.sh hud --rich --watch --iterations 1 --interval 0 --width 120, installed globalmainspring hud --rich --watch --iterations 1 --interval 0 --width 120, and fullmake all. - 2026-05-04 usability follow-up, refreshed 2026-06-13: plain
mainspring hudnow opens the live dashboard in an interactive terminal, captured/scripted output remains one-shot, stalesession.jsoncwd no longer overrides the watched runtime folder, long wave IDs are compacted in the Rich table, and normal recovery surfaces now point at current-projectmainspring stop --forceinstead of promoting cross-project process cleanup. Verified by focused HUD/state/README tests plus fullmake all. - 2026-05-04 global operator follow-up: plain
mainspring hudis now a machine-level operator dashboard, not a current-folder-only view. It discovers live Mainspring work processes from process commands/cwd plus runtime roots; rows show live status, folder, PID, Taskmaster tag, task, wave, pair, started/last stopped time, verdict, Telegram state, and team state. Stale known runtimes are opt-in viamainspring hud --all-runtimes; the old single-project dashboard remains available asmainspring hud --localor explicit--ledger/--state-dir. Verified by focused global HUD tests plus fullmake all. - 2026-05-04 default-live follow-up: interactive plain
mainspring hudnow promotes to the Rich watch dashboard even though the shell wrapper injects global seed paths; captured output remains a finite one-shot for pipes, tests, and scripts. Verified by focused HUD CLI tests, Bats dispatch coverage, installed CLI smoke, and fullmake all. - 2026-05-04 operator-state follow-up, refreshed 2026-06-09: global HUD rows now distinguish process liveness from operator health with human labels (
Running,Waiting,Blocked,Failed,Stopped cleanly) while preserving stable machine state values in JSON. Rows surface failure reason, consecutive failed waves, and next action. The wave scope filters also ignore generated build caches and runtime SQLite sidecars (build/,.gradle/,target/, frontend caches,*.db-wal,*.db-shm) so successful Gradle/Vitest verification no longer fails the reviewer gate only because test/build tools touched generated output. - 2026-05-04 progress follow-up: HUD snapshots now include a lightweight read-only project progress signal. Taskmaster leaf tasks for the active tag are counted first, ignoring cancelled items; if Taskmaster files are absent, the HUD falls back to Product Requirements Document (PRD) checkbox completion. Global and local HUD views surface the resulting
Progressvalue so the operator can see broad movement without reading logs. Verified by focused HUD tests and CLI smoke. - 2026-06-12 interrupt hygiene follow-up:
Ctrl-C/KeyboardInterruptin live HUD modes now exits cleanly with status 130 instead of printing a Python traceback after the Rich panel. Regression coverage exercises both plain--watchand Rich live modes, and the installed CLI was verified with a realSIGINTsmoke. - 2026-06-14 public snapshot polish: captured global HUD output now uses the same operator language as the Rich dashboard:
View: all projects on this machine,needs actioncounts, multiline run cards,Folder,Tag,Task,Progress,Result,Reason,Telegram, andNextfields. Stale global-scope wording, vague attention labels, and dense key/value debug-line styling are removed from public snapshots and the committed HUD demo/preview assets. Verified by focused HUD/README/hygiene tests and source CLI snapshot smoke. - 2026-06-12 process-boundary follow-up: every Python CLI entrypoint now routes through shared interrupt handling, with direct subpackage entrypoints covered by local
KeyboardInterruptguards. Operator interrupts now consistently exit 130 instead of leaking Python tracebacks; notifier watch persists state before exiting 130. A repo-wide static test prevents new naked__main__entrypoints from bypassing this contract, and packaging metadata shipscli_runtime.pyboth as an installed module and as ashare/mainspring/pyruntime helper.
- 2026-05-04 operator HUD polish:
- P5-3 Multi-engine provider support. Superseded by P-Comp-1’s LiteLLM-backed engine registry rather than a bespoke plugin protocol. Code-side support now routes Gemini, Grok, OpenAI, Anthropic, Azure, OpenRouter, Mistral, and Ollama through provider adapters, with missing modules or credentials failing closed instead of silently falling back. Remaining market follow-up: run at least one real non-author provider wave, preferably Gemini, against a docs-only task with explicit credentials.
Acceptance:
- Telegram daemon survives 50 consecutive wave failures without flood; sends 1 message + 1 dedup-suppressed log line per remaining failure
- Telegram daemon kill -9 → wave loop unaffected
mainspring hudruns cleanly on a populatedwaves.jsonl;--jsonoutput round-trips throughjq- adding
engines/grok.pyrequires zero changes toengines.shorwizard.sh mainspring --pair gemini+claude --onceruns end-to-end against a docs-only task
Rubric impact: observability 160→185, ergonomics 95→120, extensibility +20 (new axis).
P6 — Metrics-driven routing (~1 week)
Goal: the routing default (which pair, which topology) gets chosen by data, not preference.
- P6-1 Extend JSONL fields. ✅ DONE 2026-05-03.
wave_log.pyemits the additive v1 fieldstopology,team,team_name,failure_reason_class,task_status_before, andtask_status_afterwithout a schema bump.failure_reason_classis derived from explicit values or failure prefixes for routing readers. Verified bypy/tests/test_wave_log.pycoverage in the focused routing gate. - P6-2 Routing report. ✅ DONE 2026-05-03.
mainspring --metrics --routingnow reports pass rate by pair, pass rate by topology, mean duration by task class, repeat-failure clusters, cost per chapter delta per pair, 14-day chapter-delta-per-dollar values, and auto-disable candidates. Verified bypython3 -m pytest py/tests/test_metrics.py -q(22 passed),mainspring --metrics --routing --format json --days 365 | python3 -m json.tool, and the focused 127-test routing/LiteLLM/ledger gate. - P6-3 Auto-disable rule. ✅ DONE 2026-05-03.
metrics.py --routing --update-disabled-pairswrites.mainspring/state/disabled-pairs.json, preserves manual reactivation, and only recommends lower-cost/fast lanes below 70% of the 14-day median chapter-delta-per-dollar value.run_wizard()hides auto-disabled default pairs and offers manual override. Verified bypy/tests/test_metrics.pyandtests/bats/test_wizard.batscoverage; live production disablement remains data-dependent. - P6-4 Daily digest content. ✅ DONE 2026-05-03.
build_daily_digest()now emits data-derived digest lines for total waves, pass rate, mean duration, total cost, current quota status, disabled pairs, top 3 stuck task ids, tokens by pair, role tokens by pair, top 3 cost waves, and cost-per-positive-movement metrics. Cost truth prefers ledgercost_usd/total_cost_usd; if absent, it estimates only from known model prices + token counts and labels estimated rows. Verified bypython3 -m pytest py/tests/test_notify_telegram.py py/tests/test_notifier_recovery_docs.py -q(140 passed), including P-Comp-5 calendar-day acceptance coverage.
Acceptance:
--metrics --routinganswers “which pair is best for tests today” with a number + sample size- a deliberately bad pair (e.g.
claude-haiku+claude-haiku) auto-disables after 14 days of underperformance - the daily digest contains zero hand-written prose
Rubric impact: observability 185→210, decision quality +30.
P7 — Repo extraction + GitHub release (1-2 days)
Goal: Mainspring ships as its own Apache-2.0 OSS repo on GitHub as a clean source-install v1.0 release. The current public release procedure is the single checklist in v1.0 GitHub release checklist; older scratch bootstrap commands are not part of the public contract.
- P7-1 Internal renames. ✅ DONE 2026-05-03. One-pass mechanical cleanup across the now-modular tree:
- pre-v1 env aliases →
MAINSPRING_*(with backwards-compat: read both, log deprecation for one minor version, drop in v1.1.0) - pre-v1 runtime paths →
.mainspring/(fresh public projects start directly in.mainspring/; no public compatibility helper ships) - All prose mentions of the legacy product name → “Mainspring”
- Internal log labels (
[claude]→ unchanged because those are engine names, NOT the tool name) - 2026-05-03 completion status, reconciled 2026-06-13 for public v1: operator-facing runtime defaults read
MAINSPRING_*first with compatibility fallback. Runtime-root resolution is centralized: fresh projects use.mainspring/for logs/state/team/lock/last-run paths. Historical extraction shims are not part of the public source surface. Helper fallbacks in status/team/doctor/self-test default to.mainspringwhen sourced without top-level launcher globals. Current v1.0 gates cover status/team/wizard regressions, PRD validation, and packaging/SPDX checks. - 2026-05-03 prose cleanup closure: literal legacy product-name search now returns no matches outside ignored runtime/build directories. Focused verification passed with
bats tests/bats/test_common.bats(8 passed),python3 -m pytest py/tests/test_mainspring_bootstrap.py py/tests/test_public_readme.py py/tests/test_prd_validate.py -q(27 passed, 1 skipped),bash -n lib/common.sh mainspring.sh,shellcheck -S warning lib/common.sh mainspring.sh, andpython3 py/prd_validate.py docs/prd.md. - 2026-05-03 standalone command cleanup: operator-facing generated commands now target the extracted repo contract (
./mainspring.sh,mainspring, and root-levelmake all). Covered surfaces: Method init next steps, PRD decomposition writer prompts, replay command reconstruction, stale-process cleanup matching, Method task templates, and the operator playbook. Legacy process patterns remain only where needed for migration or safe cleanup.
- pre-v1 env aliases →
- P7-2 Source tree.
mainspring/ # repo root |-- mainspring.sh # entry script |-- lib/ # bash modules |-- py/ # python helpers and pytest suite |-- presets/ # env presets |-- schema/ # project config schema |-- tests/bats/ # bash test suite |-- tests/golden-runs/ # replay regression fixtures |-- docs/ | |-- architecture.md | |-- competitive-analysis.md | |-- method.md | |-- metrics.md | |-- guide.md | |-- playbook.md | |-- prd.md | `-- assets/ |-- Makefile |-- README.md # quickstart + screenshots + GitHub flair |-- LICENSE # Apache-2.0 boilerplate |-- NOTICE # mandatory under Apache-2.0 |-- SECURITY.md # vulnerability reporting policy |-- CONTRIBUTING.md # how to add an engine adapter, run tests |-- CHANGELOG.md # starts at v1.0.0 |-- .github/workflows/ci.yml # shellcheck + ruff + bats + pytest matrix `-- .gitignore - P7-3 Apache-2.0 hygiene.
LICENSE+NOTICEfiles. SPDX header on every source file:# SPDX-License-Identifier: Apache-2.0. Third-party deps (rich,pytest,ruff,shellcheck,bats-core) listed inNOTICE. No copyright on the author personally — copyright “Mainspring contributors” so future PR authors don’t need a CLA. - P7-4 Public README. ✅ DONE 2026-05-04, refreshed 2026-06-14.
README.mdis a GitHub-facing landing page with release badges, a committed visual hero (docs/assets/readme-hero.svg), explicit current GitHub source install instructions, generic PATH guidance for~/.local/bin, plain-language positioning, green/red badge decision tables, one-command start (mainspring), concise defaultmainspring --helpplus fullmainspring help --fullcontract, Product Requirements Document (PRD) explanation, vibe-coding tradeoff framing, key-feature table, HUD and Telegram sections, copy/paste commands, engine support matrix, canonical docs links, and no unsupported release-management CLI commands. The committed HUD preview and asciinema demo now use the same public snapshot vocabulary as the actual CLI and avoid stale table/debug output. Verified bypython3 -m pytest py/tests/test_public_readme.py py/tests/test_hud.py py/tests/test_no_hardcoded_paths.py -q,make release-check,make install-user, source/global CLI smoke, SVG render smoke, and JSONL parsing of the asciinema cast. - P7-5 GitHub publication procedure. The source tree is already a
standalone repository. Publication uses the maintained checklist below:
run local gates, publish only the reviewed final release commit, make the repository
public, set the Product Requirements Document (PRD)-first description and
topics, ensure a signed
v1.0.0tag exists on the final clean commit, and publish a GitHub Release fromCHANGELOG.md. Keep publication as ordinary GitHub work; do not add release-only Mainspring commands.
Acceptance:
- The GitHub repository is public, Apache-2.0, and has the production-grade PRD-first description plus discovery topics.
git clone <repo> <fresh-dir> && cd <fresh-dir> && make allgreen from a fresh clone.mainspring --helpworks on a fresh box with onlybash,python3,gitinstalled (other deps reported bymainspring doctor)
Rubric impact: packaging +50 (new axis), distributability +30.
Architecture decisions (ADRs)
Six load-bearing decisions. Each is reversible only at high cost; each is documented here so future maintainers can read them and either re-confirm or override.
ADR-01: License = Apache-2.0
Context: Mainspring will become public OSS. License choice is permanent (changing later requires CLA from every contributor).
Options considered: MIT (simplest), Apache-2.0 (explicit patent grant), BSD-2-Clause.
Decision: Apache-2.0.
Rationale: Mainspring is infrastructure tooling (runs for life), not a 200-LOC library. The patent grant matters because: (a) the engine-adapter pattern is novel-ish; (b) someone could fork Mainspring, patent the adapter approach, and try to enforce against the original. Apache-2.0 blocks that. Mature dev tools (Terraform, Kubernetes, k6, Bun, Vite) default to Apache-2.0; matching that signals enterprise readiness and lets teams adopt without legal review. MIT is simpler but loses the patent grant for no practical gain.
Consequences: every source file gets an SPDX header; NOTICE file required; copyright held by “Mainspring contributors” (no CLA, future contributors implicitly accept under §5).
Reversal cost: very high (relicensing public OSS requires every contributor’s consent). Get this right now.
ADR-02: Nested-repo strategy = configurable team-exclude
Context: Operators sometimes have nested git repos (submodule-style, ignored nested checkouts) inside their workspace. Team workers operate in worktrees that don’t see those nested repos; team-mode dispatch of nested-repo-scoped tasks would silently fail.
Options considered: ignore nested repos and let reviewer failure catch it; force all nested-repo work to solo manually; auto-detect nested git roots on every dispatch; expose an explicit exclude-prefix knob.
Decision: Mainspring exposes MAINSPRING_TEAM_EXCLUDE_PREFIXES plus a pre-v1 compatibility alias as a colon/comma-separated list of path prefixes that team mode skips. .claude/ local helper worktrees are excluded by default, and operator-configured prefixes are additive. Team mode skips matching Taskmaster items with failure_reason_class=routing:plugin_invisible. Such work routes to the solo lane, which sees the nested repo because it runs in the leader workspace.
Rationale: generic exclusion is a one-line, fully-tested guard; the operator decides which paths are nested-repo or team-invisible per project.
Consequences: team mode may leave some ready tasks untouched when their scope matches an exclude prefix; operators must keep the prefix list honest per project. Doctor and routing reports must make skipped scopes visible so the skip is never silent.
Reversal cost: low — change the env var.
ADR-04: Model policy = always premium
Context: routing decision — keep fast/mini lanes for low-risk docs-only work, or force every wave through the most capable model?
Options considered: always premium; low-risk docs-only lanes; dynamic pair selection by recent metrics; manual per-wave model choice.
Decision: always premium. Default models: Codex gpt-5.5 with reasoning_effort=xhigh; Claude opus; Gemini gemini-2.5-pro. No “fast” lane is shipped as a default preset.
Rationale: Mainspring’s mission is high-quality autonomous execution. Cheaping out on a docs task that the reviewer then has to re-do costs more cycles than running premium once. The user explicitly chose this. Future engines must default to their flagship model.
Consequences: the P6 metrics-driven auto-disable rule is restricted to non-default lanes that an operator explicitly enabled. Premium pairs are never auto-disabled.
Reversal cost: low — change defaults in wizard.sh.
ADR-05: Team failure semantics = both failed + blocked
Context: when a team task fails for a non-recoverable routing reason attached to a Taskmaster item (for example an explicit scope block or stale empty parent), what state goes where? Recoverable preflight visibility skips are covered by ADR-02 and route to solo instead of blocking the backlog.
Options considered: mark only the team backend task failed; mark only the Taskmaster item blocked; retry indefinitely with the same routing; dual-mark both systems with the same machine-readable reason.
Decision: for non-recoverable task-scoped routing failures, mark the team backend task failed (with failure_reason_class recorded in the team ledger and waves.jsonl), AND mark the Taskmaster item blocked (with the same machine-readable reason in the task body). Supervision must not re-dispatch a known-blocked item until the operator clears the block manually. Recoverable reasons such as routing:plugin_invisible are logged as failed team-preflight rows and then processed by the solo lane.
Rationale: dual-marking gives the operator two views of the same fact: the team metrics show “this team had N routing failures of class X” (useful for triage), and Taskmaster shows “task #42 is blocked because Y” (useful when picking next work). Single-marking either way loses one of those views.
Consequences: blockers become explicit operator work instead of hidden scheduler state; clearing a false-positive block requires manual Taskmaster action. Metrics can group by failure_reason_class across both ledgers because the same code is written to both places.
Reversal cost: medium — would require unwinding the dispatch ledger schema.
ADR-06: Auto-checkpoint policy = keep recovery commits out of public history
Context: Mainspring’s auto-checkpoint commits operational state during fanout (using Lore trailers + denylist). Final history quality depends on the operator reviewing checkpoints and publishing semantic commits.
Options considered: disable auto-checkpoint entirely; keep operational checkpoint commits as final history; auto-squash without asking; keep checkpointing and document public-history review.
Decision: keep auto-checkpoint as-is, but keep public-history preparation outside the Mainspring CLI. The operator uses normal git review and semantic commits before publication.
Rationale: auto-checkpoint preserves work without operator intervention, which is the whole point of autonomous execution. Manual finalization preserves history quality, which is the point of OSS publication. Doing both means the worst-case path still has an auditable trail of what happened, while the happy path ships clean semantic commits.
Consequences: operators get durable recovery points during autonomous fanout, but PR branches still require final history review. Mainspring does not expose a public history-rewrite command.
Reversal cost: low — this keeps history tooling outside the product surface.
ADR-07: Implementation language strategy = Bash for orchestration, Python for structured data
Context: Mainspring orchestrates AI agents through subprocesses (Codex, Claude, Gemini CLIs), parses their structured outputs (stream-json events, review verdicts), manages tmux + worktree fanout, and reads/writes JSONL state files. The natural shape spans two very different concerns: shell glue (process spawning, pipes, signal handling, fanout) and structured data (JSON parsing, schema validation, formatted reporting). Choosing one language for both means losing the other’s strengths.
Options considered:
- Bash only. Verified painful in v1: 4644 LOC monolith, 9 embedded
python3 - <<'PY'heredocs that produced the canonical SC2259 silent-failure bug, 8 inlinepython3 -c '...'parsers, JSON munging through a sed/awk underbelly. The whole P0+P1+P2 effort exists because this approach was untenable. - Python only. Clean tests, single language, async streaming via Anthropic/OpenAI SDKs, idiomatic for AI-agent tools (Aider, GPT Engineer, Mentat, Claude Engineer all chose Python). But: subprocess+pipe plumbing is verbose (
Popenwith stdin/stdout PIPE + signal handling = ~3× the bash-equivalent LOC); tmux + worktree fanout is awkward throughsubprocess; loses Bash’s “pipe is a first-class verb” feel. - Go. Single binary, no runtime dependency, fast startup, strong concurrency primitives. But: build pipeline required (cross-compile per platform), every release becomes a binary distribution problem, maintainers would have to own CI for releases; raises the adoption barrier from “git clone, run” to “download binary or set up Go toolchain”.
- Rust. Same upside as Go but more packaging and contributor friction than this local operator tool needs.
- Node/TypeScript. Awkward as shell-glue (everything goes through
child_process.spawn); npm dependencies in an OSS CLI tool is an anti-pattern; would clash with the plannedpip installdistribution path. - Bash + Python (current). Bash for what it’s good at; Python for what it’s good at; explicit CLI boundary between them; both pre-installed on every macOS/Linux developer workstation; zero build step. The split that organically emerged from the v1 → v2 refactor.
Decision: Bash + Python, with a strict CLI boundary.
- Bash owns: orchestration (
mainspring.sh+ 23 modules inlib/), CLI flag parsing, process spawning, pipes, signals, tmux + worktree fanout, lockfile management, log directories. - Python owns (
py/*.py): JSON parsing (stream-json prettify, review verdict parsing, runtime-state queries), JSONL emission (wave_log.py), schema validation, taskmaster query helpers, team-dispatch JSON mangling, engine quota snapshots, safe env-file loading. - Boundary: Bash invokes Python via real CLIs (
python3 py/<name>.py --flag value), never via embedded heredocs. Python modules accept all input via argv/stdin/explicit env vars; no ambient config. Each Python CLI is independently testable with pytest.
Rationale:
- Bash is the right shape for orchestration. Spawning two processes, piping their outputs through
teeand a transformer, backgrounding with&, waiting on PIDs — these are bash one-liners. The same logic in Python issubprocess.Popen(stdin=PIPE, stdout=PIPE, ...)plus thread/select juggling. P2-3’srun_ai_turnis 30 lines of bash; the equivalent Python would be 80–120 lines and a class. - Python is the right shape for structured data. Every
.pymodule does one thing: parse, validate, format. The current pytest suite gives broad parser, formatter, runtime-state, package, docs, and release-surface coverage with zero ceremony. Equivalent bash would be 3× the LOC and far weaker assertions. - The CLI boundary kills the v1 anti-pattern. The SC2259 bug existed because Python was embedded as a heredoc inside bash; the heredoc clobbered stdin and the writer’s stream silently dropped. With real
.pyfiles invoked through subprocess, the boundary is explicit, contract-bound, and unit-testable. The bug is structurally impossible to recreate. - Zero build step matters for adoption. The current honest public path is
git clone https://github.com/dlogvinenko/mainspring.git,cd mainspring,make install-user, thencd <project>andmainspring; direct PyPI and Homebrew installs are future distribution work. Source checkouts still work with./mainspring.sh --project <path>. Adding Go/Rust would require either a cross-platform release pipeline orgo install/cargo installas install instructions, both of which add friction before the source-install path has public usage. Bash + Python is whatdirenv,asdf,nvmship as, and they have lived for a decade. - The current maintenance stack already includes Bash and Python, and both are common on developer workstations. A Go/Rust rewrite would force another primary language onto the maintenance surface before the source-install path has public adoption.
- 2026-05-03 trigger note: P-Comp-1 has activated re-evaluation trigger #1. Engine command construction and provider dispatch are now Python-owned through the registry + LiteLLM runner; Bash still owns pipes, process supervision, and log capture. The live provider evidence remains credential-gated and must not be faked.
Consequences:
- Two languages to lint and test (
shellcheck+bash -nfor shell,ruff+pytestfor Python). The Makefile (P3-5) absorbs this — onemake lint testruns everything. - Distribution is source-first:
git clone,make install-user, then run the globalmainspringcommand from any project.method/install.shinstalls only the optional Method skill. Direct PyPI and Homebrew channels remain follow-up distribution work after the source release. - AI-agent ecosystem norms (Aider, GPT Engineer, etc. → Python) are noted but not followed. Mainspring’s identity is “shell-orchestrator that calls Python helpers”, not “Python AI agent”.
- Provider adapters now live in Python behind the registry + LiteLLM runner; bash side stays minimal.
Reversal cost: medium-to-high for a full rewrite to pure Python; trivial for incremental Python expansion (e.g. moving more bash logic into Python on a per-module basis). The boundary is designed to allow incremental migration if we ever decide to go pure Python — bash would shrink module by module while CLI calls stay stable.
Re-evaluate when:
- We need async streaming directly through Anthropic/OpenAI SDKs (skipping the
claude -p/codex execCLIs). At that point a pure-Python rewrite becomes attractive because the SDKs are Python-first. - We want
pip install mainspringas a distribution channel after the v1.0 OSS source release. - We want a real plugin system for engines (Python entry_points beats bash sourcing).
Until any of those three triggers, the current split is the right shape. Bash for shell, Python for data. Each language gets the work it was designed for.
Operational doctrine
How to actually use Mainspring in daily work. This is the lived contract; see guide.md for the full command reference.
When to use solo vs team
Default to solo unless the explicit reason for team is satisfied:
-
4 ready Taskmaster items with non-overlapping scope, AND
- leader workspace is clean (or only
.taskmaster/dirty), AND - no scope-blocked items in the head of the queue, AND
- no plugin/nested-repo items in the head of the queue.
Otherwise solo. Solo is faster to debug, doesn’t require tmux capacity, and produces the same quality output for single-item work.
Rule of thumb: if solo would finish the queue faster than team would even start (because of fanout overhead), pick solo.
Pair selection (until P6 metrics override)
| Goal | Pair | Why |
|---|---|---|
| Maximum quality, no speed concern | claude+codex (opus + gpt-5.5 xhigh) |
best writer + best reviewer; differing model families catch each other’s blind spots |
| Same family double-check | claude+claude or codex+codex |
useful when one provider is rate-limited |
| Most reasoning needed (complex refactors) | codex+codex xhigh |
Codex with xhigh effort and Codex review is the highest-effort lane |
| Fastest decision (only when single-step) | claude+claude |
Claude’s tool-calling is faster than Codex round-trips |
After P6 lands, consult mainspring --metrics --routing and use the data, not the table.
Reading --metrics
Three signals matter most:
- Pass rate per pair, last 14 days. If a pair drops below 70%, investigate before another wave on it. Below 50% → auto-disable should have kicked in (P6).
- Top 5 stuck task ids. A stuck task = ≥3 consecutive FAIL waves. Promote stuck tasks out of the queue: either manual review, switch pair, or mark
blockedwith a reason. - Mean duration trend. Sudden 2x increase = engine quality degrading or task complexity drifting.
When to --restart-team
Only when the team is provably stuck and --repair-state --dry-run doesn’t reveal a recoverable cause. --restart-team preserves worker heads under refs/mainspring-preserve/... before resetting, so it’s not destructive — but it does reset team backend state. Use it as a last resort.
Auto-checkpoint discipline
Auto-checkpoint commits during fanout are operational; review and squash them before publication. Never push those checkpoints directly to a PR branch — they are short-lived recovery checkpoints, not release history.
Cost awareness without cost guardrails
Per ADR-04, no cost guardrail. The operator’s daily-digest Telegram message (P5-1) shows total spend; the operator decides when to pause. The combination of premium-only lanes + visible daily spend + manual stop is sufficient for a single-operator tool.
Health rituals
Two recurring checks. The intent is small enough to actually do; the consequences of skipping are large enough to make the discipline worthwhile.
Weekly (≤ 10 min)
mainspring --metrics --days 7— check pass rate per pair, top stuck tasks.mainspring doctor— confirm dependencies + git state clean.wc -l .mainspring/logs/waves.jsonl— sanity that the JSONL is growing.- Review local git history before pushing public work.
- If the daily digest noted any disabled pairs, either re-enable manually with reason or accept and move on.
Monthly (≤ 30 min)
- Run the weekly ritual.
- Read the last 4 weeks of
.mainspring/logs/notifier.log— look for retry-loop events; investigate any task that retry-looped 3+ times. - Run
make allfrom the repository root — must be green. - Audit
.mainspring/state/disabled-pairs.json— pairs that have been auto-disabled for > 30 days should be either re-enabled or removed from the registry entirely. - Skim the last 30 days of
waves.jsonlfor anyfailure_reason_classvalue that’s new — every value should map to a known taxonomy entry in P4-5.
Disaster recovery
Six failure modes that have happened or are likely to happen, with concrete recovery steps.
.mainspring/ corrupted (e.g. partial write of waves.jsonl)
Detect: mainspring --metrics errors with Invalid JSON at line N.
Recover:
mv .mainspring/logs/waves.jsonl .mainspring/logs/waves.jsonl.corrupt
jq -c '.' .mainspring/logs/waves.jsonl.corrupt > .mainspring/logs/waves.jsonl 2>/dev/null
# or, more aggressive — keep only well-formed lines:
grep -v '^$' .mainspring/logs/waves.jsonl.corrupt | while IFS= read -r line; do
echo "$line" | jq -e . >/dev/null 2>&1 && echo "$line"
done > .mainspring/logs/waves.jsonl
Dead host mid-wave
Detect: lock file shows pid that no longer exists; mainspring status shows in-progress wave with timestamp > 30 min old.
Recover: run mainspring --repair-state --dry-run first. If the preview
only touches stale runtime bookkeeping, run mainspring --repair-state --force,
then resume normal flow. Because flock is fd-based, the OS already released the
lock when the host died.
Runaway loop (wave count climbing without progress)
Detect: --metrics shows ≥10 consecutive FAIL waves on the same task, or mainspring hud shows STUCK / repeated RETRY with the same task. Telegram should have already surfaced one actionable retry_loop event.
Recover: use the retry-loop notification’s Open: command to inspect the local HUD/log. If it is genuinely stuck, use the notification’s Stop: command or mainspring stop --force; mark the offending task blocked in Taskmaster with failure_reason_class=manual:runaway; investigate offline. Resume after the block clears.
Stale worktrees / zombie tmux panes
Detect: mainspring doctor warns; git worktree list shows worktrees pointing to non-existent paths.
Recover: start with mainspring --repair-state --dry-run. If git itself
reports stale worktrees, run git worktree prune. For team backend state, use
mainspring --last-run --restart-team only after checking the preserved worker
heads that Mainspring reports. Avoid broad tmux cleanup; kill only a specific
session after manually confirming it is unrelated to active work.
Lock without owner (rare; only happens if flock is unavailable on the platform)
Detect: mainspring exits immediately with “already running” but no process matches the recorded pid.
Recover: run mainspring --repair-state --dry-run, then
mainspring --repair-state --force only if the preview identifies the lock as
stale. Mainspring will re-acquire on next launch. If this happens repeatedly,
mainspring doctor should be flagging missing flock support; install
util-linux or the platform equivalent.
Telegram daemon stuck
Detect: .mainspring/logs/notifier.log has not appended in > 1 hour despite waves continuing.
Recover: run mainspring notify-health --format json. If it reports next_step=restart-notifier-daemon, run mainspring notify-restart, then mainspring notify-test to confirm Telegram delivery. notify-restart only stops the PID recorded for this runtime’s notifier after validating the process command and current ledger path; do not use broad process-name kills because they can kill another project’s notifier.
Versioning and migration
SemVer policy
Mainspring follows strict SemVer from v1.0.0 onward:
- MAJOR: removing/renaming a CLI flag, removing a JSONL required field, removing an env var, breaking the
EngineAdapterProtocol. - MINOR: adding a CLI flag, adding a JSONL field, adding an engine adapter, adding an env var.
- PATCH: bug fixes, doc updates, internal refactors with no public surface change.
Schema versioning (JSONL)
schema_version=1 for v1.0.0–v1.x. To bump to schema_version=2:
- Add the new shape to
wave_log.py append. Emit both shapes for 90 days (overlap window). - Update
metrics.pyto read both versions. - Document the new shape in
docs/metrics.mdwith a compatibility note. - After 90 days, drop emission of the old shape; readers continue to support it for one more major version.
Env var deprecation
Renaming an env var into the MAINSPRING_* namespace in P7:
- Read both for one minor version.
- If the old one is set and the new one isn’t, log a stderr deprecation notice once per process.
- After one minor version, drop reading the old one.
Runtime state note
Public v1 writes .mainspring/ directly. Pre-v1 private runtime trees are outside the public operator path.
v1.0 GitHub release checklist
Mainspring’s source-install product gate is the local gate below. Public
publication is ordinary GitHub repository work, not a hidden CLI workflow and not
a Mainspring subcommand. There is no public release subcommand for this on
purpose: after the verified source tree is ready, publish the reviewed final
release commit, make the repository public, sign the tag, and create the GitHub
Release. The public main branch, hosted CI, and hosted docs are already live;
Homebrew, benchmarks, and provider-matrix evidence are follow-up credibility
steps.
Before the maintainer publishes or republishes a release commit:
make release-check
make release-check is intentionally boring: it runs make all, package
smoke, Python coverage, Product Requirements Document (PRD) validation, and
git diff --check. It performs no GitHub mutations, no tag creation, no
provider calls, and no hidden release-state updates.
Then the owner publishes from a reviewed release commit:
- Confirm the repository history intended for public release is clean. Public source history must not contain local paths, private project names, memory artifacts, tokens, screenshots, or runtime ledgers.
- Push the final release ref with a GitHub credential that has
workflowscope when workflow files are present. Done for the clean publicmaincommit. - Confirm the GitHub repository description and topics match the public positioning, then set repository visibility to public. Done on 2026-06-15.
- Create a signed
v1.0.0tag on the exact release commit. - Publish the GitHub Release using
CHANGELOG.mdas the release-note source. The draft release is already retargeted to the clean commit.
Package-manager distribution, benchmark numbers, and live provider matrix
evidence are follow-up credibility work, not requirements for the first
source-install release. Hosted docs are already published at
https://dlogvinenko.github.io/mainspring/.
Correctness
shellcheck -S warningclean on all*.sh(zero errors, zero warnings; no exceptions) (make all: shellcheck OK)bash -nclean on all*.sh(make all: bash -n OK)ruff check+ruff format --checkclean on all*.py(make all: ruff check OK, 91 files already formatted)- Built-in writer/reviewer pair modes resolve through the public CLI and self-test surfaces (Bats pair parsing, engine command construction,
--self-test, and--self-test-allcoverage; real provider-matrix evidence is follow-up credibility work) - Review gate demonstrably receives writer output (no SC2259 regression) (P1-4 real wave evidence +
make allstream/review regression tests)
Architecture
- Main entrypoint ≤ 500 LOC (391 LOC after CLI parsing, runtime dispatch extraction, public env cleanup, and source archive root detection; verified by
wc -l mainspring.sh) - Zero
python3 - <<'PY'heredocs (all extracted in P2) - No
python3 -cinvocations longer than 80 chars (all consolidated in P2-2) - No duplicated writer/reviewer engine paths (unified
run_ai_turninlib/engines.sh) - All bash modules ≤ 600 LOC (largest:
lib/help.shat 598 LOC; tied withlib/team.sh; verified bywc -l lib/*.sh)
Tests
- ≥ 25 bats tests passing in < 30s (223 Bats pass in
make allon 2026-06-14) - ≥ 35 pytest tests passing in < 30s (pytest suite green, 1 skipped, in
make allon 2026-06-14) - ≥ 80% line coverage on Python modules (
make coverage: 90.6%, gate pass on 2026-06-16) - CI matrix green on Linux + macOS (local
.github/workflows/ci.ymlrunsmake allonubuntu-latestandmacos-latest; v1 release evidence is the hosted GitHub Actions run on the clean publicmaincommit)
Observability
- Every wave emits one valid
waves.jsonlline (flock-guardedappend --ledger) --metricsanswers all 7 standard questions (130-test metrics module)--metrics --routinganswers pair-effectiveness questions (verified bypy/tests/test_metrics.pyand JSON CLI smoke)
Safety
- No hardcoded user paths (verified by
py/tests/test_no_hardcoded_paths.py)- 2026-05-03 portability hardening: the guard now scans the current standalone release tree (
mainspring.sh,lib/,py/,tests/,docs/,method/, packaging metadata, and generated docs assets) instead of stale pre-extraction roots, anddocs/assets/hud-demo.castno longer embeds an absolute local user-home PRD path.
- 2026-05-03 portability hardening: the guard now scans the current standalone release tree (
flock-based concurrency (verified bypy/tests/test_fd_lock.py+tests/bats/test_lock.bats)- Write-scope whitelist enforced;
.env*/node_modules/.gitalways rejected (verified bytests/bats/test_write_scope.bats) - Destructive ops require explicit
--restart-team/--repair-state --force(automatic team preflight now refuses worktree/state cleanup without--restart-team; stale process cleanup and state repair require--force; verified bytests/bats/test_team.bats,py/tests/test_runtime_state.py, shellcheck, andbash -n)
UX
doctorcovers every external dependency (./mainspring.sh doctorreports command deps, fd-lock fallback, Pythonrich/pytest, engine registry module/env requirements, notifier, Taskmaster, logs, and current WARN gaps)- 2026-05-03 dependency hygiene:
doctortreatsnode_modulesas relevant only whenpackage.jsonexists, and provider-engine module probes use the active virtualenv, repo.venv, orMAINSPRING_PYbefore falling back topython3. Credential setup rows no longer masquerade as missing Python modules whenlitellmis installed.
- 2026-05-03 dependency hygiene:
--dry-runmakes zero API calls (standalone mode prints resolved settings + commands, exits with no API calls)- Presets cover the 3 common flag combos (nightly-max, conservative-docs, fast-smoke in presets/)
- Telegram notifications work on all 7 event classes (wave_failed, retry_loop, loop_stopped, quota_warn, team_stuck, milestone, daily_digest; actionable event payloads include project, folder, Taskmaster tag, task, pair, result, reason, next action, and duration context when available)
mainspring hudrenders cleanly on 80x24 and 200x60 terminals (verified bypy/tests/test_hud.py+hud-rich-smoke)
Portability
- Fresh box:
git clone && make allgreen without editing files- 2026-06-14 status refresh: the current worktree-local
make allis green (pytest suite green, 1 skipped; 223 Bats; HUD/docs-site smoke and dependency audit OK).make package-smoke,make coverage, PRD validation, andgit diff --checkare also green on the current source line. Before tagging, rerun the same gates on the exact commit being released. - 2026-05-18 status refresh: the current worktree-local
make allwas green at the time; the current v1.0 snapshot above is the release-facing evidence. - 2026-05-04 clean-clone smoke:
make fresh-clone-smokeran from a clean published checkout, cloned the source into a disposable checkout, and ranmake all. - 2026-05-03 local gate status:
make fresh-clone-smokenow refuses dirty worktrees, clones cleanHEADinto a temporary directory, and runsmake allfrom the clone so local uncommitted state cannot masquerade as fresh-box evidence. Verified bypython3 -m pytest py/tests/test_fresh_clone_smoke.py -qplus a direct fail-closed run against a dirty worktree. - 2026-05-03 standalone repo hygiene: the root
Makefilenow treats the checkout root asROOT, somake alland HUD smoke targets do not read runtime state from outside a fresh clone. Regression coverage lives inpy/tests/test_fresh_clone_smoke.py. - 2026-05-03 local agent worktree hygiene:
shell-lintdiscovery now prunes.claude/, and team visibility defaults exclude.claude/, so local Claude/Codex helper checkouts cannot makemake alllint unrelated worktree files or trigger team fanout warnings. Verified by the fresh-clone smoke regression, focused doctor/Taskmaster regressions,./mainspring.sh doctor, and the local gate at the time. - 2026-05-03 coverage-gate hygiene:
py/fresh_clone_smoke.pynow captures and replays childmakeoutput through Python streams so the stdlib trace coverage harness sees the same stdout/stderr shape as a real subprocess. Verified bypython3 -m pytest py/tests/test_fresh_clone_smoke.py -q,make coverage, andmake all. - 2026-05-03 evidence-artifact hardening: successful
make fresh-clone-smokeruns now write.mainspring/state/fresh-clone-smoke-evidence.jsonwith the clean source HEAD, source repo, target, command, and result so the eventual release evidence is auditable without trusting scrollback. The artifact is written only after the clonedmake allpasses; dirty-worktree and failed-clone paths still fail closed without producing an evidence file.
- 2026-06-14 status refresh: the current worktree-local
- Local wheel + isolated pipx smoke passes from source checkout (
make package-smoke: sdist+wheel build, bootstrap tests green with 1 skipped, and 23 Homebrew formula tests passed;make pipx-smoke: 1 passed) - Homebrew formula metadata is generated from explicit release inputs (
py/tests/test_homebrew_formula.py: validates URL/sha/version guards, pyproject-derivedpython@X.Y, Homebrew desc length, dependency shape, and Ruby syntax whenrubyis installed) mainspring doctorruns on a box with onlybash,python3,gitinstalled (minimal-PATH Bats fixture verifies missingtask-master/ team backend command stays WARN-only even with active team state;doctor_active_team_reportno longer leaks command-not-found output)- Provider adapters fail closed on missing modules, credentials, malformed responses, or unknown engine names (real non-author provider evidence is follow-up market evidence after the source release)
Documentation
README.mdanswers how to install Mainspring, how users verify PATH visibility, why Mainspring exists, when to use it, how to start withmainspring, what HUD/Telegram/Product Requirements Document (PRD)-first execution provide, and where to go nextdocs/prd.mdis the canonical plan (this file; see opening status and Mission source-of-truth language)guide.mdis the operator referencemetrics.mddocuments the JSONL schema in fullarchitecture.mddocumentsEngineAdapterProtocol + extension pointsCONTRIBUTING.mdwalks through “add an engine adapter” as the canonical contribution- GitHub Pages docs source is generated from canonical docs and covered by a local smoke gate (Pages is enabled and deployed at
https://dlogvinenko.github.io/mainspring/; hosted Docs Site run27561122958passed on the clean public commit)
Legal
LICENSE= Apache-2.0 boilerplate, unmodifiedNOTICElists third-party deps with their licenses- SPDX header on every source file (verified by
py/tests/test_source_license_headers.py) - Copyright = “Mainspring contributors”, no individual
Release
- GitHub repository description and discovery topics are set
- 2026-06-13 owner metadata evidence: the repository description is “Product
Requirements Document (PRD)-first AI coding agent orchestration for
production-grade software delivery.” Discovery topics include
ai-coding-agent,coding-agent,agent-orchestration,prd,taskmaster,llm-agents,codex,claude,ollama,litellm, anddeveloper-tools. Repository visibility is public as of 2026-06-15.
- 2026-06-13 owner metadata evidence: the repository description is “Product
Requirements Document (PRD)-first AI coding agent orchestration for
production-grade software delivery.” Discovery topics include
- Signed annotated
v1.0.0tag exists on the final clean release commit- 2026-06-15 owner action: import or unlock the maintainer GPG key, sign
the clean public
maincommit, push the tag, then publish the prepared draft GitHub Release.
- 2026-06-15 owner action: import or unlock the maintainer GPG key, sign
the clean public
- GitHub release notes pulled from
CHANGELOG.md- 2026-06-15 release-note source status:
CHANGELOG.mdcontains the v1.0.0 release narrative, and the draft GitHub Release is retargeted to the clean commit with those notes.
- 2026-06-15 release-note source status:
- GitHub repository page shows public Apache-2.0 status
- 2026-06-15 evidence:
dlogvinenko/mainspringis public, defaults tomain, and exposes the Apache-2.0 source tree at the clean release commit.
- 2026-06-15 evidence:
Explicit non-goals
These are things Mainspring will never become. Reject any change that moves toward them.
- Not a multi-machine system. Single host, single operator, single tmux. If you need fleet management, you need a different tool.
- Not an OpenTelemetry citizen. No distributed tracing. Single-host JSONL is the observability surface. Future export to Prometheus / OTel is a non-goal because Mainspring is not part of an SRE fleet.
- Not Sentry-instrumented. No exception aggregation service. Telegram notifications +
notifier.logare the alerting surface. - Not a cost-governed tool. No automatic spend caps, no budget enforcement. Per ADR-04 the operator runs premium models and watches the daily digest. Cost discipline is human, not automated.
- Not a read-scope sandbox. Mainspring trusts its own writer/reviewer agents to not leak secrets in summary logs. If that trust ever breaks, the response is a
secret-scanpost-wave hook, not a chroot/namespace sandbox. - Not a cross-machine resume tool. If a host dies mid-wave, the wave is lost. The next host start re-picks the task from Taskmaster. No state replication.
- Not a Linear / Jira / GitHub Issues integration. Mainspring reads Taskmaster only. Wrapping a different backlog source is a fork, not a feature request.
- Not a UI for managing waves. HUD is read-only. No “kill wave” buttons, no “retry” buttons, no drag-and-drop. Operations are CLI-driven.
- Not a public-server-exposed dashboard. HUD is localhost only, no port, no auth. Anyone wanting remote access uses
ssh. - Not a parallel writer-multiplexer. One writer per wave. No fan-out across writers within a single wave. (Team mode runs separate waves in parallel; that’s not the same as one wave with N writers.)
- Not a per-action approval tool. Cline-style approve-every-command UX breaks the autonomous-loop ethos. Operators choose scope and gates up front; the reviewer and tests stop unsafe outcomes.
- Not a demo-video generator. Walkthrough videos are valuable in some agent platforms, but Mainspring’s evidence surface is JSONL, review verdicts, tests, logs, and replay.
- Not a second memory layer. Taskmaster, PRDs, ledgers, and git history own durable state. Adding Mem0/Hermes-style session memory duplicates those contracts.
- Not a community-growth product. Discord or similar community operations are outside the personal-tool-to-OSS scope until real adoption creates a maintainer need.
- Not a backwards-compatibility museum. v0.x → v1.0 wipes history. From v1.0 onwards, deprecation windows are 90 days for schema changes, 1 minor version for env vars. After that, gone.
Backlog (Must / Should / Could / Won’t)
Ranked by value-per-effort. Must items block the source-install v1.0 code release; Should items ship in v1.x; Could are later candidates; Won’t are explicit dead ends.
Must (blocks source-install v1.0.0 code release)
- P1-1 SC2259 fix (heredoc → real
.py) — single highest-impact bug fix. - P1-2 Remove hardcoded
$HOMEpath. - P2-1, P2-3, P2-4 Heredoc extraction + run_ai_turn merge + bash modularization.
- P2-2
python3 -cconsolidation intoteam_status.py. - P3-1 / P3-2 / P3-3 / P3-5 Tests + JSONL + Makefile.
- P3-4
--metricscommand at the standard-questions level. - P4-1 Structured review JSON (kills regex parsing in critical path).
- P4-2
--dry-runmode. - P4-3 Presets.
- P5-1 Telegram notifications.
- P-Comp-1 LiteLLM multi-provider registry and fail-closed provider routing.
- P7-1 → P7-6 Repo extraction, public README, and source-install release hygiene.
Should (v1.x post-release)
- P4-5 Failure reason taxonomy.
- P4-6 / P4-7 / P4-8 Worktree visibility routing + bootstrap auto-close + dispatch ledger.
- P5-2 HUD (rich-based TUI).
- P6 Metrics-driven routing + auto-disable + daily digest.
Could (v2 candidates, only if data justifies)
- Web HUD (separate from TUI; only if multiple users ask).
- Additional engines: Ollama (local), OpenRouter (multi-provider), Grok.
- Property/fuzz testing on JSONL emitter.
- Golden-file testing for review-prompt drift.
Won’t (explicit non-goals — do not propose)
- See Explicit non-goals.
- Linear/Jira/GH Issues backlog adapter.
- Cost guardrails / hard spend caps.
- OpenTelemetry / Sentry integration.
- Read-scope sandbox.
- Multi-machine state sync.
- Cross-machine resume.
- Public web exposure.
- “Fast” model lanes as defaults.
Phase P-Comp — Post-competitor-analysis amendments (2026-04-27)
Goal: apply the recommendations from Appendix C — Competitor landscape and docs/competitive-analysis.md. Ratified 2026-04-27; 8 strong-recommend items + 5 considered items + 4 explicit skips.
Strategic reframe (item 0 — applies to all subsequent work): Mainspring is positioned as Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery, NOT as “yet another autonomous coding agent”. The Method remains the durable asset, but the public entry point must be understandable to people who do not already know the Method: one command to start, clear explanation of PRD vs. vibe coding, visible HUD, Telegram operations, hard reviewer gate, and evidence ledger. Direct competitors (Composio AO, OpenAI Symphony, Taskmaster autopilot) own the orchestrator-only niche; Mainspring’s differentiator is making doctrine executable and auditable across any project.
Strongly recommend (market evidence after the source-install release)
- P-Comp-1 LiteLLM adoption (replaces P5-3 bespoke adapters). Drop the hand-rolled provider branching in favor of the shared engine registry and LiteLLM provider adapter. Get Gemini, Ollama, OpenRouter, Mistral, Groq, Azure, OpenAI, and Anthropic via one abstraction.
py/engines/_base.pyandpy/engines/litellm_adapter.pyown the provider boundary; P5-3 task body rewrites point at LiteLLM instead of a bespoke shell plugin protocol.- 2026-05-03 code-side status: registry-routed LiteLLM and provider adapters exist for Gemini, Grok, OpenAI, Anthropic, Azure, OpenRouter, Mistral, and Ollama.
litellm_runner.pywrites usage/cost sidecars and now fails closed on malformed response shape before treating provider output as writer/reviewer text. Remaining market follow-up: installlitellmin the runtime environment and run a real provider wave with explicit credentials. Missing dependencies or API keys affect that provider run only, not the source-install release. - 2026-05-03 dry-run remediation status: provider readiness preflight now aggregates missing setup and prints exact operator guidance such as
python3 -m pip install -r requirements.txtand settingGOOGLE_API_KEYto the provider credential duringmainspring night --pair gemini+claude --dry-run. This does not run the provider; it makes setup actionable before a real wave. - 2026-05-03 doctor remediation status: the shared engine inventory now emits
setup:lines for missing LiteLLM modules and provider env vars, somainspring doctorshows the same exact remediation without running a provider. Verified by focused engine-registry tests, doctor Bats, PRD validation, and the local gate at the time. This improves readiness visibility; it does not replace a real provider run. - 2026-05-03 interpreter consistency hardening: provider dry-run readiness and runtime LiteLLM command construction now use the same Mainspring Python resolver as doctor (
VIRTUAL_ENV, repo.venv,MAINSPRING_PY, thenpython3). This prevents falselitellmmissing-module reports when dependencies are installed in the repo virtualenv, while still failing closed on missing provider credentials. - 2026-05-03 runtime remediation consistency: the live LiteLLM runner now uses the same missing-module remediation as doctor and dry-run (
python3 -m pip install -r requirements.txt) before exiting closed. This still does not close the live provider evidence gap; it prevents the runtime failure path from giving weaker setup guidance than the preflight paths. - 2026-05-03 live credential preflight hardening: the live LiteLLM runner now checks known provider-model env vars before calling the provider, so a Gemini run without
GOOGLE_API_KEYexits closed with the same actionable credential guidance used by doctor and dry-run. The shared LiteLLM provider mapping now feeds registry validation and runtime checks. Verified bypython3 -m pytest py/tests/test_litellm_runner.py py/tests/test_engine_registry.py -q,ruff check py/litellm_runner.py py/engines/litellm_adapter.py py/engines/registry.py py/tests/test_litellm_runner.py py/tests/test_engine_registry.py, andruff format --check py/litellm_runner.py py/engines/litellm_adapter.py py/engines/registry.py py/tests/test_litellm_runner.py py/tests/test_engine_registry.py. This keeps provider runs fail-closed; credentials and a real docs-only provider wave remain market-evidence work.
- 2026-05-03 code-side status: registry-routed LiteLLM and provider adapters exist for Gemini, Grok, OpenAI, Anthropic, Azure, OpenRouter, Mistral, and Ollama.
- P-Comp-2
mainspring replay <wave-id>. ✅ DONE 2026-05-03. Implementedpy/replay.pyplus top-levelmainspring replay <show|diff|build|run>dispatch. Replay reads wave rows fromwaves.jsonl, resolves canonical or legacy wave ids, reconstructs the CLI command, supports--engine,--reviewer,--model,--review-model, and--save-as, and records replay provenance throughwave_log.py(replayed_from,replay_overrides, optional wave-id override). Deterministic prompt-backed replays fail closed on missing prompt snapshots, prompt hash drift, git HEAD drift, or dirty-tree drift unless the operator explicitly allows worktree drift; older rows require explicit--allow-live-reconstruction. Golden-run replay evidence preserveschapter_delta,competitor_delta,launch_delta,product_score, verdict, success state, and exit code; reviewer swaps surface drift inreplay diff; the committed golden source row includes a prompt snapshot whose dry-run replay validates withReal-run validation: OKwithout launching a provider. Verified bypython3 -m pytest py/tests/test_replay.py py/tests/test_golden_run.py py/tests/test_wave_log.py -q(123 passed),bats tests/bats/test_wizard.bats tests/bats/test_log.bats(42 passed),python3 py/golden_run.py check-all tests/golden-runs(OK),python3 py/replay.py run golden-002-slice-impl tests/golden-runs/mainspring-prd-to-pr/waves.jsonl --dry-run --save-as golden-002-replay-smoke(validation OK),python3 py/replay.py diff golden-002-slice-impl golden-003-replay-evidence tests/golden-runs/mainspring-prd-to-pr/waves.jsonl, andruff check/ruff format --checkon replay-related files. - P-Comp-3 SWE-bench-Verified score. Run Mainspring on the SWE-bench-Verified benchmark, then publish the
%-solvednumber in public docs only after a real result exists. The source tree keeps the validated runner scriptpy/bench/swe_bench.py, but v1 no longer ships a public placeholder result page.- 2026-06-13 public-release cleanup: the empty benchmark result page was removed from the public docs site because benchmark collateral should appear only after real evidence exists.
py/bench/swe_bench.pyremains source-only benchmark tooling with focused tests and is not part of the installedmainspringruntime payload. Remaining benchmark work: generate real Mainspring predictions, run SWE-bench Verified in an explicit benchmark environment, then publish the actual% resolvednumber.
- 2026-06-13 public-release cleanup: the empty benchmark result page was removed from the public docs site because benchmark collateral should appear only after real evidence exists.
- P-Comp-4 pipx + Homebrew distribution (extends P7). Post-source distribution work: publish package-manager paths so users can install without cloning the repository. Ship
pyproject.toml+ entry point sopipx install mainspringworks. Publish a Homebrew tap (brew install <tap>/mainspring). Bash entry script becomes a thin shim that the Python entry point invokes. The source-install release remains valid before these channels exist.- 2026-05-03 local packaging status:
pyproject.tomldeclares themainspringconsole entry point, runtime dependencies, and packaged data files formainspring.sh,lib/,py/,presets/,schema/, andmethod/.Makefilenow exposespackage-smokeandpipx-smokeverification. Verified by package and pipx smoke tests. Remaining distribution work: publish package-manager metadata and capture an external fresh-box install. - 2026-05-03 Homebrew formula metadata status:
py/homebrew_formula.pynow generatesFormula/mainspring.rbfrom explicit release version, tarball URL, homepage, sha256 inputs, and thepyproject.tomlrequires-pythonfloor so formulapython@X.Ycannot drift silently from package metadata. The formula keeps only runtime dependencies (bash,git, Python) and no longer declares unused:testdependencies; tests lock the Homebrew desc length, dependency shape, pyproject-derived Python version, and Ruby syntax.packaging/homebrew/README.mddocuments the tap publish sequence, including the requiredbrew update-python-resources mainspringstep for vendoringrich/litellmresources. Verified by focused formula tests, generated-formula Ruby syntax, ruff checks on formula files, and package smoke. Remaining distribution work: publish the tap and capture externalbrew install <tap>/mainspringoutput. - 2026-05-04 global editable CLI hardening, refreshed 2026-06-13:
Makefilenow exposesinstall-user/dev-install, runspipx ensurepath, removes any old Mainspring pipx environment, and installs the current checkout as the user-levelmainspringcommand. The Python console bootstrap marks pipx invocations withMAINSPRING_CONSOLE_ENTRYPOINT=1, so editable installs target the caller’s project directory while direct./mainspring.shsource-checkout runs still require--projectto control another repo. README, guide, and--helpnow document install-once/run-anywhere usage and the source-checkout--projectfallback. Current v1.0 gates include global install smoke and bootstrap coverage.
- 2026-05-03 local packaging status:
- P-Comp-5 Daily cost digest in Telegram (extends P5-1). ✅ DONE 2026-05-03. P5-1’s daily 09:00 digest now uses the previous local calendar day and includes total spend, top-3 most expensive waves, tokens per pair, role-token breakdowns, cost per positive
chapter_delta, cost per explicit or inferredproduct_scoremovement, uncosted movement callouts, quota status, and disabled pairs. Sources are ledger cost fields first, then a local explicit price table only when token counts and model ids are known; unknown models stay uncosted instead of fabricating spend. Verified bytest_comp5_daily_cost_digest_acceptanceplus the focused notifier suite (140 passed).
Recommend (v1.x growth work after source release)
- P-Comp-6 YAML config + JSON Schema (
.mainspring.yaml). ✅ DONE 2026-05-03. Project-local config loads after presets and before execution, CLI flags win on conflict, lowercase aliases are accepted, and schema validation fails closed viaschema/config.schema.json. The schema is included in the packaged data files. Verified bypy/tests/test_last_run.py,py/tests/test_mainspring_bootstrap.py, andtests/bats/test_wizard.bats. - P-Comp-7 GitHub Pages docs-site source and workflow. ✅ DONE 2026-05-04, refreshed 2026-06-15.
py/docs_site.pygenerates a Jekyll source tree from the committed canonical docs (docs/prd.md,docs/method.md,docs/playbook.md,docs/guide.md,docs/competitive-analysis.md, plus architecture and metrics pages) without duplicating a second planning source or publishing empty benchmark result pages..github/workflows/pages.ymlbuilds that generated source withactions/jekyll-build-pages, uploads the Pages artifact, and deploys only onmainpushes after the repo owner enables GitHub Pages and setsMAINSPRING_ENABLE_PAGES_DEPLOY=1. Workflow permissions are least-privilege: build and pull-request runs keep read-only contents permissions, whilepages: writeandid-token: writeexist only on the gated deploy job.make docs-site-smokevalidates the generator output locally. Hosted Pages is live athttps://dlogvinenko.github.io/mainspring/and returns HTTP 200 with SEO, Open Graph, and Twitter metadata. Verified bypython3 -m pytest py/tests/test_docs_site.py py/tests/test_ci_workflow.py py/tests/test_prd_validate.py -q,make docs-site-smoke, hosted Docs Site run27561122958, and an HTTP smoke check against the published URL. - P-Comp-8 Golden-run regression fixture. ✅ DONE 2026-05-03. Created
tests/golden-runs/mainspring-prd-to-pr/with a 3-wave PRD-to-PR ledger and deterministicexpected.txt;golden_run.py check-allnow has a committed scenario to diff in pytest. This gives replay/ledger behavior an end-to-end fixture without launching model CLIs.
v1.x roadmap ownership: P-Comp-6, P-Comp-7 local implementation and hosted publication, and P-Comp-8 are complete. Future hosted-docs work is limited to optional custom-domain polish; the default GitHub Pages site is already published.
Considered (Could — v1.x or later, not blocking v1.0)
- P-Comp-9
BacklogSourceplugin interface (Taskmaster + GH Issues + Linear). ✅ DONE 2026-05-03.py/backlog_source.pydefines the Python protocol (list_ready_tasks,get_details(id),mark_done(id),mark_blocked(id, reason)) and ships the Taskmaster JSON adapter only for v1.0. The existingtaskmaster.py next/metadata/status read paths now exercise the adapter, while status mutations call thetask-masterCLI and fail closed on command errors instead of pretending the backlog changed. Packaged runtime metadata includes the new module. Non-Taskmaster adapters remain opt-in roadmap work, not v1.0 behavior. Verified bypython3 -m pytest py/tests/test_backlog_source.py py/tests/test_taskmaster.py py/tests/test_mainspring_bootstrap.py::test_pyproject_declares_pipx_console_entry_and_runtime_payload -q. - P-Comp-10
--auto-retry-ci <N>opt-in retry loop. ✅ DONE 2026-05-03. Default remains0(current stop-on-fail behavior). When enabled, writer failures classified astypecheck_fail,lint_fail, ortest_failrecordengine:<reason>in the ledger, incrementretry_count, keep the same Taskmaster item, and inject the captured failure-output tail plus optional--auto-retry-ci-logtail into the next writer prompt. The retry cap is enforced before each retry so the loop cannot run away. Verified by Bats coverage intest_wave.batsand CLI/dry-run coverage intest_wizard.bats; fullmake allincludes both. - P-Comp-11 Role-based agent modes (
--mode architect|code|debug|ask). ✅ DONE 2026-05-03.--modeis parsed by the CLI, saved/loaded through presets and.mainspring.yaml, shown in dry-run output, appended to the writer prompt, and passed into the reviewer lens.architectandaskare advisory no-edit modes: review hard validation rejects user file edits and keeps Taskmaster items inreviewinstead of closing them.debugadds root-cause and targeted-verification discipline whilecoderemains the default implementation lane. Verified by focused Bats coverage intest_log.bats,test_review.bats, andtest_wizard.bats; fullmake allincludes these paths. - P-Comp-12 Plugin entry-points via Python (replace bash
source). When LiteLLM (P-Comp-1) lands and proves the Python expansion path, revisit ADR-07: move engine + backlog adapters topyproject.toml[project.entry-points]mechanism. v2 conversation; not v1.x. Triggers ADR-07 re-evaluation per the “trigger #3 plugin system” note. - P-Comp-13 Plausible-style anonymous opt-in telemetry. Track only: wave count, pass-rate, version, OS. Default OFF. Aider and OpenHands do this. Lower priority because Mainspring has private-first ethos; revisit only after v1.0 OSS release if adoption signal demands it.
Could-lane re-evaluation triggers: P-Comp-9’s core interface is complete; non-Taskmaster adapters become eligible when at least two alternate backlog sources are requested by real operators. P-Comp-12 becomes eligible only after P-Comp-1 proves Python provider dispatch in real waves. P-Comp-13 becomes eligible after public adoption creates a concrete maintainer question that anonymous counts would answer. P-Comp-9, P-Comp-10, and P-Comp-11 are already closed.
Explicit skips (Won’t — confirmed non-goals)
- Per-action approval mode (Cline-style). Breaks autonomous-loop ethos. Cline + Aider already serve that niche.
- Walkthrough video artifact (Symphony-style). Useful later, but heavy ops cost. Not v1.
- Memory / context persistence between sessions (Mem0, Hermes). Taskmaster owns state. Adding our own memory layer = duplication.
- Discord community channel. Personal-tool ethos. Per current PRD anti-goals. Revisit only if OSS adoption growth demands it.
Acceptance for closing P-Comp
- All source-release Must items complete and locally verified; remaining provider/benchmark/package evidence is tracked as follow-up growth work after the source release.
- Should items (P-Comp-6 through P-Comp-8) on the v1.x roadmap with explicit owners + estimates.
- Could items documented in Backlog with re-evaluation triggers.
- Skips added to “Explicit non-goals” section above.
- Mission section in this PRD updated with the Method-first reframe quoted in item 0.
- README and docs-site entry explain Product Requirements Document (PRD)-first AI coding, vibe-coding tradeoffs, install path, one-command start, HUD, Telegram, and evidence-ledger value in plain language.
Appendix A — Source-of-knowledge recipes
This appendix points future maintainers and AI agents at the current source of truth. It is intentionally a map, not a second implementation plan.
CLI truth
lib/help.shis the public help contract. Any new command or flag needs help text plus Bats coverage before it is documented elsewhere.lib/cli.shis the argument parser. Public command spelling lives there.docs/guide.mdis the human-facing command reference. Its command tables group command families with complete runnable variants, not detached flags.
Runtime and logs
.mainspring/logs/waves.jsonlis the wave ledger. Additive fields are OK; removal or rename needs a schema-version migration.py/wave_log.pyowns ledger rows and failure context.py/replay.pyis the source of truth for reconstructing recorded waves.py/runtime_state.pydiscovers live runtimes for HUD, status, and notifier recovery. It must not trust stale session cwd over a verified process cwd.
Operator visibility
py/hud.pyowns global/local HUD rendering, progress estimation, and clean interrupt behavior.py/notify_telegram.pyowns Telegram event selection, deduplication, project/folder/tag context, and loop-stopped alerts.lib/notify.showns only daemon lifecycle and recorded-PID validation. Broad process-name kills are not an acceptable recovery path.
Review and safety gates
lib/review.shbuilds the reviewer prompt and applies the hard gate.py/parse_review.pyvalidates structured review output and keeps the required review fields machine-checkable.lib/write_scope.shprotects the operator’s checkout from forbidden path changes and generated-output noise.
Package payload
pyproject.tomldeclares the installed console script and runtime payload.MANIFEST.indeclares source distribution collateral.py/mainspring_bootstrap.pylaunches the packaged runtime without inheriting a project virtualenv that could hide dependencies.
Verification map
- Shell syntax and lint:
bash -n mainspring.sh lib/*.shandshellcheck -S warning mainspring.sh lib/*.sh. - Python lint and format:
ruff check pyandruff format --check py. - Unit/integration behavior:
python3 -m pytest py/tests -qandbash tests/bats/run.sh. - Public docs and payload checks:
make release-check(which expands tomake all,make package-smoke, coverage, PRD validation, andgit diff --check).
Appendix B — Verification commands
Use these from the repository root when validating a release candidate. These commands are intentionally boring: they prove the source tree, package payload, PRD, and diff hygiene without calling live AI providers.
set -e
make release-check
./mainspring.sh doctor
./mainspring.sh --dry-run --once
Optional live engine smoke, only when credentials/quota are intentionally available:
./mainspring.sh --self-test
./mainspring.sh --self-test-all
Optional portability smoke, only when Docker is available:
docker run --rm -v "$(pwd):/m" -w /m alpine:3.19 sh -c 'apk add bash python3 git shellcheck && bash mainspring.sh doctor || true'
Appendix C — Competitor landscape / competitive positioning (June 2026 refresh)
The detailed current market analysis lives in
docs/competitive-analysis.md. It supersedes the
April 2026 snapshot that previously lived inline here.
Snapshot date: 2026-06-14. Product claims were checked against official docs and public repository surfaces. Exact popularity metrics are intentionally omitted because popularity signals drift quickly.
Current strategic finding
Mainspring should not compete as “another coding agent.” OpenCode, Cline, Goose, Aider, OpenHands, Roo Code, GitHub Copilot cloud agent, and Devin already own the broad coding-agent mindshare.
Mainspring should compete as:
Product Requirements Document (PRD)-first AI coding agent orchestration for production-grade software delivery.
That means Mainspring exists to solve the operator problem that generic agents leave behind: intent, bounded work, independent review, evidence, global status, notifications, local/private model routing, and recovery.
June 2026 release score
The refreshed 1000-point release-readiness score in
docs/competitive-analysis.md rates Mainspring’s v1 source release readiness at 900/1000. This is a source-release readiness score, not a claim that Mainspring has more distribution than established competitors.
Mainspring scores high on:
- Product Requirements Document (PRD)-first production-grade workflow.
- Taskmaster-aware work selection.
- Independent writer/reviewer wave model.
- Fail-closed JSONL evidence, replay, and failure taxonomy.
- Global dashboard and Telegram operator visibility.
- Local/private writer model routing through Ollama or MTPLX plus Codex/Claude reviewer.
Next public credibility evidence after the source release:
- Signed
v1.0.0tag and GitHub Release. - Package-manager install path.
- Demo video or GIF showing PRD -> wave -> reviewer -> HUD -> Telegram -> ledger.
- Published SWE-bench Verified or equivalent benchmark result.
- Optional GitHub Issues, Linear, and Jira backlog adapters.
Closest threats
| Threat | Why it matters | Mainspring response |
|---|---|---|
| Agent Orchestrator | Worktrees, PR automation, CI fixes, review comment loops, tracker integrations. | Stay Product Requirements Document (PRD)-first and evidence-first; add optional GitHub/Linear backlog adapters later. |
| OpenAI Symphony | Strong “manage work, not agents” positioning plus OpenAI brand. | Stay local/private, multi-engine, and operator-owned. |
| Claude Task Master | Owns PRD-to-task decomposition and overlaps with autopilot. | Be explicit: Mainspring complements Taskmaster by adding execution, review, HUD, Telegram, and evidence. |
| OpenCode / Goose / Aider | Broader coding-agent mindshare and provider/local-model support. | Do not fight on chat UX; own autonomous execution control. |
| Cline / Roo Code | Strong editor-native trust and approval UX. | Own unattended CLI waves where per-action approval is the wrong workflow. |
| GitHub Copilot cloud agent / Devin | Hosted issue-to-PR convenience and enterprise reach. | Own local/private, inspectable, non-SaaS workflows. |
Search and positioning requirements
Public copy should repeatedly use these phrases where natural:
- Product Requirements Document (PRD)-first AI coding agent orchestration.
- Local AI coding agent orchestration for production-grade software delivery.
- Writer/reviewer AI coding workflow.
- Taskmaster execution loop.
- Fail-closed AI code review gate.
- JSONL evidence ledger and replay.
- Terminal HUD for multiple coding agents.
- Telegram alerts for autonomous coding runs.
- Local model writer with Codex or Claude reviewer.
The next market-facing gates are: signed release announcement, package install path, comparison pages, 60-second demo, and benchmark evidence.
Last edited: 2026-06-15. This file is the canonical plan; if any other file in the repo disagrees, update it.