feat: implement new claude skills and workflow

2026-01-03 11:02:30 -06:00
parent c443305007
commit 9f00797925
45 changed files with 10132 additions and 2174 deletions
--- a/.claude/skills/codebase-analysis/CLAUDE.md
+++ b/.claude/skills/codebase-analysis/CLAUDE.md
@@ -0,0 +1,16 @@
+# skills/codebase-analysis/
+
+## Overview
+
+Systematic codebase analysis skill. IMMEDIATELY invoke the script - do NOT explore first.
+
+## Index
+
+| File/Directory       | Contents          | Read When          |
+| -------------------- | ----------------- | ------------------ |
+| `SKILL.md`           | Invocation        | Using this skill   |
+| `scripts/analyze.py` | Complete workflow | Debugging behavior |
+
+## Key Point
+
+The script IS the workflow. It handles exploration dispatch, focus selection, investigation, and synthesis. Do NOT explore or analyze before invoking. Run the script and obey its output.
--- a/.claude/skills/codebase-analysis/README.md
+++ b/.claude/skills/codebase-analysis/README.md
@@ -0,0 +1,48 @@
+# Analyze
+
+Before you plan anything non-trivial, you need to actually understand the
+codebase. Not impressions -- evidence. The analyze skill forces systematic
+investigation with structured phases and explicit evidence requirements.
+
+| Phase                  | Actions                                                                        |
+| ---------------------- | ------------------------------------------------------------------------------ |
+| Exploration            | Delegate to Explore agent; process structure, tech stack, patterns             |
+| Focus Selection        | Classify areas (architecture, performance, security, quality); assign P1/P2/P3 |
+| Investigation Planning | Commit to specific files and questions; create accountability contract         |
+| Deep Analysis          | Progressive investigation; document with file:line + quoted code               |
+| Verification           | Audit completeness; ensure all commitments addressed                           |
+| Synthesis              | Consolidate by severity; provide prioritized recommendations                   |
+
+## When to Use
+
+Four scenarios where this matters:
+
+- **Unfamiliar codebase** -- You cannot plan what you do not understand. Period.
+- **Security review** -- Vulnerability assessment requires systematic coverage,
+  not "I looked around and it seems fine."
+- **Performance analysis** -- Before optimization, know where time actually
+  goes, not where you assume it goes.
+- **Architecture evaluation** -- Major refactors deserve evidence-backed
+  understanding, not vibes.
+
+## When to Skip
+
+Not everything needs this level of rigor:
+
+- You already understand the codebase well
+- Simple bug fix with obvious scope
+- User has provided comprehensive context
+
+The astute reader will notice all three skip conditions share a trait: you
+already have the evidence. The skill exists for when you do not.
+
+## Example Usage
+
+```
+Use your analyze skill to understand this codebase.
+Focus on security and architecture before we plan the authentication refactor.
+```
+
+The skill outputs findings organized by severity (CRITICAL/HIGH/MEDIUM/LOW),
+each with file:line references and quoted code. This feeds directly into
+planning -- you have evidence-backed understanding before proposing changes.
--- a/.claude/skills/codebase-analysis/SKILL.md
+++ b/.claude/skills/codebase-analysis/SKILL.md
@@ -0,0 +1,25 @@
+---
+name: codebase-analysis
+description: Invoke IMMEDIATELY via python script when user requests codebase analysis, architecture review, security assessment, or quality evaluation. Do NOT explore first - the script orchestrates exploration.
+---
+
+# Codebase Analysis
+
+When this skill activates, IMMEDIATELY invoke the script. The script IS the workflow.
+
+## Invocation
+
+```bash
+python3 scripts/analyze.py \
+  --step-number 1 \
+  --total-steps 6 \
+  --thoughts "Starting analysis. User request: <describe what user asked to analyze>"
+```
+
+| Argument        | Required | Description                               |
+| --------------- | -------- | ----------------------------------------- |
+| `--step-number` | Yes      | Current step (starts at 1)                |
+| `--total-steps` | Yes      | Minimum 6; adjust as script instructs     |
+| `--thoughts`    | Yes      | Accumulated state from all previous steps |
+
+Do NOT explore or analyze first. Run the script and follow its output.
--- a/.claude/skills/codebase-analysis/scripts/analyze.py
+++ b/.claude/skills/codebase-analysis/scripts/analyze.py
@@ -0,0 +1,661 @@
+#!/usr/bin/env python3
+"""
+Analyze Skill - Step-by-step codebase analysis with exploration and deep investigation.
+
+Six-phase workflow:
+1. EXPLORATION: Process Explore sub-agent results
+2. FOCUS SELECTION: Classify investigation areas
+3. INVESTIGATION PLANNING: Commit to specific files and questions
+4. DEEP ANALYSIS (1-N): Progressive investigation with evidence
+5. VERIFICATION: Validate completeness before synthesis
+6. SYNTHESIS: Consolidate verified findings
+
+Usage:
+    python3 analyze.py --step-number 1 --total-steps 6 --thoughts "Explore found: ..."
+"""
+
+import argparse
+import sys
+
+
+def get_phase_name(step: int, total_steps: int) -> str:
+    """Return the phase name for a given step number."""
+    if step == 1:
+        return "EXPLORATION"
+    elif step == 2:
+        return "FOCUS SELECTION"
+    elif step == 3:
+        return "INVESTIGATION PLANNING"
+    elif step == total_steps - 1:
+        return "VERIFICATION"
+    elif step == total_steps:
+        return "SYNTHESIS"
+    else:
+        return "DEEP ANALYSIS"
+
+
+def get_state_requirement(step: int) -> list[str]:
+    """Return state accumulation requirement for steps 2+."""
+    if step < 2:
+        return []
+
+    return [
+        "",
+        "<state_requirement>",
+        "CRITICAL: Your --thoughts for this step MUST include:",
+        "",
+        "1. FOCUS AREAS: Each area identified and its priority (from step 2)",
+        "2. INVESTIGATION PLAN: Files and questions committed to (from step 3)",
+        "3. FILES EXAMINED: Every file read with key observations",
+        "4. ISSUES BY SEVERITY: All [CRITICAL]/[HIGH]/[MEDIUM]/[LOW] items",
+        "5. PATTERNS: Cross-file patterns identified",
+        "6. HYPOTHESES: Current theories and supporting evidence",
+        "7. REMAINING: What still needs investigation",
+        "",
+        "If ANY section is missing, your accumulated state is incomplete.",
+        "Reconstruct it before proceeding.",
+        "</state_requirement>",
+    ]
+
+
+def get_step_guidance(step: int, total_steps: int) -> dict:
+    """Return step-specific guidance and actions."""
+
+    next_step = step + 1 if step < total_steps else None
+    phase = get_phase_name(step, total_steps)
+    is_final = step >= total_steps
+
+    # Minimum steps: exploration(1) + focus(2) + planning(3) + analysis(4) + verification(5) + synthesis(6)
+    min_steps = 6
+
+    # PHASE 1: EXPLORATION
+    if step == 1:
+        return {
+            "phase": phase,
+            "step_title": "Process Exploration Results",
+            "actions": [
+                "STOP. Before proceeding, verify you have Explore agent results.",
+                "",
+                "If your --thoughts do NOT contain Explore agent output, you MUST:",
+                "",
+                "<exploration_delegation>",
+                "Assess the scope and delegate appropriately:",
+                "",
+                "SINGLE CODEBASE, FOCUSED SCOPE:",
+                "  - One Explore agent is sufficient",
+                "  - Use Task tool with subagent_type='Explore'",
+                "  - Prompt: 'Explore this repository. Report directory structure,",
+                "    tech stack, entry points, main components, observed patterns.'",
+                "",
+                "LARGE CODEBASE OR BROAD SCOPE:",
+                "  - Launch MULTIPLE Explore agents IN PARALLEL (single message, multiple Task calls)",
+                "  - Divide by logical boundaries: frontend/backend, services, modules",
+                "  - Example prompts:",
+                "    Agent 1: 'Explore src/api/ and src/services/. Focus on API structure.'",
+                "    Agent 2: 'Explore src/core/ and src/models/. Focus on domain logic.'",
+                "    Agent 3: 'Explore tests/ and config/. Focus on test patterns and configuration.'",
+                "",
+                "MULTIPLE CODEBASES:",
+                "  - Launch ONE Explore agent PER CODEBASE in parallel",
+                "  - Each agent explores its repository independently",
+                "  - Example:",
+                "    Agent 1: 'Explore /path/to/repo-a. Report structure and patterns.'",
+                "    Agent 2: 'Explore /path/to/repo-b. Report structure and patterns.'",
+                "",
+                "WAIT for ALL agents to complete before invoking this step again.",
+                "</exploration_delegation>",
+                "",
+                "Only proceed below if you have concrete Explore output to process.",
+                "",
+                "=" * 60,
+                "",
+                "<exploration_processing>",
+                "From the Explore agent(s) report(s), extract and document:",
+                "",
+                "STRUCTURE:",
+                "  - Main directories and their purposes",
+                "  - Where core logic lives vs. configuration vs. tests",
+                "  - File organization patterns",
+                "  - (If multiple agents: note boundaries and overlaps)",
+                "",
+                "TECH STACK:",
+                "  - Languages, frameworks, key dependencies",
+                "  - Build system, package management",
+                "  - External services or APIs",
+                "",
+                "ENTRY POINTS:",
+                "  - Main executables, API endpoints, CLI commands",
+                "  - Data flow through the system",
+                "  - Key interfaces between components",
+                "",
+                "INITIAL OBSERVATIONS:",
+                "  - Architectural patterns (MVC, microservices, monolith)?",
+                "  - Obvious code smells or areas of concern?",
+                "  - Parts that seem well-structured vs. problematic?",
+                "</exploration_processing>",
+            ],
+            "next": (
+                f"Invoke step {next_step} with your processed exploration summary. "
+                "Include all structure, tech stack, and initial observations in --thoughts."
+            ),
+        }
+
+    # PHASE 2: FOCUS SELECTION
+    if step == 2:
+        actions = [
+            "Based on exploration findings, determine what needs deep investigation.",
+            "",
+            "<focus_classification>",
+            "Evaluate the codebase against each dimension. Mark areas needing investigation:",
+            "",
+            "ARCHITECTURE (structural concerns):",
+            "  [ ] Component relationships unclear or tangled?",
+            "  [ ] Dependency graph needs mapping?",
+            "  [ ] Layering violations or circular dependencies?",
+            "  [ ] Missing or unclear module boundaries?",
+            "",
+            "PERFORMANCE (efficiency concerns):",
+            "  [ ] Hot paths that may be inefficient?",
+            "  [ ] Database queries needing review?",
+            "  [ ] Memory allocation patterns?",
+            "  [ ] Concurrency or parallelism issues?",
+            "",
+            "SECURITY (vulnerability concerns):",
+            "  [ ] Input validation gaps?",
+            "  [ ] Authentication/authorization flows?",
+            "  [ ] Sensitive data handling?",
+            "  [ ] External API integrations?",
+            "",
+            "QUALITY (maintainability concerns):",
+            "  [ ] Code duplication patterns?",
+            "  [ ] Overly complex functions/classes?",
+            "  [ ] Missing error handling?",
+            "  [ ] Test coverage gaps?",
+            "</focus_classification>",
+            "",
+            "<priority_assignment>",
+            "Rank your focus areas by priority (P1 = most critical):",
+            "",
+            "  P1: [focus area] - [why most critical]",
+            "  P2: [focus area] - [why second]",
+            "  P3: [focus area] - [if applicable]",
+            "",
+            "Consider: security > correctness > performance > maintainability",
+            "</priority_assignment>",
+            "",
+            "<step_estimation>",
+            "Estimate total steps based on scope:",
+            "",
+            f"  Minimum steps: {min_steps} (exploration + focus + planning + 1 analysis + verification + synthesis)",
+            "  1-2 focus areas, small codebase:  total_steps = 6-7",
+            "  2-3 focus areas, medium codebase: total_steps = 7-9",
+            "  3+ focus areas, large codebase:   total_steps = 9-12",
+            "",
+            "You can adjust this estimate as understanding grows.",
+            "</step_estimation>",
+        ]
+        actions.extend(get_state_requirement(step))
+        return {
+            "phase": phase,
+            "step_title": "Classify Investigation Areas",
+            "actions": actions,
+            "next": (
+                f"Invoke step {next_step} with your prioritized focus areas and "
+                "updated total_steps estimate. Next: create investigation plan."
+            ),
+        }
+
+    # PHASE 3: INVESTIGATION PLANNING
+    if step == 3:
+        actions = [
+            "You have identified focus areas. Now commit to specific investigation targets.",
+            "",
+            "This step creates ACCOUNTABILITY. You will verify against these commitments.",
+            "",
+            "<investigation_commitments>",
+            "For EACH focus area (in priority order), specify:",
+            "",
+            "---",
+            "FOCUS AREA: [name] (Priority: P1/P2/P3)",
+            "",
+            "Files to examine:",
+            "  - path/to/file1.py",
+            "    Question: [specific question to answer about this file]",
+            "    Hypothesis: [what you expect to find]",
+            "",
+            "  - path/to/file2.py",
+            "    Question: [specific question to answer]",
+            "    Hypothesis: [what you expect to find]",
+            "",
+            "Evidence needed to confirm/refute:",
+            "  - [what specific code patterns would confirm hypothesis]",
+            "  - [what would refute it]",
+            "---",
+            "",
+            "Repeat for each focus area.",
+            "</investigation_commitments>",
+            "",
+            "<commitment_rules>",
+            "This is a CONTRACT. In subsequent steps, you MUST:",
+            "",
+            "  1. Read every file listed (using Read tool)",
+            "  2. Answer every question posed",
+            "  3. Document evidence with file:line references",
+            "  4. Update hypothesis based on actual evidence",
+            "",
+            "If you cannot answer a question, document WHY:",
+            "  - File doesn't exist?",
+            "  - Question was wrong?",
+            "  - Need different files?",
+            "",
+            "Do NOT silently skip commitments.",
+            "</commitment_rules>",
+        ]
+        actions.extend(get_state_requirement(step))
+        return {
+            "phase": phase,
+            "step_title": "Create Investigation Plan",
+            "actions": actions,
+            "next": (
+                f"Invoke step {next_step} with your complete investigation plan. "
+                "Next: begin executing the plan with the highest priority focus area."
+            ),
+        }
+
+    # PHASE 5: VERIFICATION (step N-1)
+    if step == total_steps - 1:
+        actions = [
+            "STOP. Before synthesizing, verify your investigation is complete.",
+            "",
+            "<completeness_audit>",
+            "Review your investigation commitments from Step 3.",
+            "",
+            "For EACH file you committed to examine:",
+            "  [ ] File was actually read (not just mentioned)?",
+            "  [ ] Specific question was answered with evidence?",
+            "  [ ] Finding documented with file:line reference and quoted code?",
+            "",
+            "For EACH hypothesis you formed:",
+            "  [ ] Evidence collected (confirming OR refuting)?",
+            "  [ ] Hypothesis updated based on evidence?",
+            "  [ ] If refuted, what replaced it?",
+            "</completeness_audit>",
+            "",
+            "<gap_detection>",
+            "Identify gaps in your investigation:",
+            "",
+            "  - Files committed but not examined?",
+            "  - Focus areas declared but not investigated?",
+            "  - Issues referenced without file:line evidence?",
+            "  - Patterns claimed without cross-file validation?",
+            "  - Questions posed but not answered?",
+            "",
+            "List each gap explicitly:",
+            "  GAP 1: [description]",
+            "  GAP 2: [description]",
+            "  ...",
+            "</gap_detection>",
+            "",
+            "<gap_resolution>",
+            "If gaps exist:",
+            "  1. INCREASE total_steps by number of gaps that need investigation",
+            "  2. Return to DEEP ANALYSIS phase to fill gaps",
+            "  3. Re-enter VERIFICATION after gaps are filled",
+            "",
+            "If no gaps (or gaps are acceptable):",
+            "  Proceed to SYNTHESIS (next step)",
+            "</gap_resolution>",
+            "",
+            "<evidence_quality_check>",
+            "For each [CRITICAL] or [HIGH] severity finding, verify:",
+            "  [ ] Has quoted code (2-5 lines)?",
+            "  [ ] Has exact file:line reference?",
+            "  [ ] Impact is clearly explained?",
+            "  [ ] Recommended fix is actionable?",
+            "",
+            "Findings without evidence are UNVERIFIED. Either:",
+            "  - Add evidence now, or",
+            "  - Downgrade severity, or",
+            "  - Mark as 'needs investigation'",
+            "</evidence_quality_check>",
+        ]
+        actions.extend(get_state_requirement(step))
+        return {
+            "phase": phase,
+            "step_title": "Verify Investigation Completeness",
+            "actions": actions,
+            "next": (
+                "If gaps found: invoke earlier step to fill gaps, then return here. "
+                f"If complete: invoke step {next_step} for final synthesis."
+            ),
+        }
+
+    # PHASE 6: SYNTHESIS (final step)
+    if is_final:
+        return {
+            "phase": phase,
+            "step_title": "Consolidate and Recommend",
+            "actions": [
+                "Investigation verified. Synthesize all findings into actionable output.",
+                "",
+                "<final_consolidation>",
+                "Organize all VERIFIED findings by severity:",
+                "",
+                "CRITICAL ISSUES (must address immediately):",
+                "  For each:",
+                "    - file:line reference",
+                "    - Quoted code (2-5 lines)",
+                "    - Impact description",
+                "    - Recommended fix",
+                "",
+                "HIGH ISSUES (should address soon):",
+                "  For each: file:line, description, recommended fix",
+                "",
+                "MEDIUM ISSUES (consider addressing):",
+                "  For each: description, general guidance",
+                "",
+                "LOW ISSUES (nice to fix):",
+                "  Summarize patterns, defer to future work",
+                "</final_consolidation>",
+                "",
+                "<pattern_synthesis>",
+                "Identify systemic patterns:",
+                "",
+                "  - Issues appearing across multiple files -> systemic problem",
+                "  - Root causes explaining multiple symptoms",
+                "  - Architectural changes that would prevent recurrence",
+                "</pattern_synthesis>",
+                "",
+                "<recommendations>",
+                "Provide prioritized action plan:",
+                "",
+                "IMMEDIATE (blocks other work / security risk):",
+                "  1. [action with specific file:line reference]",
+                "  2. [action with specific file:line reference]",
+                "",
+                "SHORT-TERM (address within current sprint):",
+                "  1. [action with scope indication]",
+                "  2. [action with scope indication]",
+                "",
+                "LONG-TERM (strategic improvements):",
+                "  1. [architectural or process recommendation]",
+                "  2. [architectural or process recommendation]",
+                "</recommendations>",
+                "",
+                "<final_quality_check>",
+                "Before presenting to user, verify:",
+                "",
+                "  [ ] All CRITICAL/HIGH issues have file:line + quoted code?",
+                "  [ ] Recommendations are actionable, not vague?",
+                "  [ ] Findings organized by impact, not discovery order?",
+                "  [ ] No findings lost from earlier steps?",
+                "  [ ] Patterns are supported by multiple examples?",
+                "</final_quality_check>",
+            ],
+            "next": None,
+        }
+
+    # PHASE 4: DEEP ANALYSIS (steps 4 to N-2)
+    # Calculate position within deep analysis phase
+    deep_analysis_step = step - 3  # 1st, 2nd, 3rd deep analysis step
+    remaining_before_verification = total_steps - 1 - step  # steps until verification
+
+    if deep_analysis_step == 1:
+        step_title = "Initial Investigation"
+        focus_instruction = [
+            "Execute your investigation plan from Step 3.",
+            "",
+            "<first_pass_protocol>",
+            "For each file in your P1 (highest priority) focus area:",
+            "",
+            "1. READ the file using the Read tool",
+            "2. ANSWER the specific question you committed to",
+            "3. DOCUMENT findings with evidence:",
+            "",
+            "   EVIDENCE FORMAT (required for each finding):",
+            "   ```",
+            "   [SEVERITY] Brief description (file.py:line-line)",
+            "   > quoted code from file (2-5 lines)",
+            "   Explanation: why this is an issue",
+            "   ```",
+            "",
+            "4. UPDATE your hypothesis based on what you found",
+            "   - Confirmed? Document supporting evidence",
+            "   - Refuted? Document what you found instead",
+            "   - Inconclusive? Note what else you need to check",
+            "</first_pass_protocol>",
+            "",
+            "Findings without quoted code are UNVERIFIED.",
+        ]
+    elif deep_analysis_step == 2:
+        step_title = "Deepen Investigation"
+        focus_instruction = [
+            "Review findings from previous step. Go deeper.",
+            "",
+            "<second_pass_protocol>",
+            "For each issue found in the previous step:",
+            "",
+            "1. TRACE to root cause",
+            "   - Why does this issue exist?",
+            "   - What allowed it to be introduced?",
+            "   - Are there related issues in connected files?",
+            "",
+            "2. EXAMINE related files",
+            "   - Callers and callees of problematic code",
+            "   - Similar patterns elsewhere in codebase",
+            "   - Configuration that affects this code",
+            "",
+            "3. LOOK for patterns",
+            "   - Same issue in multiple places? -> Systemic problem",
+            "   - One-off issue? -> Localized fix",
+            "",
+            "4. MOVE to P2 focus area if P1 is sufficiently investigated",
+            "</second_pass_protocol>",
+            "",
+            "Continue documenting with file:line + quoted code.",
+        ]
+    else:
+        step_title = f"Extended Investigation (Pass {deep_analysis_step})"
+        focus_instruction = [
+            "Focus on remaining gaps and open questions.",
+            "",
+            "<extended_investigation_protocol>",
+            "Review your accumulated state. Address:",
+            "",
+            "1. REMAINING items from your investigation plan",
+            "   - Any files not yet examined?",
+            "   - Any questions not yet answered?",
+            "",
+            "2. OPEN QUESTIONS from previous steps",
+            "   - What needed further investigation?",
+            "   - What dependencies weren't clear?",
+            "",
+            "3. PATTERN VALIDATION",
+            "   - Cross-file patterns claimed but not verified?",
+            "   - Need more examples to confirm systemic issues?",
+            "",
+            "4. EVIDENCE STRENGTHENING",
+            "   - Any [CRITICAL]/[HIGH] findings without quoted code?",
+            "   - Any claims without file:line references?",
+            "</extended_investigation_protocol>",
+            "",
+            "If investigation is complete, reduce total_steps to reach verification.",
+        ]
+
+    actions = focus_instruction + [
+        "",
+        "<scope_check>",
+        "After this step's investigation:",
+        "",
+        f"  Remaining steps before verification: {remaining_before_verification}",
+        "",
+        "  - Discovered more complexity? -> INCREASE total_steps",
+        "  - Remaining scope smaller than expected? -> DECREASE total_steps",
+        "  - All focus areas sufficiently covered? -> Set next step = total_steps - 1 (verification)",
+        "</scope_check>",
+    ]
+    actions.extend(get_state_requirement(step))
+
+    return {
+        "phase": phase,
+        "step_title": step_title,
+        "actions": actions,
+        "next": (
+            f"Invoke step {next_step}. "
+            f"{remaining_before_verification} step(s) before verification. "
+            "Include ALL accumulated findings in --thoughts. "
+            "Adjust total_steps if scope changed."
+        ),
+    }
+
+
+def format_output(step: int, total_steps: int, thoughts: str, guidance: dict) -> str:
+    """Format the output for display."""
+    lines = []
+
+    # Header
+    lines.append("=" * 70)
+    lines.append(f"ANALYZE - Step {step}/{total_steps}: {guidance['step_title']}")
+    lines.append(f"Phase: {guidance['phase']}")
+    lines.append("=" * 70)
+    lines.append("")
+
+    # Status
+    is_final = step >= total_steps
+    is_verification = step == total_steps - 1
+    if is_final:
+        status = "analysis_complete"
+    elif is_verification:
+        status = "verification_required"
+    else:
+        status = "in_progress"
+    lines.append(f"STATUS: {status}")
+    lines.append("")
+
+    # Current thoughts summary (truncated for display)
+    lines.append("YOUR ACCUMULATED STATE:")
+    if len(thoughts) > 600:
+        lines.append(thoughts[:600] + "...")
+        lines.append("[truncated - full state in --thoughts]")
+    else:
+        lines.append(thoughts)
+    lines.append("")
+
+    # Actions
+    lines.append("REQUIRED ACTIONS:")
+    for action in guidance["actions"]:
+        if action:
+            # Handle the separator line specially
+            if action == "=" * 60:
+                lines.append("  " + action)
+            else:
+                lines.append(f"  {action}")
+        else:
+            lines.append("")
+    lines.append("")
+
+    # Next step or completion
+    if guidance["next"]:
+        lines.append("NEXT:")
+        lines.append(guidance["next"])
+    else:
+        lines.append("WORKFLOW COMPLETE")
+        lines.append("")
+        lines.append("Present your consolidated findings to the user:")
+        lines.append("  - Organized by severity (CRITICAL -> LOW)")
+        lines.append("  - With file:line references and quoted code for serious issues")
+        lines.append("  - With actionable recommendations for each category")
+
+    lines.append("")
+    lines.append("=" * 70)
+
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze Skill - Systematic codebase analysis",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Workflow Phases:
+  Step 1: EXPLORATION         - Process Explore agent results
+  Step 2: FOCUS SELECTION     - Classify investigation areas
+  Step 3: INVESTIGATION PLAN  - Commit to specific files and questions
+  Step 4+: DEEP ANALYSIS      - Progressive investigation with evidence
+  Step N-1: VERIFICATION      - Validate completeness before synthesis
+  Step N: SYNTHESIS           - Consolidate verified findings
+
+Examples:
+  # Step 1: After Explore agent returns
+  python3 analyze.py --step-number 1 --total-steps 6 \\
+    --thoughts "Explore found: Python web app, Flask, SQLAlchemy..."
+
+  # Step 2: Focus selection
+  python3 analyze.py --step-number 2 --total-steps 7 \\
+    --thoughts "Structure: src/, tests/. Focus: security (P1), quality (P2)..."
+
+  # Step 3: Investigation planning
+  python3 analyze.py --step-number 3 --total-steps 7 \\
+    --thoughts "P1 Security: auth/login.py (Q: input validation?), ..."
+
+  # Step 4: Initial investigation
+  python3 analyze.py --step-number 4 --total-steps 7 \\
+    --thoughts "FILES: auth/login.py read. [CRITICAL] SQL injection at :45..."
+
+  # Step 5: Deepen investigation
+  python3 analyze.py --step-number 5 --total-steps 7 \\
+    --thoughts "[Previous state] + traced to db/queries.py, pattern in 3 files..."
+
+  # Step 6: Verification
+  python3 analyze.py --step-number 6 --total-steps 7 \\
+    --thoughts "[All findings] Checking: all files read, all questions answered..."
+
+  # Step 7: Synthesis
+  python3 analyze.py --step-number 7 --total-steps 7 \\
+    --thoughts "[Verified findings] Ready for consolidation..."
+"""
+    )
+
+    parser.add_argument(
+        "--step-number",
+        type=int,
+        required=True,
+        help="Current step number (starts at 1)",
+    )
+    parser.add_argument(
+        "--total-steps",
+        type=int,
+        required=True,
+        help="Estimated total steps (adjust as understanding grows)",
+    )
+    parser.add_argument(
+        "--thoughts",
+        type=str,
+        required=True,
+        help="Accumulated findings, evidence, and file references",
+    )
+
+    args = parser.parse_args()
+
+    # Validate inputs
+    if args.step_number < 1:
+        print("ERROR: step-number must be >= 1", file=sys.stderr)
+        sys.exit(1)
+
+    if args.total_steps < 6:
+        print("ERROR: total-steps must be >= 6 (minimum workflow)", file=sys.stderr)
+        sys.exit(1)
+
+    if args.total_steps < args.step_number:
+        print("ERROR: total-steps must be >= step-number", file=sys.stderr)
+        sys.exit(1)
+
+    # Get guidance for current step
+    guidance = get_step_guidance(args.step_number, args.total_steps)
+
+    # Print formatted output
+    print(format_output(args.step_number, args.total_steps, args.thoughts, guidance))
+
+
+if __name__ == "__main__":
+    main()
--- a/.claude/skills/decision-critic/CLAUDE.md
+++ b/.claude/skills/decision-critic/CLAUDE.md
@@ -0,0 +1,16 @@
+# skills/decision-critic/
+
+## Overview
+
+Decision stress-testing skill. IMMEDIATELY invoke the script - do NOT analyze first.
+
+## Index
+
+| File/Directory               | Contents          | Read When          |
+| ---------------------------- | ----------------- | ------------------ |
+| `SKILL.md`                   | Invocation        | Using this skill   |
+| `scripts/decision-critic.py` | Complete workflow | Debugging behavior |
+
+## Key Point
+
+The script IS the workflow. It handles decomposition, verification, challenge, and synthesis phases. Do NOT analyze or critique before invoking. Run the script and obey its output.
--- a/.claude/skills/decision-critic/README.md
+++ b/.claude/skills/decision-critic/README.md
@@ -0,0 +1,59 @@
+# Decision Critic
+
+Here's the problem: LLMs are sycophants. They agree with you. They validate your
+reasoning. They tell you your architectural decision is sound and well-reasoned.
+That's not what you need for important decisions -- you need stress-testing.
+
+The decision-critic skill forces structured adversarial analysis:
+
+| Phase         | Actions                                                                    |
+| ------------- | -------------------------------------------------------------------------- |
+| Decomposition | Extract claims, assumptions, constraints; assign IDs; classify each        |
+| Verification  | Generate questions for verifiable items; answer independently; mark status |
+| Challenge     | Steel-man argument against; explore alternative framings                   |
+| Synthesis     | Verdict (STAND/REVISE/ESCALATE); summary and recommendation                |
+
+## When to Use
+
+Use this for decisions where you actually want criticism, not agreement:
+
+- Architectural choices with long-term consequences
+- Technology selection (language, framework, database)
+- Tradeoffs between competing concerns (performance vs. maintainability)
+- Decisions you're uncertain about and want stress-tested
+
+## Example Usage
+
+```
+I'm considering using Redis for our session storage instead of PostgreSQL.
+My reasoning:
+
+- Redis is faster for key-value lookups
+- Sessions are ephemeral, don't need ACID guarantees
+- We already have Redis for caching
+
+Use your decision critic skill to stress-test this decision.
+```
+
+So what happens? The skill:
+
+1. **Decomposes** the decision into claims (C1: Redis is faster), assumptions
+   (A1: sessions don't need durability), constraints (K1: Redis already
+   deployed)
+2. **Verifies** each claim -- is Redis actually faster for your access pattern?
+   What's the actual latency difference?
+3. **Challenges** -- what if sessions DO need durability (shopping carts)?
+   What's the operational cost of Redis failures?
+4. **Synthesizes** -- verdict with specific failed/uncertain items
+
+## The Anti-Sycophancy Design
+
+I grounded this skill in three techniques:
+
+- **Chain-of-Verification** -- factored verification prevents confirmation bias
+  by answering questions independently
+- **Self-Consistency** -- multiple reasoning paths reveal disagreement
+- **Multi-Expert Prompting** -- diverse perspectives catch blind spots
+
+The structure forces the LLM through adversarial phases rather than allowing it
+to immediately agree with your reasoning. That's the whole point.
--- a/.claude/skills/decision-critic/SKILL.md
+++ b/.claude/skills/decision-critic/SKILL.md
@@ -0,0 +1,29 @@
+---
+name: decision-critic
+description: Invoke IMMEDIATELY via python script to stress-test decisions and reasoning. Do NOT analyze first - the script orchestrates the critique workflow.
+---
+
+# Decision Critic
+
+When this skill activates, IMMEDIATELY invoke the script. The script IS the workflow.
+
+## Invocation
+
+```bash
+python3 scripts/decision-critic.py \
+  --step-number 1 \
+  --total-steps 7 \
+  --decision "<decision text>" \
+  --context "<constraints and background>" \
+  --thoughts "<your accumulated analysis from all previous steps>"
+```
+
+| Argument        | Required | Description                                                 |
+| --------------- | -------- | ----------------------------------------------------------- |
+| `--step-number` | Yes      | Current step (1-7)                                          |
+| `--total-steps` | Yes      | Always 7                                                    |
+| `--decision`    | Step 1   | The decision statement being criticized                     |
+| `--context`     | Step 1   | Constraints, background, system context                     |
+| `--thoughts`    | Yes      | Your analysis including all IDs and status from prior steps |
+
+Do NOT analyze or critique first. Run the script and follow its output.
--- a/.claude/skills/decision-critic/scripts/decision-critic.py
+++ b/.claude/skills/decision-critic/scripts/decision-critic.py
@@ -0,0 +1,468 @@
+#!/usr/bin/env python3
+"""
+Decision Critic - Step-by-step prompt injection for structured decision criticism.
+
+Grounded in:
+- Chain-of-Verification (Dhuliawala et al., 2023)
+- Self-Consistency (Wang et al., 2023)
+- Multi-Expert Prompting (Wang et al., 2024)
+"""
+
+import argparse
+import sys
+from typing import Optional
+
+
+def get_phase_name(step: int) -> str:
+    """Return the phase name for a given step number."""
+    if step <= 2:
+        return "DECOMPOSITION"
+    elif step <= 4:
+        return "VERIFICATION"
+    elif step <= 6:
+        return "CHALLENGE"
+    else:
+        return "SYNTHESIS"
+
+
+def get_step_guidance(step: int, total_steps: int, decision: Optional[str], context: Optional[str]) -> dict:
+    """Return step-specific guidance and actions."""
+
+    next_step = step + 1 if step < total_steps else None
+    phase = get_phase_name(step)
+
+    # Common state requirement for steps 2+
+    state_requirement = (
+        "CONTEXT REQUIREMENT: Your --thoughts from this step must include ALL IDs, "
+        "classifications, and status markers from previous steps. This accumulated "
+        "state is essential for workflow continuity."
+    )
+
+    # DECOMPOSITION PHASE
+    if step == 1:
+        return {
+            "phase": phase,
+            "step_title": "Extract Structure",
+            "actions": [
+                "You are a structured decision critic. Your task is to decompose this "
+                "decision into its constituent parts so each can be independently verified "
+                "or challenged. This analysis is critical to the quality of the entire workflow.",
+                "",
+                "Extract and assign stable IDs that will persist through ALL subsequent steps:",
+                "",
+                "CLAIMS [C1, C2, ...] - Factual assertions (3-7 items)",
+                "  What facts does this decision assume to be true?",
+                "  What cause-effect relationships does it depend on?",
+                "",
+                "ASSUMPTIONS [A1, A2, ...] - Unstated beliefs (2-5 items)",
+                "  What is implied but not explicitly stated?",
+                "  What would someone unfamiliar with the context not know?",
+                "",
+                "CONSTRAINTS [K1, K2, ...] - Hard boundaries (1-4 items)",
+                "  What technical limitations exist?",
+                "  What organizational/timeline constraints apply?",
+                "",
+                "JUDGMENTS [J1, J2, ...] - Subjective tradeoffs (1-3 items)",
+                "  Where are values being weighed against each other?",
+                "  What 'it depends' decisions were made?",
+                "",
+                "OUTPUT FORMAT:",
+                "  C1: <claim text>",
+                "  C2: <claim text>",
+                "  A1: <assumption text>",
+                "  K1: <constraint text>",
+                "  J1: <judgment text>",
+                "",
+                "These IDs will be referenced in ALL subsequent steps. Be thorough but focused.",
+            ],
+            "next": f"Step {next_step}: Classify each item's verifiability.",
+            "academic_note": None,
+        }
+
+    if step == 2:
+        return {
+            "phase": phase,
+            "step_title": "Classify Verifiability",
+            "actions": [
+                "You are a structured decision critic continuing your analysis.",
+                "",
+                "Classify each item from Step 1. Retain original IDs and add a verifiability tag.",
+                "",
+                "CLASSIFICATIONS:",
+                "",
+                "  [V] VERIFIABLE - Can be checked against evidence or tested",
+                "      Examples: \"API supports 1000 RPS\" (testable), \"Library X has feature Y\" (checkable)",
+                "",
+                "  [J] JUDGMENT - Subjective tradeoff with no objectively correct answer",
+                "      Examples: \"Simplicity is more important than flexibility\", \"Risk is acceptable\"",
+                "",
+                "  [C] CONSTRAINT - Given condition, accepted as fixed for this decision",
+                "      Examples: \"Budget is $50K\", \"Must launch by Q2\", \"Team has 3 engineers\"",
+                "",
+                "EDGE CASE RULE: When an item could fit multiple categories, prefer [V] over [J] over [C].",
+                "Rationale: Verifiable items can be checked; judgments can be debated; constraints are given.",
+                "",
+                "Example edge case:",
+                "  \"The team can deliver in 4 weeks\" - Could be [J] (judgment about capacity) or [V] (checkable",
+                "  against past velocity). Choose [V] because it CAN be verified against evidence.",
+                "",
+                "OUTPUT FORMAT (preserve original IDs):",
+                "  C1 [V]: <claim text>",
+                "  C2 [J]: <claim text>",
+                "  A1 [V]: <assumption text>",
+                "  K1 [C]: <constraint text>",
+                "",
+                "COUNT: State how many [V] items require verification in the next phase.",
+                "",
+                state_requirement,
+            ],
+            "next": f"Step {next_step}: Generate verification questions for [V] items.",
+            "academic_note": None,
+        }
+
+    # VERIFICATION PHASE
+    if step == 3:
+        return {
+            "phase": phase,
+            "step_title": "Generate Verification Questions",
+            "actions": [
+                "You are a structured decision critic. This step is crucial for catching errors.",
+                "",
+                "For each [V] item from Step 2, generate 1-3 verification questions.",
+                "",
+                "CRITERIA FOR GOOD QUESTIONS:",
+                "  - Specific and independently answerable",
+                "  - Designed to reveal if the claim is FALSE (falsification focus)",
+                "  - Do not assume the claim is true in the question itself",
+                "  - Each question should test a different aspect of the claim",
+                "",
+                "QUESTION BOUNDS:",
+                "  - Simple claims: 1 question",
+                "  - Moderate claims: 2 questions",
+                "  - Complex claims with multiple parts: 3 questions maximum",
+                "",
+                "OUTPUT FORMAT:",
+                "  C1 [V]: <claim text>",
+                "    Q1: <verification question>",
+                "    Q2: <verification question>",
+                "  A1 [V]: <assumption text>",
+                "    Q1: <verification question>",
+                "",
+                "EXAMPLE:",
+                "  C1 [V]: Retrying failed requests creates race condition risk",
+                "    Q1: Can a retry succeed after another request has already written?",
+                "    Q2: What ordering guarantees exist between concurrent requests?",
+                "",
+                state_requirement,
+            ],
+            "next": f"Step {next_step}: Answer questions with factored verification.",
+            "academic_note": (
+                "Chain-of-Verification (Dhuliawala et al., 2023): \"Plan verification questions "
+                "to check its work, and then systematically answer those questions.\""
+            ),
+        }
+
+    if step == 4:
+        return {
+            "phase": phase,
+            "step_title": "Factored Verification",
+            "actions": [
+                "You are a structured decision critic. This verification step is the most important "
+                "in the entire workflow. Your accuracy here directly determines verdict quality. "
+                "Take your time and be rigorous.",
+                "",
+                "Answer each verification question INDEPENDENTLY.",
+                "",
+                "EPISTEMIC BOUNDARY (critical for avoiding confirmation bias):",
+                "",
+                "  Answer using ONLY:",
+                "    (a) Established domain knowledge - facts you would find in documentation,",
+                "        textbooks, or widely-accepted technical references",
+                "    (b) Stated constraints - information explicitly provided in the decision context",
+                "    (c) Logical inference - deductions from first principles that would hold",
+                "        regardless of whether this specific decision is correct",
+                "",
+                "  Do NOT:",
+                "    - Assume the decision is correct and work backward",
+                "    - Assume the decision is incorrect and seek to disprove",
+                "    - Reference whether the claim 'should' be true given the decision",
+                "",
+                "SEPARATE your answer from its implication:",
+                "  - ANSWER: The factual response to the question (evidence-based)",
+                "  - IMPLICATION: What this means for the original claim (judgment)",
+                "",
+                "Then mark each [V] item:",
+                "  VERIFIED - Answers are consistent with the claim",
+                "  FAILED - Answers reveal inconsistency, error, or contradiction",
+                "  UNCERTAIN - Insufficient evidence; state what additional information would resolve",
+                "",
+                "OUTPUT FORMAT:",
+                "  C1 [V]: <claim text>",
+                "    Q1: <question>",
+                "      Answer: <factual answer based on epistemic boundary>",
+                "      Implication: <what this means for the claim>",
+                "    Status: VERIFIED | FAILED | UNCERTAIN",
+                "    Rationale: <one sentence explaining the status>",
+                "",
+                state_requirement,
+            ],
+            "next": f"Step {next_step}: Begin challenge phase with adversarial analysis.",
+            "academic_note": (
+                "Chain-of-Verification: \"Factored variants which separate out verification steps, "
+                "in terms of which context is attended to, give further performance gains.\""
+            ),
+        }
+
+    # CHALLENGE PHASE
+    if step == 5:
+        return {
+            "phase": phase,
+            "step_title": "Contrarian Perspective",
+            "actions": [
+                "You are a structured decision critic shifting to adversarial analysis.",
+                "",
+                "Your task: Generate the STRONGEST possible argument AGAINST the decision.",
+                "",
+                "START FROM VERIFICATION RESULTS:",
+                "  - FAILED items are direct ammunition - the decision rests on false premises",
+                "  - UNCERTAIN items are attack vectors - unverified assumptions create risk",
+                "  - Even VERIFIED items may have hidden dependencies worth probing",
+                "",
+                "STEEL-MANNING: Present the opposition's BEST case, not a strawman.",
+                "Ask: What would a thoughtful, well-informed critic with domain expertise say?",
+                "Make the argument as strong as you can, even if you personally disagree.",
+                "",
+                "ATTACK VECTORS TO EXPLORE:",
+                "  - What could go wrong that wasn't considered?",
+                "  - What alternatives were dismissed too quickly?",
+                "  - What second-order effects were missed?",
+                "  - What happens if key assumptions change?",
+                "  - Who would disagree, and why might they be right?",
+                "",
+                "OUTPUT FORMAT:",
+                "",
+                "CONTRARIAN POSITION: <one-sentence summary of the opposition's stance>",
+                "",
+                "ARGUMENT:",
+                "<Present the strongest 2-3 paragraph case against the decision.",
+                " Reference specific item IDs (C1, A2, etc.) where applicable.",
+                " Build from verification failures if any exist.>",
+                "",
+                "KEY RISKS:",
+                "- <Risk 1 with item ID reference if applicable>",
+                "- <Risk 2>",
+                "- <Risk 3>",
+                "",
+                state_requirement,
+            ],
+            "next": f"Step {next_step}: Explore alternative problem framing.",
+            "academic_note": (
+                "Multi-Expert Prompting (Wang et al., 2024): \"Integrating multiple experts' "
+                "perspectives catches blind spots in reasoning.\""
+            ),
+        }
+
+    if step == 6:
+        return {
+            "phase": phase,
+            "step_title": "Alternative Framing",
+            "actions": [
+                "You are a structured decision critic examining problem formulation.",
+                "",
+                "PURPOSE: Step 5 challenged the SOLUTION. This step challenges the PROBLEM STATEMENT.",
+                "Goal: Reveal hidden assumptions baked into how the problem was originally framed.",
+                "",
+                "Set aside the proposed solution temporarily. Ask:",
+                "  'If I approached this problem fresh, how might I state it differently?'",
+                "",
+                "REFRAMING VECTORS:",
+                "  - Is this the right problem to solve, or a symptom of a deeper issue?",
+                "  - What would a different stakeholder (user, ops, security) prioritize?",
+                "  - What if the constraints (K items) were different or negotiable?",
+                "  - Is there a simpler formulation that dissolves the tradeoffs?",
+                "  - What objectives might be missing from the original framing?",
+                "",
+                "OUTPUT FORMAT:",
+                "",
+                "ALTERNATIVE FRAMING: <one-sentence restatement of the problem>",
+                "",
+                "WHAT THIS FRAMING EMPHASIZES:",
+                "<Describe what becomes important under this new framing that wasn't",
+                " prominent in the original.>",
+                "",
+                "HIDDEN ASSUMPTIONS REVEALED:",
+                "<What did the original problem statement take for granted?",
+                " Reference specific items (C, A, K, J) where the assumption appears.>",
+                "",
+                "IMPLICATION FOR DECISION:",
+                "<Does this reframing strengthen, weaken, or redirect the proposed decision?>",
+                "",
+                state_requirement,
+            ],
+            "next": f"Step {next_step}: Synthesize findings into verdict.",
+            "academic_note": None,
+        }
+
+    # SYNTHESIS PHASE
+    if step == 7:
+        return {
+            "phase": phase,
+            "step_title": "Synthesis and Verdict",
+            "actions": [
+                "You are a structured decision critic delivering your final assessment.",
+                "This verdict will guide real decisions. Be confident in your analysis and precise "
+                "in your recommendation.",
+                "",
+                "VERDICT RUBRIC:",
+                "",
+                "  ESCALATE when ANY of these apply:",
+                "    - Any FAILED item involves safety, security, or compliance",
+                "    - Any UNCERTAIN item is critical AND cannot be cheaply verified",
+                "    - The alternative framing reveals the problem itself is wrong",
+                "",
+                "  REVISE when ANY of these apply:",
+                "    - Any FAILED item on a core claim (not peripheral)",
+                "    - Multiple UNCERTAIN items on feasibility, effort, or impact",
+                "    - Challenge phase revealed unaddressed gaps that change the calculus",
+                "",
+                "  STAND when ALL of these apply:",
+                "    - No FAILED items on core claims",
+                "    - UNCERTAIN items are explicitly acknowledged as accepted risks",
+                "    - Challenges from Steps 5-6 are addressable within the current approach",
+                "",
+                "BORDERLINE CASES:",
+                "  - When between STAND and REVISE: favor REVISE (cheaper to refine than to fail)",
+                "  - When between REVISE and ESCALATE: state both options with conditions",
+                "",
+                "OUTPUT FORMAT:",
+                "",
+                "VERDICT: [STAND | REVISE | ESCALATE]",
+                "",
+                "VERIFICATION SUMMARY:",
+                "  Verified: <list IDs>",
+                "  Failed: <list IDs with one-line explanation each>",
+                "  Uncertain: <list IDs with what would resolve each>",
+                "",
+                "CHALLENGE ASSESSMENT:",
+                "  Strongest challenge: <one-sentence summary from Step 5>",
+                "  Alternative framing insight: <one-sentence summary from Step 6>",
+                "  Response: <how the decision addresses or fails to address these>",
+                "",
+                "RECOMMENDATION:",
+                "  <Specific next action. If ESCALATE, specify to whom/what forum.",
+                "   If REVISE, specify which items need rework. If STAND, note accepted risks.>",
+            ],
+            "next": None,
+            "academic_note": (
+                "Self-Consistency (Wang et al., 2023): \"Correct reasoning processes tend to "
+                "have greater agreement in their final answer than incorrect processes.\""
+            ),
+        }
+
+    return {
+        "phase": "UNKNOWN",
+        "step_title": "Unknown Step",
+        "actions": ["Invalid step number."],
+        "next": None,
+        "academic_note": None,
+    }
+
+
+def format_output(step: int, total_steps: int, guidance: dict) -> str:
+    """Format the output for display."""
+    lines = []
+
+    # Header
+    lines.append(f"DECISION CRITIC - Step {step}/{total_steps}: {guidance['step_title']}")
+    lines.append(f"Phase: {guidance['phase']}")
+    lines.append("")
+
+    # Actions
+    for action in guidance["actions"]:
+        lines.append(action)
+    lines.append("")
+
+    # Academic note if present
+    if guidance.get("academic_note"):
+        lines.append(f"[{guidance['academic_note']}]")
+        lines.append("")
+
+    # Next step or completion
+    if guidance["next"]:
+        lines.append(f"NEXT: {guidance['next']}")
+    else:
+        lines.append("WORKFLOW COMPLETE - Present verdict to user.")
+
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Decision Critic - Structured decision criticism workflow"
+    )
+    parser.add_argument(
+        "--step-number",
+        type=int,
+        required=True,
+        help="Current step number (1-7)",
+    )
+    parser.add_argument(
+        "--total-steps",
+        type=int,
+        required=True,
+        help="Total steps in workflow (always 7)",
+    )
+    parser.add_argument(
+        "--decision",
+        type=str,
+        help="The decision being criticized (required for step 1)",
+    )
+    parser.add_argument(
+        "--context",
+        type=str,
+        help="Relevant constraints and background (required for step 1)",
+    )
+    parser.add_argument(
+        "--thoughts",
+        type=str,
+        required=True,
+        help="Your analysis, findings, and progress from previous steps",
+    )
+
+    args = parser.parse_args()
+
+    # Validate step number
+    if args.step_number < 1 or args.step_number > 7:
+        print("ERROR: step-number must be between 1 and 7", file=sys.stderr)
+        sys.exit(1)
+
+    # Validate step 1 requirements
+    if args.step_number == 1:
+        if not args.decision:
+            print("ERROR: --decision is required for step 1", file=sys.stderr)
+            sys.exit(1)
+
+    # Get guidance for current step
+    guidance = get_step_guidance(
+        args.step_number,
+        args.total_steps,
+        args.decision,
+        args.context,
+    )
+
+    # Print decision context on step 1
+    if args.step_number == 1:
+        print("DECISION UNDER REVIEW:")
+        print(args.decision)
+        if args.context:
+            print("")
+            print("CONTEXT:")
+            print(args.context)
+        print("")
+
+    # Print formatted output
+    print(format_output(args.step_number, args.total_steps, guidance))
+
+
+if __name__ == "__main__":
+    main()
--- a/.claude/skills/doc-sync/README.md
+++ b/.claude/skills/doc-sync/README.md
@@ -0,0 +1,46 @@
+# Doc Sync
+
+The CLAUDE.md/README.md hierarchy is central to context hygiene. CLAUDE.md files
+are pure indexes -- tabular navigation with "What" and "When to read" columns
+that help LLMs (and humans) find relevant files without loading everything.
+README.md files capture invisible knowledge: architecture decisions, design
+tradeoffs, and invariants that are not apparent from reading code.
+
+The doc-sync skill audits and synchronizes this hierarchy across a repository.
+
+## How It Works
+
+The skill operates in five phases:
+
+1. **Discovery** -- Maps all directories, identifies missing or outdated
+   CLAUDE.md files
+2. **Audit** -- Checks for drift (files added/removed but not indexed),
+   misplaced content (architecture docs in CLAUDE.md instead of README.md)
+3. **Migration** -- Moves architectural content from CLAUDE.md to README.md
+4. **Update** -- Creates/updates indexes with proper tabular format
+5. **Verification** -- Confirms complete coverage and correct structure
+
+## When to Use
+
+Use this skill for:
+
+- **Bootstrapping** -- Adopting this workflow on an existing repository
+- **After bulk changes** -- Major refactors, directory restructuring
+- **Periodic audits** -- Checking for documentation drift
+- **Onboarding** -- Before starting work on an unfamiliar codebase
+
+If you use the planning workflow consistently, the technical writer agent
+maintains documentation as part of execution. As such, doc-sync is primarily for
+bootstrapping or recovery -- not routine use.
+
+## Example Usage
+
+```
+Use your doc-sync skill to synchronize documentation across this repository
+```
+
+For targeted updates:
+
+```
+Use your doc-sync skill to update documentation in src/validators/
+```
--- a/.claude/skills/doc-sync/SKILL.md
+++ b/.claude/skills/doc-sync/SKILL.md
@@ -0,0 +1,315 @@
+---
+name: doc-sync
+description: Synchronizes CLAUDE.md navigation indexes and README.md architecture docs across a repository. Use when asked to "sync docs", "update CLAUDE.md files", "ensure documentation is in sync", "audit documentation", or when documentation maintenance is needed after code changes.
+---
+
+# Doc Sync
+
+Maintains the CLAUDE.md navigation hierarchy and optional README.md architecture docs across a repository. This skill is self-contained and performs all documentation work directly.
+
+## Scope Resolution
+
+Determine scope FIRST:
+
+| User Request                                            | Scope                                     |
+| ------------------------------------------------------- | ----------------------------------------- |
+| "sync docs" / "update documentation" / no specific path | REPOSITORY-WIDE                           |
+| "sync docs in src/validator/"                           | DIRECTORY: src/validator/ and descendants |
+| "update CLAUDE.md for parser.py"                        | FILE: single file's parent directory      |
+
+For REPOSITORY-WIDE scope, perform a full audit. For narrower scopes, operate only within the specified boundary.
+
+## CLAUDE.md Format Specification
+
+### Index Format
+
+Use tabular format with What and When columns:
+
+```markdown
+## Files
+
+| File        | What                           | When to read                              |
+| ----------- | ------------------------------ | ----------------------------------------- |
+| `cache.rs`  | LRU cache with O(1) operations | Implementing caching, debugging evictions |
+| `errors.rs` | Error types and Result aliases | Adding error variants, handling failures  |
+
+## Subdirectories
+
+| Directory   | What                          | When to read                              |
+| ----------- | ----------------------------- | ----------------------------------------- |
+| `config/`   | Runtime configuration loading | Adding config options, modifying defaults |
+| `handlers/` | HTTP request handlers         | Adding endpoints, modifying request flow  |
+```
+
+### Column Guidelines
+
+- **File/Directory**: Use backticks around names: `cache.rs`, `config/`
+- **What**: Factual description of contents (nouns, not actions)
+- **When to read**: Task-oriented triggers using action verbs (implementing, debugging, modifying, adding, understanding)
+- At least one column must have content; empty cells use `-`
+
+### Trigger Quality Test
+
+Given task "add a new validation rule", can an LLM scan the "When to read" column and identify the right file?
+
+### ROOT vs SUBDIRECTORY CLAUDE.md
+
+**ROOT CLAUDE.md:**
+
+```markdown
+# [Project Name]
+
+[One sentence: what this is]
+
+## Files
+
+| File | What | When to read |
+| ---- | ---- | ------------ |
+
+## Subdirectories
+
+| Directory | What | When to read |
+| --------- | ---- | ------------ |
+
+## Build
+
+[Copy-pasteable command]
+
+## Test
+
+[Copy-pasteable command]
+
+## Development
+
+[Setup instructions, environment requirements, workflow notes]
+```
+
+**SUBDIRECTORY CLAUDE.md:**
+
+```markdown
+# [directory-name]/
+
+## Files
+
+| File | What | When to read |
+| ---- | ---- | ------------ |
+
+## Subdirectories
+
+| Directory | What | When to read |
+| --------- | ---- | ------------ |
+```
+
+**Critical constraint:** Subdirectory CLAUDE.md files are PURE INDEX. No prose, no overview sections, no architectural explanations. Those belong in README.md.
+
+## README.md Specification
+
+### Creation Criteria (Invisible Knowledge Test)
+
+Create README.md ONLY when the directory contains knowledge NOT visible from reading the code:
+
+- Multiple components interact through non-obvious contracts or protocols
+- Design tradeoffs were made that affect how code should be modified
+- The directory's structure encodes domain knowledge (e.g., processing order matters)
+- Failure modes or edge cases aren't apparent from reading individual files
+- There are "rules" developers must follow that aren't enforced by the compiler/linter
+
+**DO NOT create README.md when:**
+
+- The directory is purely organizational (just groups related files)
+- Code is self-explanatory with good function/module docs
+- You'd be restating what CLAUDE.md index entries already convey
+
+### Content Test
+
+For each sentence in README.md, ask: "Could a developer learn this by reading the source files?"
+
+- If YES: delete the sentence
+- If NO: keep it
+
+README.md earns its tokens by providing INVISIBLE knowledge: the reasoning behind the code, not descriptions of the code.
+
+### README.md Structure
+
+```markdown
+# [Component Name]
+
+## Overview
+
+[One paragraph: what problem this solves, high-level approach]
+
+## Architecture
+
+[How sub-components interact; data flow; key abstractions]
+
+## Design Decisions
+
+[Tradeoffs made and why; alternatives considered]
+
+## Invariants
+
+[Rules that must be maintained; constraints not enforced by code]
+```
+
+## Workflow
+
+### Phase 1: Discovery
+
+Map directories requiring CLAUDE.md verification:
+
+```bash
+# Find all directories (excluding .git, node_modules, __pycache__, etc.)
+find . -type d \( -name .git -o -name node_modules -o -name __pycache__ -o -name .venv -o -name target -o -name dist -o -name build \) -prune -o -type d -print
+```
+
+For each directory in scope, record:
+
+1. Does CLAUDE.md exist?
+2. If yes, does it have the required table-based index structure?
+3. What files/subdirectories exist that need indexing?
+
+### Phase 2: Audit
+
+For each directory, check for drift and misplaced content:
+
+```
+<audit_check dir="[path]">
+CLAUDE.md exists: [YES/NO]
+Has table-based index: [YES/NO]
+Files in directory: [list]
+Files in index: [list]
+Missing from index: [list]
+Stale in index (file deleted): [list]
+Triggers are task-oriented: [YES/NO/PARTIAL]
+Contains misplaced content: [YES/NO] (architecture/design docs that belong in README.md)
+README.md exists: [YES/NO]
+README.md warranted: [YES/NO] (invisible knowledge present?)
+</audit_check>
+```
+
+### Phase 3: Content Migration
+
+**Critical:** If CLAUDE.md contains content that does NOT belong there, migrate it:
+
+Content that MUST be moved from CLAUDE.md to README.md:
+
+- Architecture explanations or diagrams
+- Design decision documentation
+- Component interaction descriptions
+- Overview sections with prose (in subdirectory CLAUDE.md files)
+- Invariants or rules documentation
+- Any "why" explanations beyond simple triggers
+
+Migration process:
+
+1. Identify misplaced content in CLAUDE.md
+2. Create or update README.md with the architectural content
+3. Strip CLAUDE.md down to pure index format
+4. Add README.md to the CLAUDE.md index table
+
+### Phase 4: Index Updates
+
+For each directory needing work:
+
+**Creating/Updating CLAUDE.md:**
+
+1. Use the appropriate template (ROOT or SUBDIRECTORY)
+2. Populate tables with all files and subdirectories
+3. Write "What" column: factual content description
+4. Write "When to read" column: action-oriented triggers
+5. If README.md exists, include it in the Files table
+
+**Creating README.md (only when warranted):**
+
+1. Verify invisible knowledge criteria are met
+2. Document architecture, design decisions, invariants
+3. Apply the content test: remove anything visible from code
+4. Keep under ~500 tokens
+
+### Phase 5: Verification
+
+After all updates complete, verify:
+
+1. Every directory in scope has CLAUDE.md
+2. All CLAUDE.md files use table-based index format
+3. No drift remains (files <-> index entries match)
+4. No misplaced content in CLAUDE.md (architecture docs moved to README.md)
+5. README.md files are indexed in their parent CLAUDE.md
+6. Subdirectory CLAUDE.md files contain no prose/overview sections
+
+## Output Format
+
+```
+## Doc Sync Report
+
+### Scope: [REPOSITORY-WIDE | directory path]
+
+### Changes Made
+- CREATED: [list of new CLAUDE.md files]
+- UPDATED: [list of modified CLAUDE.md files]
+- MIGRATED: [list of content moved from CLAUDE.md to README.md]
+- CREATED: [list of new README.md files]
+- FLAGGED: [any issues requiring human decision]
+
+### Verification
+- Directories audited: [count]
+- CLAUDE.md coverage: [count]/[total] (100%)
+- Drift detected: [count] entries fixed
+- Content migrations: [count] (architecture docs moved to README.md)
+- README.md files: [count] (only where warranted)
+```
+
+## Exclusions
+
+DO NOT index:
+
+- Generated files (dist/, build/, _.generated._, compiled outputs)
+- Vendored dependencies (node_modules/, vendor/, third_party/)
+- Git internals (.git/)
+- IDE/editor configs (.idea/, .vscode/ unless project-specific settings)
+
+DO index:
+
+- Hidden config files that affect development (.eslintrc, .env.example, .gitignore)
+- Test files and test directories
+- Documentation files (including README.md)
+
+## Anti-Patterns
+
+### Index Anti-Patterns
+
+**Too vague (matches everything):**
+
+```markdown
+| `config/` | Configuration | Working with configuration |
+```
+
+**Content description instead of trigger:**
+
+```markdown
+| `cache.rs` | Contains the LRU cache implementation | - |
+```
+
+**Missing action verb:**
+
+```markdown
+| `parser.py` | Input parsing | Input parsing and format handling |
+```
+
+### Correct Examples
+
+```markdown
+| `cache.rs` | LRU cache with O(1) get/set | Implementing caching, debugging misses, tuning eviction |
+| `config/` | YAML config parsing, env overrides | Adding config options, changing defaults, debugging config loading |
+```
+
+## When NOT to Use This Skill
+
+- Single file documentation (inline comments, docstrings) - handle directly
+- Code comments - handle directly
+- Function/module docstrings - handle directly
+- This skill is for CLAUDE.md/README.md synchronization specifically
+
+## Reference
+
+For additional trigger pattern examples, see `references/trigger-patterns.md`.
--- a/.claude/skills/doc-sync/references/trigger-patterns.md
+++ b/.claude/skills/doc-sync/references/trigger-patterns.md
@@ -0,0 +1,125 @@
+# Trigger Patterns Reference
+
+Examples of well-formed triggers for CLAUDE.md index table entries.
+
+## Column Formula
+
+| File         | What                             | When to read                          |
+| ------------ | -------------------------------- | ------------------------------------- |
+| `[filename]` | [noun-based content description] | [action verb] [specific context/task] |
+
+## Action Verbs by Category
+
+### Implementation Tasks
+
+implementing, adding, creating, building, writing, extending
+
+### Modification Tasks
+
+modifying, updating, changing, refactoring, migrating
+
+### Debugging Tasks
+
+debugging, troubleshooting, investigating, diagnosing, fixing
+
+### Understanding Tasks
+
+understanding, learning, reviewing, analyzing, exploring
+
+## Examples by File Type
+
+### Source Code Files
+
+| File           | What                                | When to read                                                                       |
+| -------------- | ----------------------------------- | ---------------------------------------------------------------------------------- |
+| `cache.rs`     | LRU cache with O(1) operations      | Implementing caching, debugging cache misses, modifying eviction policy            |
+| `auth.rs`      | JWT validation, session management  | Implementing login/logout, modifying token validation, debugging auth failures     |
+| `parser.py`    | Input parsing, format detection     | Modifying input parsing, adding new input formats, debugging parse errors          |
+| `validator.py` | Validation rules, constraint checks | Adding validation rules, modifying validation logic, understanding validation flow |
+
+### Configuration Files
+
+| File           | What                             | When to read                                                                  |
+| -------------- | -------------------------------- | ----------------------------------------------------------------------------- |
+| `config.toml`  | Runtime config options, defaults | Adding new config options, modifying defaults, debugging configuration issues |
+| `.env.example` | Environment variable template    | Setting up development environment, adding new environment variables          |
+| `Cargo.toml`   | Rust dependencies, build config  | Adding dependencies, modifying build configuration, debugging build issues    |
+
+### Test Files
+
+| File                 | What                        | When to read                                                                     |
+| -------------------- | --------------------------- | -------------------------------------------------------------------------------- |
+| `test_cache.py`      | Cache unit tests            | Adding cache tests, debugging test failures, understanding cache behavior        |
+| `integration_tests/` | Cross-component test suites | Adding integration tests, debugging cross-component issues, validating workflows |
+
+### Documentation Files
+
+| File              | What                                     | When to read                                                                             |
+| ----------------- | ---------------------------------------- | ---------------------------------------------------------------------------------------- |
+| `README.md`       | Architecture, design decisions           | Understanding architecture, design decisions, component relationships                    |
+| `ARCHITECTURE.md` | System design, component boundaries      | Understanding system design, component boundaries, data flow                             |
+| `API.md`          | Endpoint specs, request/response formats | Implementing API endpoints, understanding request/response formats, debugging API issues |
+
+### Index Files (cross-cutting concerns)
+
+| File                      | What                               | When to read                                                                    |
+| ------------------------- | ---------------------------------- | ------------------------------------------------------------------------------- |
+| `error-handling-index.md` | Error handling patterns reference  | Understanding error handling patterns, failure modes, error recovery strategies |
+| `performance-index.md`    | Performance optimization reference | Optimizing latency, throughput, resource usage, understanding cost models       |
+| `security-index.md`       | Security patterns reference        | Implementing authentication, encryption, threat mitigation, compliance features |
+
+## Examples by Directory Type
+
+### Feature Directories
+
+| Directory  | What                                    | When to read                                                                          |
+| ---------- | --------------------------------------- | ------------------------------------------------------------------------------------- |
+| `auth/`    | Authentication, authorization, sessions | Implementing authentication, authorization, session management, debugging auth issues |
+| `api/`     | HTTP endpoints, request handling        | Implementing endpoints, modifying request handling, debugging API responses           |
+| `storage/` | Persistence, data access layer          | Implementing persistence, modifying data access, debugging storage issues             |
+
+### Layer Directories
+
+| Directory   | What                          | When to read                                                                     |
+| ----------- | ----------------------------- | -------------------------------------------------------------------------------- |
+| `handlers/` | Request handlers, routing     | Implementing request handlers, modifying routing, debugging request processing   |
+| `models/`   | Data models, schemas          | Adding data models, modifying schemas, understanding data structures             |
+| `services/` | Business logic, service layer | Implementing business logic, modifying service interactions, debugging workflows |
+
+### Utility Directories
+
+| Directory  | What                              | When to read                                                                       |
+| ---------- | --------------------------------- | ---------------------------------------------------------------------------------- |
+| `utils/`   | Helper functions, common patterns | Needing helper functions, implementing common patterns, debugging utility behavior |
+| `scripts/` | Maintenance tasks, automation     | Running maintenance tasks, automating workflows, debugging script execution        |
+| `tools/`   | Development tools, CLI utilities  | Using development tools, implementing tooling, debugging tool behavior             |
+
+## Anti-Patterns
+
+### Too Vague (matches everything)
+
+| File       | What          | When to read               |
+| ---------- | ------------- | -------------------------- |
+| `config/`  | Configuration | Working with configuration |
+| `utils.py` | Utilities     | When you need utilities    |
+
+### Content Description Only (no trigger)
+
+| File       | What                                          | When to read |
+| ---------- | --------------------------------------------- | ------------ |
+| `cache.rs` | Contains the LRU cache implementation         | -            |
+| `auth.rs`  | Authentication logic including JWT validation | -            |
+
+### Missing Action Verb
+
+| File           | What             | When to read                      |
+| -------------- | ---------------- | --------------------------------- |
+| `parser.py`    | Input parsing    | Input parsing and format handling |
+| `validator.py` | Validation rules | Validation rules and constraints  |
+
+## Trigger Guidelines
+
+- Combine 2-4 triggers per entry using commas or "or"
+- Use action verbs: implementing, debugging, modifying, adding, understanding
+- Be specific: "debugging cache misses" not "debugging"
+- If more than 4 triggers needed, the file may be doing too much
--- a/.claude/skills/incoherence/CLAUDE.md
+++ b/.claude/skills/incoherence/CLAUDE.md
@@ -0,0 +1,24 @@
+# skills/incoherence/
+
+## Overview
+
+Incoherence detection skill using parallel agents. IMMEDIATELY invoke the
+script -- do NOT explore first.
+
+## Index
+
+| File/Directory           | Contents          | Read When          |
+| ------------------------ | ----------------- | ------------------ |
+| `SKILL.md`               | Invocation        | Using this skill   |
+| `scripts/incoherence.py` | Complete workflow | Debugging behavior |
+
+## Key Point
+
+The script IS the workflow. Three phases:
+
+- Detection (steps 1-12): Survey, explore, verify candidates
+- Resolution (steps 13-15): Interactive AskUserQuestion prompts
+- Application (steps 16-21): Apply changes, present final report
+
+Resolution is interactive - user answers structured questions inline. No manual
+file editing required.
--- a/.claude/skills/incoherence/SKILL.md
+++ b/.claude/skills/incoherence/SKILL.md
@@ -0,0 +1,37 @@
+---
+name: incoherence
+description: Detect and resolve incoherence in documentation, code, specs vs implementation.
+---
+
+# Incoherence Detector
+
+When this skill activates, IMMEDIATELY invoke the script. The script IS the
+workflow.
+
+## Invocation
+
+```bash
+python3 scripts/incoherence.py \
+  --step-number 1 \
+  --total-steps 21 \
+  --thoughts "<context>"
+```
+
+| Argument        | Required | Description                               |
+| --------------- | -------- | ----------------------------------------- |
+| `--step-number` | Yes      | Current step (1-21)                       |
+| `--total-steps` | Yes      | Always 21                                 |
+| `--thoughts`    | Yes      | Accumulated state from all previous steps |
+
+Do NOT explore or detect first. Run the script and follow its output.
+
+## Workflow Phases
+
+1. **Detection (steps 1-12)**: Survey codebase, explore dimensions, verify
+   candidates
+2. **Resolution (steps 13-15)**: Present issues via AskUserQuestion, collect
+   user decisions
+3. **Application (steps 16-21)**: Apply resolutions, present final report
+
+Resolution is interactive - user answers structured questions inline. No manual
+file editing required.
--- a/.claude/skills/incoherence/scripts/incoherence.py
+++ b/.claude/skills/incoherence/scripts/incoherence.py
--- a/.claude/skills/planner/CLAUDE.md
+++ b/.claude/skills/planner/CLAUDE.md
@@ -0,0 +1,86 @@
+# skills/planner/
+
+## Overview
+
+Planning skill with resources that must stay synced with agent prompts.
+
+## Index
+
+| File/Directory                        | Contents                                       | Read When                                    |
+| ------------------------------------- | ---------------------------------------------- | -------------------------------------------- |
+| `SKILL.md`                            | Planning workflow, phases                      | Using the planner skill                      |
+| `scripts/planner.py`                  | Step-by-step planning orchestration            | Debugging planner behavior                   |
+| `resources/plan-format.md`            | Plan template (injected by script)             | Editing plan structure                       |
+| `resources/temporal-contamination.md` | Detection heuristic for contaminated comments  | Updating TW/QR temporal contamination logic  |
+| `resources/diff-format.md`            | Unified diff spec for code changes             | Updating Developer diff consumption logic    |
+| `resources/default-conventions.md`    | Default structural conventions (4-tier system) | Updating QR RULE 2 or planner decision audit |
+
+## Resource Sync Requirements
+
+Resources are **authoritative sources**.
+
+- **SKILL.md** references resources directly (main Claude can read files)
+- **Agent prompts** embed resources 1:1 (sub-agents cannot access files
+  reliably)
+
+### plan-format.md
+
+Plan template injected by `scripts/planner.py` at planning phase completion.
+
+**No agent sync required** - the script reads and outputs the format directly,
+so editing this file takes effect immediately without updating any agent
+prompts.
+
+### temporal-contamination.md
+
+Authoritative source for temporal contamination detection. Full content embedded
+1:1.
+
+| Synced To                    | Embedded Section           |
+| ---------------------------- | -------------------------- |
+| `agents/technical-writer.md` | `<temporal_contamination>` |
+| `agents/quality-reviewer.md` | `<temporal_contamination>` |
+
+**When updating**: Modify `resources/temporal-contamination.md` first, then copy
+content into both `<temporal_contamination>` sections.
+
+### diff-format.md
+
+Authoritative source for unified diff format. Full content embedded 1:1.
+
+| Synced To             | Embedded Section |
+| --------------------- | ---------------- |
+| `agents/developer.md` | `<diff_format>`  |
+
+**When updating**: Modify `resources/diff-format.md` first, then copy content
+into `<diff_format>` section.
+
+### default-conventions.md
+
+Authoritative source for default structural conventions (four-tier decision
+backing system). Embedded 1:1 in QR for RULE 2 enforcement; referenced by
+planner.py for decision audit.
+
+| Synced To                    | Embedded Section        |
+| ---------------------------- | ----------------------- |
+| `agents/quality-reviewer.md` | `<default_conventions>` |
+
+**When updating**: Modify `resources/default-conventions.md` first, then copy
+full content verbatim into `<default_conventions>` section in QR.
+
+## Sync Verification
+
+After modifying a resource, verify sync:
+
+```bash
+# Check temporal-contamination.md references
+grep -l "temporal.contamination\|four detection questions\|change-relative\|baseline reference" agents/*.md
+
+# Check diff-format.md references
+grep -l "context lines\|AUTHORITATIVE\|APPROXIMATE\|context anchor" agents/*.md
+
+# Check default-conventions.md references
+grep -l "default_conventions\|domain: god-object\|domain: test-organization" agents/*.md
+```
+
+If grep finds files not listed in sync tables above, update this document.
--- a/.claude/skills/planner/README.md
+++ b/.claude/skills/planner/README.md
@@ -0,0 +1,80 @@
+# Planner
+
+LLM-generated plans have gaps. I have seen missing error handling, vague
+acceptance criteria, specs that nobody can implement. I built this skill with
+two workflows -- planning and execution -- connected by quality gates that catch
+these problems early.
+
+## Planning Workflow
+
+```
+  Planning ----+
+      |        |
+      v        |
+     QR -------+  [fail: restart planning]
+      |
+      v
+     TW -------+
+      |        |
+      v        |
+   QR-Docs ----+  [fail: restart TW]
+      |
+      v
+   APPROVED
+```
+
+| Step                    | Actions                                                                    |
+| ----------------------- | -------------------------------------------------------------------------- |
+| Context & Scope         | Confirm path, define scope, identify approaches, list constraints          |
+| Decision & Architecture | Evaluate approaches, select with reasoning, diagram, break into milestones |
+| Refinement              | Document risks, add uncertainty flags, specify paths and criteria          |
+| Final Verification      | Verify completeness, check specs, write to file                            |
+| QR-Completeness         | Verify Decision Log complete, policy defaults confirmed, plan structure    |
+| QR-Code                 | Read codebase, verify diff context, apply RULE 0/1/2 to proposed code      |
+| Technical Writer        | Scrub temporal comments, add WHY comments, enrich rationale                |
+| QR-Docs                 | Verify no temporal contamination, comments explain WHY not WHAT            |
+
+So, why all the feedback loops? QR-Completeness and QR-Code run before TW to
+catch structural issues early. QR-Docs runs after TW to validate documentation
+quality. Doc issues restart only TW; structure issues restart planning. The loop
+runs until both pass.
+
+## Execution Workflow
+
+```
+  Plan --> Milestones --> QR --> Docs --> Retrospective
+               ^          |
+               +- [fail] -+
+
+  * Reconciliation phase precedes Milestones when resuming partial work
+```
+
+After planning completes and context clears (`/clear`), execution proceeds:
+
+| Step                   | Purpose                                                         |
+| ---------------------- | --------------------------------------------------------------- |
+| Execution Planning     | Analyze plan, detect reconciliation signals, output strategy    |
+| Reconciliation         | (conditional) Validate existing code against plan               |
+| Milestone Execution    | Delegate to agents, run tests; repeat until all complete        |
+| Post-Implementation QR | Quality review of implemented code                              |
+| Issue Resolution       | (conditional) Present issues, collect decisions, delegate fixes |
+| Documentation          | Technical writer updates CLAUDE.md/README.md                    |
+| Retrospective          | Present execution summary                                       |
+
+I designed the coordinator to never write code directly -- it delegates to
+developers. Separating coordination from implementation produces cleaner
+results. The coordinator:
+
+- Parallelizes independent work across up to 4 developers per milestone
+- Runs quality review after all milestones complete
+- Loops through issue resolution until QR passes
+- Invokes technical writer only after QR passes
+
+**Reconciliation** handles resume scenarios. When the user request contains
+signals like "already implemented", "resume", or "partially complete", the
+workflow validates existing code against plan requirements before executing
+remaining milestones. Building on unverified code means rework.
+
+**Issue Resolution** presents each QR finding individually with options (Fix /
+Skip / Alternative). Fixes delegate to developers or technical writers, then QR
+runs again. This cycle repeats until QR passes.
--- a/.claude/skills/planner/SKILL.md
+++ b/.claude/skills/planner/SKILL.md
@@ -0,0 +1,59 @@
+---
+name: planner
+description: Interactive planning and execution for complex tasks. Use when user asks to use or invoke planner skill.
+---
+
+# Planner Skill
+
+Two-phase workflow: **planning** (create plans) and **execution** (implement
+plans).
+
+## Invocation Routing
+
+| User Intent                                 | Script      | Invocation                                                                         |
+| ------------------------------------------- | ----------- | ---------------------------------------------------------------------------------- |
+| "plan", "design", "architect", "break down" | planner.py  | `python3 scripts/planner.py --step-number 1 --total-steps 4 --thoughts "..."`      |
+| "review plan" (after plan written)          | planner.py  | `python3 scripts/planner.py --phase review --step-number 1 --total-steps 2 ...`    |
+| "execute", "implement", "run plan"          | executor.py | `python3 scripts/executor.py --plan-file PATH --step-number 1 --total-steps 7 ...` |
+
+Scripts inject step-specific guidance via JIT prompt injection. Invoke the
+script and follow its REQUIRED ACTIONS output.
+
+## When to Use
+
+Use when task has:
+
+- Multiple milestones with dependencies
+- Architectural decisions requiring documentation
+- Complexity benefiting from forced reflection pauses
+
+Skip when task is:
+
+- Single-step with obvious implementation
+- Quick fix or minor change
+- Already well-specified by user
+
+## Resources
+
+| Resource                              | Contents                                   | Read When                                       |
+| ------------------------------------- | ------------------------------------------ | ----------------------------------------------- |
+| `resources/diff-format.md`            | Unified diff specification for plans       | Writing code changes in milestones              |
+| `resources/temporal-contamination.md` | Comment hygiene detection heuristics       | Writing comments in code snippets               |
+| `resources/default-conventions.md`    | Priority hierarchy, structural conventions | Making decisions without explicit user guidance |
+| `resources/plan-format.md`            | Plan template structure                    | Completing planning phase (injected by script)  |
+
+**Resource loading rule**: Scripts will prompt you to read specific resources at
+decision points. When prompted, read the full resource before proceeding.
+
+## Workflow Summary
+
+**Planning phase**: Steps 1-N explore context, evaluate approaches, refine
+milestones. Final step writes plan to file. Review phase (TW scrub -> QR
+validation) follows.
+
+**Execution phase**: 7 steps -- analyze plan, reconcile existing code, delegate
+milestones to agents, QR validation, issue resolution, documentation,
+retrospective.
+
+All procedural details are injected by the scripts. Invoke the appropriate
+script and follow its output.
--- a/.claude/skills/planner/resources/default-conventions.md
+++ b/.claude/skills/planner/resources/default-conventions.md
@@ -0,0 +1,156 @@
+# Default Conventions
+
+These conventions apply when project documentation does not specify otherwise.
+
+## MotoVaultPro Project Conventions
+
+**Naming**:
+- Database columns: snake_case (`user_id`, `created_at`)
+- TypeScript types: camelCase (`userId`, `createdAt`)
+- API responses: camelCase
+- Files: kebab-case (`vehicle-repository.ts`)
+
+**Architecture**:
+- Feature capsules: `backend/src/features/{feature}/`
+- Repository pattern with mapRow() for case conversion
+- Single-tenant, user-scoped data
+
+**Frontend**:
+- Mobile + desktop validation required (320px, 768px, 1920px)
+- Touch targets >= 44px
+- No hover-only interactions
+
+**Development**:
+- Local node development (`npm install`, `npm run dev`, `npm test`)
+- CI/CD pipeline validates containers and integration tests
+- Plans stored in Gitea Issue comments
+
+---
+
+## Priority Hierarchy
+
+Higher tiers override lower. Cite backing source when auditing.
+
+| Tier | Source          | Action                           |
+| ---- | --------------- | -------------------------------- |
+| 1    | user-specified  | Explicit user instruction: apply |
+| 2    | doc-derived     | CLAUDE.md / project docs: apply  |
+| 3    | default-derived | This document: apply             |
+| 4    | assumption      | No backing: CONFIRM WITH USER    |
+
+## Severity Levels
+
+| Level      | Meaning                          | Action          |
+| ---------- | -------------------------------- | --------------- |
+| SHOULD_FIX | Likely to cause maintenance debt | Flag for fixing |
+| SUGGESTION | Improvement opportunity          | Note if time    |
+
+---
+
+## Structural Conventions
+
+<default-conventions domain="god-object">
+**God Object**: >15 public methods OR >10 dependencies OR mixed concerns (networking + UI + data)
+Severity: SHOULD_FIX
+</default-conventions>
+
+<default-conventions domain="god-function">
+**God Function**: >50 lines OR multiple abstraction levels OR >3 nesting levels
+Severity: SHOULD_FIX
+Exception: Inherently sequential algorithms or state machines
+</default-conventions>
+
+<default-conventions domain="duplicate-logic">
+**Duplicate Logic**: Copy-pasted blocks, repeated error handling, parallel near-identical functions
+Severity: SHOULD_FIX
+</default-conventions>
+
+<default-conventions domain="dead-code">
+**Dead Code**: No callers, impossible branches, unread variables, unused imports
+Severity: SUGGESTION
+</default-conventions>
+
+<default-conventions domain="inconsistent-error-handling">
+**Inconsistent Error Handling**: Mixed exceptions/error codes, inconsistent types, swallowed errors
+Severity: SUGGESTION
+Exception: Project specifies different handling per error category
+</default-conventions>
+
+---
+
+## File Organization Conventions
+
+<default-conventions domain="test-organization">
+**Test Organization**: Extend existing test files; create new only when:
+- Distinct module boundary OR >500 lines OR different fixtures required
+Severity: SHOULD_FIX (for unnecessary fragmentation)
+</default-conventions>
+
+<default-conventions domain="file-creation">
+**File Creation**: Prefer extending existing files; create new only when:
+- Clear module boundary OR >300-500 lines OR distinct responsibility
+Severity: SUGGESTION
+</default-conventions>
+
+---
+
+## Testing Conventions
+
+<default-conventions domain="testing">
+**Principle**: Test behavior, not implementation. Fast feedback.
+
+**Test Type Hierarchy** (preference order):
+
+1. **Integration tests** (highest value)
+   - Test end-user verifiable behavior
+   - Use real systems/dependencies (e.g., testcontainers)
+   - Verify component interaction at boundaries
+   - This is where the real value lies
+
+2. **Property-based / generative tests** (preferred)
+   - Cover wide input space with invariant assertions
+   - Catch edge cases humans miss
+   - Use for functions with clear input/output contracts
+
+3. **Unit tests** (use sparingly)
+   - Only for highly complex or critical logic
+   - Risk: maintenance liability, brittleness to refactoring
+   - Prefer integration tests that cover same behavior
+
+**Test Placement**: Tests are part of implementation milestones, not separate
+milestones. A milestone is not complete until its tests pass. This creates fast
+feedback during development.
+
+**DO**:
+
+- Integration tests with real dependencies (testcontainers, etc.)
+- Property-based tests for invariant-rich functions
+- Parameterized fixtures over duplicate test bodies
+- Test behavior observable by end users
+
+**DON'T**:
+
+- Test external library/dependency behavior (out of scope)
+- Unit test simple code (maintenance liability exceeds value)
+- Mock owned dependencies (use real implementations)
+- Test implementation details that may change
+- One-test-per-variant when parametrization applies
+
+Severity: SHOULD_FIX (violations), SUGGESTION (missed opportunities)
+</default-conventions>
+
+---
+
+## Modernization Conventions
+
+<default-conventions domain="version-constraints">
+**Version Constraint Violation**: Features unavailable in project's documented target version
+Requires: Documented target version
+Severity: SHOULD_FIX
+</default-conventions>
+
+<default-conventions domain="modernization">
+**Modernization Opportunity**: Legacy APIs, verbose patterns, manual stdlib reimplementations
+Severity: SUGGESTION
+Exception: Project requires legacy pattern
+</default-conventions>
--- a/.claude/skills/planner/resources/diff-format.md
+++ b/.claude/skills/planner/resources/diff-format.md
@@ -0,0 +1,201 @@
+# Unified Diff Format for Plan Code Changes
+
+This document is the authoritative specification for code changes in implementation plans.
+
+## Purpose
+
+Unified diff format encodes both **location** and **content** in a single structure. This eliminates the need for location directives in comments (e.g., "insert at line 42") and provides reliable anchoring even when line numbers drift.
+
+## Anatomy
+
+```diff
+--- a/path/to/file.py
+++ b/path/to/file.py
+@@ -123,6 +123,15 @@ def existing_function(ctx):
+    # Context lines (unchanged) serve as location anchors
+    existing_code()
+
+   # NEW: Comments explain WHY - transcribed verbatim by Developer
+   # Guard against race condition when messages arrive out-of-order
+   new_code()
+
+    # More context to anchor the insertion point
+    more_existing_code()
+```
+
+## Components
+
+| Component                                  | Authority                 | Purpose                                                    |
+| ------------------------------------------ | ------------------------- | ---------------------------------------------------------- |
+| File path (`--- a/path/to/file.py`)        | **AUTHORITATIVE**         | Exact target file                                          |
+| Line numbers (`@@ -123,6 +123,15 @@`)      | **APPROXIMATE**           | May drift as earlier milestones modify the file            |
+| Function context (`@@ ... @@ def func():`) | **SCOPE HINT**            | Function/method containing the change                      |
+| Context lines (unchanged)                  | **AUTHORITATIVE ANCHORS** | Developer matches these patterns to locate insertion point |
+| `+` lines                                  | **NEW CODE**              | Code to add, including WHY comments                        |
+| `-` lines                                  | **REMOVED CODE**          | Code to delete                                             |
+
+## Two-Layer Location Strategy
+
+Code changes use two complementary layers for location:
+
+1. **Prose scope hint** (optional): Natural language describing conceptual location
+2. **Diff with context**: Precise insertion point via context line matching
+
+### Layer 1: Prose Scope Hints
+
+For complex changes, add a prose description before the diff block:
+
+````markdown
+Add validation after input sanitization in `UserService.validate()`:
+
+```diff
+@@ -123,6 +123,15 @@ def validate(self, user):
+     sanitized = sanitize(user.input)
+
+    # Validate format before proceeding
+    if not is_valid_format(sanitized):
+        raise ValidationError("Invalid format")
+
+     return process(sanitized)
+`` `
+```
+````
+
+The prose tells Developer **where conceptually** (which method, what operation precedes it). The diff tells Developer **where exactly** (context lines to match).
+
+**When to use prose hints:**
+
+- Changes to large files (>300 lines)
+- Multiple changes to the same file in one milestone
+- Complex nested structures where function context alone is ambiguous
+- When the surrounding code logic matters for understanding placement
+
+**When prose is optional:**
+
+- Small files with obvious structure
+- Single change with unique context lines
+- Function context in @@ line provides sufficient scope
+
+### Layer 2: Function Context in @@ Line
+
+The `@@` line can include function/method context after the line numbers:
+
+```diff
+@@ -123,6 +123,15 @@ def validate(self, user):
+```
+
+This follows standard unified diff format (git generates this automatically). It tells Developer which function contains the change, aiding navigation even when line numbers drift.
+
+## Why Context Lines Matter
+
+When a plan has multiple milestones that modify the same file, earlier milestones shift line numbers. The `@@ -123` in Milestone 3 may no longer be accurate after Milestones 1 and 2 execute.
+
+**Context lines solve this**: Developer searches for the unchanged context patterns in the actual file. These patterns are stable anchors that survive line number drift.
+
+Include 2-3 context lines before and after changes for reliable matching.
+
+## Comment Placement
+
+Comments in `+` lines explain **WHY**, not **WHAT**. These comments:
+
+- Are transcribed verbatim by Developer
+- Source rationale from Planning Context (Decision Log, Rejected Alternatives)
+- Use concrete terms without hidden baselines
+- Must pass temporal contamination review (see `temporal-contamination.md`)
+
+**Important**: Comments written during planning often contain temporal contamination -- change-relative language, baseline references, or location directives. @agent-technical-writer reviews and fixes these before @agent-developer transcribes them.
+
+<example type="CORRECT" category="why_comment">
+```diff
+   # Polling chosen over webhooks: 30% webhook delivery failures in third-party API
+   # WebSocket rejected to preserve stateless architecture
+   updates = poll_api(interval=30)
+```
+Explains WHY this approach was chosen.
+</example>
+
+<example type="INCORRECT" category="what_comment">
+```diff
+   # Poll the API every 30 seconds
+   updates = poll_api(interval=30)
+```
+Restates WHAT the code does - redundant with the code itself.
+</example>
+
+<example type="INCORRECT" category="hidden_baseline">
+```diff
+   # Generous timeout for slow networks
+   REQUEST_TIMEOUT = 60
+```
+"Generous" compared to what? Hidden baseline provides no actionable information.
+</example>
+
+<example type="CORRECT" category="concrete_justification">
+```diff
+   # 60s accommodates 95th percentile upstream response times
+   REQUEST_TIMEOUT = 60
+```
+Concrete justification that explains why this specific value.
+</example>
+
+## Location Directives: Forbidden
+
+The diff structure handles location. Location directives in comments are redundant and error-prone.
+
+<example type="INCORRECT" category="location_directive">
+```python
+# Insert this BEFORE the retry loop (line 716)
+# Timestamp guard: prevent older data from overwriting newer
+get_ctx, get_cancel = context.with_timeout(ctx, 500)
+```
+Location directive leaked into comment - line numbers become stale.
+</example>
+
+<example type="CORRECT" category="location_directive">
+```diff
+@@ -714,6 +714,10 @@ def put(self, ctx, tags):
+    for tag in tags:
+        subject = tag.subject
+
+-       # Timestamp guard: prevent older data from overwriting newer
+-       # due to network delays, retries, or concurrent writes
+-       get_ctx, get_cancel = context.with_timeout(ctx, 500)
+
+        # Retry loop for Put operations
+        for attempt in range(max_retries):
+
+```
+Context lines (`for tag in tags`, `# Retry loop`) are stable anchors that survive line number drift.
+</example>
+
+## When to Use Diff Format
+
+<diff_format_decision>
+
+| Code Characteristic                     | Use Diff? | Boundary Test                            |
+| --------------------------------------- | --------- | ---------------------------------------- |
+| Conditionals, loops, error handling,    | YES       | Has branching logic                      |
+| state machines                          |           |                                          |
+| Multiple insertions same file           | YES       | >1 change location                       |
+| Deletions or replacements               | YES       | Removing/changing existing code          |
+| Pure assignment/return (CRUD, getters)  | NO        | Single statement, no branching           |
+| Boilerplate from template               | NO        | Developer can generate from pattern name |
+
+The boundary test: "Does Developer need to see exact placement and context to implement correctly?"
+
+- YES -> diff format
+- NO (can implement from description alone) -> prose sufficient
+
+</diff_format_decision>
+
+## Validation Checklist
+
+Before finalizing code changes in a plan:
+
+- [ ] File path is exact (not "auth files" but `src/auth/handler.py`)
+- [ ] Context lines exist in target file (validate patterns match actual code)
+- [ ] Comments explain WHY, not WHAT
+- [ ] No location directives in comments
+- [ ] No hidden baselines (test: "[adjective] compared to what?")
+- [ ] 2-3 context lines for reliable anchoring
+```
--- a/.claude/skills/planner/resources/plan-format.md
+++ b/.claude/skills/planner/resources/plan-format.md
@@ -0,0 +1,250 @@
+# Plan Format
+
+Write your plan using this structure:
+
+```markdown
+# [Plan Title]
+
+## Overview
+
+[Problem statement, chosen approach, and key decisions in 1-2 paragraphs]
+
+## Planning Context
+
+This section is consumed VERBATIM by downstream agents (Technical Writer,
+Quality Reviewer). Quality matters: vague entries here produce poor annotations
+and missed risks.
+
+### Decision Log
+
+| Decision           | Reasoning Chain                                              |
+| ------------------ | ------------------------------------------------------------ |
+| [What you decided] | [Multi-step reasoning: premise -> implication -> conclusion] |
+
+Each rationale must contain at least 2 reasoning steps. Single-step rationales
+are insufficient.
+
+INSUFFICIENT: "Polling over webhooks | Webhooks are unreliable" SUFFICIENT:
+"Polling over webhooks | Third-party API has 30% webhook delivery failure in
+testing -> unreliable delivery would require fallback polling anyway -> simpler
+to use polling as primary mechanism"
+
+INSUFFICIENT: "500ms timeout | Matches upstream latency" SUFFICIENT: "500ms
+timeout | Upstream 95th percentile is 450ms -> 500ms covers 95% of requests
+without timeout -> remaining 5% should fail fast rather than queue"
+
+Include BOTH architectural decisions AND implementation-level micro-decisions:
+
+- Architectural: "Event sourcing over CRUD | Need audit trail + replay
+  capability -> CRUD would require separate audit log -> event sourcing provides
+  both natively"
+- Implementation: "Mutex over channel | Single-writer case -> channel
+  coordination adds complexity without benefit -> mutex is simpler with
+  equivalent safety"
+
+Technical Writer sources ALL code comments from this table. If a micro-decision
+isn't here, TW cannot document it.
+
+### Rejected Alternatives
+
+| Alternative          | Why Rejected                                                        |
+| -------------------- | ------------------------------------------------------------------- |
+| [Approach not taken] | [Concrete reason: performance, complexity, doesn't fit constraints] |
+
+Technical Writer uses this to add "why not X" context to code comments.
+
+### Constraints & Assumptions
+
+- [Technical: API limits, language version, existing patterns to follow]
+- [Organizational: timeline, team expertise, approval requirements]
+- [Dependencies: external services, libraries, data formats]
+- [Default conventions applied: cite any `<default-conventions domain="...">`
+  used]
+
+### Known Risks
+
+| Risk            | Mitigation                                    | Anchor                                     |
+| --------------- | --------------------------------------------- | ------------------------------------------ |
+| [Specific risk] | [Concrete mitigation or "Accepted: [reason]"] | [file:L###-L### if claiming code behavior] |
+
+**Anchor requirement**: If mitigation claims existing code behavior ("no change
+needed", "already handles X"), cite the file:line + brief excerpt that proves
+the claim. Skip anchors for hypothetical risks or external unknowns.
+
+Quality Reviewer excludes these from findings but will challenge unverified
+behavioral claims.
+
+## Invisible Knowledge
+
+This section captures knowledge NOT deducible from reading the code alone.
+Technical Writer uses this for README.md documentation during
+post-implementation.
+
+**The test**: Would a new team member understand this from reading the source
+files? If no, it belongs here.
+
+**Categories** (not exhaustive -- apply the principle):
+
+1. **Architectural decisions**: Component relationships, data flow, module
+   boundaries
+2. **Business rules**: Domain constraints that shape implementation choices
+3. **System invariants**: Properties that must hold but are not enforced by
+   types/compiler
+4. **Historical context**: Why alternatives were rejected (links to Decision
+   Log)
+5. **Performance characteristics**: Non-obvious efficiency properties or
+   requirements
+6. **Tradeoffs**: Costs and benefits of chosen approaches
+
+### Architecture
+```
+
+[ASCII diagram showing component relationships]
+
+Example: User Request | v +----------+ +-------+ | Auth |---->| Cache |
+----------+ +-------+ | v +----------+ +------+ | Handler |---->| DB |
+----------+ +------+
+
+```
+
+### Data Flow
+
+```
+
+[How data moves through the system - inputs, transformations, outputs]
+
+Example: HTTP Request --> Validate --> Transform --> Store --> Response | v Log
+(async)
+
+````
+
+### Why This Structure
+
+[Reasoning behind module organization that isn't obvious from file names]
+
+- Why these boundaries exist
+- What would break if reorganized differently
+
+### Invariants
+
+[Rules that must be maintained but aren't enforced by code]
+
+- Ordering requirements
+- State consistency rules
+- Implicit contracts between components
+
+### Tradeoffs
+
+[Key decisions with their costs and benefits]
+
+- What was sacrificed for what gain
+- Performance vs. readability choices
+- Consistency vs. flexibility choices
+
+## Milestones
+
+### Milestone 1: [Name]
+
+**Files**: [exact paths - e.g., src/auth/handler.py, not "auth files"]
+
+**Flags** (if applicable): [needs TW rationale, needs error handling review, needs conformance check]
+
+**Requirements**:
+
+- [Specific: "Add retry with exponential backoff", not "improve error handling"]
+
+**Acceptance Criteria**:
+
+- [Testable: "Returns 429 after 3 failed attempts" - QR can verify pass/fail]
+- [Avoid vague: "Works correctly" or "Handles errors properly"]
+
+**Tests** (milestone not complete until tests pass):
+
+- **Test files**: [exact paths, e.g., tests/test_retry.py]
+- **Test type**: [integration | property-based | unit] - see default-conventions
+- **Backing**: [user-specified | doc-derived | default-derived]
+- **Scenarios**:
+  - Normal: [e.g., "successful retry after transient failure"]
+  - Edge: [e.g., "max retries exhausted", "zero delay"]
+  - Error: [e.g., "non-retryable error returns immediately"]
+
+Skip tests when: user explicitly stated no tests, OR milestone is documentation-only,
+OR project docs prohibit tests for this component. State skip reason explicitly.
+
+**Code Changes** (for non-trivial logic, use unified diff format):
+
+See `resources/diff-format.md` for specification.
+
+```diff
+--- a/path/to/file.py
+++ b/path/to/file.py
+@@ -123,6 +123,15 @@ def existing_function(ctx):
+   # Context lines (unchanged) serve as location anchors
+   existing_code()
+
+  # WHY comment explaining rationale - transcribed verbatim by Developer
+  new_code()
+
+   # More context to anchor the insertion point
+   more_existing_code()
+````
+
+### Milestone N: ...
+
+### Milestone [Last]: Documentation
+
+**Files**:
+
+- `path/to/CLAUDE.md` (index updates)
+- `path/to/README.md` (if Invisible Knowledge section has content)
+
+**Requirements**:
+
+- Update CLAUDE.md index entries for all new/modified files
+- Each entry has WHAT (contents) and WHEN (task triggers)
+- If plan's Invisible Knowledge section is non-empty:
+  - Create/update README.md with architecture diagrams from plan
+  - Include tradeoffs, invariants, "why this structure" content
+  - Verify diagrams match actual implementation
+
+**Acceptance Criteria**:
+
+- CLAUDE.md enables LLM to locate relevant code for debugging/modification tasks
+- README.md captures knowledge not discoverable from reading source files
+- Architecture diagrams in README.md match plan's Invisible Knowledge section
+
+**Source Material**: `## Invisible Knowledge` section of this plan
+
+### Cross-Milestone Integration Tests
+
+When integration tests require components from multiple milestones:
+
+1. Place integration tests in the LAST milestone that provides a required
+   component
+2. List dependencies explicitly in that milestone's **Tests** section
+3. Integration test milestone is not complete until all dependencies are
+   implemented
+
+Example:
+
+- M1: Auth handler (property tests for auth logic)
+- M2: Database layer (property tests for queries)
+- M3: API endpoint (integration tests covering M1 + M2 + M3 with testcontainers)
+
+The integration tests in M3 verify the full flow that end users would exercise,
+using real dependencies. This creates fast feedback as soon as all components
+exist.
+
+## Milestone Dependencies (if applicable)
+
+```
+M1 ---> M2
+   \
+    --> M3 --> M4
+```
+
+Independent milestones can execute in parallel during /plan-execution.
+
+```
+
+```
--- a/.claude/skills/planner/resources/temporal-contamination.md
+++ b/.claude/skills/planner/resources/temporal-contamination.md
@@ -0,0 +1,135 @@
+# Temporal Contamination in Code Comments
+
+This document defines terminology for identifying comments that leak information
+about code history, change processes, or planning artifacts. Both
+@agent-technical-writer and @agent-quality-reviewer reference this
+specification.
+
+## The Core Principle
+
+> **Timeless Present Rule**: Comments must be written from the perspective of a
+> reader encountering the code for the first time, with no knowledge of what
+> came before or how it got here. The code simply _is_.
+
+**Why this matters**: Change-narrative comments are an LLM artifact -- a
+category error, not merely a style issue. The change process is ephemeral and
+irrelevant to the code's ongoing existence. Humans writing comments naturally
+describe what code IS, not what they DID to create it. Referencing the change
+that created a comment is fundamentally confused about what belongs in
+documentation.
+
+Think of it this way: a novel's narrator never describes the author's typing
+process. Similarly, code comments should never describe the developer's editing
+process. The code simply exists; the path to its existence is invisible.
+
+In a plan, this means comments are written _as if the plan was already
+executed_.
+
+## Detection Heuristic
+
+Evaluate each comment against these five questions. Signal words are examples --
+extrapolate to semantically similar constructs.
+
+### 1. Does it describe an action taken rather than what exists?
+
+**Category**: Change-relative
+
+| Contaminated                           | Timeless Present                                            |
+| -------------------------------------- | ----------------------------------------------------------- |
+| `// Added mutex to fix race condition` | `// Mutex serializes cache access from concurrent requests` |
+| `// New validation for the edge case`  | `// Rejects negative values (downstream assumes unsigned)`  |
+| `// Changed to use batch API`          | `// Batch API reduces round-trips from N to 1`              |
+
+Signal words (non-exhaustive): "Added", "Replaced", "Now uses", "Changed to",
+"New", "Updated", "Refactored"
+
+### 2. Does it compare to something not in the code?
+
+**Category**: Baseline reference
+
+| Contaminated                                      | Timeless Present                                                    |
+| ------------------------------------------------- | ------------------------------------------------------------------- |
+| `// Replaces per-tag logging with summary`        | `// Single summary line; per-tag logging would produce 1500+ lines` |
+| `// Unlike the old approach, this is thread-safe` | `// Thread-safe: each goroutine gets independent state`             |
+| `// Previously handled in caller`                 | `// Encapsulated here; caller should not manage lifecycle`          |
+
+Signal words (non-exhaustive): "Instead of", "Rather than", "Previously",
+"Replaces", "Unlike the old", "No longer"
+
+### 3. Does it describe where to put code rather than what code does?
+
+**Category**: Location directive
+
+| Contaminated                  | Timeless Present                              |
+| ----------------------------- | --------------------------------------------- |
+| `// After the SendAsync call` | _(delete -- diff structure encodes location)_ |
+| `// Insert before validation` | _(delete -- diff structure encodes location)_ |
+| `// Add this at line 425`     | _(delete -- diff structure encodes location)_ |
+
+Signal words (non-exhaustive): "After", "Before", "Insert", "At line", "Here:",
+"Below", "Above"
+
+**Action**: Always delete. Location is encoded in diff structure, not comments.
+
+### 4. Does it describe intent rather than behavior?
+
+**Category**: Planning artifact
+
+| Contaminated                           | Timeless Present                                         |
+| -------------------------------------- | -------------------------------------------------------- |
+| `// TODO: add retry logic later`       | _(delete, or implement retry now)_                       |
+| `// Will be extended for batch mode`   | _(delete -- do not document hypothetical futures)_       |
+| `// Temporary workaround until API v2` | `// API v1 lacks filtering; client-side filter required` |
+
+Signal words (non-exhaustive): "Will", "TODO", "Planned", "Eventually", "For
+future", "Temporary", "Workaround until"
+
+**Action**: Delete, implement the feature, or reframe as current constraint.
+
+### 5. Does it describe the author's choice rather than code behavior?
+
+**Category**: Intent leakage
+
+| Contaminated                               | Timeless Present                                     |
+| ------------------------------------------ | ---------------------------------------------------- |
+| `// Intentionally placed after validation` | `// Runs after validation completes`                 |
+| `// Deliberately using mutex over channel` | `// Mutex serializes access (single-writer pattern)` |
+| `// Chose polling for reliability`         | `// Polling: 30% webhook delivery failures observed` |
+| `// We decided to cache at this layer`     | `// Cache here: reduces DB round-trips for hot path` |
+
+Signal words (non-exhaustive): "intentionally", "deliberately", "chose",
+"decided", "on purpose", "by design", "we opted"
+
+**Action**: Extract the technical justification; discard the decision narrative.
+The reader doesn't need to know someone "decided" -- they need to know WHY this
+approach works.
+
+**The test**: Can you delete the intent word and the comment still makes sense?
+If yes, delete the intent word. If no, reframe around the technical reason.
+
+---
+
+**Catch-all**: If a comment only makes sense to someone who knows the code's
+history, it is temporally contaminated -- even if it does not match any category
+above.
+
+## Subtle Cases
+
+Same word, different verdict -- demonstrates that detection requires semantic
+judgment, not keyword matching.
+
+| Comment                                | Verdict      | Reasoning                                        |
+| -------------------------------------- | ------------ | ------------------------------------------------ |
+| `// Now handles edge cases properly`   | Contaminated | "properly" implies it was improper before        |
+| `// Now blocks until connection ready` | Clean        | "now" describes runtime moment, not code history |
+| `// Fixed the null pointer issue`      | Contaminated | Describes a fix, not behavior                    |
+| `// Returns null when key not found`   | Clean        | Describes behavior                               |
+
+## The Transformation Pattern
+
+> **Extract the technical justification, discard the change narrative.**
+
+1. What useful info is buried? (problem, behavior)
+2. Reframe as timeless present
+
+Example: "Added mutex to fix race" -> "Mutex serializes concurrent access"
--- a/.claude/skills/planner/scripts/executor.py
+++ b/.claude/skills/planner/scripts/executor.py
@@ -0,0 +1,682 @@
+#!/usr/bin/env python3
+"""
+Plan Executor - Execute approved plans through delegation.
+
+Seven-phase execution workflow with JIT prompt injection:
+  Step 1: Execution Planning (analyze plan, detect reconciliation)
+  Step 2: Reconciliation (conditional, validate existing code)
+  Step 3: Milestone Execution (delegate to agents, run tests)
+  Step 4: Post-Implementation QR (quality review)
+  Step 5: QR Issue Resolution (conditional, fix issues)
+  Step 6: Documentation (TW pass)
+  Step 7: Retrospective (present summary)
+
+Usage:
+    python3 executor.py --plan-file PATH --step-number 1 --total-steps 7 --thoughts "..."
+"""
+
+import argparse
+import re
+import sys
+
+
+def detect_reconciliation_signals(thoughts: str) -> bool:
+    """Check if user's thoughts contain reconciliation triggers."""
+    triggers = [
+        r"\balready\s+(implemented|done|complete)",
+        r"\bpartially\s+complete",
+        r"\bhalfway\s+done",
+        r"\bresume\b",
+        r"\bcontinue\s+from\b",
+        r"\bpick\s+up\s+where\b",
+        r"\bcheck\s+what'?s\s+done\b",
+        r"\bverify\s+existing\b",
+        r"\bprior\s+work\b",
+    ]
+    thoughts_lower = thoughts.lower()
+    return any(re.search(pattern, thoughts_lower) for pattern in triggers)
+
+
+def get_step_1_guidance(plan_file: str, thoughts: str) -> dict:
+    """Step 1: Execution Planning - analyze plan, detect reconciliation."""
+    reconciliation_detected = detect_reconciliation_signals(thoughts)
+
+    actions = [
+        "EXECUTION PLANNING",
+        "",
+        f"Plan file: {plan_file}",
+        "",
+        "Read the plan file and analyze:",
+        "  1. Count milestones and their dependencies",
+        "  2. Identify file targets per milestone",
+        "  3. Determine parallelization opportunities",
+        "  4. Set up TodoWrite tracking for all milestones",
+        "",
+        "<execution_rules>",
+        "",
+        "RULE 0 (ABSOLUTE): Delegate ALL code work to specialized agents",
+        "",
+        "Your role: coordinate, validate, orchestrate. Agents implement code.",
+        "",
+        "Delegation routing:",
+        "  - New function needed -> @agent-developer",
+        "  - Bug to fix -> @agent-debugger (diagnose) then @agent-developer (fix)",
+        "  - Any source file modification -> @agent-developer",
+        "  - Documentation files -> @agent-technical-writer",
+        "",
+        "Exception (trivial only): Fixes under 5 lines where delegation overhead",
+        "exceeds fix complexity (missing import, typo correction).",
+        "",
+        "---",
+        "",
+        "RULE 1: Execution Protocol",
+        "",
+        "Before ANY phase:",
+        "  1. Use TodoWrite to track all plan phases",
+        "  2. Analyze dependencies to identify parallelizable work",
+        "  3. Delegate implementation to specialized agents",
+        "  4. Validate each increment before proceeding",
+        "",
+        "You plan HOW to execute (parallelization, sequencing). You do NOT plan",
+        "WHAT to execute -- that's the plan's job.",
+        "",
+        "---",
+        "",
+        "RULE 1.5: Model Selection",
+        "",
+        "Agent defaults (sonnet) are calibrated for quality. Adjust upward only.",
+        "",
+        "  | Action               | Allowed | Rationale                        |",
+        "  |----------------------|---------|----------------------------------|",
+        "  | Upgrade to opus      | YES     | Challenging tasks need reasoning |",
+        "  | Use default (sonnet) | YES     | Baseline for all delegations     |",
+        "  | Keep at sonnet+      | ALWAYS  | Maintains quality baseline       |",
+        "",
+        "</execution_rules>",
+        "",
+        "<dependency_analysis>",
+        "",
+        "Parallelizable when ALL conditions met:",
+        "  - Different target files",
+        "  - No data dependencies",
+        "  - No shared state (globals, configs, resources)",
+        "",
+        "Sequential when ANY condition true:",
+        "  - Same file modified by multiple tasks",
+        "  - Task B imports or depends on Task A's output",
+        "  - Shared database tables or external resources",
+        "",
+        "Before delegating ANY batch:",
+        "  1. List tasks with their target files",
+        "  2. Identify file dependencies (same file = sequential)",
+        "  3. Identify data dependencies (imports = sequential)",
+        "  4. Group independent tasks into parallel batches",
+        "  5. Separate batches with sync points",
+        "",
+        "</dependency_analysis>",
+        "",
+        "<milestone_type_detection>",
+        "",
+        "Before delegating ANY milestone, identify its type from file extensions:",
+        "",
+        "  | Milestone Type | Recognition Signal              | Delegate To             |",
+        "  |----------------|--------------------------------|-------------------------|",
+        "  | Documentation  | ALL files are *.md or *.rst    | @agent-technical-writer |",
+        "  | Code           | ANY file is source code        | @agent-developer        |",
+        "",
+        "Mixed milestones: Split delegation -- @agent-developer first (code),",
+        "then @agent-technical-writer (docs) after code completes.",
+        "",
+        "</milestone_type_detection>",
+        "",
+        "<delegation_format>",
+        "",
+        "EVERY delegation MUST use this structure:",
+        "",
+        "  <delegation>",
+        "    <agent>@agent-[developer|debugger|technical-writer|quality-reviewer]</agent>",
+        "    <mode>[For TW/QR: plan-scrub|post-implementation|plan-review|reconciliation]</mode>",
+        "    <plan_source>[Absolute path to plan file]</plan_source>",
+        "    <milestone>[Milestone number and name]</milestone>",
+        "    <files>[Exact file paths from milestone]</files>",
+        "    <task>[Specific task description]</task>",
+        "    <acceptance_criteria>",
+        "      - [Criterion 1 from plan]",
+        "      - [Criterion 2 from plan]",
+        "    </acceptance_criteria>",
+        "  </delegation>",
+        "",
+        "For parallel delegations, wrap multiple blocks:",
+        "",
+        "  <parallel_batch>",
+        "    <rationale>[Why these can run in parallel]</rationale>",
+        "    <sync_point>[Command to run after all complete]</sync_point>",
+        "    <delegation>...</delegation>",
+        "    <delegation>...</delegation>",
+        "  </parallel_batch>",
+        "",
+        "Agent limits:",
+        "  - @agent-developer: Maximum 4 parallel",
+        "  - @agent-debugger: Maximum 2 parallel",
+        "  - @agent-quality-reviewer: ALWAYS sequential",
+        "  - @agent-technical-writer: Can parallel across independent modules",
+        "",
+        "</delegation_format>",
+    ]
+
+    if reconciliation_detected:
+        next_step = (
+            "RECONCILIATION SIGNALS DETECTED in your thoughts.\n\n"
+            "Invoke step 2 to validate existing code against plan requirements:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 2 '
+            '--total-steps 7 --thoughts "Starting reconciliation..."'
+        )
+    else:
+        next_step = (
+            "No reconciliation signals detected. Proceed to milestone execution.\n\n"
+            "Invoke step 3 to begin delegating milestones:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 3 '
+            '--total-steps 7 --thoughts "Analyzed plan: N milestones, '
+            'parallel batches: [describe], starting execution..."'
+        )
+
+    return {
+        "actions": actions,
+        "next": next_step,
+    }
+
+
+def get_step_2_guidance(plan_file: str) -> dict:
+    """Step 2: Reconciliation - validate existing code against plan."""
+    return {
+        "actions": [
+            "RECONCILIATION PHASE",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Validate existing code against plan requirements BEFORE executing.",
+            "",
+            "<reconciliation_protocol>",
+            "",
+            "Delegate to @agent-quality-reviewer for each milestone:",
+            "",
+            "  Task for @agent-quality-reviewer:",
+            "  Mode: reconciliation",
+            "  Plan Source: [plan_file.md]",
+            "  Milestone: [N]",
+            "",
+            "  Check if the acceptance criteria for Milestone [N] are ALREADY",
+            "  satisfied in the current codebase. Validate REQUIREMENTS, not just",
+            "  code presence.",
+            "",
+            "  Return: SATISFIED | NOT_SATISFIED | PARTIALLY_SATISFIED",
+            "",
+            "---",
+            "",
+            "Execution based on reconciliation result:",
+            "",
+            "  | Result              | Action                                    |",
+            "  |---------------------|-------------------------------------------|",
+            "  | SATISFIED           | Skip execution, record as already complete|",
+            "  | NOT_SATISFIED       | Execute milestone normally                |",
+            "  | PARTIALLY_SATISFIED | Execute only the missing parts            |",
+            "",
+            "---",
+            "",
+            "Why requirements-based (not diff-based):",
+            "",
+            "Checking if code from the diff exists misses critical cases:",
+            "  - Code added but incorrect (doesn't meet acceptance criteria)",
+            "  - Code added but incomplete (partial implementation)",
+            "  - Requirements met by different code than planned (valid alternative)",
+            "",
+            "Checking acceptance criteria catches all of these.",
+            "",
+            "</reconciliation_protocol>",
+        ],
+        "next": (
+            "After collecting reconciliation results for all milestones, "
+            "invoke step 3:\n\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 3 '
+            "--total-steps 7 --thoughts \"Reconciliation complete: "
+            'M1: SATISFIED, M2: NOT_SATISFIED, ..."'
+        ),
+    }
+
+
+def get_step_3_guidance(plan_file: str) -> dict:
+    """Step 3: Milestone Execution - delegate to agents, run tests."""
+    return {
+        "actions": [
+            "MILESTONE EXECUTION",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Execute milestones through delegation. Parallelize independent work.",
+            "",
+            "<diff_compliance_validation>",
+            "",
+            "BEFORE delegating each milestone with code changes:",
+            "  1. Read resources/diff-format.md if not already in context",
+            "  2. Verify plan's diffs meet specification:",
+            "     - Context lines are VERBATIM from actual files (not placeholders)",
+            "     - WHY comments explain rationale (not WHAT code does)",
+            "     - No location directives in comments",
+            "",
+            "AFTER @agent-developer completes, verify:",
+            "  - Context lines from plan were found in target file",
+            "  - WHY comments were transcribed verbatim to code",
+            "  - No location directives remain in implemented code",
+            "  - No temporal contamination leaked (change-relative language)",
+            "",
+            "If Developer reports context lines not found, check drift table below.",
+            "",
+            "</diff_compliance_validation>",
+            "",
+            "<error_handling>",
+            "",
+            "Error classification:",
+            "",
+            "  | Severity | Signals                          | Action                  |",
+            "  |----------|----------------------------------|-------------------------|",
+            "  | Critical | Segfault, data corruption        | STOP, @agent-debugger   |",
+            "  | High     | Test failures, missing deps      | @agent-debugger         |",
+            "  | Medium   | Type errors, lint failures       | Auto-fix, then debugger |",
+            "  | Low      | Warnings, style issues           | Note and continue       |",
+            "",
+            "Escalation triggers -- STOP and report when:",
+            "  - Fix would change fundamental approach",
+            "  - Three attempted solutions failed",
+            "  - Performance or safety characteristics affected",
+            "  - Confidence < 80%",
+            "",
+            "Context anchor mismatch protocol:",
+            "",
+            "When @agent-developer reports context lines don't match actual code:",
+            "",
+            "  | Mismatch Type               | Action                         |",
+            "  |-----------------------------|--------------------------------|",
+            "  | Whitespace/formatting only  | Proceed with normalized match  |",
+            "  | Minor variable rename       | Proceed, note in execution log |",
+            "  | Code restructured           | Proceed, note deviation        |",
+            "  | Context lines not found     | STOP - escalate to planner     |",
+            "  | Logic fundamentally changed | STOP - escalate to planner     |",
+            "",
+            "</error_handling>",
+            "",
+            "<acceptance_testing>",
+            "",
+            "Run after each milestone:",
+            "",
+            "  # Python",
+            "  pytest --strict-markers --strict-config",
+            "  mypy --strict",
+            "",
+            "  # JavaScript/TypeScript",
+            "  tsc --strict --noImplicitAny",
+            "  eslint --max-warnings=0",
+            "",
+            "  # Go",
+            "  go test -race -cover -vet=all",
+            "",
+            "Pass criteria: 100% tests pass, zero linter warnings.",
+            "",
+            "Self-consistency check (for milestones with >3 files):",
+            "  1. Developer's implementation notes claim: [what was implemented]",
+            "  2. Test results demonstrate: [what behavior was verified]",
+            "  3. Acceptance criteria state: [what was required]",
+            "",
+            "All three must align. Discrepancy = investigate before proceeding.",
+            "",
+            "</acceptance_testing>",
+        ],
+        "next": (
+            "CONTINUE in step 3 until ALL milestones complete:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 3 '
+            '--total-steps 7 --thoughts "Completed M1, M2. Executing M3..."'
+            "\n\n"
+            "When ALL milestones are complete, invoke step 4 for quality review:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 4 '
+            '--total-steps 7 --thoughts "All milestones complete. '
+            'Modified files: [list]. Ready for QR."'
+        ),
+    }
+
+
+def get_step_4_guidance(plan_file: str) -> dict:
+    """Step 4: Post-Implementation QR - quality review."""
+    return {
+        "actions": [
+            "POST-IMPLEMENTATION QUALITY REVIEW",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Delegate to @agent-quality-reviewer for comprehensive review.",
+            "",
+            "<qr_delegation>",
+            "",
+            "  Task for @agent-quality-reviewer:",
+            "  Mode: post-implementation",
+            "  Plan Source: [plan_file.md]",
+            "  Files Modified: [list]",
+            "  Reconciled Milestones: [list milestones that were SATISFIED]",
+            "",
+            "  Priority order for findings:",
+            "    1. Issues in reconciled milestones (bypassed execution validation)",
+            "    2. Issues in newly implemented milestones",
+            "    3. Cross-cutting issues",
+            "",
+            "  Checklist:",
+            "    - Every requirement implemented",
+            "    - No unauthorized deviations",
+            "    - Edge cases handled",
+            "    - Performance requirements met",
+            "",
+            "</qr_delegation>",
+            "",
+            "Expected output: PASS or issues list sorted by severity.",
+        ],
+        "next": (
+            "After QR completes:\n\n"
+            "If QR returns ISSUES -> invoke step 5:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 5 '
+            '--total-steps 7 --thoughts "QR found N issues: [summary]"'
+            "\n\n"
+            "If QR returns PASS -> invoke step 6:\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 6 '
+            '--total-steps 7 --thoughts "QR passed. Proceeding to documentation."'
+        ),
+    }
+
+
+def get_step_5_guidance(plan_file: str) -> dict:
+    """Step 5: QR Issue Resolution - present issues, collect decisions, fix."""
+    return {
+        "actions": [
+            "QR ISSUE RESOLUTION",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Present issues to user, collect decisions, delegate fixes.",
+            "",
+            "<issue_resolution_protocol>",
+            "",
+            "Phase 1: Collect Decisions",
+            "",
+            "Sort findings by severity (critical -> high -> medium -> low).",
+            "For EACH issue, present:",
+            "",
+            "  ## Issue [N] of [Total] ([severity])",
+            "",
+            "  **Category**: [production-reliability | project-conformance | structural-quality]",
+            "  **File**: [affected file path]",
+            "  **Location**: [function/line if applicable]",
+            "",
+            "  **Problem**:",
+            "  [Clear description of what is wrong and why it matters]",
+            "",
+            "  **Evidence**:",
+            "  [Specific code/behavior that demonstrates the issue]",
+            "",
+            "Then use AskUserQuestion with options:",
+            "  - **Fix**: Delegate to @agent-developer to resolve",
+            "  - **Skip**: Accept the issue as-is",
+            "  - **Alternative**: User provides different approach",
+            "",
+            "Repeat for each issue. Do NOT execute any fixes during this phase.",
+            "",
+            "---",
+            "",
+            "Phase 2: Execute Decisions",
+            "",
+            "After ALL decisions are collected:",
+            "",
+            "  1. Summarize the decisions",
+            "  2. Execute fixes:",
+            "     - 'Fix' decisions: Delegate to @agent-developer",
+            "     - 'Skip' decisions: Record in retrospective as accepted risk",
+            "     - 'Alternative' decisions: Apply user's specified approach",
+            "  3. Parallelize where possible (different files, no dependencies)",
+            "",
+            "</issue_resolution_protocol>",
+        ],
+        "next": (
+            "After ALL fixes are applied, return to step 4 for re-validation:\n\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 4 '
+            '--total-steps 7 --thoughts "Applied fixes for issues X, Y, Z. '
+            'Re-running QR."'
+            "\n\n"
+            "This creates a validation loop until QR passes."
+        ),
+    }
+
+
+def get_step_6_guidance(plan_file: str) -> dict:
+    """Step 6: Documentation - TW pass for CLAUDE.md, README.md."""
+    return {
+        "actions": [
+            "POST-IMPLEMENTATION DOCUMENTATION",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Delegate to @agent-technical-writer for documentation updates.",
+            "",
+            "<tw_delegation>",
+            "",
+            "Skip condition: If ALL milestones contained only documentation files",
+            "(*.md/*.rst), TW already handled this during milestone execution.",
+            "Proceed directly to step 7.",
+            "",
+            "For code-primary plans:",
+            "",
+            "  Task for @agent-technical-writer:",
+            "  Mode: post-implementation",
+            "  Plan Source: [plan_file.md]",
+            "  Files Modified: [list]",
+            "",
+            "  Requirements:",
+            "    - Create/update CLAUDE.md index entries",
+            "    - Create README.md if architectural complexity warrants",
+            "    - Add module-level docstrings where missing",
+            "    - Verify transcribed comments are accurate",
+            "",
+            "</tw_delegation>",
+            "",
+            "<final_checklist>",
+            "",
+            "Execution is NOT complete until:",
+            "  - [ ] All todos completed",
+            "  - [ ] Quality review passed (no unresolved issues)",
+            "  - [ ] Documentation delegated for ALL modified files",
+            "  - [ ] Documentation tasks completed",
+            "  - [ ] Self-consistency checks passed for complex milestones",
+            "",
+            "</final_checklist>",
+        ],
+        "next": (
+            "After documentation is complete, invoke step 7 for retrospective:\n\n"
+            f'  python3 executor.py --plan-file "{plan_file}" --step-number 7 '
+            '--total-steps 7 --thoughts "Documentation complete. '
+            'Generating retrospective."'
+        ),
+    }
+
+
+def get_step_7_guidance(plan_file: str) -> dict:
+    """Step 7: Retrospective - present execution summary."""
+    return {
+        "actions": [
+            "EXECUTION RETROSPECTIVE",
+            "",
+            f"Plan file: {plan_file}",
+            "",
+            "Generate and PRESENT the retrospective to the user.",
+            "Do NOT write to a file -- present it directly so the user sees it.",
+            "",
+            "<retrospective_format>",
+            "",
+            "================================================================================",
+            "EXECUTION RETROSPECTIVE",
+            "================================================================================",
+            "",
+            "Plan: [plan file path]",
+            "Status: COMPLETED | BLOCKED | ABORTED",
+            "",
+            "## Milestone Outcomes",
+            "",
+            "| Milestone  | Status               | Notes                              |",
+            "| ---------- | -------------------- | ---------------------------------- |",
+            "| 1: [name]  | EXECUTED             | -                                  |",
+            "| 2: [name]  | SKIPPED (RECONCILED) | Already satisfied before execution |",
+            "| 3: [name]  | BLOCKED              | [reason]                           |",
+            "",
+            "## Reconciliation Summary",
+            "",
+            "If reconciliation was run:",
+            "  - Milestones already complete: [count]",
+            "  - Milestones executed: [count]",
+            "  - Milestones with partial work detected: [count]",
+            "",
+            "If reconciliation was skipped:",
+            '  - "Reconciliation skipped (no prior work indicated)"',
+            "",
+            "## Plan Accuracy Issues",
+            "",
+            "[List any problems with the plan discovered during execution]",
+            "  - [file] Context anchor drift: expected X, found Y",
+            "  - Milestone [N] requirements were ambiguous: [what]",
+            "  - Missing dependency: [what was assumed but didn't exist]",
+            "",
+            'If none: "No plan accuracy issues encountered."',
+            "",
+            "## Deviations from Plan",
+            "",
+            "| Deviation      | Category        | Approved By      |",
+            "| -------------- | --------------- | ---------------- |",
+            "| [what changed] | Trivial / Minor | [who or 'auto']  |",
+            "",
+            'If none: "No deviations from plan."',
+            "",
+            "## Quality Review Summary",
+            "",
+            "  - Production reliability: [count] issues",
+            "  - Project conformance: [count] issues",
+            "  - Structural quality: [count] suggestions",
+            "",
+            "## Feedback for Future Plans",
+            "",
+            "[Actionable improvements based on execution experience]",
+            "  - [ ] [specific suggestion]",
+            "  - [ ] [specific suggestion]",
+            "",
+            "================================================================================",
+            "",
+            "</retrospective_format>",
+        ],
+        "next": "EXECUTION COMPLETE.\n\nPresent the retrospective to the user.",
+    }
+
+
+def get_step_guidance(step_number: int, plan_file: str, thoughts: str) -> dict:
+    """Route to appropriate step guidance."""
+    if step_number == 1:
+        return get_step_1_guidance(plan_file, thoughts)
+    elif step_number == 2:
+        return get_step_2_guidance(plan_file)
+    elif step_number == 3:
+        return get_step_3_guidance(plan_file)
+    elif step_number == 4:
+        return get_step_4_guidance(plan_file)
+    elif step_number == 5:
+        return get_step_5_guidance(plan_file)
+    elif step_number == 6:
+        return get_step_6_guidance(plan_file)
+    elif step_number == 7:
+        return get_step_7_guidance(plan_file)
+    else:
+        return {
+            "actions": [f"Unknown step {step_number}. Valid steps are 1-7."],
+            "next": "Re-invoke with a valid step number.",
+        }
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Plan Executor - Execute approved plans through delegation",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Start execution
+  python3 executor.py --plan-file plans/auth.md --step-number 1 --total-steps 7 \\
+    --thoughts "Execute the auth implementation plan"
+
+  # Continue milestone execution
+  python3 executor.py --plan-file plans/auth.md --step-number 3 --total-steps 7 \\
+    --thoughts "Completed M1, M2. Executing M3..."
+
+  # After QR finds issues
+  python3 executor.py --plan-file plans/auth.md --step-number 5 --total-steps 7 \\
+    --thoughts "QR found 2 issues: missing error handling, incorrect return type"
+""",
+    )
+
+    parser.add_argument(
+        "--plan-file", type=str, required=True, help="Path to the plan file to execute"
+    )
+    parser.add_argument("--step-number", type=int, required=True, help="Current step (1-7)")
+    parser.add_argument(
+        "--total-steps", type=int, required=True, help="Total steps (always 7)"
+    )
+    parser.add_argument(
+        "--thoughts", type=str, required=True, help="Your current thinking and status"
+    )
+
+    args = parser.parse_args()
+
+    if args.step_number < 1 or args.step_number > 7:
+        print("Error: step-number must be between 1 and 7", file=sys.stderr)
+        sys.exit(1)
+
+    if args.total_steps != 7:
+        print("Warning: total-steps should be 7 for executor", file=sys.stderr)
+
+    guidance = get_step_guidance(args.step_number, args.plan_file, args.thoughts)
+    is_complete = args.step_number >= 7
+
+    step_names = {
+        1: "Execution Planning",
+        2: "Reconciliation",
+        3: "Milestone Execution",
+        4: "Post-Implementation QR",
+        5: "QR Issue Resolution",
+        6: "Documentation",
+        7: "Retrospective",
+    }
+
+    print("=" * 80)
+    print(
+        f"EXECUTOR - Step {args.step_number} of 7: {step_names.get(args.step_number, 'Unknown')}"
+    )
+    print("=" * 80)
+    print()
+    print(f"STATUS: {'execution_complete' if is_complete else 'in_progress'}")
+    print()
+    print("YOUR THOUGHTS:")
+    print(args.thoughts)
+    print()
+
+    if guidance["actions"]:
+        print("GUIDANCE:")
+        print()
+        for action in guidance["actions"]:
+            print(action)
+        print()
+
+    print("NEXT:")
+    print(guidance["next"])
+    print()
+    print("=" * 80)
+
+
+if __name__ == "__main__":
+    main()
--- a/.claude/skills/planner/scripts/planner.py
+++ b/.claude/skills/planner/scripts/planner.py
--- a/.claude/skills/problem-analysis/CLAUDE.md
+++ b/.claude/skills/problem-analysis/CLAUDE.md
@@ -0,0 +1,19 @@
+# skills/problem-analysis/
+
+## Overview
+
+Structured problem analysis skill. IMMEDIATELY invoke the script - do NOT
+explore first.
+
+## Index
+
+| File/Directory       | Contents          | Read When          |
+| -------------------- | ----------------- | ------------------ |
+| `SKILL.md`           | Invocation        | Using this skill   |
+| `scripts/analyze.py` | Complete workflow | Debugging behavior |
+
+## Key Point
+
+The script IS the workflow. It handles decomposition, solution generation,
+critique, verification, and synthesis. Do NOT analyze before invoking. Run the
+script and obey its output.
--- a/.claude/skills/problem-analysis/README.md
+++ b/.claude/skills/problem-analysis/README.md
@@ -0,0 +1,45 @@
+# Problem Analysis
+
+LLMs jump to solutions. You describe a problem, they propose an answer. For
+complex decisions with multiple viable paths, that first answer often reflects
+the LLM's biases rather than the best fit for your constraints. This skill
+forces structured reasoning before you commit.
+
+The skill runs through six phases:
+
+| Phase       | Actions                                                                  |
+| ----------- | ------------------------------------------------------------------------ |
+| Decompose   | State problem; identify hard/soft constraints, variables, assumptions    |
+| Generate    | Create 2-4 distinct approaches (fundamentally different, not variations) |
+| Critique    | Specific weaknesses; eliminate or refine                                 |
+| Verify      | Answer questions WITHOUT looking at solutions                            |
+| Cross-check | Reconcile verified facts with original claims; update viability          |
+| Synthesize  | Trade-off matrix with verified facts; decision framework                 |
+
+## When to Use
+
+Use this for decisions where the cost of choosing wrong is high:
+
+- Multiple viable technical approaches (Redis vs Postgres, REST vs GraphQL)
+- Architectural decisions with long-term consequences
+- Problems where you suspect your first instinct might be wrong
+
+## Example Usage
+
+```
+I need to decide how to handle distributed locking in our microservices.
+Options I'm considering:
+
+- Redis with Redlock algorithm
+- ZooKeeper
+- Database advisory locks
+
+Use your problem-analysis skill to structure this decision.
+```
+
+## The Design
+
+The structure prevents premature convergence. Critique catches obvious flaws
+before costly verification. Factored verification prevents confirmation bias --
+you answer questions without seeing your original solutions. Cross-check forces
+explicit reconciliation of evidence with claims.
--- a/.claude/skills/problem-analysis/SKILL.md
+++ b/.claude/skills/problem-analysis/SKILL.md
@@ -0,0 +1,26 @@
+---
+name: problem-analysis
+description: Invoke IMMEDIATELY for structured problem analysis and solution discovery.
+---
+
+# Problem Analysis
+
+When this skill activates, IMMEDIATELY invoke the script. The script IS the
+workflow.
+
+## Invocation
+
+```bash
+python3 scripts/analyze.py \
+  --step 1 \
+  --total-steps 7 \
+  --thoughts "Problem: <describe>"
+```
+
+| Argument        | Required | Description                               |
+| --------------- | -------- | ----------------------------------------- |
+| `--step`        | Yes      | Current step (starts at 1)                |
+| `--total-steps` | Yes      | Minimum 7; adjust as script instructs     |
+| `--thoughts`    | Yes      | Accumulated state from all previous steps |
+
+Do NOT analyze or explore first. Run the script and follow its output.
--- a/.claude/skills/problem-analysis/scripts/analyze.py
+++ b/.claude/skills/problem-analysis/scripts/analyze.py
@@ -0,0 +1,379 @@
+#!/usr/bin/env python3
+"""
+Problem Analysis Skill - Structured deep reasoning workflow.
+
+Guides problem analysis through seven phases:
+  1. Decompose  - understand problem space, constraints, assumptions
+  2. Generate   - create initial solution approaches
+  3. Expand     - push for MORE solutions not yet considered
+  4. Critique   - Self-Refine feedback on solutions
+  5. Verify     - factored verification of assumptions
+  6. Cross-check - reconcile verified facts with claims
+  7. Synthesize - structured trade-off analysis
+
+Extra steps beyond 7 go to verification (where accuracy improves most).
+
+Usage:
+    python3 analyze.py --step 1 --total-steps 7 --thoughts "Problem: <describe the decision or challenge>"
+
+Research grounding:
+  - ToT (Yao 2023): decompose into thoughts "small enough for diverse samples,
+    big enough to evaluate"
+  - CoVe (Dhuliawala 2023): factored verification improves accuracy 17%->70%.
+    Use OPEN questions, not yes/no ("model tends to agree whether right or wrong")
+  - Self-Refine (Madaan 2023): feedback must be "actionable and specific";
+    separate feedback from refinement for 5-40% improvement
+  - Analogical Prompting (Yasunaga 2024): "recall relevant and distinct problems"
+    improves reasoning; diversity in self-generated examples is critical
+  - Diversity-Based Selection (Zhang 2022): "even with 50% wrong demonstrations,
+    diversity-based clustering performance does not degrade significantly"
+"""
+
+import argparse
+import sys
+
+
+def get_step_1_guidance():
+    """Step 1: Problem Decomposition - understand the problem space."""
+    return (
+        "Problem Decomposition",
+        [
+            "State the CORE PROBLEM in one sentence: 'I need to decide X'",
+            "",
+            "List HARD CONSTRAINTS (non-negotiable):",
+            "  - Hard constraints: latency limits, accuracy requirements, compatibility",
+            "  - Resource constraints: budget, timeline, skills, capacity",
+            "  - Quality constraints: what 'good' looks like for this problem",
+            "",
+            "List SOFT CONSTRAINTS (preferences, can trade off)",
+            "",
+            "List VARIABLES (what you control):",
+            "  - Structural choices (architecture, format, organization)",
+            "  - Content choices (scope, depth, audience, tone)",
+            "  - Process choices (workflow, tools, automation level)",
+            "",
+            "Surface HIDDEN ASSUMPTIONS by asking:",
+            "  'What am I assuming about scale/load patterns?'",
+            "  'What am I assuming about the team's capabilities?'",
+            "  'What am I assuming will NOT change?'",
+            "",
+            "If unclear, use AskUserQuestion to clarify",
+        ],
+        [
+            "PROBLEM (one sentence)",
+            "HARD CONSTRAINTS (non-negotiable)",
+            "SOFT CONSTRAINTS (preferences)",
+            "VARIABLES (what you control)",
+            "ASSUMPTIONS (surfaced via questions)",
+        ],
+    )
+
+
+def get_step_2_guidance():
+    """Step 2: Solution Generation - create distinct approaches."""
+    return (
+        "Solution Generation",
+        [
+            "Generate 2-4 DISTINCT solution approaches",
+            "",
+            "Solutions must differ on a FUNDAMENTAL AXIS:",
+            "  - Scope: narrow-deep vs broad-shallow",
+            "  - Complexity: simple-but-limited vs complex-but-flexible",
+            "  - Control: standardized vs customizable",
+            "  - Approach: build vs buy, manual vs automated, centralized vs distributed",
+            "  (Identify axes specific to your problem domain)",
+            "",
+            "For EACH solution, document:",
+            "  - Name: short label (e.g., 'Option A', 'Hybrid Approach')",
+            "  - Core mechanism: HOW it solves the problem (1-2 sentences)",
+            "  - Key assumptions: what must be true for this to work",
+            "  - Claimed benefits: what this approach provides",
+            "",
+            "AVOID premature convergence - do not favor one solution yet",
+        ],
+        [
+            "PROBLEM (from step 1)",
+            "CONSTRAINTS (from step 1)",
+            "SOLUTIONS (each with: name, mechanism, assumptions, claimed benefits)",
+        ],
+    )
+
+
+def get_step_3_guidance():
+    """Step 3: Solution Expansion - push beyond initial ideas."""
+    return (
+        "Solution Expansion",
+        [
+            "Review the solutions from step 2. Now PUSH FURTHER:",
+            "",
+            "UNEXPLORED AXES - What fundamental trade-offs were NOT represented?",
+            "  - If all solutions are complex, what's the SIMPLEST approach?",
+            "  - If all are centralized, what's DISTRIBUTED?",
+            "  - If all use technology X, what uses its OPPOSITE or COMPETITOR?",
+            "  - If all optimize for metric A, what optimizes for metric B?",
+            "",
+            "ADJACENT DOMAINS - What solutions from RELATED problems might apply?",
+            "  'How does [related domain] solve similar problems?'",
+            "  'What would [different industry/field] do here?'",
+            "  'What patterns from ADJACENT DOMAINS might apply?'",
+            "",
+            "ANTI-SOLUTIONS - What's the OPPOSITE of each current solution?",
+            "  If Solution A is stateful, what's stateless?",
+            "  If Solution A is synchronous, what's asynchronous?",
+            "  If Solution A is custom-built, what's off-the-shelf?",
+            "",
+            "NULL/MINIMAL OPTIONS:",
+            "  - What if we did NOTHING and accepted the current state?",
+            "  - What if we solved a SMALLER version of the problem?",
+            "  - What's the 80/20 solution that's 'good enough'?",
+            "",
+            "ADD 1-3 MORE solutions. Each must represent an axis/approach",
+            "not covered by the initial set.",
+        ],
+        [
+            "INITIAL SOLUTIONS (from step 2)",
+            "AXES NOT YET EXPLORED (identified gaps)",
+            "NEW SOLUTIONS (1-3 additional, each with: name, mechanism, assumptions)",
+            "COMPLETE SOLUTION SET (all solutions for next phase)",
+        ],
+    )
+
+
+def get_step_4_guidance():
+    """Step 4: Solution Critique - Self-Refine feedback phase."""
+    return (
+        "Solution Critique",
+        [
+            "For EACH solution, identify weaknesses:",
+            "  - What could go wrong? (failure modes)",
+            "  - What does this solution assume that might be false?",
+            "  - Where is the complexity hiding?",
+            "  - What operational burden does this create?",
+            "",
+            "Generate SPECIFIC, ACTIONABLE feedback:",
+            "  BAD:  'This might have scaling issues'",
+            "  GOOD: 'Single-node Redis fails at >100K ops/sec; Solution A",
+            "         assumes <50K ops/sec but requirements say 200K'",
+            "",
+            "Identify which solutions should be:",
+            "  - ELIMINATED: fatal flaw, violates hard constraint",
+            "  - REFINED: fixable weakness, needs modification",
+            "  - ADVANCED: no obvious flaws, proceed to verification",
+            "",
+            "For REFINED solutions, state the specific modification needed",
+        ],
+        [
+            "SOLUTIONS (from step 2)",
+            "CRITIQUE for each (specific weaknesses, failure modes)",
+            "DISPOSITION: ELIMINATED / REFINED / ADVANCED for each",
+            "MODIFICATIONS needed for REFINED solutions",
+        ],
+    )
+
+
+def get_verification_guidance():
+    """
+    Steps 4 to N-2: Factored Assumption Verification.
+
+    Key insight from CoVe: answer verification questions WITHOUT attending
+    to the original solutions. Models that see their own hallucinations
+    tend to repeat them.
+    """
+    return (
+        "Factored Verification",
+        [
+            "FACTORED VERIFICATION (answer WITHOUT looking at solutions):",
+            "",
+            "Step A - List assumptions as OPEN questions:",
+            "  BAD:  'Is option A better?' (yes/no triggers agreement bias)",
+            "  GOOD: 'What throughput does option A achieve under heavy load?'",
+            "  GOOD: 'What reading level does this document require?'",
+            "  GOOD: 'How long does this workflow take with the proposed automation?'",
+            "",
+            "Step B - Answer each question INDEPENDENTLY:",
+            "  - Pretend you have NOT seen the solutions",
+            "  - Answer from first principles or domain knowledge",
+            "  - Do NOT defend any solution; seek truth",
+            "  - Cite sources or reasoning for each answer",
+            "",
+            "Step C - Categorize each assumption:",
+            "  VERIFIED:  evidence confirms the assumption",
+            "  FALSIFIED: evidence contradicts (note: 'claimed X, actually Y')",
+            "  UNCERTAIN: insufficient evidence; note what would resolve it",
+        ],
+        [
+            "SOLUTIONS still under consideration",
+            "VERIFICATION QUESTIONS (open, not yes/no)",
+            "ANSWERS (independent, from first principles)",
+            "CATEGORIZED: VERIFIED / FALSIFIED / UNCERTAIN for each",
+        ],
+    )
+
+
+def get_crosscheck_guidance():
+    """
+    Step N-1: Cross-check - reconcile verified facts with original claims.
+
+    From CoVe Factor+Revise: explicit cross-check achieves +7.7 FACTSCORE
+    points over factored verification alone.
+    """
+    return (
+        "Cross-Check",
+        [
+            "Reconcile verified facts with solution claims:",
+            "",
+            "For EACH surviving solution:",
+            "  - Which claims are now SUPPORTED by verification?",
+            "  - Which claims are CONTRADICTED? (list specific contradictions)",
+            "  - Which claims remain UNTESTED?",
+            "",
+            "Update solution viability:",
+            "  - Mark solutions with falsified CORE assumptions as ELIMINATED",
+            "  - Note which solutions gained credibility (verified strengths)",
+            "  - Note which solutions lost credibility (falsified claims)",
+            "",
+            "Check for EMERGENT solutions:",
+            "  - Do verified facts suggest an approach not previously considered?",
+            "  - Can surviving solutions be combined based on verified strengths?",
+        ],
+        [
+            "SOLUTIONS with updated status",
+            "SUPPORTED claims (with evidence)",
+            "CONTRADICTED claims (with specific contradictions)",
+            "UNTESTED claims",
+            "ELIMINATED solutions (if any, with reason)",
+            "EMERGENT solutions (if any)",
+        ],
+    )
+
+
+def get_final_step_guidance():
+    """Final step: Structured Trade-off Synthesis."""
+    return (
+        "Trade-off Synthesis",
+        [
+            "STRUCTURED SYNTHESIS:",
+            "",
+            "1. SURVIVING SOLUTIONS:",
+            "   List solutions NOT eliminated by falsified assumptions",
+            "",
+            "2. TRADE-OFF MATRIX (verified facts only):",
+            "   For each dimension that matters to THIS decision:",
+            "   - Measurable outcomes: 'A achieves X; B achieves Y (verified)'",
+            "   - Complexity/effort: 'A requires N; B requires M'",
+            "   - Risk profile: 'A fails when...; B fails when...'",
+            "   (Add dimensions specific to your problem)",
+            "",
+            "3. DECISION FRAMEWORK:",
+            "   'If [hard constraint] is paramount -> choose A because...'",
+            "   'If [other priority] matters more -> choose B because...'",
+            "   'If uncertain about [X] -> gather [specific data] first'",
+            "",
+            "4. RECOMMENDATION (if one solution dominates):",
+            "   State which solution and the single strongest reason",
+            "   Acknowledge what you're giving up by choosing it",
+        ],
+        [],  # No next step
+    )
+
+
+def get_guidance(step: int, total_steps: int):
+    """
+    Dispatch to appropriate guidance based on step number.
+
+    7-phase structure:
+      Step 1:      Decomposition
+      Step 2:      Generation (initial solutions)
+      Step 3:      Expansion (push for MORE solutions)
+      Step 4:      Critique (Self-Refine feedback)
+      Steps 5-N-2: Verification (factored, extra steps go here)
+      Step N-1:    Cross-check
+      Step N:      Synthesis
+    """
+    if step == 1:
+        return get_step_1_guidance()
+    if step == 2:
+        return get_step_2_guidance()
+    if step == 3:
+        return get_step_3_guidance()
+    if step == 4:
+        return get_step_4_guidance()
+    if step == total_steps:
+        return get_final_step_guidance()
+    if step == total_steps - 1:
+        return get_crosscheck_guidance()
+    # Steps 5 to N-2 are verification
+    return get_verification_guidance()
+
+
+def format_output(step: int, total_steps: int, thoughts: str) -> str:
+    """Format output for display."""
+    title, actions, next_state = get_guidance(step, total_steps)
+    is_complete = step >= total_steps
+
+    lines = [
+        "=" * 70,
+        f"PROBLEM ANALYSIS - Step {step}/{total_steps}: {title}",
+        "=" * 70,
+        "",
+        "ACCUMULATED STATE:",
+        thoughts[:1200] + "..." if len(thoughts) > 1200 else thoughts,
+        "",
+        "ACTIONS:",
+    ]
+    lines.extend(f"  {action}" for action in actions)
+
+    if not is_complete and next_state:
+        lines.append("")
+        lines.append("NEXT STEP STATE MUST INCLUDE:")
+        lines.extend(f"  - {item}" for item in next_state)
+
+    lines.append("")
+
+    if is_complete:
+        lines.extend([
+            "COMPLETE - Present to user:",
+            "  1. Problem and constraints (from decomposition)",
+            "  2. Solutions considered (including eliminated ones and why)",
+            "  3. Verified facts (from factored verification)",
+            "  4. Trade-off matrix with decision framework",
+            "  5. Recommendation (if one dominates) or decision criteria",
+        ])
+    else:
+        next_title, _, _ = get_guidance(step + 1, total_steps)
+        lines.extend([
+            f"NEXT: Step {step + 1} - {next_title}",
+            f"REMAINING: {total_steps - step} step(s)",
+            "",
+            "ADJUST: increase --total-steps if more verification needed (min 7)",
+        ])
+
+    lines.extend(["", "=" * 70])
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Problem Analysis - Structured deep reasoning",
+        epilog=(
+            "Phases: decompose (1) -> generate (2) -> expand (3) -> "
+            "critique (4) -> verify (5 to N-2) -> cross-check (N-1) -> synthesize (N)"
+        ),
+    )
+    parser.add_argument("--step", type=int, required=True)
+    parser.add_argument("--total-steps", type=int, required=True)
+    parser.add_argument("--thoughts", type=str, required=True)
+    args = parser.parse_args()
+
+    if args.step < 1:
+        sys.exit("ERROR: --step must be >= 1")
+    if args.total_steps < 7:
+        sys.exit("ERROR: --total-steps must be >= 7 (requires 7 phases)")
+    if args.step > args.total_steps:
+        sys.exit("ERROR: --step cannot exceed --total-steps")
+
+    print(format_output(args.step, args.total_steps, args.thoughts))
+
+
+if __name__ == "__main__":
+    main()
--- a/.claude/skills/prompt-engineer/CLAUDE.md
+++ b/.claude/skills/prompt-engineer/CLAUDE.md
@@ -0,0 +1,21 @@
+# skills/prompt-engineer/
+
+## Overview
+
+Prompt optimization skill using research-backed techniques. IMMEDIATELY invoke
+the script - do NOT explore or analyze first.
+
+## Index
+
+| File/Directory                                 | Contents               | Read When          |
+| ---------------------------------------------- | ---------------------- | ------------------ |
+| `SKILL.md`                                     | Invocation             | Using this skill   |
+| `scripts/optimize.py`                          | Complete workflow      | Debugging behavior |
+| `references/prompt-engineering-single-turn.md` | Single-turn techniques | Script instructs   |
+| `references/prompt-engineering-multi-turn.md`  | Multi-turn techniques  | Script instructs   |
+
+## Key Point
+
+The script IS the workflow. It handles triage, blind problem identification,
+planning, factored verification, feedback, refinement, and integration. Do NOT
+analyze before invoking. Run the script and obey its output.
--- a/.claude/skills/prompt-engineer/README.md
+++ b/.claude/skills/prompt-engineer/README.md
@@ -0,0 +1,149 @@
+# Prompt Engineer
+
+Prompts are code. They have bugs, edge cases, and failure modes. This skill
+treats prompt optimization as a systematic discipline -- analyzing issues,
+applying documented patterns, and proposing changes with explicit rationale.
+
+I use this on my own workflow. The skill was optimized using itself -- of
+course.
+
+## When to Use
+
+- A sub-agent definition that misbehaves (agents/developer.md)
+- A Python script with embedded prompts that underperform
+  (skills/planner/scripts/planner.py)
+- A multi-prompt workflow that produces inconsistent results
+- Any prompt that does not do what you intended
+
+## How It Works
+
+The skill:
+
+1. Reads prompt engineering pattern references
+2. Analyzes the target prompt for issues
+3. Proposes changes with explicit pattern attribution
+4. Waits for approval before applying changes
+5. Presents optimized result with self-verification
+
+I use recitation and careful output ordering to ground the skill in the
+referenced patterns. This prevents the model from inventing techniques.
+
+## Example Usage
+
+Optimize a sub-agent:
+
+```
+Use your prompt engineer skill to optimize the system prompt for
+the following claude code sub-agent: agents/developer.md
+```
+
+Optimize a multi-prompt workflow:
+
+```
+Consider @skills/planner/scripts/planner.py. Identify all prompts,
+understand how they interact, then use your prompt engineer skill
+to optimize each.
+```
+
+## Example Output
+
+Each proposed change includes scope, problem, technique, before/after, and
+rationale. A single invocation may propose many changes:
+
+```
+  +==============================================================================+
+  |  CHANGE 1: Add STOP gate to Step 1 (Exploration)                             |
+  +==============================================================================+
+  |                                                                              |
+  |  SCOPE                                                                       |
+  |  -----                                                                       |
+  |  Prompt:      analyze.py step 1                                              |
+  |  Section:     Lines 41-49 (precondition check)                               |
+  |  Downstream:  All subsequent steps depend on exploration results             |
+  |                                                                              |
+  +------------------------------------------------------------------------------+
+  |                                                                              |
+  |  PROBLEM                                                                     |
+  |  -------                                                                     |
+  |  Issue:    Hedging language allows model to skip precondition                |
+  |                                                                              |
+  |  Evidence: "PRECONDITION: You should have already delegated..."              |
+  |            "If you have not, STOP and do that first"                         |
+  |                                                                              |
+  |  Runtime:  Model proceeds to "process exploration results" without having    |
+  |            any results, produces empty/fabricated structure analysis         |
+  |                                                                              |
+  +------------------------------------------------------------------------------+
+  |                                                                              |
+  |  TECHNIQUE                                                                   |
+  |  ---------                                                                   |
+  |  Apply:    STOP Escalation Pattern (single-turn ref)                         |
+  |                                                                              |
+  |  Trigger:  "For behaviors you need to interrupt, not just discourage"        |
+  |  Effect:   "Creates metacognitive checkpoint--the model must pause and       |
+  |             re-evaluate before proceeding"                                   |
+  |  Stacks:   Affirmative Directives                                            |
+  |                                                                              |
+  +------------------------------------------------------------------------------+
+  |                                                                              |
+  |  BEFORE                                                                      |
+  |  ------                                                                      |
+  |  +----------------------------------------------------------------------+    |
+  |  | "PRECONDITION: You should have already delegated to the Explore      |    |
+  |  |  sub-agent.",                                                        |    |
+  |  | "If you have not, STOP and do that first:",                          |    |
+  |  +----------------------------------------------------------------------+    |
+  |                                                                              |
+  |                                    |                                         |
+  |                                    v                                         |
+  |                                                                              |
+  |  AFTER                                                                       |
+  |  -----                                                                       |
+  |  +----------------------------------------------------------------------+    |
+  |  | "STOP. Before proceeding, verify you have Explore agent results.",   |    |
+  |  | "",                                                                  |    |
+  |  | "If your --thoughts do NOT contain Explore agent output, you MUST:", |    |
+  |  | "  1. Use Task tool with subagent_type='Explore'                     |    |
+  |  | "  2. Prompt: 'Explore this repository. Report directory structure,  |    |
+  |  | "     tech stack, entry points, main components, observed patterns.' |    |
+  |  | "  3. WAIT for results before invoking this step again               |    |
+  |  | "",                                                                  |    |
+  |  | "Only proceed below if you have concrete Explore output to process." |    |
+  |  +----------------------------------------------------------------------+    |
+  |                                                                              |
+  +------------------------------------------------------------------------------+
+  |                                                                              |
+  |  WHY THIS IMPROVES QUALITY                                                   |
+  |  -------------------------                                                   |
+  |  Transforms soft precondition into hard gate. Model must explicitly verify   |
+  |  it has Explore results before processing, preventing fabricated analysis.   |
+  |                                                                              |
+  +==============================================================================+
+
+  ... many more
+
+
+  ---
+  Compatibility check:
+  - STOP Escalation + Affirmative Directives: Compatible (STOP is for interrupting specific behaviors)
+  - History Accumulation + Completeness Checkpoint Tags: Synergistic (both enforce state tracking)
+  - Quote Extraction + Chain-of-Verification: Complementary (both prevent hallucination)
+  - Progressive depth + Pre-Work Context Analysis: Sequential (planning enables deeper execution)
+
+  Anti-patterns verified:
+  - No hedging spiral (replaced "should have" with "STOP. Verify...")
+  - No everything-is-critical (CRITICAL used only for state requirement)
+  - Affirmative directives used (changed negatives to positives)
+  - No implicit category trap (explicit checklists provided)
+
+  ---
+  Does this plan look reasonable? I'll apply these changes once you confirm.
+```
+
+## Caveat
+
+When you tell an LLM "find problems and opportunities for optimization", it will
+find problems. That is what you asked it to do. Some may not be real issues.
+
+I recommend invoking the skill multiple times on challenging prompts, but
+recognize when it is good enough and stop. Diminishing returns are real.
--- a/.claude/skills/prompt-engineer/SKILL.md
+++ b/.claude/skills/prompt-engineer/SKILL.md
@@ -0,0 +1,26 @@
+---
+name: prompt-engineer
+description: Invoke IMMEDIATELY via python script when user requests prompt optimization. Do NOT analyze first - invoke this skill immediately.
+---
+
+# Prompt Engineer
+
+When this skill activates, IMMEDIATELY invoke the script. The script IS the
+workflow.
+
+## Invocation
+
+```bash
+python3 scripts/optimize.py \
+  --step 1 \
+  --total-steps 9 \
+  --thoughts "Prompt: <path or description>"
+```
+
+| Argument        | Required | Description                               |
+| --------------- | -------- | ----------------------------------------- |
+| `--step`        | Yes      | Current step (starts at 1)                |
+| `--total-steps` | Yes      | Minimum 9; adjust as script instructs     |
+| `--thoughts`    | Yes      | Accumulated state from all previous steps |
+
+Do NOT analyze or explore first. Run the script and follow its output.
--- a/.claude/skills/prompt-engineer/references/prompt-engineering-multi-turn.md
+++ b/.claude/skills/prompt-engineer/references/prompt-engineering-multi-turn.md
@@ -0,0 +1,790 @@
+# Prompt Engineering: Research-Backed Techniques for Multi-Turn Prompts
+
+This document synthesizes practical prompt engineering patterns with academic research on iterative LLM reasoning. All techniques target **multi-turn prompts**—structured sequences of messages where output from one turn becomes input to subsequent turns. These techniques leverage the observation that models can improve their own outputs through deliberate self-examination across multiple passes.
+
+**Prerequisite**: This guide assumes familiarity with single-turn techniques (CoT, Plan-and-Solve, RE2, etc.). Multi-turn techniques often enhance or extend single-turn methods across message boundaries.
+
+**Meta-principle**: The value of multi-turn prompting comes from separation of concerns—each turn has a distinct cognitive goal (generate, critique, verify, synthesize). Mixing these goals within a single turn reduces effectiveness.
+
+---
+
+## Technique Selection Guide
+
+| Domain              | Technique                  | Trigger Condition                                      | Stacks With                          | Conflicts With             | Cost/Tradeoff                                  | Effect                                                             |
+| ------------------- | -------------------------- | ------------------------------------------------------ | ------------------------------------ | -------------------------- | ---------------------------------------------- | ------------------------------------------------------------------ |
+| **Refinement**      | Self-Refine                | Output quality improvable through iteration            | Any single-turn reasoning technique  | Time-critical tasks        | 2-4x tokens per iteration                      | 5-40% absolute improvement across 7 task types                     |
+| **Refinement**      | Iterative Critique         | Specific quality dimensions need improvement           | Self-Refine, Format Strictness       | —                          | Moderate; targeted feedback reduces iterations | Monotonic improvement on scored dimensions                         |
+| **Verification**    | Chain-of-Verification      | Factual accuracy critical; hallucination risk          | Quote Extraction (single-turn)       | Joint verification         | 3-4x tokens (baseline + verify + revise)       | List-based QA: 17%→70% accuracy; FACTSCORE: 55.9→71.4              |
+| **Verification**    | Factored Verification      | High hallucination persistence in joint verification   | CoVe                                 | Joint CoVe                 | Additional token cost for separation           | Outperforms joint CoVe by 3-8 points across tasks                  |
+| **Aggregation**     | Universal Self-Consistency | Free-form output; standard SC inapplicable             | Any sampling technique               | Greedy decoding            | N samples + 1 selection call                   | Matches SC on math; enables SC for open-ended tasks                |
+| **Aggregation**     | Multi-Chain Reasoning      | Evidence scattered across reasoning attempts           | Self-Consistency, CoT                | Single-chain reliance      | N chains + 1 meta-reasoning call               | +5.7% over SC on multi-hop QA; high-quality explanations           |
+| **Aggregation**     | Complexity-Weighted Voting | Varying reasoning depth across samples                 | Self-Consistency, USC                | Simple majority voting     | Minimal; selection strategy only               | Further gains over standard SC (+2-3 points)                       |
+| **Meta-Reasoning**  | Chain Synthesis            | Multiple valid reasoning paths exist                   | MCR, USC                             | —                          | Moderate; synthesis pass                       | Combines complementary facts from different chains                 |
+| **Meta-Reasoning**  | Explanation Generation     | Interpretability required alongside answer             | MCR                                  | —                          | Included in meta-reasoning pass                | 82% of explanations rated high-quality                             |
+
+---
+
+## Quick Reference: Key Principles
+
+1. **Self-Refine for Iterative Improvement** — Feedback must be actionable ("use the formula n(n+1)/2") and specific ("the for loop is brute force"); vague feedback fails
+2. **Separate Feedback from Refinement** — Generate feedback in one turn, apply it in another; mixing degrades both
+3. **Factored Verification Beats Joint** — Answer verification questions without attending to the original response; prevents hallucination copying
+4. **Shortform Questions Beat Longform** — 70% accuracy on individual verification questions vs. 17% for the same facts in longform generation
+5. **Universal Self-Consistency for Free-Form** — When answers can't be exactly matched, ask the LLM to select the most consistent response
+6. **Multi-Chain Reasoning for Evidence Collection** — Use reasoning chains as evidence sources, not just answer votes
+7. **Meta-Reasoning Over Chains** — A second model pass that reads all chains produces better answers than majority voting
+8. **Complexity-Weighted Voting** — Vote over complex chains only; simple chains may reflect shortcuts
+9. **History Accumulation Helps** — Retain previous feedback and outputs in refinement prompts; models learn from past mistakes
+10. **Open Questions Beat Yes/No** — Verification questions expecting factual answers outperform yes/no format
+11. **Stopping Conditions Matter** — Use explicit quality thresholds or iteration limits; models rarely self-terminate optimally
+12. **Non-Monotonic Improvement Possible** — Multi-aspect tasks may improve on one dimension while regressing on another; track best-so-far
+
+---
+
+## 1. Iterative Refinement
+
+Techniques where the model critiques and improves its own output across multiple turns.
+
+### Self-Refine
+
+A general-purpose iterative improvement framework. Per Madaan et al. (2023): "SELF-REFINE: an iterative self-refinement algorithm that alternates between two generative steps—FEEDBACK and REFINE. These steps work in tandem to generate high-quality outputs."
+
+**The core loop:**
+
+```
+Turn 1 (Generate):
+  Input: Task description + prompt
+  Output: Initial response y₀
+
+Turn 2 (Feedback):
+  Input: Task + y₀ + feedback prompt
+  Output: Actionable, specific feedback fb₀
+
+Turn 3 (Refine):
+  Input: Task + y₀ + fb₀ + refine prompt
+  Output: Improved response y₁
+
+[Iterate until stopping condition]
+```
+
+**Critical quality requirements for feedback:**
+
+Per the paper: "By 'actionable', we mean the feedback should contain a concrete action that would likely improve the output. By 'specific', we mean the feedback should identify concrete phrases in the output to change."
+
+**CORRECT feedback (actionable + specific):**
+
+```
+This code is slow as it uses a for loop which is brute force.
+A better approach is to use the formula n(n+1)/2 instead of iterating.
+```
+
+**INCORRECT feedback (vague):**
+
+```
+The code could be more efficient. Consider optimizing it.
+```
+
+**History accumulation improves refinement:**
+
+The refinement prompt should include all previous iterations. Per the paper: "To inform the model about the previous iterations, we retain the history of previous feedback and outputs by appending them to the prompt. Intuitively, this allows the model to learn from past mistakes and avoid repeating them."
+
+```
+Turn N (Refine with history):
+  Input: Task + y₀ + fb₀ + y₁ + fb₁ + ... + yₙ₋₁ + fbₙ₋₁
+  Output: Improved response yₙ
+```
+
+**Performance:** "SELF-REFINE outperforms direct generation from strong LLMs like GPT-3.5 and GPT-4 by 5-40% absolute improvement" across dialogue response generation, code optimization, code readability, math reasoning, sentiment reversal, acronym generation, and constrained generation.
+
+**When Self-Refine works best:**
+
+| Task Type                   | Improvement | Notes                                        |
+| --------------------------- | ----------- | -------------------------------------------- |
+| Code optimization           | +13%        | Clear optimization criteria                  |
+| Dialogue response           | +35-40%     | Multi-aspect quality (relevance, engagement) |
+| Constrained generation      | +20%        | Verifiable constraint satisfaction           |
+| Math reasoning (with oracle) | +4.8%      | Requires correctness signal                  |
+
+**Limitation — Non-monotonic improvement:**
+
+Per the paper: "For tasks with multi-aspect feedback like Acronym Generation, the output quality can fluctuate during the iterative process, improving on one aspect while losing out on another."
+
+**Mitigation:** Track scores across iterations; select the output with maximum total score, not necessarily the final output.
+
+---
+
+### Feedback Prompt Design
+
+The feedback prompt determines refinement quality. Key elements from Self-Refine experiments:
+
+**Structure:**
+
+```
+You are given [task description] and an output.
+
+Output: {previous_output}
+
+Provide feedback on this output. Your feedback should:
+1. Identify specific phrases or elements that need improvement
+2. Explain why they are problematic
+3. Suggest concrete actions to fix them
+
+Do not rewrite the output. Only provide feedback.
+
+Feedback:
+```
+
+**Why separation matters:** Combining feedback and rewriting in one turn degrades both. The model either produces shallow feedback to get to rewriting, or rewrites without fully analyzing problems.
+
+---
+
+### Refinement Prompt Design
+
+The refinement prompt applies feedback to produce improved output.
+
+**Structure:**
+
+```
+You are given [task description], a previous output, and feedback on that output.
+
+Previous output: {previous_output}
+
+Feedback: {feedback}
+
+Using this feedback, produce an improved version of the output.
+Address each point raised in the feedback.
+
+Improved output:
+```
+
+**With history (for iteration 2+):**
+
+```
+You are given [task description], your previous attempts, and feedback on each.
+
+Attempt 1: {y₀}
+Feedback 1: {fb₀}
+
+Attempt 2: {y₁}
+Feedback 2: {fb₁}
+
+Using all feedback, produce an improved version. Do not repeat previous mistakes.
+
+Improved output:
+```
+
+---
+
+### Stopping Conditions
+
+Self-Refine requires explicit stopping conditions. Options:
+
+1. **Fixed iterations:** Stop after N refinement cycles (typically 2-4)
+2. **Feedback-based:** Prompt the model to include a stop signal in feedback
+3. **Score-based:** Stop when quality score exceeds threshold
+4. **Diminishing returns:** Stop when improvement between iterations falls below threshold
+
+**Prompt for feedback-based stopping:**
+
+```
+Provide feedback on this output. If the output is satisfactory and needs no
+further improvement, respond with "NO_REFINEMENT_NEEDED" instead of feedback.
+
+Feedback:
+```
+
+**Warning:** Models often fail to self-terminate appropriately. Per Madaan et al.: fixed iteration limits are more reliable than self-assessed stopping.
+
+---
+
+## 2. Verification
+
+Techniques where the model fact-checks its own outputs through targeted questioning.
+
+### Chain-of-Verification (CoVe)
+
+A structured approach to reducing hallucination through self-verification. Per Dhuliawala et al. (2023): "Chain-of-Verification (CoVe) whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response."
+
+**The four-step process:**
+
+```
+Turn 1 (Baseline Response):
+  Input: Original query
+  Output: Initial response (may contain hallucinations)
+
+Turn 2 (Plan Verifications):
+  Input: Query + baseline response
+  Output: List of verification questions
+
+Turn 3 (Execute Verifications):
+  Input: Verification questions ONLY (not baseline response)
+  Output: Answers to each verification question
+
+Turn 4 (Final Verified Response):
+  Input: Query + baseline response + verification Q&A pairs
+  Output: Revised response incorporating verifications
+```
+
+**The critical insight — shortform beats longform:**
+
+Per the paper: "Shortform verification questions are more accurately answered than longform queries. In a longform response, LLMs are prone to generate a number of hallucinations. However, it can often be the case that the LLM itself would know these hallucinations are wrong if queried specifically for that individual fact, independent of the rest of the longform generation."
+
+**Quantitative evidence:**
+
+| Setting                       | Accuracy |
+| ----------------------------- | -------- |
+| Facts in longform generation  | ~17%     |
+| Same facts as individual Q&A  | ~70%     |
+
+The same model that hallucinates facts in context can correctly answer when asked directly. CoVe exploits this asymmetry.
+
+**Example from the paper:**
+
+```
+Query: Name some politicians who were born in NY, New York.
+
+Baseline Response (with hallucinations):
+1. Hillary Clinton - former secretary of state... [WRONG: born in Chicago]
+2. Donald Trump - former president... [CORRECT: born in Queens, NYC]
+3. Michael Bloomberg - former Mayor... [WRONG: born in Boston]
+
+Verification Questions:
+- Where was Hillary Clinton born?
+- Where was Donald Trump born?
+- Where was Michael Bloomberg born?
+
+Verification Answers:
+- Hillary Clinton was born in Chicago, Illinois
+- Donald Trump was born in Queens, New York City
+- Michael Bloomberg was born in Boston, Massachusetts
+
+Final Verified Response:
+1. Donald Trump - former president (born in Queens, NYC)
+2. Alexandria Ocasio-Cortez - Democratic representative (born in NYC)
+...
+```
+
+---
+
+### Factored vs. Joint Verification
+
+**The hallucination copying problem:**
+
+Per Dhuliawala et al.: "Models that attend to existing hallucinations in the context from their own generations tend to repeat the hallucinations."
+
+When verification questions are answered with the baseline response in context, the model tends to confirm its own hallucinations rather than correct them.
+
+**Joint verification (less effective):**
+
+```
+Turn 3 (Joint):
+  Input: Query + baseline response + verification questions
+  Output: All answers in one pass
+
+Problem: Model sees its original hallucinations and copies them
+```
+
+**Factored verification (more effective):**
+
+```
+Turn 3a: Answer Q1 independently (no baseline in context)
+Turn 3b: Answer Q2 independently (no baseline in context)
+Turn 3c: Answer Q3 independently (no baseline in context)
+...
+```
+
+**2-Step verification (middle ground):**
+
+```
+Turn 3a: Generate all verification answers (no baseline in context)
+Turn 3b: Cross-check answers against baseline, note inconsistencies
+```
+
+**Performance comparison (Wiki-Category task):**
+
+| Method          | Precision |
+| --------------- | --------- |
+| Baseline        | 0.13      |
+| Joint CoVe      | 0.15      |
+| 2-Step CoVe     | 0.19      |
+| Factored CoVe   | 0.22      |
+
+Factored verification consistently outperforms joint verification by preventing hallucination propagation.
+
+---
+
+### Verification Question Design
+
+**Open questions outperform yes/no:**
+
+Per the paper: "We find that yes/no type questions perform worse for the factored version of CoVe. Some anecdotal examples... find the model tends to agree with facts in a yes/no question format whether they are right or wrong."
+
+**CORRECT (open verification question):**
+
+```
+When did Texas secede from Mexico?
+→ Expected answer: 1836
+```
+
+**INCORRECT (yes/no verification question):**
+
+```
+Did Texas secede from Mexico in 1845?
+→ Model tends to agree regardless of correctness
+```
+
+**LLM-generated questions outperform heuristics:**
+
+Per the paper: "We compare the quality of these questions to heuristically constructed ones... Results show a reduced precision with rule-based verification questions."
+
+Let the model generate verification questions tailored to the specific response, rather than using templated questions.
+
+---
+
+### Factor+Revise for Complex Verification
+
+For longform generation, add an explicit cross-check step between verification and final response.
+
+**Structure:**
+
+```
+Turn 3 (Execute verifications): [as above]
+
+Turn 3.5 (Cross-check):
+  Input: Baseline response + verification Q&A pairs
+  Output: Explicit list of inconsistencies found
+
+Turn 4 (Final response):
+  Input: Baseline + verifications + inconsistency list
+  Output: Revised response
+```
+
+**Performance:** Factor+Revise achieves FACTSCORE 71.4 vs. 63.7 for factored-only, demonstrating that explicit reasoning about inconsistencies further improves accuracy.
+
+**Prompt for cross-check:**
+
+```
+Original passage: {baseline_excerpt}
+
+From another source:
+Q: {verification_question_1}
+A: {verification_answer_1}
+
+Q: {verification_question_2}
+A: {verification_answer_2}
+
+Identify any inconsistencies between the original passage and the verified facts.
+List each inconsistency explicitly.
+
+Inconsistencies:
+```
+
+---
+
+## 3. Aggregation and Consistency
+
+Techniques that sample multiple responses and select or synthesize the best output.
+
+### Universal Self-Consistency (USC)
+
+Extends self-consistency to free-form outputs where exact-match voting is impossible. Per Chen et al. (2023): "USC leverages LLMs themselves to select the most consistent answer among multiple candidates... USC eliminates the need of designing an answer extraction process, and is applicable to tasks with free-form answers."
+
+**The two-step process:**
+
+```
+Turn 1 (Sample):
+  Input: Query
+  Output: N responses sampled with temperature > 0
+  [y₁, y₂, ..., yₙ]
+
+Turn 2 (Select):
+  Input: Query + all N responses
+  Output: Index of most consistent response
+```
+
+**The selection prompt:**
+
+```
+I have generated the following responses to the question: {question}
+
+Response 0: {response_0}
+Response 1: {response_1}
+Response 2: {response_2}
+...
+
+Select the most consistent response based on majority consensus.
+The most consistent response is Response:
+```
+
+**Why this works:**
+
+Per the paper: "Although prior works show that LLMs sometimes have trouble evaluating the prediction correctness, empirically we observe that LLMs are generally able to examine the response consistency across multiple tasks."
+
+Assessing consistency is easier than assessing correctness. The model doesn't need to know the right answer—just which answers agree with each other most.
+
+**Performance:**
+
+| Task                    | Greedy | Random | USC   | Standard SC |
+| ----------------------- | ------ | ------ | ----- | ----------- |
+| GSM8K                   | 91.3   | 91.5   | 92.4  | 92.7        |
+| MATH                    | 34.2   | 34.3   | 37.6  | 37.5        |
+| TruthfulQA (free-form)  | 62.1   | 62.9   | 67.7  | N/A         |
+| SummScreen (free-form)  | 30.6   | 30.2   | 31.7  | N/A         |
+
+USC matches standard SC on structured tasks and enables consistency-based selection where SC cannot apply.
+
+**Robustness to ordering:**
+
+Per the paper: "The overall model performance remains similar with different response orders, suggesting the effect of response order is minimal." USC is not significantly affected by the order in which responses are presented.
+
+**Optimal sample count:**
+
+USC benefits from more samples up to a point, then plateaus or slightly degrades due to context length limitations. Per experiments: 8 samples is a reliable sweet spot balancing accuracy and cost.
+
+---
+
+### Multi-Chain Reasoning (MCR)
+
+Uses multiple reasoning chains as evidence sources, not just answer votes. Per Yoran et al. (2023): "Unlike prior work, sampled reasoning chains are used not for their predictions (as in SC) but as a means to collect pieces of evidence from multiple chains."
+
+**The key insight:**
+
+Self-Consistency discards the reasoning and only votes on answers. MCR preserves the reasoning and synthesizes facts across chains.
+
+**The three-step process:**
+
+```
+Turn 1 (Generate chains):
+  Input: Query
+  Output: N reasoning chains, each with intermediate steps
+  [chain₁, chain₂, ..., chainₙ]
+
+Turn 2 (Concatenate):
+  Combine all chains into unified multi-chain context
+
+Turn 3 (Meta-reason):
+  Input: Query + multi-chain context
+  Output: Final answer + explanation synthesizing evidence
+```
+
+**Why MCR outperforms SC:**
+
+Per the paper: "SC solely relies on the chains' answers... By contrast, MCR concatenates the intermediate steps from each chain into a unified context, which is passed, along with the original question, to a meta-reasoner model."
+
+**Example from the paper:**
+
+```
+Question: Did Brad Peyton need to know about seismology?
+
+Chain 1 (Answer: No):
+- Brad Peyton is a film director
+- What is seismology? Seismology is the study of earthquakes
+- Do film directors need to know about earthquakes? No
+
+Chain 2 (Answer: Yes):
+- Brad Peyton directed San Andreas
+- San Andreas is about a massive earthquake
+- [implicit: he needed to research the topic]
+
+Chain 3 (Answer: No):
+- Brad Peyton is a director, writer, and producer
+- What do film directors have to know? Many things
+- Is seismology one of them? No
+
+Self-Consistency vote: No (2-1)
+
+MCR meta-reasoning: Combines facts from all chains:
+- Brad Peyton is a film director (chain 1, 3)
+- He directed San Andreas (chain 2)
+- San Andreas is about a massive earthquake (chain 2)
+- Seismology is the study of earthquakes (chain 1)
+
+MCR answer: Yes (synthesizes that directing an earthquake film required seismology knowledge)
+```
+
+**Performance:**
+
+MCR outperforms SC by up to 5.7% on multi-hop QA datasets. Additionally: "MCR generates high quality explanations for over 82% of examples, while fewer than 3% are unhelpful."
+
+---
+
+### Complexity-Weighted Voting
+
+An extension to self-consistency that weights votes by reasoning complexity. Per Fu et al. (2023): "We propose complexity-based consistency, where instead of taking a majority vote among all generated chains, we vote over the top K complex chains."
+
+**The process:**
+
+```
+Turn 1 (Sample with CoT):
+  Generate N reasoning chains with answers
+
+Turn 2 (Rank by complexity):
+  Count reasoning steps in each chain
+  Select top K chains by step count
+
+Turn 3 (Vote):
+  Majority vote only among the K complex chains
+```
+
+**Why complexity matters:**
+
+Simple chains may reflect shortcuts or lucky guesses. Complex chains demonstrate thorough reasoning. Voting only over complex chains filters out low-effort responses.
+
+**Performance (GSM8K):**
+
+| Method                      | Accuracy |
+| --------------------------- | -------- |
+| Standard SC (all chains)    | 78.0     |
+| Complexity-weighted (top K) | 80.5     |
+
+**Implementation note:** This requires no additional LLM calls beyond standard SC—just post-processing to count steps and filter before voting.
+
+---
+
+## 4. Implementation Patterns
+
+### Conversation Structure Template
+
+A general template for multi-turn improvement:
+
+```
+SYSTEM: [Base system prompt with single-turn techniques]
+
+--- Turn 1: Initial Generation ---
+USER: [Task]
+ASSISTANT: [Initial output y₀]
+
+--- Turn 2: Analysis/Feedback ---
+USER: [Analysis prompt - critique, verify, or evaluate y₀]
+ASSISTANT: [Feedback, verification results, or evaluation]
+
+--- Turn 3: Refinement/Synthesis ---
+USER: [Refinement prompt incorporating Turn 2 output]
+ASSISTANT: [Improved output y₁]
+
+[Repeat Turns 2-3 as needed]
+
+--- Final Turn: Format/Extract ---
+USER: [Optional: extract final answer in required format]
+ASSISTANT: [Final formatted output]
+```
+
+### Context Management
+
+Multi-turn prompting accumulates context. Manage token limits by:
+
+1. **Summarize history:** After N iterations, summarize previous attempts rather than including full text
+2. **Keep recent + best:** Retain only the most recent iteration and the best-scoring previous output
+3. **Structured extraction:** Extract key points from feedback rather than full feedback text
+
+**Example (summarized history):**
+
+```
+Previous attempts summary:
+- Attempt 1: Failed due to [specific issue]
+- Attempt 2: Improved [aspect] but [remaining issue]
+- Attempt 3: Best so far, minor issue with [aspect]
+
+Latest attempt: [full text of y₃]
+
+Feedback on latest attempt:
+```
+
+---
+
+## 5. Anti-Patterns
+
+### The Mixed-Goal Turn
+
+**Anti-pattern:** Combining distinct cognitive operations in a single turn.
+
+```
+# PROBLEMATIC
+Generate a response, then critique it, then improve it.
+```
+
+Each operation deserves focused attention. The model may rush through critique to reach improvement, or improve without thorough analysis.
+
+```
+# BETTER
+Turn 1: Generate response
+Turn 2: Critique the response (output: feedback only)
+Turn 3: Improve based on feedback
+```
+
+### The Contaminated Context
+
+**Anti-pattern:** Including the original response when answering verification questions.
+
+Per Dhuliawala et al. (2023): "Models that attend to existing hallucinations in the context from their own generations tend to repeat the hallucinations."
+
+```
+# PROBLEMATIC
+Original response: [contains potential hallucinations]
+Verification question: Where was Hillary Clinton born?
+Answer:
+```
+
+The model will often confirm the hallucination from its original response.
+
+```
+# BETTER
+Verification question: Where was Hillary Clinton born?
+Answer:
+[Original response NOT in context]
+```
+
+Exclude the baseline response when executing verifications. Include it only in the final revision step.
+
+### The Yes/No Verification Trap
+
+**Anti-pattern:** Phrasing verification questions as yes/no confirmations.
+
+```
+# PROBLEMATIC
+Is it true that Michael Bloomberg was born in New York?
+```
+
+Per CoVe research: Models tend to agree with yes/no questions regardless of correctness.
+
+```
+# BETTER
+Where was Michael Bloomberg born?
+```
+
+Open questions expecting factual answers perform significantly better.
+
+### The Infinite Loop
+
+**Anti-pattern:** No explicit stopping condition for iterative refinement.
+
+```
+# PROBLEMATIC
+Keep improving until the output is perfect.
+```
+
+Models rarely self-terminate appropriately. "Perfect" is undefined.
+
+```
+# BETTER
+Improve for exactly 3 iterations, then output the best version.
+
+# OR
+Improve until the quality score exceeds 8/10, maximum 5 iterations.
+```
+
+Always include explicit stopping criteria: iteration limits, quality thresholds, or both.
+
+### The Forgotten History
+
+**Anti-pattern:** Discarding previous iterations in refinement.
+
+```
+# PROBLEMATIC
+Turn 3: Here is feedback. Improve the output.
+[No reference to previous attempts]
+```
+
+Per Madaan et al.: "Retaining the history of previous feedback and outputs... allows the model to learn from past mistakes and avoid repeating them."
+
+```
+# BETTER
+Turn 3:
+Previous attempts and feedback:
+- Attempt 1: [y₀] → Feedback: [fb₀]
+- Attempt 2: [y₁] → Feedback: [fb₁]
+
+Improve, avoiding previously identified issues:
+```
+
+### The Vague Feedback
+
+**Anti-pattern:** Feedback without actionable specifics.
+
+```
+# PROBLEMATIC
+The response could be improved. Some parts are unclear.
+```
+
+This feedback provides no guidance for refinement.
+
+```
+# BETTER
+The explanation of photosynthesis in paragraph 2 uses jargon ("electron
+transport chain") without definition. Add a brief explanation: "the process
+by which plants convert light energy into chemical energy through a series
+of protein complexes."
+```
+
+Feedback must identify specific elements AND suggest concrete improvements.
+
+### The Majority Fallacy
+
+**Anti-pattern:** Assuming majority vote is always correct.
+
+```
+# PROBLEMATIC
+3 out of 5 chains say the answer is X, so X is correct.
+```
+
+Per Fu et al.: Simple chains may reflect shortcuts. Per Yoran et al.: Intermediate reasoning contains useful information discarded by voting.
+
+```
+# BETTER
+Weight votes by reasoning complexity, or use MCR to synthesize
+evidence from all chains including minority answers.
+```
+
+---
+
+## 6. Technique Combinations
+
+Multi-turn techniques can be combined for compounding benefits.
+
+### Self-Refine + CoVe
+
+Apply verification after refinement to catch introduced errors:
+
+```
+Turn 1: Generate initial output
+Turn 2: Feedback
+Turn 3: Refine
+Turn 4: Plan verification questions for refined output
+Turn 5: Execute verifications (factored)
+Turn 6: Final verified output
+```
+
+### USC + Complexity Weighting
+
+Filter by complexity before consistency selection:
+
+```
+Turn 1: Sample N responses with reasoning
+Turn 2: Filter to top K by reasoning complexity
+Turn 3: Apply USC to select most consistent among K
+```
+
+### MCR + Self-Refine
+
+Use multi-chain evidence collection, then refine the synthesis:
+
+```
+Turn 1: Generate N reasoning chains
+Turn 2: Meta-reason to synthesize evidence and produce answer
+Turn 3: Feedback on synthesis
+Turn 4: Refine synthesis
+```
+
+---
+
+## Research Citations
+
+- Chen, X., Aksitov, R., Alon, U., et al. (2023). "Universal Self-Consistency for Large Language Model Generation." arXiv.
+- Dhuliawala, S., Komeili, M., Xu, J., et al. (2023). "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv.
+- Diao, S., Wang, P., Lin, Y., & Zhang, T. (2023). "Active Prompting with Chain-of-Thought for Large Language Models." arXiv.
+- Fu, Y., Peng, H., Sabharwal, A., Clark, P., & Khot, T. (2023). "Complexity-Based Prompting for Multi-Step Reasoning." arXiv.
+- Madaan, A., Tandon, N., Gupta, P., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv.
+- Wang, X., Wei, J., Schuurmans, D., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR.
+- Yao, S., Yu, D., Zhao, J., et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS.
+- Yoran, O., Wolfson, T., Bogin, B., et al. (2023). "Answering Questions by Meta-Reasoning over Multiple Chains of Thought." arXiv.
+- Zhang, Y., Yuan, Y., & Yao, A. (2024). "Meta Prompting for AI Systems." arXiv.
--- a/.claude/skills/prompt-engineer/references/prompt-engineering-single-turn.md
+++ b/.claude/skills/prompt-engineer/references/prompt-engineering-single-turn.md
--- a/.claude/skills/prompt-engineer/scripts/optimize.py
+++ b/.claude/skills/prompt-engineer/scripts/optimize.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+"""
+Prompt Engineer Skill - Multi-turn prompt optimization workflow.
+
+Guides prompt optimization through nine phases:
+  1. Triage     - Assess complexity, route to lightweight or full process
+  2. Understand - Blind problem identification (NO references yet)
+  3. Plan       - Consult references, match techniques, generate visual cards
+  4. Verify     - Factored verification of FACTS (open questions, cross-check)
+  5. Feedback   - Generate actionable critique from verification results
+  6. Refine     - Apply feedback to update the plan
+  7. Approval   - Present refined plan to human, HARD GATE
+  8. Execute    - Apply approved changes to prompt
+  9. Integrate  - Coherence check, anti-pattern audit, quality verification
+
+Research grounding:
+  - Self-Refine (Madaan 2023): Separate feedback from refinement for 5-40%
+    improvement. Feedback must be "actionable and specific."
+  - CoVe (Dhuliawala 2023): Factored verification improves accuracy 17%->70%.
+    Use OPEN questions, not yes/no ("model tends to agree whether right or wrong")
+  - Factor+Revise: Explicit cross-check achieves +7.7 FACTSCORE points over
+    factored verification alone.
+  - Separation of Concerns: "Each turn has a distinct cognitive goal. Mixing
+    these goals within a single turn reduces effectiveness."
+
+Usage:
+    python3 optimize.py --step 1 --total-steps 9 --thoughts "Prompt: agents/developer.md"
+"""
+
+import argparse
+import sys
+
+
+def get_step_1_guidance():
+    """Step 1: Triage - Assess complexity and route appropriately."""
+    return {
+        "title": "Triage",
+        "actions": [
+            "Assess the prompt complexity:",
+            "",
+            "SIMPLE prompts (use lightweight 3-step process):",
+            "  - Under 20 lines",
+            "  - Single clear purpose (one tool, one behavior)",
+            "  - No conditional logic or branching",
+            "  - No inter-section dependencies",
+            "",
+            "COMPLEX prompts (use full 9-step process):",
+            "  - Multiple sections serving different functions",
+            "  - Conditional behaviors or rule hierarchies",
+            "  - Tool orchestration or multi-step workflows",
+            "  - Known failure modes that need addressing",
+            "",
+            "If SIMPLE: Note 'LIGHTWEIGHT' and proceed with abbreviated analysis",
+            "If COMPLEX: Note 'FULL PROCESS' and proceed to step 2",
+            "",
+            "Read the prompt file now. Do NOT read references yet.",
+        ],
+        "state_requirements": [
+            "PROMPT_PATH: path to the prompt being optimized",
+            "COMPLEXITY: SIMPLE or COMPLEX",
+            "PROMPT_SUMMARY: 2-3 sentences describing purpose",
+            "PROMPT_LENGTH: approximate line count",
+        ],
+    }
+
+
+def get_step_2_guidance():
+    """Step 2: Understand - Blind problem identification."""
+    return {
+        "title": "Understand (Blind)",
+        "actions": [
+            "CRITICAL: Do NOT read the reference documents yet.",
+            "This step uses BLIND problem identification to prevent pattern-shopping.",
+            "",
+            "Document the prompt's OPERATING CONTEXT:",
+            "  - Interaction model: single-shot or conversational?",
+            "  - Agent type: tool-use, coding, analysis, or general?",
+            "  - Token constraints: brevity critical or thoroughness preferred?",
+            "  - Failure modes: what goes wrong when this prompt fails?",
+            "",
+            "Identify PROBLEMS by examining the prompt text directly:",
+            "  - Quote specific problematic text with line numbers",
+            "  - Describe what's wrong in concrete terms",
+            "  - Note observable symptoms (not guessed causes)",
+            "",
+            "Examples of observable problems:",
+            "  'Lines 12-15 use hedging language: \"might want to\", \"could try\"'",
+            "  'No examples provided for expected output format'",
+            "  'Multiple rules marked CRITICAL with no clear precedence'",
+            "  'Instructions say what NOT to do but not what TO do'",
+            "",
+            "List at least 3 specific problems with quoted evidence.",
+        ],
+        "state_requirements": [
+            "OPERATING_CONTEXT: interaction model, agent type, constraints",
+            "PROBLEMS: list of specific issues with QUOTED text from prompt",
+            "Each problem must have: line reference, quoted text, description",
+        ],
+    }
+
+
+def get_step_3_guidance():
+    """Step 3: Plan - Consult references, match techniques."""
+    return {
+        "title": "Plan",
+        "actions": [
+            "NOW read the reference documents:",
+            "  - references/prompt-engineering-single-turn.md (always)",
+            "  - references/prompt-engineering-multi-turn.md (if multi-turn prompt)",
+            "",
+            "For EACH problem identified in Step 2:",
+            "",
+            "1. Locate a matching technique in the reference",
+            "2. QUOTE the trigger condition from the Technique Selection Guide",
+            "3. QUOTE the expected effect",
+            "4. Note stacking compatibility and conflicts",
+            "5. Draft the BEFORE/AFTER transformation",
+            "",
+            "Format each proposed change as a visual card:",
+            "",
+            "  CHANGE N: [title]",
+            "  PROBLEM: [quoted text from prompt]",
+            "  TECHNIQUE: [name]",
+            "  TRIGGER: \"[quoted from reference]\"",
+            "  EFFECT: \"[quoted from reference]\"",
+            "  BEFORE: [original prompt text]",
+            "  AFTER: [modified prompt text]",
+            "",
+            "If you cannot quote a trigger condition that matches, do NOT apply.",
+        ],
+        "state_requirements": [
+            "PROBLEMS: (from step 2)",
+            "PROPOSED_CHANGES: list of visual cards, each with:",
+            "  - Problem quoted from prompt",
+            "  - Technique name",
+            "  - Trigger condition QUOTED from reference",
+            "  - Effect QUOTED from reference",
+            "  - BEFORE/AFTER text",
+            "STACKING_NOTES: compatibility between proposed techniques",
+        ],
+    }
+
+
+def get_step_4_guidance():
+    """Step 4: Verify - Factored verification of facts."""
+    return {
+        "title": "Verify (Factored)",
+        "actions": [
+            "FACTORED VERIFICATION: Answer questions WITHOUT seeing your proposals.",
+            "",
+            "For EACH proposed technique, generate OPEN verification questions:",
+            "",
+            "  WRONG (yes/no): 'Is Affirmative Directives applicable here?'",
+            "  RIGHT (open):   'What is the trigger condition for Affirmative Directives?'",
+            "",
+            "  WRONG (yes/no): 'Does the prompt have hedging language?'",
+            "  RIGHT (open):   'What hedging phrases appear in lines 10-20?'",
+            "",
+            "Answer each question INDEPENDENTLY:",
+            "  - Pretend you have NOT seen your proposals",
+            "  - Answer from the reference or prompt text directly",
+            "  - Do NOT defend your choices; seek truth",
+            "",
+            "Then CROSS-CHECK: Compare answers to your claims:",
+            "",
+            "  TECHNIQUE: [name]",
+            "  CLAIMED TRIGGER: \"[what you quoted in step 3]\"",
+            "  VERIFIED TRIGGER: \"[what the reference actually says]\"",
+            "  MATCH: CONSISTENT / INCONSISTENT / PARTIAL",
+            "",
+            "  CLAIMED PROBLEM: \"[quoted prompt text in step 3]\"",
+            "  VERIFIED TEXT: \"[what the prompt actually says at that line]\"",
+            "  MATCH: CONSISTENT / INCONSISTENT / PARTIAL",
+        ],
+        "state_requirements": [
+            "VERIFICATION_QS: open questions for each technique",
+            "VERIFICATION_ANSWERS: factored answers (without seeing proposals)",
+            "CROSS_CHECK: for each technique:",
+            "  - Claimed vs verified trigger condition",
+            "  - Claimed vs verified prompt text",
+            "  - Match status: CONSISTENT / INCONSISTENT / PARTIAL",
+        ],
+    }
+
+
+def get_step_5_guidance():
+    """Step 5: Feedback - Generate actionable critique."""
+    return {
+        "title": "Feedback",
+        "actions": [
+            "Generate FEEDBACK based on verification results.",
+            "",
+            "Self-Refine research requires feedback to be:",
+            "  - ACTIONABLE: contains concrete action to improve",
+            "  - SPECIFIC: identifies concrete phrases to change",
+            "",
+            "WRONG (vague): 'The technique selection could be improved.'",
+            "RIGHT (actionable): 'Change 3 claims Affirmative Directives but the",
+            "  prompt text at line 15 is already affirmative. Remove this change.'",
+            "",
+            "For each INCONSISTENT or PARTIAL match from Step 4:",
+            "",
+            "  ISSUE: [specific problem from cross-check]",
+            "  ACTION: [concrete fix]",
+            "    - Replace technique with [alternative]",
+            "    - Modify BEFORE/AFTER to [specific change]",
+            "    - Remove change entirely because [reason]",
+            "",
+            "For CONSISTENT matches: Note 'VERIFIED - no changes needed'",
+            "",
+            "Do NOT apply feedback yet. Only generate critique.",
+        ],
+        "state_requirements": [
+            "CROSS_CHECK: (from step 4)",
+            "FEEDBACK: for each proposed change:",
+            "  - STATUS: VERIFIED / NEEDS_REVISION / REMOVE",
+            "  - If NEEDS_REVISION: specific actionable fix",
+            "  - If REMOVE: reason for removal",
+        ],
+    }
+
+
+def get_step_6_guidance():
+    """Step 6: Refine - Apply feedback to update plan."""
+    return {
+        "title": "Refine",
+        "actions": [
+            "Apply the feedback from Step 5 to update your proposed changes.",
+            "",
+            "For each change marked VERIFIED: Keep unchanged",
+            "",
+            "For each change marked NEEDS_REVISION:",
+            "  - Apply the specific fix from feedback",
+            "  - Update the BEFORE/AFTER text",
+            "  - Verify the trigger condition still matches",
+            "",
+            "For each change marked REMOVE: Delete from proposal",
+            "",
+            "After applying all feedback, verify:",
+            "  - No stacking conflicts between remaining techniques",
+            "  - All BEFORE/AFTER transformations are consistent",
+            "  - No duplicate or overlapping changes",
+            "",
+            "Produce the REFINED PLAN ready for human approval.",
+        ],
+        "state_requirements": [
+            "REFINED_CHANGES: updated list of visual cards",
+            "CHANGES_MADE: what was revised or removed and why",
+            "FINAL_STACKING_CHECK: confirm no conflicts",
+        ],
+    }
+
+
+def get_step_7_guidance():
+    """Step 7: Approval - Present to human, hard gate."""
+    return {
+        "title": "Approval Gate",
+        "actions": [
+            "Present the REFINED PLAN to the user for approval.",
+            "",
+            "Format:",
+            "",
+            "  ## Proposed Changes",
+            "",
+            "  [Visual cards for each change]",
+            "",
+            "  ## Verification Summary",
+            "  - [N] changes verified against reference",
+            "  - [M] changes revised based on verification",
+            "  - [K] changes removed (did not match trigger conditions)",
+            "",
+            "  ## Compatibility",
+            "  - [Note stacking synergies]",
+            "  - [Note any resolved conflicts]",
+            "",
+            "  ## Anti-Patterns Checked",
+            "  - Hedging Spiral: [checked/found/none]",
+            "  - Everything-Is-Critical: [checked/found/none]",
+            "  - Negative Instruction Trap: [checked/found/none]",
+            "",
+            "  ---",
+            "  Does this plan look reasonable? Confirm to proceed with execution.",
+            "",
+            "HARD GATE: Do NOT proceed to Step 8 without explicit user approval.",
+        ],
+        "state_requirements": [
+            "REFINED_CHANGES: (from step 6)",
+            "APPROVAL_PRESENTATION: formatted summary for user",
+            "USER_APPROVAL: must be obtained before step 8",
+        ],
+    }
+
+
+def get_step_8_guidance():
+    """Step 8: Execute - Apply approved changes."""
+    return {
+        "title": "Execute",
+        "actions": [
+            "Apply the approved changes to the prompt.",
+            "",
+            "Work through changes in logical order (by prompt section).",
+            "",
+            "For each approved change:",
+            "  1. Locate the target text in the prompt",
+            "  2. Apply the BEFORE -> AFTER transformation",
+            "  3. Verify the modification matches what was approved",
+            "",
+            "No additional approval needed per change - plan was approved in Step 7.",
+            "",
+            "If a conflict is discovered during execution:",
+            "  - STOP and present the conflict to user",
+            "  - Wait for resolution before continuing",
+            "",
+            "After all changes applied, proceed to integration.",
+        ],
+        "state_requirements": [
+            "APPROVED_CHANGES: (from step 7)",
+            "APPLIED_CHANGES: list of what was modified",
+            "EXECUTION_NOTES: any issues encountered",
+        ],
+    }
+
+
+def get_step_9_guidance():
+    """Step 9: Integrate - Coherence and quality verification."""
+    return {
+        "title": "Integrate",
+        "actions": [
+            "Verify the optimized prompt holistically.",
+            "",
+            "COHERENCE CHECKS:",
+            "  - Cross-section references: do sections reference each other correctly?",
+            "  - Terminology consistency: same terms throughout?",
+            "  - Priority consistency: do multiple sections align on priorities?",
+            "  - Flow and ordering: logical progression?",
+            "",
+            "EMPHASIS AUDIT:",
+            "  - Count CRITICAL, IMPORTANT, NEVER, ALWAYS markers",
+            "  - If more than 2-3 highest-level markers, reconsider",
+            "",
+            "ANTI-PATTERN FINAL CHECK:",
+            "  - Hedging Spiral: accumulated uncertainty language?",
+            "  - Everything-Is-Critical: overuse of emphasis?",
+            "  - Negative Instruction Trap: 'don't' instead of 'do'?",
+            "  - Implicit Category Trap: examples without principles?",
+            "",
+            "QUALITY VERIFICATION (open questions):",
+            "  - 'What behavior will this produce in edge cases?'",
+            "  - 'How would an agent interpret this if skimming?'",
+            "  - 'What could go wrong with this phrasing?'",
+            "",
+            "Present the final optimized prompt with summary of changes.",
+        ],
+        "state_requirements": [],  # Final step
+    }
+
+
+def get_guidance(step: int, total_steps: int):
+    """Dispatch to appropriate guidance based on step number."""
+    guidance_map = {
+        1: get_step_1_guidance,
+        2: get_step_2_guidance,
+        3: get_step_3_guidance,
+        4: get_step_4_guidance,
+        5: get_step_5_guidance,
+        6: get_step_6_guidance,
+        7: get_step_7_guidance,
+        8: get_step_8_guidance,
+        9: get_step_9_guidance,
+    }
+
+    if step in guidance_map:
+        return guidance_map[step]()
+
+    # Extra steps beyond 9 continue integration/verification
+    return get_step_9_guidance()
+
+
+def format_output(step: int, total_steps: int, thoughts: str) -> str:
+    """Format output for display."""
+    guidance = get_guidance(step, total_steps)
+    is_complete = step >= total_steps
+
+    lines = [
+        "=" * 70,
+        f"PROMPT ENGINEER - Step {step}/{total_steps}: {guidance['title']}",
+        "=" * 70,
+        "",
+        "ACCUMULATED STATE:",
+        thoughts[:1200] + "..." if len(thoughts) > 1200 else thoughts,
+        "",
+        "ACTIONS:",
+    ]
+    lines.extend(f"  {action}" for action in guidance["actions"])
+
+    state_reqs = guidance.get("state_requirements", [])
+    if not is_complete and state_reqs:
+        lines.append("")
+        lines.append("NEXT STEP STATE MUST INCLUDE:")
+        lines.extend(f"  - {item}" for item in state_reqs)
+
+    lines.append("")
+
+    if is_complete:
+        lines.extend([
+            "COMPLETE - Present to user:",
+            "  1. Summary of optimization process",
+            "  2. Techniques applied with reference sections",
+            "  3. Quality improvements (top 3)",
+            "  4. What was preserved from original",
+            "  5. Final optimized prompt",
+        ])
+    else:
+        next_guidance = get_guidance(step + 1, total_steps)
+        lines.extend([
+            f"NEXT: Step {step + 1} - {next_guidance['title']}",
+            f"REMAINING: {total_steps - step} step(s)",
+            "",
+            "ADJUST: increase --total-steps if more verification needed (min 9)",
+        ])
+
+    lines.extend(["", "=" * 70])
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Prompt Engineer - Multi-turn optimization workflow",
+        epilog=(
+            "Phases: triage (1) -> understand (2) -> plan (3) -> "
+            "verify (4) -> feedback (5) -> refine (6) -> "
+            "approval (7) -> execute (8) -> integrate (9)"
+        ),
+    )
+    parser.add_argument("--step", type=int, required=True)
+    parser.add_argument("--total-steps", type=int, required=True)
+    parser.add_argument("--thoughts", type=str, required=True)
+    args = parser.parse_args()
+
+    if args.step < 1:
+        sys.exit("ERROR: --step must be >= 1")
+    if args.total_steps < 9:
+        sys.exit("ERROR: --total-steps must be >= 9 (requires 9 phases)")
+    if args.step > args.total_steps:
+        sys.exit("ERROR: --step cannot exceed --total-steps")
+
+    print(format_output(args.step, args.total_steps, args.thoughts))
+
+
+if __name__ == "__main__":
+    main()