feat: v0.2.0 sprint — ground truth eval, crossover/mutation, checkpointing, similarity guards, dataset loader, CLI commands, extended test coverage

Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes 2 integration tests that broke when the codebase went async (DSPyLLMAdapter and full pipeline tests now properly await coroutines). 277 tests pass (260 unit + 17 integration). Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-03-29 19:13:50 +00:00
parent b9745566c8
commit a5bf2ad59c
43 changed files with 5007 additions and 358 deletions
@@ -0,0 +1,369 @@
+# PROMETHEUS Feature Roadmap
+
+> Complete codebase review — features needed for production-grade prompt optimization.
+> Generated from v0.1.0 architecture review (2026-03-29).
+
+---
+
+## Legend
+
+| Marker | Meaning |
+|--------|---------|
+| **CLI** | Exposed as a CLI option/flag |
+| **Config** | YAML config field |
+| **Internal** | No user-facing surface, architectural improvement |
+| **P1** | Critical / must-have for reliability |
+| **P2** | High value, should-have |
+| **P3** | Nice-to-have, deferred to later versions |
+
+---
+
+## 1. Multi-Model Routing (P1)
+
+**Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config.
+
+**Feature:**
+- Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field.
+- Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model).
+
+**Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields.
+
+**Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring.
+
+---
+
+## 2. Async / Parallel Execution (P1)
+
+**Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.
+
+**Feature:**
+- Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`).
+- Parallelize judge calls within a batch.
+- Keep the proposer sequential (single call per iteration).
+
+**Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field.
+
+**Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`.
+
+---
+
+## 3. Robust Error Handling & Retry (P1)
+
+**Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.
+
+**Feature:**
+- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
+- Configurable `max_retries` and `retry_delay_base`.
+- Circuit breaker: if N consecutive iterations fail, pause and alert.
+- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.
+
+**Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag.
+
+**Scope:** Infrastructure adapters + evolution loop.
+
+---
+
+## 4. Checkpoint & Resume (P2)
+
+**Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.
+
+**Feature:**
+- Save `OptimizationState` to disk every K iterations (or every accepted improvement).
+- Resume from the latest checkpoint file on restart.
+- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.
+
+**Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field.
+
+**Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`.
+
+---
+
+## 5. Population-Based Evolution (P2)
+
+**Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned.
+
+**Feature:**
+- Maintain a population of K candidates (e.g., top-K by score or Pareto front).
+- Crossover: combine instructions from two parent candidates.
+- Mutation operators: paraphrase, constrain, generalize, specialize.
+- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).
+
+**Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags.
+
+**Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure.
+
+---
+
+## 6. Hold-Out Validation (P2)
+
+**Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.
+
+**Feature:**
+- Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
+- Evolution uses train minibatches for accept/reject decisions.
+- After each iteration, evaluate the best candidate on the hold-out set.
+- Report both train and validation scores in results.
+- Optional early stopping if validation score degrades for K consecutive iterations.
+
+**Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`.
+
+**Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics).
+
+---
+
+## 7. Custom Judge Criteria (P2)
+
+**Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.
+
+**Feature:**
+- Allow users to define custom judge rubrics, criteria, and scoring scales.
+- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
+- Allow `perfect_score` to reflect the custom scale.
+
+**Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides.
+
+**Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation).
+
+---
+
+## 8. Real-World Evaluation Harness (P2)
+
+**Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.
+
+**Feature:**
+- Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns).
+- When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
+- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.
+
+**Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge).
+
+**Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure.
+
+---
+
+## 9. Logging & Observability (P2)
+
+**Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.
+
+**Feature:**
+- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
+- JSON-formatted log output for machine parsing.
+- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
+- Optional OpenTelemetry export for distributed tracing.
+
+**Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`.
+
+**Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain.
+
+---
+
+## 10. CLI Improvements (P2)
+
+**Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands.
+
+**Feature:**
+- Fix Typer subcommand routing.
+- `prometheus version` — show version.
+- `prometheus init` — scaffold a config YAML interactively.
+- `prometheus list` — list past optimization runs.
+- `prometheus diff` — compare two result files (before/after prompt diff, score improvement).
+- `prometheus eval` — evaluate a prompt against a dataset without optimization.
+
+**Surface:** CLI subcommands.
+
+**Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command.
+
+---
+
+## 11. Input Validation & Schema Enforcement (P2)
+
+**Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.
+
+**Feature:**
+- Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency).
+- Provide clear, actionable error messages for missing/invalid fields.
+- Support config migration/upgrade from older versions.
+
+**Surface:** Internal. Errors surface as clear CLI messages.
+
+**Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution.
+
+---
+
+## 12. Adaptive Minibatch Sizing (P3)
+
+**Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.
+
+**Feature:**
+- Start with a small minibatch for quick early iterations.
+- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
+- Shrink if too many evaluations fail (cost optimization).
+
+**Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config.
+
+**Scope:** `EvolutionLoop`, `SyntheticBootstrap`.
+
+---
+
+## 13. Prompt Diversity Tracking (P3)
+
+**Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.
+
+**Feature:**
+- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
+- Report diversity metrics in the result.
+- Flag stagnation (N iterations with <epsilon change).
+
+**Surface:** Internal. Reported in `OptimizationResult.history` entries.
+
+**Scope:** `EvolutionLoop`, `OptimizationResult` (add diversity field per history entry).
+
+---
+
+## 14. Temperature & Sampling Control (P3)
+
+**Current state:** No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.
+
+**Feature:**
+- Per-model-slot temperature and sampling parameters.
+- Higher temperature for proposer (creativity), lower for judge (consistency).
+
+**Surface:** `task_temperature`, `judge_temperature`, `proposer_temperature`, `synth_temperature` Config fields. `--temperature` CLI flag for global override.
+
+**Scope:** `cli/app.py` (DSPy LM configuration), infrastructure adapters.
+
+---
+
+## 15. Cost Estimation & Budget Caps (P3)
+
+**Current state:** `total_llm_calls` is tracked (inaccurately). No cost estimation, no budget caps.
+
+**Feature:**
+- Estimate cost per run based on model pricing and approximate token counts.
+- Allow users to set a budget cap (`--max-cost-usd`).
+- Report estimated cost in the result.
+
+**Surface:** `--max-cost-usd` CLI flag. `max_cost_usd` Config field. Cost breakdown in result output.
+
+**Scope:** `cli/app.py`, `OptimizationResult` (add cost fields), token counting in adapters.
+
+---
+
+## 16. Multi-Objective Optimization (P3)
+
+**Current state:** Single scalar score from the judge. The `Prompt` entity comment mentions "Pareto tracking" but it's not implemented.
+
+**Feature:**
+- Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
+- Maintain a Pareto front of non-dominated candidates.
+- Allow users to set objective weights or constraints.
+
+**Surface:** `objectives` Config field (list of `{name, weight, judge_criteria}`). CLI: `--objective` repeatable flag.
+
+**Scope:** `EvolutionLoop` (Pareto front), `scoring.py` (multi-objective acceptance), `OptimizationResult` (Pareto set).
+
+---
+
+## 17. Export Optimized Prompt (P3)
+
+**Current state:** The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.
+
+**Feature:**
+- `prometheus export` command to extract the optimized prompt as plain text.
+- Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
+- Copy to clipboard option.
+
+**Surface:** `prometheus export --format <txt|md|json|langchain|dspy>` CLI subcommand. `--clipboard` flag.
+
+**Scope:** New `cli/commands/export.py`, format renderers in infrastructure.
+
+---
+
+## 18. Config Profiles / Presets (P3)
+
+**Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.
+
+**Feature:**
+- Named profiles: `fast`, `thorough`, `economy`, `research`.
+- Profile overrides individual config fields.
+- User-defined profiles stored in `~/.prometheus/profiles/`.
+
+**Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands.
+
+**Scope:** `cli/app.py`, new `ProfileManager` in application.
+
+---
+
+## Summary Table
+
+| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope |
+|---|---------|----------|-------------|----------------|-----------------|
+| 1 | Multi-Model Routing | P1 | Existing | Existing | Small |
+| 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium |
+| 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium |
+| 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium |
+| 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large |
+| 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium |
+| 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium |
+| 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large |
+| 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium |
+| 10 | CLI Improvements | P2 | Subcommands | — | Medium |
+| 11 | Input Validation | P2 | — (error messages) | — | Small |
+| 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small |
+| 13 | Prompt Diversity Tracking | P3 | — | — | Small |
+| 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small |
+| 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small |
+| 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large |
+| 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small |
+| 18 | Config Profiles / Presets | P3 | `--profile` | — | Small |
+
+---
+
+## Known Bugs (from TEST_REPORT.md and code review)
+
+| # | Bug | Severity | File |
+|---|-----|----------|------|
+| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters |
+| 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` |
+| 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` |
+| 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` |
+| 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` |
+| 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` |
+| 7 | `AppSettings` is never imported or used | LOW | `config.py` |
+| 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` |
+| 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` |
+
+---
+
+## Test Coverage Gaps
+
+| Area | Current | Needed |
+|------|---------|--------|
+| CLI commands | 0 tests | Unit + integration for each subcommand |
+| Config validation | 0 tests | Schema validation, missing fields, type errors |
+| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery |
+| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters |
+| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios |
+| Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` |
+
+---
+
+## Recommended Implementation Order
+
+### Phase 1 — Production Reliability (P1)
+1. Fix multi-model routing (#1) — highest impact, smallest scope
+2. Add error handling & retry (#3) — essential for production runs
+3. Implement async/parallel execution (#2) — biggest wall-clock improvement
+
+### Phase 2 — Optimization Quality (P2)
+4. Input validation (#11) — small scope, high reliability gain
+5. Logging & observability (#9) — enables debugging long runs
+6. CLI improvements (#10) — fix Typer bug, add basic commands
+7. Hold-out validation (#6) — prevents overfitting
+8. Checkpoint & resume (#4) — essential for long runs
+9. Custom judge criteria (#7) — enables domain-specific optimization
+
+### Phase 3 — Advanced Features (P3)
+10. Population-based evolution (#5)
+11. Real-world eval harness (#8)
+12. Remaining P3 features as demand dictates
@@ -5,12 +5,12 @@ description = "Prompt evolution without reference data"
 readme = "README.md"
 requires-python = ">=3.12"
 dependencies = [
-    "dspy>=2.6,<3.0",
-    "typer>=0.15,<0.20",
-    "pydantic>=2.10",
-    "pydantic-settings>=2.7",
-    "pyyaml>=6.0",
-    "rich>=13.9",
+    "dspy==2.6.27",
+    "typer==0.19.2",
+    "pydantic==2.12.5",
+    "pydantic-settings==2.13.1",
+    "pyyaml==6.0.3",
+    "rich==14.3.3",
 ]

 [project.optional-dependencies]
@@ -46,6 +46,6 @@ module = ["dspy", "dspy.*"]
 ignore_missing_imports = true

 [[tool.mypy.overrides]]
-module = ["prometheus.infrastructure.*", "prometheus.cli.app"]
+module = ["prometheus.infrastructure.*", "prometheus.cli.app", "prometheus.cli.commands.*"]
 disable_error_code = ["misc", "import-untyped"]

@@ -22,6 +22,24 @@ class SyntheticBootstrap:
        self._generator = generator
        self._rng = random.Random(seed)

+    @staticmethod
+    def split_pool(
+        pool: list[SyntheticExample],
+        validation_fraction: float,
+        rng: random.Random | None = None,
+    ) -> tuple[list[SyntheticExample], list[SyntheticExample]]:
+        """Split *pool* into (train, validation) sets.
+
+        Returns (pool, []) when *validation_fraction* is 0.
+        """
+        if validation_fraction <= 0.0 or len(pool) < 2:
+            return pool, []
+        n_val = max(1, int(len(pool) * validation_fraction))
+        shuffled = list(pool)
+        _rng = rng or random.Random(42)
+        _rng.shuffle(shuffled)
+        return shuffled[:-n_val], shuffled[-n_val:]
+
    def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]:
        """Generate the synthetic pool in a single call.

@@ -16,6 +16,7 @@ from prometheus.domain.entities import (
    Trajectory,
 )
 from prometheus.domain.ports import JudgePort, LLMPort
+from prometheus.domain.scoring import normalize_score

 logger = logging.getLogger(__name__)

@@ -72,6 +73,7 @@ class PromptEvaluator:
        trajectories: list[Trajectory] = []
        for i, (example, output) in enumerate(zip(minibatch, outputs)):
            score, feedback = judge_results[i]
+            score = normalize_score(score)
            scores.append(score)
            feedbacks.append(feedback)
            trajectories.append(
@@ -2,24 +2,33 @@
 Evolution loop — core PROMETHEUS engine.

 Orchestrates the select → evaluate → propose → accept cycle.
-Equivalent to GEPAEngine.run(), adapted to work without a valset.
+Supports two modes:
+- Single-candidate hill climbing (population_size=1, backward compat)
+- Population-based evolution with crossover & mutation (population_size>1)
 """
 from __future__ import annotations

 import logging
+import random

 from prometheus.application.bootstrap import SyntheticBootstrap
 from prometheus.application.evaluator import PromptEvaluator
+from prometheus.cli.logging_setup import get_logger
 from prometheus.domain.entities import (
    Candidate,
    OptimizationState,
    Prompt,
    SyntheticExample,
 )
-from prometheus.domain.ports import ProposerPort
+from prometheus.domain.ports import (
+    CheckpointPort,
+    CrossoverPort,
+    MutationPort,
+    ProposerPort,
+)
 from prometheus.domain.scoring import should_accept

-logger = logging.getLogger(__name__)
+logger = get_logger("evolution")


 class CircuitBreakerOpen(Exception):
@@ -30,9 +39,9 @@ class EvolutionLoop:
    """Main evolution loop.

    Design:
-    - Keeps only the best candidate (no full population).
-    - Simplifies vs GEPA (no Pareto, no merge).
-    - Population support deferred to v2.
+    - population_size=1: classic single-candidate hill climbing (backward compat).
+    - population_size>1: population-based evolution with crossover, mutation,
+      and diversity maintenance.

    Error handling:
    - Transient errors are retried by adapters.
@@ -51,6 +60,17 @@ class EvolutionLoop:
        verbose: bool = False,
        circuit_breaker_threshold: int = 5,
        error_strategy: str = "retry",
+        checkpoint_port: CheckpointPort | None = None,
+        checkpoint_interval: int = 5,
+        # --- Population-based evolution params ---
+        population_size: int = 1,
+        crossover_rate: float = 0.5,
+        mutation_rate: float = 0.3,
+        diversity_penalty: float = 0.1,
+        crossover_port: CrossoverPort | None = None,
+        mutation_port: MutationPort | None = None,
+        # --- Hold-out validation params ---
+        early_stop_patience: int = 5,
    ):
        self._evaluator = evaluator
        self._proposer = proposer
@@ -61,18 +81,44 @@ class EvolutionLoop:
        self._verbose = verbose
        self._circuit_breaker_threshold = circuit_breaker_threshold
        self._error_strategy = error_strategy
+        self._checkpoint_port = checkpoint_port
+        self._checkpoint_interval = checkpoint_interval
+        self._population_size = population_size
+        self._crossover_rate = crossover_rate
+        self._mutation_rate = mutation_rate
+        self._diversity_penalty = diversity_penalty
+        self._crossover_port = crossover_port
+        self._mutation_port = mutation_port
+        self._early_stop_patience = early_stop_patience

    async def run(
        self,
        seed_prompt: Prompt,
        synthetic_pool: list[SyntheticExample],
        task_description: str,
+        initial_state: OptimizationState | None = None,
+        validation_pool: list[SyntheticExample] | None = None,
    ) -> OptimizationState:
-        """Execute the complete evolution loop."""
-        state = OptimizationState()
+        """Execute the complete evolution loop.
+
+        If *initial_state* is provided (from a checkpoint), resume from that
+        point — skipping the seed evaluation and continuing at the saved iteration.
+
+        If *validation_pool* is provided (non-empty), the best candidate is
+        evaluated on the hold-out set after each iteration and early stopping
+        is applied when validation score degrades for ``early_stop_patience``
+        consecutive iterations.
+        """
+        state = initial_state or OptimizationState()
        consecutive_failures = 0

-        # Evaluate the seed
+        # Hold-out validation tracking
+        has_validation = bool(validation_pool)
+        best_validation_score: float = -1.0
+        validation_patience_counter: int = 0
+
+        # Only evaluate the seed when starting fresh (no checkpoint resume)
+        if initial_state is None:
            initial_batch = self._bootstrap.sample_minibatch(
                synthetic_pool, self._minibatch_size
            )
@@ -81,31 +127,162 @@ class EvolutionLoop:
            )
            state.total_llm_calls += 2 * self._minibatch_size  # N executions + N judge calls

-        best_candidate = Candidate(
+            seed_candidate = Candidate(
                prompt=seed_prompt,
                best_score=initial_eval.total_score,
                generation=0,
            )
-        state.best_candidate = best_candidate
-        state.candidates.append(best_candidate)
-        self._log(f"Initial score: {initial_eval.total_score:.2f}")
+            state.best_candidate = seed_candidate
+            state.candidates.append(seed_candidate)
+            logger.info(
+                "Initial evaluation complete",
+                extra={
+                    "structured": {
+                        "event": "initial_eval",
+                        "score": round(initial_eval.total_score, 4),
+                        "minibatch_size": self._minibatch_size,
+                        "sample_ids": [ex.id for ex in initial_batch],
+                    },
+                },
+            )
+
+            # Evaluate seed on validation set
+            if has_validation and state.best_candidate is not None:
+                val_eval = await self._evaluator.evaluate(
+                    state.best_candidate.prompt, validation_pool, task_description
+                )
+                state.total_llm_calls += 2 * len(validation_pool)
+                best_validation_score = val_eval.mean_score
+                logger.info(
+                    "Initial validation evaluation",
+                    extra={
+                        "structured": {
+                            "event": "validation_eval",
+                            "iteration": 0,
+                            "validation_score": round(best_validation_score, 4),
+                            "validation_pool_size": len(validation_pool),
+                        },
+                    },
+                )
+
+            # Population initialization: seed the population with mutations
+            if self._population_size > 1:
+                await self._initialize_population(
+                    state, seed_prompt, seed_candidate, task_description
+                )
+        else:
+            logger.info(
+                "Resuming from checkpoint",
+                extra={
+                    "structured": {
+                        "event": "resume",
+                        "iteration": state.iteration,
+                        "total_llm_calls": state.total_llm_calls,
+                    },
+                },
+            )
+            # Restore validation tracking from state history
+            if has_validation:
+                for entry in reversed(state.history):
+                    if entry.get("event") == "validation_eval":
+                        best_validation_score = entry.get("best_validation_score", -1.0)
+                        validation_patience_counter = entry.get("validation_patience", 0)
+                        break
+
+        # Determine starting iteration
+        start_iteration = state.iteration + 1

        # Main loop
-        for i in range(1, self._max_iterations + 1):
+        for i in range(start_iteration, self._max_iterations + 1):
            state.iteration = i

            try:
-                await self._run_iteration(
-                    i, state, best_candidate, synthetic_pool, task_description
+                if self._population_size > 1 and len(state.candidates) > 1:
+                    await self._run_population_iteration(
+                        i, state, synthetic_pool, task_description
+                    )
+                else:
+                    await self._run_single_iteration(
+                        i, state, synthetic_pool, task_description
                    )
-                # Update best_candidate from state after successful iteration
-                best_candidate = state.best_candidate  # type: ignore[assignment]
                consecutive_failures = 0

+                # Hold-out validation: evaluate best candidate on validation set
+                if has_validation and state.best_candidate is not None:
+                    val_eval = await self._evaluator.evaluate(
+                        state.best_candidate.prompt, validation_pool, task_description
+                    )
+                    state.total_llm_calls += 2 * len(validation_pool)
+                    current_val_score = val_eval.mean_score
+
+                    if current_val_score > best_validation_score:
+                        best_validation_score = current_val_score
+                        validation_patience_counter = 0
+                    else:
+                        validation_patience_counter += 1
+
+                    state.history.append({
+                        "iteration": i,
+                        "event": "validation_eval",
+                        "validation_score": round(current_val_score, 4),
+                        "best_validation_score": round(best_validation_score, 4),
+                        "validation_patience": validation_patience_counter,
+                    })
+
+                    logger.info(
+                        "Validation evaluation",
+                        extra={
+                            "structured": {
+                                "event": "validation_eval",
+                                "iteration": i,
+                                "validation_score": round(current_val_score, 4),
+                                "best_validation_score": round(best_validation_score, 4),
+                                "patience": f"{validation_patience_counter}/{self._early_stop_patience}",
+                            },
+                        },
+                    )
+
+                    if validation_patience_counter >= self._early_stop_patience:
+                        logger.warning(
+                            "Early stopping triggered — validation score did not improve for %d iterations",
+                            self._early_stop_patience,
+                            extra={
+                                "structured": {
+                                    "event": "early_stop",
+                                    "iteration": i,
+                                    "best_validation_score": round(best_validation_score, 4),
+                                    "patience": self._early_stop_patience,
+                                },
+                            },
+                        )
+                        state.history.append({
+                            "iteration": i,
+                            "event": "early_stop",
+                            "best_validation_score": round(best_validation_score, 4),
+                            "patience": self._early_stop_patience,
+                        })
+                        state.best_validation_score = best_validation_score
+                        state.early_stopped = True
+                        if self._checkpoint_port is not None:
+                            self._checkpoint_port.save(state)
+                        break
+
+                # Checkpoint on accepted improvement (detected via state change)
+                self._maybe_checkpoint(state)
+
            except Exception as exc:
                consecutive_failures += 1
-                self._log(
-                    f"Iter {i}: ERROR ({consecutive_failures} consecutive) — {exc}"
+                logger.error(
+                    "Iteration error",
+                    extra={
+                        "structured": {
+                            "event": "iteration_error",
+                            "iteration": i,
+                            "consecutive_failures": consecutive_failures,
+                            "error": str(exc),
+                        },
+                    },
+                    exc_info=True,
                )
                state.history.append(
                    {
@@ -118,9 +295,16 @@ class EvolutionLoop:

                # Check circuit breaker
                if consecutive_failures >= self._circuit_breaker_threshold:
-                    self._log(
-                        f"Circuit breaker tripped after {consecutive_failures} "
-                        f"consecutive failures."
+                    logger.warning(
+                        "Circuit breaker tripped",
+                        extra={
+                            "structured": {
+                                "event": "circuit_breaker",
+                                "iteration": i,
+                                "consecutive_failures": consecutive_failures,
+                                "error_strategy": self._error_strategy,
+                            },
+                        },
                    )
                    state.history.append(
                        {
@@ -134,7 +318,9 @@ class EvolutionLoop:
                            f"Circuit breaker tripped after "
                            f"{consecutive_failures} consecutive failures"
                        ) from exc
-                    # skip / retry strategies: stop the loop gracefully
+                    # skip / retry strategies: save checkpoint, then stop the loop gracefully
+                    if self._checkpoint_port is not None:
+                        self._checkpoint_port.save(state)
                    break

                if self._error_strategy == "abort":
@@ -142,21 +328,77 @@ class EvolutionLoop:
                # skip / retry: continue to next iteration
                continue

+        # Store final validation metadata on state
+        if has_validation:
+            state.best_validation_score = best_validation_score
+
        return state

-    async def _run_iteration(
+    # ------------------------------------------------------------------
+    # Population initialization
+    # ------------------------------------------------------------------
+
+    async def _initialize_population(
+        self,
+        state: OptimizationState,
+        seed_prompt: Prompt,
+        seed_candidate: Candidate,
+        task_description: str,
+    ) -> None:
+        """Fill the population with mutated variants of the seed prompt."""
+        n_needed = self._population_size - 1
+        mutation_types = ["paraphrase", "constrain", "generalize", "specialize"]
+
+        for idx in range(n_needed):
+            mutation_type = mutation_types[idx % len(mutation_types)]
+
+            if self._mutation_port is not None:
+                new_prompt = await self._mutation_port.mutate(
+                    seed_prompt, task_description, mutation_type
+                )
+            else:
+                # Fallback: use proposer for reflective mutation
+                new_prompt = await self._proposer.propose(
+                    seed_prompt, [], task_description
+                )
+            state.total_llm_calls += 1
+            new_candidate = Candidate(
+                prompt=new_prompt,
+                best_score=seed_candidate.best_score,  # estimate until evaluated
+                generation=0,
+                parent_id=id(seed_candidate),
+            )
+            state.candidates.append(new_candidate)
+
+        logger.info(
+            "Population initialized",
+            extra={
+                "structured": {
+                    "event": "population_init",
+                    "population_size": len(state.candidates),
+                },
+            },
+        )
+
+    # ------------------------------------------------------------------
+    # Single-candidate iteration (original hill-climbing)
+    # ------------------------------------------------------------------
+
+    async def _run_single_iteration(
        self,
        i: int,
        state: OptimizationState,
-        best_candidate: Candidate,
        synthetic_pool: list[SyntheticExample],
        task_description: str,
    ) -> None:
-        """Execute a single iteration. Mutates *state* in-place."""
+        """Execute a single-candidate iteration. Mutates *state* in-place."""
+        best_candidate = state.best_candidate  # type: ignore[assignment]
+
        # 1. Sample a fresh minibatch
        batch = self._bootstrap.sample_minibatch(
            synthetic_pool, self._minibatch_size
        )
+        sample_ids = [ex.id for ex in batch]

        # 2. Evaluate the current candidate
        current_eval = await self._evaluator.evaluate(
@@ -164,9 +406,31 @@ class EvolutionLoop:
        )
        state.total_llm_calls += 2 * self._minibatch_size

+        logger.debug(
+            "Iteration minibatch evaluated",
+            extra={
+                "structured": {
+                    "event": "minibatch_eval",
+                    "iteration": i,
+                    "sample_ids": sample_ids,
+                    "scores": [round(s, 4) for s in current_eval.scores],
+                    "total_score": round(current_eval.total_score, 4),
+                },
+            },
+        )
+
        # 3. Skip if perfect
        if all(s >= self._perfect_score for s in current_eval.scores):
-            self._log(f"Iter {i}: All scores perfect, skipping.")
+            logger.info(
+                "Iteration skipped — all scores perfect",
+                extra={
+                    "structured": {
+                        "event": "skip_perfect",
+                        "iteration": i,
+                        "total_score": round(current_eval.total_score, 4),
+                    },
+                },
+            )
            state.history.append(
                {
                    "iteration": i,
@@ -177,12 +441,26 @@ class EvolutionLoop:
            return

        # 4. Propose a new prompt (reflective mutation) — sequential
+        state.total_llm_calls += 1
        new_prompt = await self._proposer.propose(
            best_candidate.prompt,
            current_eval.trajectories,
            task_description,
        )
-        state.total_llm_calls += 1  # 1 proposition call
+
+        prompt_diff = self._compute_prompt_diff(
+            best_candidate.prompt.text, new_prompt.text
+        )
+        logger.debug(
+            "Proposed new prompt",
+            extra={
+                "structured": {
+                    "event": "proposer_output",
+                    "iteration": i,
+                    "prompt_diff": prompt_diff,
+                },
+            },
+        )

        # 5. Evaluate the new prompt on the same minibatch
        new_eval = await self._evaluator.evaluate(
@@ -200,9 +478,22 @@ class EvolutionLoop:
            )
            state.best_candidate = new_candidate
            state.candidates.append(new_candidate)
-            self._log(
-                f"Iter {i}: ACCEPTED "
-                f"({current_eval.total_score:.2f} -> {new_eval.total_score:.2f})"
+            logger.info(
+                "Iteration accepted",
+                extra={
+                    "structured": {
+                        "event": "accepted",
+                        "iteration": i,
+                        "old_score": round(current_eval.total_score, 4),
+                        "new_score": round(new_eval.total_score, 4),
+                        "improvement": round(
+                            new_eval.total_score - current_eval.total_score, 4
+                        ),
+                        "sample_ids": sample_ids,
+                        "new_scores": [round(s, 4) for s in new_eval.scores],
+                        "prompt_diff": prompt_diff,
+                    },
+                },
            )
            state.history.append(
                {
@@ -215,9 +506,19 @@ class EvolutionLoop:
                }
            )
        else:
-            self._log(
-                f"Iter {i}: REJECTED "
-                f"({new_eval.total_score:.2f} <= {current_eval.total_score:.2f})"
+            logger.info(
+                "Iteration rejected",
+                extra={
+                    "structured": {
+                        "event": "rejected",
+                        "iteration": i,
+                        "old_score": round(current_eval.total_score, 4),
+                        "new_score": round(new_eval.total_score, 4),
+                        "sample_ids": sample_ids,
+                        "new_scores": [round(s, 4) for s in new_eval.scores],
+                        "prompt_diff": prompt_diff,
+                    },
+                },
            )
            state.history.append(
                {
@@ -228,6 +529,213 @@ class EvolutionLoop:
                }
            )

-    def _log(self, msg: str) -> None:
-        if self._verbose:
-            logger.info("[PROMETHEUS] %s", msg)
+    # ------------------------------------------------------------------
+    # Population-based iteration
+    # ------------------------------------------------------------------
+
+    async def _run_population_iteration(
+        self,
+        i: int,
+        state: OptimizationState,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+    ) -> None:
+        """Execute a population-based iteration. Mutates *state* in-place."""
+        population = state.candidates
+
+        # 1. Sample a fresh minibatch
+        batch = self._bootstrap.sample_minibatch(
+            synthetic_pool, self._minibatch_size
+        )
+        sample_ids = [ex.id for ex in batch]
+
+        # 2. Select two parents via tournament selection
+        parent_a = self._tournament_select(population)
+        parent_b = self._tournament_select(population)
+
+        # 3. Generate child: crossover or reflective mutation
+        use_crossover = (
+            random.random() < self._crossover_rate
+            and self._crossover_port is not None
+        )
+        if use_crossover:
+            state.total_llm_calls += 1
+            child_prompt = await self._crossover_port.crossover(
+                parent_a.prompt, parent_b.prompt, task_description
+            )
+            origin = "crossover"
+        else:
+            # Reflective mutation: evaluate a parent, propose improvement
+            state.total_llm_calls += 2 * self._minibatch_size
+            parent_eval = await self._evaluator.evaluate(
+                parent_a.prompt, batch, task_description
+            )
+            state.total_llm_calls += 1
+            child_prompt = await self._proposer.propose(
+                parent_a.prompt,
+                parent_eval.trajectories,
+                task_description,
+            )
+            origin = "reflective"
+
+        # 4. Optional mutation
+        if random.random() < self._mutation_rate and self._mutation_port is not None:
+            mutation_type = random.choice(
+                ["paraphrase", "constrain", "generalize", "specialize"]
+            )
+            state.total_llm_calls += 1
+            child_prompt = await self._mutation_port.mutate(
+                child_prompt, task_description, mutation_type
+            )
+            origin += f"+mutation({mutation_type})"
+
+        # 5. Evaluate the child
+        state.total_llm_calls += 2 * self._minibatch_size
+        child_eval = await self._evaluator.evaluate(
+            child_prompt, batch, task_description
+        )
+
+        # 6. Compute fitness with diversity penalty
+        child_score = child_eval.total_score
+        diversity_sim = self._compute_diversity_score(
+            child_prompt, population
+        )
+        child_fitness = child_score - self._diversity_penalty * (1.0 - diversity_sim)
+
+        # 7. Find worst candidate and replace if child is better
+        worst_idx = min(
+            range(len(population)),
+            key=lambda idx: population[idx].best_score,
+        )
+        worst_fitness = (
+            population[worst_idx].best_score
+            - self._diversity_penalty * (1.0 - self._compute_diversity_score(
+                population[worst_idx].prompt, population
+            ))
+        )
+
+        accepted = child_fitness > worst_fitness
+
+        if accepted:
+            new_candidate = Candidate(
+                prompt=child_prompt,
+                best_score=child_score,
+                generation=i,
+                parent_id=id(parent_a),
+            )
+            population[worst_idx] = new_candidate
+
+            # Update best if this child is the new best
+            if child_score > (state.best_candidate.best_score if state.best_candidate else 0):
+                state.best_candidate = new_candidate
+
+            logger.info(
+                "Population iteration accepted",
+                extra={
+                    "structured": {
+                        "event": "pop_accepted",
+                        "iteration": i,
+                        "origin": origin,
+                        "child_score": round(child_score, 4),
+                        "child_fitness": round(child_fitness, 4),
+                        "diversity_sim": round(diversity_sim, 4),
+                        "replaced_idx": worst_idx,
+                        "sample_ids": sample_ids,
+                    },
+                },
+            )
+            state.history.append(
+                {
+                    "iteration": i,
+                    "event": "pop_accepted",
+                    "origin": origin,
+                    "child_score": child_score,
+                    "child_fitness": child_fitness,
+                    "diversity_sim": diversity_sim,
+                }
+            )
+        else:
+            logger.info(
+                "Population iteration rejected",
+                extra={
+                    "structured": {
+                        "event": "pop_rejected",
+                        "iteration": i,
+                        "origin": origin,
+                        "child_score": round(child_score, 4),
+                        "child_fitness": round(child_fitness, 4),
+                        "worst_fitness": round(worst_fitness, 4),
+                        "sample_ids": sample_ids,
+                    },
+                },
+            )
+            state.history.append(
+                {
+                    "iteration": i,
+                    "event": "pop_rejected",
+                    "origin": origin,
+                    "child_score": child_score,
+                    "child_fitness": child_fitness,
+                    "worst_fitness": worst_fitness,
+                }
+            )
+
+    # ------------------------------------------------------------------
+    # Selection and diversity helpers
+    # ------------------------------------------------------------------
+
+    def _tournament_select(
+        self,
+        population: list[Candidate],
+        tournament_size: int = 3,
+    ) -> Candidate:
+        """Tournament selection: pick the best from a random subset."""
+        k = min(tournament_size, len(population))
+        contestants = random.sample(population, k)
+        return max(contestants, key=lambda c: c.best_score)
+
+    @staticmethod
+    def _compute_diversity_score(
+        prompt: Prompt,
+        population: list[Candidate],
+    ) -> float:
+        """Compute the average Jaccard similarity between *prompt* and all
+        population members. Returns 1.0 when population has only one member
+        (no diversity penalty)."""
+        if len(population) <= 1:
+            return 1.0
+
+        prompt_words = set(prompt.text.lower().split())
+        if not prompt_words:
+            return 0.0
+
+        similarities: list[float] = []
+        for candidate in population:
+            other_words = set(candidate.prompt.text.lower().split())
+            if not other_words:
+                continue
+            intersection = prompt_words & other_words
+            union = prompt_words | other_words
+            sim = len(intersection) / len(union) if union else 0.0
+            similarities.append(sim)
+
+        # Average similarity (lower = more diverse)
+        return sum(similarities) / len(similarities) if similarities else 0.0
+
+    @staticmethod
+    def _compute_prompt_diff(old: str, new: str) -> dict[str, int]:
+        """Compute a simple diff summary between two prompts."""
+        old_lines = set(old.splitlines())
+        new_lines = set(new.splitlines())
+        return {
+            "lines_added": len(new_lines - old_lines),
+            "lines_removed": len(old_lines - new_lines),
+            "chars_delta": len(new) - len(old),
+        }
+
+    def _maybe_checkpoint(self, state: OptimizationState) -> None:
+        """Save a checkpoint if the interval is met or on accepted improvements."""
+        if self._checkpoint_port is None:
+            return
+        if state.iteration % self._checkpoint_interval == 0:
+            self._checkpoint_port.save(state)
@@ -0,0 +1,116 @@
+"""
+Ground-truth evaluator — execution + similarity comparison.
+
+Produces a quality signal *with* ground truth by comparing model outputs
+against expected outputs using a configurable similarity metric.
+"""
+from __future__ import annotations
+
+import asyncio
+import logging
+
+from prometheus.domain.entities import (
+    EvalResult,
+    GroundTruthExample,
+    Prompt,
+    Trajectory,
+)
+from prometheus.domain.ports import LLMPort, SimilarityPort
+
+logger = logging.getLogger(__name__)
+
+
+class GroundTruthEvaluator:
+    """Evaluates a prompt against a ground-truth dataset.
+
+    Pipeline: execute → compare with similarity metric → build trajectories.
+    Unlike PromptEvaluator (which uses LLM-as-Judge), this compares outputs
+    directly against known-good expected outputs.
+    """
+
+    def __init__(
+        self,
+        executor: LLMPort,
+        similarity: SimilarityPort,
+        max_concurrency: int = 5,
+    ):
+        self._executor = executor
+        self._similarity = similarity
+        self._semaphore = asyncio.Semaphore(max_concurrency)
+
+    async def evaluate(
+        self,
+        prompt: Prompt,
+        dataset: list[GroundTruthExample],
+    ) -> EvalResult:
+        """Evaluate the prompt on the ground-truth dataset.
+
+        Steps:
+        1. Execute the prompt on each input (parallel, bounded)
+        2. Compare each output against expected using similarity metric
+        3. Build trajectories with feedback
+        """
+        # Step 1: Parallel execution (per-item isolation)
+        output_coros = [
+            self._execute_single(prompt, example) for example in dataset
+        ]
+        outputs = await asyncio.gather(*output_coros)
+
+        # Step 2: Compute similarity scores
+        scores: list[float] = []
+        feedbacks: list[str] = []
+        trajectories: list[Trajectory] = []
+
+        for example, output in zip(dataset, outputs):
+            score = self._similarity.compute(output, example.expected_output)
+            score = max(0.0, min(1.0, score))  # normalize to [0, 1]
+            scores.append(score)
+            feedback = self._build_feedback(output, example.expected_output, score)
+            feedbacks.append(feedback)
+            trajectories.append(
+                Trajectory(
+                    input_text=example.input_text,
+                    output_text=output,
+                    score=score,
+                    feedback=feedback,
+                    prompt_used=prompt.text,
+                )
+            )
+
+        logger.info(
+            "Ground-truth evaluation complete: %d items, mean_score=%.4f",
+            len(dataset),
+            sum(scores) / len(scores) if scores else 0.0,
+        )
+
+        return EvalResult(
+            scores=scores,
+            feedbacks=feedbacks,
+            trajectories=trajectories,
+        )
+
+    async def _execute_single(
+        self, prompt: Prompt, example: GroundTruthExample
+    ) -> str:
+        async with self._semaphore:
+            try:
+                return await self._executor.execute(prompt, example.input_text)
+            except Exception as exc:
+                logger.warning(
+                    "Execution failed for input '%s…': %s",
+                    example.input_text[:40],
+                    exc,
+                )
+                return f"[execution error: {exc}]"
+
+    @staticmethod
+    def _build_feedback(output: str, expected: str, score: float) -> str:
+        """Build human-readable feedback for a ground-truth comparison."""
+        if score >= 0.99:
+            return "Exact match."
+        elif score >= 0.7:
+            return f"Close match (score={score:.2f}). Expected: {expected[:100]}"
+        elif score >= 0.3:
+            return f"Partial match (score={score:.2f}). Expected: {expected[:100]}"
+        else:
+            return f"Poor match (score={score:.2f}). Expected: {expected[:100]}"
@@ -10,8 +10,16 @@ from prometheus.application.bootstrap import SyntheticBootstrap
 from prometheus.application.dto import OptimizationConfig, OptimizationResult
 from prometheus.application.evaluator import PromptEvaluator
 from prometheus.application.evolution import EvolutionLoop
+from prometheus.cli.logging_setup import get_logger
 from prometheus.domain.entities import Prompt
-from prometheus.domain.ports import ProposerPort
+from prometheus.domain.ports import (
+    CheckpointPort,
+    CrossoverPort,
+    MutationPort,
+    ProposerPort,
+)
+
+logger = get_logger("use_cases")


 class OptimizePromptUseCase:
@@ -25,24 +33,60 @@ class OptimizePromptUseCase:
        evaluator: PromptEvaluator,
        proposer: ProposerPort,
        bootstrap: SyntheticBootstrap,
+        checkpoint_port: CheckpointPort | None = None,
+        crossover_port: CrossoverPort | None = None,
+        mutation_port: MutationPort | None = None,
    ):
        self._evaluator = evaluator
        self._proposer = proposer
        self._bootstrap = bootstrap
+        self._checkpoint_port = checkpoint_port
+        self._crossover_port = crossover_port
+        self._mutation_port = mutation_port

    async def execute(self, config: OptimizationConfig) -> OptimizationResult:
        """Full pipeline:
        1. Bootstrap → generate synthetic inputs
-        2. Evolution → optimization loop
+        2. Evolution → optimization loop (with optional checkpoint resume)
        3. Return result
        """
-        # Phase 0: Bootstrap
+        # Phase 0: Bootstrap (skip synthetic generation on resume if pool was saved)
+        initial_state = None
+        if config.resume and self._checkpoint_port is not None:
+            initial_state = self._checkpoint_port.load()
+            if initial_state is not None and initial_state.synthetic_pool:
+                synthetic_pool = initial_state.synthetic_pool
+                logger.info(
+                    "Resumed checkpoint includes %d synthetic inputs — skipping bootstrap",
+                    extra={"structured": {"event": "resume_skip_bootstrap", "pool_size": len(synthetic_pool)}},
+                )
+            else:
+                synthetic_pool = self._bootstrap.run(
+                    task_description=config.task_description,
+                    n_examples=config.n_synthetic_inputs,
+                )
+        else:
            synthetic_pool = self._bootstrap.run(
                task_description=config.task_description,
                n_examples=config.n_synthetic_inputs,
            )

-        # Phase 1: Evolution
+        # Split into train / validation if configured
+        validation_pool: list = []
+        if config.validation_split > 0:
+            synthetic_pool, validation_pool = SyntheticBootstrap.split_pool(
+                synthetic_pool, config.validation_split,
+            )
+            logger.info(
+                "Split synthetic pool: %d train, %d validation (%.0f%% hold-out)",
+                len(synthetic_pool), len(validation_pool),
+                config.validation_split * 100,
+                extra={"structured": {
+                    "event": "pool_split",
+                    "train_size": len(synthetic_pool),
+                    "val_size": len(validation_pool),
+                }},
+            )
        loop = EvolutionLoop(
            evaluator=self._evaluator,
            proposer=self._proposer,
@@ -53,9 +97,22 @@ class OptimizePromptUseCase:
            verbose=config.verbose,
            circuit_breaker_threshold=config.circuit_breaker_threshold,
            error_strategy=config.error_strategy,
+            checkpoint_port=self._checkpoint_port,
+            checkpoint_interval=config.checkpoint_interval,
+            population_size=config.population_size,
+            crossover_rate=config.crossover_rate,
+            mutation_rate=config.mutation_rate,
+            diversity_penalty=config.diversity_penalty,
+            crossover_port=self._crossover_port,
+            mutation_port=self._mutation_port,
+            early_stop_patience=config.early_stop_patience,
        )
        seed_prompt = Prompt(text=config.seed_prompt)
-        state = await loop.run(seed_prompt, synthetic_pool, config.task_description)
+        state = await loop.run(
+            seed_prompt, synthetic_pool, config.task_description,
+            initial_state=initial_state,
+            validation_pool=validation_pool or None,
+        )

        # Phase 2: Result
        initial_score = (
@@ -71,9 +128,12 @@ class OptimizePromptUseCase:
            ),
            initial_prompt=config.seed_prompt,
            iterations_used=state.iteration,
-            total_llm_calls=state.total_llm_calls + 1,  # +1 for bootstrap
+            total_llm_calls=state.total_llm_calls + 1,  # +1 for bootstrap synthesis call
            initial_score=initial_score,
            final_score=final_score,
            improvement=final_score - initial_score,
            history=state.history,
+            final_validation_score=state.best_validation_score,
+            best_validation_score=state.best_validation_score,
+            early_stopped=state.early_stopped,
        )
@@ -1,31 +1,13 @@
 """
 CLI — user entry point.

-Typer interface with -i (input) and -o (output) options.
+Registers all subcommands and delegates to cli/commands/.
 """
 from __future__ import annotations

-import asyncio
-import logging
-import os
-from dataclasses import asdict
-
-import dspy
 import typer
-from pydantic import ValidationError
-from rich.console import Console
-from rich.panel import Panel
-from rich.table import Table

-from prometheus.application.bootstrap import SyntheticBootstrap
-from prometheus.application.dto import OptimizationConfig, OptimizationResult
-from prometheus.application.evaluator import PromptEvaluator
-from prometheus.application.use_cases import OptimizePromptUseCase
-from prometheus.infrastructure.file_io import YamlPersistence
-from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
-from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
-from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
-from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
+from prometheus.cli.commands import init, list_runs, optimize, version

 app = typer.Typer(
    name="prometheus",
@@ -33,205 +15,12 @@ app = typer.Typer(
    no_args_is_help=True,
 )

-console = Console()
-
-
-@app.command()
-def optimize(
-    input: str = typer.Option(
-        ...,
-        "-i",
-        "--input",
-        help="Path to input YAML config file.",
-        exists=True,
-        readable=True,
-    ),
-    output: str = typer.Option(
-        "output.yaml",
-        "-o",
-        "--output",
-        help="Path to output YAML result file.",
-    ),
-    verbose: bool = typer.Option(
-        False,
-        "-v",
-        "--verbose",
-        help="Print detailed progress.",
-    ),
-    max_retries: int = typer.Option(
-        3,
-        "--max-retries",
-        help="Max retry attempts for transient LLM errors (429, timeout, 5xx).",
-    ),
-    error_strategy: str = typer.Option(
-        "retry",
-        "--error-strategy",
-        help="How to handle errors: skip | retry | abort.",
-    ),
-    max_concurrency: int = typer.Option(
-        5,
-        "--max-concurrency",
-        help="Max parallel LLM calls for minibatch execution and judging.",
-    ),
-) -> None:
-    """Optimize a prompt without any reference data.
-
-    Usage:
-        prometheus optimize -i config.yaml -o result.yaml
-    """
-    asyncio.run(
-        _async_optimize(input, output, verbose, max_retries, error_strategy, max_concurrency)
-    )
-
-
-async def _async_optimize(
-    input: str,
-    output: str,
-    verbose: bool,
-    max_retries: int,
-    error_strategy: str,
-    max_concurrency: int,
-) -> None:
-    # Configure verbose logging
-    if verbose:
-        logging.basicConfig(level=logging.INFO, format="[PROMETHEUS] %(message)s")
-
-    console.print(
-        Panel.fit(
-            "PROMETHEUS — Prompt Evolution Engine",
-            subtitle="No reference data required",
-        )
-    )
-
-    # 1. Load & validate config
-    persistence = YamlPersistence()
-    raw_config = persistence.read_config(input)
-
-    # CLI flags override config file values
-    raw_config.setdefault("max_retries", max_retries)
-    raw_config.setdefault("error_strategy", error_strategy)
-    raw_config.setdefault("max_concurrency", max_concurrency)
-    raw_config["output_path"] = output
-    raw_config["verbose"] = verbose
-
-    try:
-        config = OptimizationConfig.model_validate(raw_config)
-    except ValidationError as exc:
-        console.print("[bold red]Configuration error:[/bold red]\n")
-        for err in exc.errors():
-            loc = " → ".join(str(l) for l in err["loc"])
-            console.print(f"  [red]• {loc}: {err['msg']}[/red]")
-        raise typer.Exit(code=1) from exc
-    console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]")
-    console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]")
-
-    # 2. Create per-model DSPy LM instances
-    def _model_lm_kwargs(
-        model_api_base: str | None,
-        model_api_key_env: str | None,
-    ) -> dict:
-        """Build kwargs for dspy.LM, using per-model overrides with global fallback."""
-        kwargs: dict = {}
-        api_base = model_api_base or config.api_base
-        api_key_env = model_api_key_env or config.api_key_env
-        if api_base:
-            kwargs["api_base"] = api_base
-        if api_key_env:
-            kwargs["api_key"] = os.environ.get(api_key_env, "")
-        return kwargs
-
-    task_lm = dspy.LM(
-        config.task_model,
-        **_model_lm_kwargs(config.task_api_base, config.task_api_key_env),
-    )
-    judge_lm = dspy.LM(
-        config.judge_model,
-        **_model_lm_kwargs(config.judge_api_base, config.judge_api_key_env),
-    )
-    proposer_lm = dspy.LM(
-        config.proposer_model,
-        **_model_lm_kwargs(config.proposer_api_base, config.proposer_api_key_env),
-    )
-    synth_lm = dspy.LM(
-        config.synth_model,
-        **_model_lm_kwargs(config.synth_api_base, config.synth_api_key_env),
-    )
-
-    # 3. Build adapters (Dependency Injection — each gets its own LM + retry config)
-    synth_adapter = DSPySyntheticAdapter(lm=synth_lm)
-    llm_adapter = DSPyLLMAdapter(
-        lm=task_lm,
-        max_retries=config.max_retries,
-        retry_delay_base=config.retry_delay_base,
-    )
-    judge_adapter = DSPyJudgeAdapter(
-        lm=judge_lm,
-        max_retries=config.max_retries,
-        retry_delay_base=config.retry_delay_base,
-        max_concurrency=config.max_concurrency,
-    )
-    proposer_adapter = DSPyProposerAdapter(
-        lm=proposer_lm,
-        max_retries=config.max_retries,
-        retry_delay_base=config.retry_delay_base,
-    )
-    bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed)
-    evaluator = PromptEvaluator(
-        executor=llm_adapter,
-        judge=judge_adapter,
-        max_concurrency=config.max_concurrency,
-    )
-    use_case = OptimizePromptUseCase(
-        evaluator=evaluator,
-        proposer=proposer_adapter,
-        bootstrap=bootstrap,
-    )
-
-    # 4. Execute
-    with console.status("[bold green]Evolving prompt..."):
-        result = await use_case.execute(config)
-
-    # 5. Display results
-    _display_result(result)
-
-    # 6. Save
-    _save_result(persistence, output, result)
-    console.print(f"\n[green]Results saved to {output}[/green]")
-
-
-def _display_result(result: OptimizationResult) -> None:
-    """Display a Rich summary in the terminal."""
-    console.print()
-    console.print(
-        Panel(
-            f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}",
-            title="Result",
-        )
-    )
-    table = Table(title="Metrics")
-    table.add_column("Metric", style="cyan")
-    table.add_column("Value", style="bold")
-    table.add_row("Initial Score", f"{result.initial_score:.2f}")
-    table.add_row("Final Score", f"{result.final_score:.2f}")
-    table.add_row("Improvement", f"{result.improvement:+.2f}")
-    table.add_row("Iterations", str(result.iterations_used))
-    table.add_row("LLM Calls", str(result.total_llm_calls))
-    console.print(table)
-
-
-def _save_result(
-    persistence: YamlPersistence,
-    path: str,
-    result: OptimizationResult,
-) -> None:
-    """Save the result as YAML."""
-    persistence.write_result(path, asdict(result))
-
-
-@app.command(hidden=True)
-def _help() -> None:
-    """Internal placeholder to force multi-command Typer behavior."""
-    pass
+# Register all subcommands — having multiple commands fixes the
+# Typer 0.24+ bug where a single-command app absorbs the subcommand.
+optimize.register(app)
+version.register(app)
+init.register(app)
+list_runs.register(app)


 if __name__ == "__main__":
@@ -0,0 +1 @@
+"""CLI command modules."""
@@ -0,0 +1,97 @@
+"""prometheus init — scaffold a config YAML interactively."""
+from __future__ import annotations
+
+from pathlib import Path
+
+import typer
+import yaml
+from rich.console import Console
+
+console = Console()
+
+_TEMPLATE = """\
+# PROMETHEUS configuration
+# Generated by `prometheus init`
+
+# --- Required ---
+seed_prompt: {seed_prompt}
+task_description: {task_description}
+
+# --- Models ---
+task_model: {task_model}
+judge_model: {judge_model}
+proposer_model: {proposer_model}
+synth_model: {synth_model}
+
+# --- Global API settings (optional) ---
+# api_base: https://api.openai.com/v1
+# api_key_env: OPENAI_API_KEY
+
+# --- Evolution parameters ---
+max_iterations: 30
+n_synthetic_inputs: 20
+minibatch_size: 5
+perfect_score: 1.0
+seed: 42
+
+# --- Concurrency ---
+max_concurrency: 5
+
+# --- Error handling ---
+max_retries: 3
+retry_delay_base: 1.0
+circuit_breaker_threshold: 5
+error_strategy: retry
+"""
+
+
+def register(app: typer.Typer) -> None:
+    """Register the init command on the Typer app."""
+
+    @app.command()
+    def init(
+        output: str = typer.Option(
+            "config.yaml",
+            "-o",
+            "--output",
+            help="Path for the generated config file.",
+        ),
+    ) -> None:
+        """Interactively scaffold a PROMETHEUS config YAML.
+
+        Prompts for required fields and writes a ready-to-edit config file.
+        """
+        target = Path(output)
+
+        if target.exists() and not typer.confirm(
+            f"{output} already exists. Overwrite?", default=False
+        ):
+            raise typer.Exit(code=0)
+
+        seed_prompt: str = typer.prompt("Seed prompt")
+        task_description: str = typer.prompt("Task description")
+        task_model: str = typer.prompt("Task model", default="openai/gpt-4o-mini")
+        judge_model: str = typer.prompt("Judge model", default="openai/gpt-4o")
+        proposer_model: str = typer.prompt("Proposer model", default="openai/gpt-4o")
+        synth_model: str = typer.prompt("Synth model", default="openai/gpt-4o")
+
+        content = _TEMPLATE.format(
+            seed_prompt=_yaml_string(seed_prompt),
+            task_description=_yaml_string(task_description),
+            task_model=task_model,
+            judge_model=judge_model,
+            proposer_model=proposer_model,
+            synth_model=synth_model,
+        )
+
+        target.write_text(content, encoding="utf-8")
+        console.print(f"[green]Config written to {output}[/green]")
+        console.print("[dim]Edit it as needed, then run: prometheus optimize -i config.yaml[/dim]")
+
+
+def _yaml_string(value: str) -> str:
+    """Quote a string for YAML if it contains special characters."""
+    if any(ch in value for ch in (":", "#", "'", '"', "\n", "{", "}", "[", "]", ",")):
+        escaped = value.replace("'", "''")
+        return f"'{escaped}'"
+    return value
@@ -0,0 +1,101 @@
+"""prometheus list — list past optimization runs."""
+from __future__ import annotations
+
+import glob as globmod
+from pathlib import Path
+
+import typer
+import yaml
+from rich.console import Console
+from rich.table import Table
+
+console = Console()
+
+_DEFAULT_PATTERNS = ("output.yaml", "results/*.yaml", "*.result.yaml")
+
+
+def register(app: typer.Typer) -> None:
+    """Register the list command on the Typer app."""
+
+    @app.command("list")
+    def list_runs(
+        directory: str = typer.Option(
+            ".",
+            "-d",
+            "--directory",
+            help="Directory to scan for result YAML files.",
+        ),
+    ) -> None:
+        """List past optimization runs found in result YAML files.
+
+        Scans the given directory for YAML files that look like PROMETHEUS
+        output (they contain 'optimized_prompt' and 'final_score' keys) and
+        displays a summary table.
+        """
+        base = Path(directory)
+        if not base.is_dir():
+            console.print(f"[red]Directory not found: {directory}[/red]")
+            raise typer.Exit(code=1)
+
+        runs: list[dict] = []
+
+        for pattern in _DEFAULT_PATTERNS:
+            for path_str in globmod.glob(str(base / pattern), recursive=False):
+                _try_read_run(path_str, runs)
+
+        # Also try nested directories one level deep
+        for path_str in globmod.glob(str(base / "**/*.yaml"), recursive=True):
+            _try_read_run(path_str, runs)
+
+        if not runs:
+            console.print("[dim]No optimization runs found.[/dim]")
+            raise typer.Exit(code=0)
+
+        # Deduplicate by path
+        seen: set[str] = set()
+        unique_runs: list[dict] = []
+        for run in runs:
+            if run["path"] not in seen:
+                seen.add(run["path"])
+                unique_runs.append(run)
+
+        table = Table(title="PROMETHEUS Runs")
+        table.add_column("File", style="cyan")
+        table.add_column("Initial", justify="right")
+        table.add_column("Final", justify="right", style="green")
+        table.add_column("Delta", justify="right")
+        table.add_column("Iters", justify="right")
+        table.add_column("Prompt (first 60 chars)", style="dim")
+
+        for run in sorted(unique_runs, key=lambda r: r["path"]):
+            table.add_row(
+                run["path"],
+                f"{run['initial_score']:.2f}",
+                f"{run['final_score']:.2f}",
+                f"{run['improvement']:+.2f}",
+                str(run["iterations"]),
+                run["prompt_preview"],
+            )
+
+        console.print(table)
+
+
+def _try_read_run(path_str: str, runs: list[dict]) -> None:
+    """Try to parse a YAML file as a PROMETHEUS result and append metadata."""
+    try:
+        with open(path_str, encoding="utf-8") as f:
+            data = yaml.safe_load(f)
+        if not isinstance(data, dict):
+            return
+        if "optimized_prompt" not in data or "final_score" not in data:
+            return
+        runs.append({
+            "path": path_str,
+            "initial_score": float(data.get("initial_score", 0.0)),
+            "final_score": float(data.get("final_score", 0.0)),
+            "improvement": float(data.get("improvement", 0.0)),
+            "iterations": int(data.get("iterations_used", 0)),
+            "prompt_preview": str(data.get("optimized_prompt", ""))[:60],
+        })
+    except (OSError, yaml.YAMLError, ValueError, TypeError):
+        pass
@@ -27,8 +27,6 @@ from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
 from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
 from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter
 from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter
-from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter
-from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter
 from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
 from prometheus.infrastructure.similarity import create_similarity_adapter
 from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
@@ -337,6 +335,29 @@ async def _async_optimize(
    with console.status("[bold green]Evolving prompt..."):
        result = await use_case.execute(config)

+    # 4b. Compute actual LLM call count from adapter counters
+    actual_llm_calls = (
+        llm_adapter.call_count
+        + judge_adapter.call_count
+        + proposer_adapter.call_count
+        + synth_adapter.call_count
+        + (crossover_adapter.call_count if crossover_adapter else 0)
+        + (mutation_adapter.call_count if mutation_adapter else 0)
+    )
+    result = OptimizationResult(
+        optimized_prompt=result.optimized_prompt,
+        initial_prompt=result.initial_prompt,
+        iterations_used=result.iterations_used,
+        total_llm_calls=actual_llm_calls,
+        initial_score=result.initial_score,
+        final_score=result.final_score,
+        improvement=result.improvement,
+        history=result.history,
+        final_validation_score=result.final_validation_score,
+        best_validation_score=result.best_validation_score,
+        early_stopped=result.early_stopped,
+    )
+
    # 5. Display results
    _display_result(result)

@@ -0,0 +1,18 @@
+"""prometheus version — print the current version."""
+from __future__ import annotations
+
+import typer
+from rich.console import Console
+
+from prometheus import __version__
+
+console = Console()
+
+
+def register(app: typer.Typer) -> None:
+    """Register the version command on the Typer app."""
+
+    @app.command()
+    def version() -> None:
+        """Print the PROMETHEUS version."""
+        console.print(f"PROMETHEUS {__version__}")
@@ -0,0 +1,96 @@
+"""
+Structured logging configuration for PROMETHEUS.
+
+Supports text (human-readable) and JSON (machine-parseable) output,
+configurable log levels, and optional file output.
+
+Fixes Bug #4: verbose mode now reliably produces output by configuring
+handlers explicitly instead of relying on ``logging.basicConfig``.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import sys
+from datetime import datetime, timezone
+from typing import TextIO
+
+
+_PROMETHEUS_LOGGER = "prometheus"
+
+
+class _JsonFormatter(logging.Formatter):
+    """Emit one JSON object per log line."""
+
+    def format(self, record: logging.LogRecord) -> str:
+        message = record.getMessage()
+        payload: dict = {
+            "timestamp": datetime.fromtimestamp(record.created, tz=timezone.utc).isoformat(),
+            "level": record.levelname,
+            "logger": record.name,
+            "message": message,
+        }
+        # Merge any extra structured fields the caller attached.
+        if hasattr(record, "structured"):
+            payload["structured"] = record.structured  # type: ignore[attr-defined]
+        if record.exc_info and record.exc_info[1] is not None:
+            payload["exception"] = self.formatException(record.exc_info)
+        return json.dumps(payload, default=str)
+
+
+class _TextFormatter(logging.Formatter):
+    """Human-readable format with structured extras appended."""
+
+    def format(self, record: logging.LogRecord) -> str:
+        base = super().format(record)
+        if hasattr(record, "structured") and record.structured:
+            extras = " ".join(f"{k}={v}" for k, v in record.structured.items())
+            base = f"{base}  {extras}"
+        return base
+
+
+def configure_logging(
+    *,
+    level: int = logging.WARNING,
+    log_format: str = "text",
+    log_file: str | None = None,
+) -> None:
+    """Configure the prometheus root logger.
+
+    Args:
+        level: Logging level (e.g. logging.DEBUG, logging.INFO).
+        log_format: ``"text"`` for human-readable or ``"json"`` for
+            machine-parseable output.
+        log_file: Optional path to also write logs to a file.
+    """
+    prom_logger = logging.getLogger(_PROMETHEUS_LOGGER)
+    prom_logger.setLevel(level)
+    # Remove any stale handlers so re-configuration is idempotent.
+    prom_logger.handlers.clear()
+
+    if log_format == "json":
+        fmt = _JsonFormatter()
+    else:
+        fmt = _TextFormatter(
+            fmt="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+            datefmt="%H:%M:%S",
+        )
+
+    # Console handler (stderr so it doesn't mix with Rich stdout)
+    console_handler = logging.StreamHandler(sys.stderr)
+    console_handler.setFormatter(fmt)
+    prom_logger.addHandler(console_handler)
+
+    # Optional file handler
+    if log_file:
+        file_handler = logging.FileHandler(log_file)
+        file_handler.setFormatter(fmt)
+        prom_logger.addHandler(file_handler)
+
+    # Prevent propagation to root logger to avoid duplicate output
+    prom_logger.propagate = False
+
+
+def get_logger(name: str) -> logging.Logger:
+    """Return a child logger under the prometheus namespace."""
+    return logging.getLogger(f"{_PROMETHEUS_LOGGER}.{name}")
@@ -31,6 +31,15 @@ class SyntheticExample:
    id: int = 0


+@dataclass(frozen=True)
+class GroundTruthExample:
+    """A ground-truth evaluation example with a known-good expected output."""
+
+    input_text: str
+    expected_output: str
+    id: int = 0
+
+
@dataclass
 class Trajectory:
    """Execution trace of a prompt on an input.
@@ -85,3 +94,6 @@ class OptimizationState:
    synthetic_pool: list[SyntheticExample] = field(default_factory=list)
    history: list[dict[str, Any]] = field(default_factory=list)
    total_llm_calls: int = 0
+    # Hold-out validation
+    best_validation_score: float | None = None
+    early_stopped: bool = False
@@ -8,7 +8,14 @@ from abc import ABC, abstractmethod

 from typing import Any

-from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
+from prometheus.domain.entities import (
+    Candidate,
+    GroundTruthExample,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)


 class LLMPort(ABC):
@@ -73,6 +80,34 @@ class SyntheticGeneratorPort(ABC):
        ...


+class CrossoverPort(ABC):
+    """Port for crossover — combining instructions from two parent candidates."""
+
+    @abstractmethod
+    async def crossover(
+        self,
+        parent_a: Prompt,
+        parent_b: Prompt,
+        task_description: str,
+    ) -> Prompt:
+        """Combine instructions from two parents into a child prompt."""
+        ...
+
+
+class MutationPort(ABC):
+    """Port for mutating a prompt — paraphrase, constrain, generalize, specialize."""
+
+    @abstractmethod
+    async def mutate(
+        self,
+        prompt: Prompt,
+        task_description: str,
+        mutation_type: str = "paraphrase",
+    ) -> Prompt:
+        """Apply a mutation to the prompt."""
+        ...
+
+
 class PersistencePort(ABC):
    """Port for reading/writing files."""

@@ -83,3 +118,49 @@ class PersistencePort(ABC):
    @abstractmethod
    def write_result(self, path: str, data: dict[str, Any]) -> None:
        ...
+
+
+class SimilarityPort(ABC):
+    """Port for computing similarity between a prediction and expected output.
+
+    Infrastructure provides concrete metrics (exact match, BLEU, ROUGE, cosine).
+    """
+
+    @abstractmethod
+    def compute(self, prediction: str, expected: str) -> float:
+        """Compute similarity score in [0, 1]. 1.0 = perfect match."""
+        ...
+
+
+class DatasetLoaderPort(ABC):
+    """Port for loading ground-truth evaluation datasets."""
+
+    @abstractmethod
+    def load(self, path: str) -> list[GroundTruthExample]:
+        """Load a dataset from a CSV or JSON file.
+
+        Each row must have 'input' and 'expected_output' fields.
+        """
+        ...
+
+
+class CheckpointPort(ABC):
+    """Port for saving and loading optimization checkpoints.
+
+    Enables resuming long-running optimizations after interruption.
+    """
+
+    @abstractmethod
+    def save(self, state: OptimizationState) -> None:
+        """Persist the current optimization state to disk."""
+        ...
+
+    @abstractmethod
+    def load(self) -> OptimizationState | None:
+        """Load the latest checkpoint. Returns None if no checkpoint exists."""
+        ...
+
+    @abstractmethod
+    def latest_exists(self) -> bool:
+        """Check if a checkpoint file is available for resuming."""
+        ...
@@ -0,0 +1,149 @@
+"""
+JSON checkpoint persistence — save/load optimization state to disk.
+
+Implements the CheckpointPort with JSON for human-readable, versionable snapshots.
+"""
+from __future__ import annotations
+
+import json
+import logging
+from dataclasses import asdict
+from pathlib import Path
+
+from prometheus.cli.logging_setup import get_logger
+from prometheus.domain.entities import (
+    Candidate,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+)
+from prometheus.domain.ports import CheckpointPort
+
+logger = get_logger("checkpoint")
+
+_CHECKPOINT_FILE = "latest.json"
+
+
+class JsonCheckpointPersistence(CheckpointPort):
+    """Saves optimization state as JSON to a configurable directory."""
+
+    def __init__(self, checkpoint_dir: str | Path = ".prometheus/checkpoints") -> None:
+        self._dir = Path(checkpoint_dir)
+
+    def save(self, state: OptimizationState) -> None:
+        """Persist the current optimization state to disk."""
+        self._dir.mkdir(parents=True, exist_ok=True)
+        path = self._dir / _CHECKPOINT_FILE
+        data = _serialize_state(state)
+        path.write_text(json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
+        logger.info(
+            "Checkpoint saved",
+            extra={
+                "structured": {
+                    "event": "checkpoint_saved",
+                    "path": str(path),
+                    "iteration": state.iteration,
+                    "total_llm_calls": state.total_llm_calls,
+                },
+            },
+        )
+
+    def load(self) -> OptimizationState | None:
+        """Load the latest checkpoint. Returns None if no checkpoint exists."""
+        path = self._dir / _CHECKPOINT_FILE
+        if not path.exists():
+            return None
+        raw = json.loads(path.read_text(encoding="utf-8"))
+        state = _deserialize_state(raw)
+        logger.info(
+            "Checkpoint loaded",
+            extra={
+                "structured": {
+                    "event": "checkpoint_loaded",
+                    "path": str(path),
+                    "iteration": state.iteration,
+                    "total_llm_calls": state.total_llm_calls,
+                },
+            },
+        )
+        return state
+
+    def latest_exists(self) -> bool:
+        """Check if a checkpoint file is available for resuming."""
+        return (self._dir / _CHECKPOINT_FILE).exists()
+
+
+# ---------------------------------------------------------------------------
+# Serialization helpers — keep the JSON format stable and self-describing.
+# ---------------------------------------------------------------------------
+
+_SCHEMA_VERSION = 1
+
+
+def _serialize_state(state: OptimizationState) -> dict:
+    """Convert OptimizationState to a JSON-safe dict."""
+    return {
+        "schema_version": _SCHEMA_VERSION,
+        "iteration": state.iteration,
+        "best_candidate": _serialize_candidate(state.best_candidate),
+        "candidates": [_serialize_candidate(c) for c in state.candidates],
+        "synthetic_pool": [
+            {"input_text": ex.input_text, "category": ex.category, "id": ex.id}
+            for ex in state.synthetic_pool
+        ],
+        "history": state.history,
+        "total_llm_calls": state.total_llm_calls,
+    }
+
+
+def _serialize_candidate(candidate: Candidate | None) -> dict | None:
+    if candidate is None:
+        return None
+    return {
+        "prompt_text": candidate.prompt.text,
+        "prompt_metadata": candidate.prompt.metadata,
+        "best_score": candidate.best_score,
+        "generation": candidate.generation,
+        "parent_id": candidate.parent_id,
+    }
+
+
+def _deserialize_state(data: dict) -> OptimizationState:
+    """Reconstruct OptimizationState from a checkpoint dict."""
+    version = data.get("schema_version", 1)
+    # Future migration hooks go here: if version < 2: ...
+
+    best_raw = data.get("best_candidate")
+    best_candidate = _deserialize_candidate(best_raw)
+
+    candidates = [_deserialize_candidate(c) for c in data.get("candidates", [])]
+
+    synthetic_pool = [
+        SyntheticExample(
+            input_text=ex["input_text"],
+            category=ex.get("category", "default"),
+            id=ex.get("id", 0),
+        )
+        for ex in data.get("synthetic_pool", [])
+    ]
+
+    state = OptimizationState(
+        iteration=data.get("iteration", 0),
+        best_candidate=best_candidate,
+        candidates=candidates,
+        synthetic_pool=synthetic_pool,
+        history=data.get("history", []),
+        total_llm_calls=data.get("total_llm_calls", 0),
+    )
+    return state
+
+
+def _deserialize_candidate(raw: dict | None) -> Candidate | None:
+    if raw is None:
+        return None
+    return Candidate(
+        prompt=Prompt(text=raw["prompt_text"], metadata=raw.get("prompt_metadata", {})),
+        best_score=raw.get("best_score", 0.0),
+        generation=raw.get("generation", 0),
+        parent_id=raw.get("parent_id"),
+    )
@@ -0,0 +1,63 @@
+"""
+Adapter: Instruction Crossover via DSPy.
+
+Implements CrossoverPort — combines two parent prompts into a child.
+"""
+from __future__ import annotations
+
+import asyncio
+
+import dspy
+
+from prometheus.domain.entities import Prompt
+from prometheus.domain.ports import CrossoverPort
+from prometheus.infrastructure.dspy_modules import InstructionCrossover
+from prometheus.infrastructure.retry import async_retry_with_backoff
+
+
+class DSPyCrossoverAdapter(CrossoverPort):
+    """Uses DSPy to combine two parent instructions into a child."""
+
+    def __init__(
+        self,
+        lm: dspy.LM,
+        max_retries: int = 3,
+        retry_delay_base: float = 1.0,
+    ) -> None:
+        self._lm = lm
+        self._crossover = InstructionCrossover()
+        self._max_retries = max_retries
+        self._retry_delay_base = retry_delay_base
+        self.call_count: int = 0
+
+    async def crossover(
+        self,
+        parent_a: Prompt,
+        parent_b: Prompt,
+        task_description: str,
+    ) -> Prompt:
+        async def _call() -> Prompt:
+            return await asyncio.to_thread(
+                self._sync_crossover, parent_a, parent_b, task_description,
+            )
+
+        return await async_retry_with_backoff(
+            _call,
+            max_retries=self._max_retries,
+            retry_delay_base=self._retry_delay_base,
+        )
+
+    def _sync_crossover(
+        self,
+        parent_a: Prompt,
+        parent_b: Prompt,
+        task_description: str,
+    ) -> Prompt:
+        with dspy.context(lm=self._lm):
+            pred = self._crossover(
+                parent_a=parent_a.text,
+                parent_b=parent_b.text,
+                task_description=task_description,
+            )
+        self.call_count += 1
+        return Prompt(text=pred.child_instruction)
@@ -0,0 +1,75 @@
+"""Dataset loader — loads ground-truth CSV/JSON datasets."""
+from __future__ import annotations
+
+import csv
+import json
+import logging
+from pathlib import Path
+
+from prometheus.domain.entities import GroundTruthExample
+from prometheus.domain.ports import DatasetLoaderPort
+
+logger = logging.getLogger(__name__)
+
+
+class FileDatasetLoader(DatasetLoaderPort):
+    """Loads evaluation datasets from CSV or JSON files.
+
+    CSV files must have 'input' and 'expected_output' columns.
+    JSON files must be an array of objects with 'input' and 'expected_output' keys.
+    """
+
+    def load(self, path: str) -> list[GroundTruthExample]:
+        suffix = Path(path).suffix.lower()
+        if suffix == ".csv":
+            return self._load_csv(path)
+        elif suffix in (".json", ".jsonl"):
+            return self._load_json(path)
+        else:
+            raise ValueError(
+                f"Unsupported dataset format '{suffix}'. Use .csv, .json, or .jsonl."
+            )
+
+    def _load_csv(self, path: str) -> list[GroundTruthExample]:
+        examples: list[GroundTruthExample] = []
+        with open(path, newline="", encoding="utf-8") as f:
+            reader = csv.DictReader(f)
+            for i, row in enumerate(reader):
+                input_text = row.get("input", "").strip()
+                expected = row.get("expected_output", "").strip()
+                if not input_text:
+                    logger.warning("Skipping CSV row %d: empty 'input' field", i + 1)
+                    continue
+                examples.append(
+                    GroundTruthExample(
+                        input_text=input_text,
+                        expected_output=expected,
+                        id=i,
+                    )
+                )
+        logger.info("Loaded %d examples from CSV: %s", len(examples), path)
+        return examples
+
+    def _load_json(self, path: str) -> list[GroundTruthExample]:
+        with open(path, encoding="utf-8") as f:
+            data = json.load(f)
+        if not isinstance(data, list):
+            raise ValueError("JSON dataset must be an array of objects.")
+        examples: list[GroundTruthExample] = []
+        for i, item in enumerate(data):
+            input_text = item.get("input", "").strip() if isinstance(item, dict) else ""
+            expected = (
+                item.get("expected_output", "").strip() if isinstance(item, dict) else ""
+            )
+            if not input_text:
+                logger.warning("Skipping JSON item %d: empty 'input' field", i)
+                continue
+            examples.append(
+                GroundTruthExample(
+                    input_text=input_text,
+                    expected_output=expected,
+                    id=i,
+                )
+            )
+        logger.info("Loaded %d examples from JSON: %s", len(examples), path)
+        return examples
@@ -59,6 +59,7 @@ class DSPyJudgeAdapter(JudgePort):
            if self._judge_dimensions
            else {}
        )
+        self.call_count: int = 0

    async def judge_batch(
        self,
@@ -104,13 +105,15 @@ class DSPyJudgeAdapter(JudgePort):

    def _sync_judge(self, task_description: str, input_text: str, output_text: str):
        with dspy.context(lm=self._lm):
-            return self._judge(
+            result = self._judge(
                task_description=task_description,
                input_text=input_text,
                output_text=output_text,
                judge_criteria=self._judge_criteria,
                dimension_names=self._dimension_names,
            )
+        self.call_count += 1
+        return result

    def _aggregate_result(self, pred: Any) -> tuple[float, str]:
        """Compute weighted aggregate score from dimension scores if available."""
@@ -34,6 +34,7 @@ class DSPyLLMAdapter(LLMPort):
        self._predictor = dspy.Predict(self._ExecuteSignature)
        self._max_retries = max_retries
        self._retry_delay_base = retry_delay_base
+        self.call_count: int = 0

    async def execute(self, prompt: Prompt, input_text: str) -> str:
        async def _call() -> str:
@@ -52,4 +53,5 @@ class DSPyLLMAdapter(LLMPort):
                instruction=prompt.text,
                input_text=input_text,
            )
+        self.call_count += 1
        return str(result.output)
@@ -0,0 +1,70 @@
+"""
+Adapter: Instruction Mutation via DSPy.
+
+Implements MutationPort — applies typed mutations (paraphrase, constrain,
+generalize, specialize) to a prompt.
+"""
+from __future__ import annotations
+
+import asyncio
+import random
+
+import dspy
+
+from prometheus.domain.entities import Prompt
+from prometheus.domain.ports import MutationPort
+from prometheus.infrastructure.dspy_modules import InstructionMutator
+from prometheus.infrastructure.retry import async_retry_with_backoff
+
+_MUTATION_TYPES = ("paraphrase", "constrain", "generalize", "specialize")
+
+
+class DSPyMutationAdapter(MutationPort):
+    """Uses DSPy to apply typed mutations to an instruction."""
+
+    def __init__(
+        self,
+        lm: dspy.LM,
+        max_retries: int = 3,
+        retry_delay_base: float = 1.0,
+    ) -> None:
+        self._lm = lm
+        self._mutator = InstructionMutator()
+        self._max_retries = max_retries
+        self._retry_delay_base = retry_delay_base
+        self.call_count: int = 0
+
+    async def mutate(
+        self,
+        prompt: Prompt,
+        task_description: str,
+        mutation_type: str = "paraphrase",
+    ) -> Prompt:
+        if mutation_type not in _MUTATION_TYPES:
+            mutation_type = random.choice(_MUTATION_TYPES)
+
+        async def _call() -> Prompt:
+            return await asyncio.to_thread(
+                self._sync_mutate, prompt, task_description, mutation_type,
+            )
+
+        return await async_retry_with_backoff(
+            _call,
+            max_retries=self._max_retries,
+            retry_delay_base=self._retry_delay_base,
+        )
+
+    def _sync_mutate(
+        self,
+        prompt: Prompt,
+        task_description: str,
+        mutation_type: str,
+    ) -> Prompt:
+        with dspy.context(lm=self._lm):
+            pred = self._mutator(
+                current_instruction=prompt.text,
+                task_description=task_description,
+                mutation_type=mutation_type,
+            )
+        self.call_count += 1
+        return Prompt(text=pred.mutated_instruction)
@@ -29,6 +29,7 @@ class DSPyProposerAdapter(ProposerPort):
        self._proposer = InstructionProposer()
        self._max_retries = max_retries
        self._retry_delay_base = retry_delay_base
+        self.call_count: int = 0

    async def propose(
        self,
@@ -56,6 +57,7 @@ class DSPyProposerAdapter(ProposerPort):
                task_description=task_description,
                failure_examples=failure_examples,
            )
+        self.call_count += 1
        return Prompt(text=pred.new_instruction)

    @staticmethod
@@ -0,0 +1,153 @@
+"""Similarity adapters — concrete metrics for comparing prediction vs expected."""
+from __future__ import annotations
+
+import math
+import re
+from collections import Counter
+
+from prometheus.domain.ports import SimilarityPort
+
+
+class ExactMatchSimilarity(SimilarityPort):
+    """Case-insensitive exact string match. Returns 1.0 or 0.0."""
+
+    def compute(self, prediction: str, expected: str) -> float:
+        return 1.0 if prediction.strip().lower() == expected.strip().lower() else 0.0
+
+
+class BleuSimilarity(SimilarityPort):
+    """BLEU-style n-gram precision (up to 4-grams).
+
+    Simplified implementation using sentence-level BLEU with brevity penalty.
+    Returns a score in [0, 1].
+    """
+
+    def __init__(self, max_n: int = 4):
+        self._max_n = max_n
+
+    def compute(self, prediction: str, expected: str) -> float:
+        pred_tokens = _tokenize(prediction)
+        ref_tokens = _tokenize(expected)
+        if not pred_tokens or not ref_tokens:
+            return 0.0 if not ref_tokens else 0.0
+
+        # Modified precision for each n-gram
+        precisions: list[float] = []
+        for n in range(1, self._max_n + 1):
+            pred_ngrams = _ngrams(pred_tokens, n)
+            ref_ngrams = _ngrams(ref_tokens, n)
+            if not pred_ngrams:
+                break
+            clipped = sum(min(pred_ngrams[ng], ref_ngrams.get(ng, 0)) for ng in pred_ngrams)
+            total = sum(pred_ngrams.values())
+            precisions.append(clipped / total if total > 0 else 0.0)
+
+        if not precisions:
+            return 0.0
+
+        # Geometric mean of precisions
+        log_avg = sum(math.log(p) for p in precisions if p > 0)
+        n_nonzero = sum(1 for p in precisions if p > 0)
+        if n_nonzero == 0:
+            return 0.0
+        geo_mean = math.exp(log_avg / n_nonzero)
+
+        # Brevity penalty
+        bp = 1.0
+        if len(pred_tokens) < len(ref_tokens):
+            bp = math.exp(1 - len(ref_tokens) / len(pred_tokens))
+
+        return min(bp * geo_mean, 1.0)
+
+
+class RougeLSimilarity(SimilarityPort):
+    """ROUGE-L using Longest Common Subsequence.
+
+    Returns F1 score combining precision and recall in [0, 1].
+    """
+
+    def compute(self, prediction: str, expected: str) -> float:
+        pred_tokens = _tokenize(prediction)
+        ref_tokens = _tokenize(expected)
+        if not pred_tokens or not ref_tokens:
+            return 0.0
+
+        lcs_len = _lcs_length(pred_tokens, ref_tokens)
+        precision = lcs_len / len(pred_tokens)
+        recall = lcs_len / len(ref_tokens)
+        if precision + recall == 0:
+            return 0.0
+        f1 = 2 * precision * recall / (precision + recall)
+        return f1
+
+
+class CosineSimilarity(SimilarityPort):
+    """TF-IDF cosine similarity between bag-of-words vectors.
+
+    Lightweight semantic similarity without external embedding models.
+    """
+
+    def compute(self, prediction: str, expected: str) -> float:
+        pred_tokens = _tokenize(prediction)
+        ref_tokens = _tokenize(expected)
+        if not pred_tokens or not ref_tokens:
+            return 0.0
+
+        pred_counts = Counter(pred_tokens)
+        ref_counts = Counter(ref_tokens)
+        all_tokens = set(pred_counts) | set(ref_counts)
+
+        dot = sum(pred_counts.get(t, 0) * ref_counts.get(t, 0) for t in all_tokens)
+        norm_pred = math.sqrt(sum(v * v for v in pred_counts.values()))
+        norm_ref = math.sqrt(sum(v * v for v in ref_counts.values()))
+
+        if norm_pred == 0 or norm_ref == 0:
+            return 0.0
+        return dot / (norm_pred * norm_ref)
+
+
+# --- Helpers ---
+
+
+def _tokenize(text: str) -> list[str]:
+    """Simple whitespace + punctuation tokenizer."""
+    return re.findall(r"\w+", text.lower())
+
+
+def _ngrams(tokens: list[str], n: int) -> Counter:
+    """Count n-grams in a token list."""
+    return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))
+
+
+def _lcs_length(a: list[str], b: list[str]) -> int:
+    """Compute length of the Longest Common Subsequence."""
+    m, n = len(a), len(b)
+    prev = [0] * (n + 1)
+    for i in range(1, m + 1):
+        curr = [0] * (n + 1)
+        for j in range(1, n + 1):
+            if a[i - 1] == b[j - 1]:
+                curr[j] = prev[j - 1] + 1
+            else:
+                curr[j] = max(prev[j], curr[j - 1])
+        prev = curr
+    return prev[n]
+
+
+def create_similarity_adapter(metric: str) -> SimilarityPort:
+    """Factory: create a SimilarityPort by metric name.
+
+    Supported metrics: exact, bleu, rouge_l, cosine.
+    """
+    adapters = {
+        "exact": ExactMatchSimilarity,
+        "bleu": BleuSimilarity,
+        "rouge_l": RougeLSimilarity,
+        "cosine": CosineSimilarity,
+    }
+    cls = adapters.get(metric)
+    if cls is None:
+        raise ValueError(
+            f"Unknown eval metric '{metric}'. Choose from: {sorted(adapters)}"
+        )
+    return cls()
@@ -18,6 +18,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort):
    def __init__(self, lm: dspy.LM) -> None:
        self._lm = lm
        self._generator = SyntheticInputGenerator()
+        self.call_count: int = 0

    def generate_inputs(
        self,
@@ -29,6 +30,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort):
                task_description=task_description,
                n_examples=n_examples,
            )
+        self.call_count += 1
        return [
            SyntheticExample(
                input_text=text,
@@ -91,3 +91,27 @@ def mock_proposer_port() -> AsyncMock:
        text="You are a very helpful assistant. Answer the question precisely."
    )
    return port
+
+
+@pytest.fixture
+def mock_crossover_port() -> AsyncMock:
+    """Mock CrossoverPort that combines two parent prompts."""
+    port = AsyncMock()
+
+    async def _crossover(parent_a: Prompt, parent_b: Prompt, task_description: str) -> Prompt:
+        return Prompt(text=f"{parent_a.text} Also, {parent_b.text.lower()}")
+
+    port.crossover = AsyncMock(side_effect=_crossover)
+    return port
+
+
+@pytest.fixture
+def mock_mutation_port() -> AsyncMock:
+    """Mock MutationPort that paraphrases a prompt."""
+    port = AsyncMock()
+
+    async def _mutate(prompt: Prompt, task_description: str, mutation_type: str = "paraphrase") -> Prompt:
+        return Prompt(text=f"[{mutation_type}] {prompt.text}")
+
+    port.mutate = AsyncMock(side_effect=_mutate)
+    return port
@@ -20,9 +20,10 @@ def mock_lm() -> dspy.LM:


 class TestDSPyLLMAdapter:
-    def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
+    @pytest.mark.asyncio
+    async def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
        adapter = DSPyLLMAdapter(lm=mock_lm)
        prompt = Prompt(text="Answer the question.")
-        result = adapter.execute(prompt, "What is 2+2?")
+        result = await adapter.execute(prompt, "What is 2+2?")
        assert isinstance(result, str)
        assert len(result) > 0
@@ -0,0 +1,300 @@
+"""Integration tests for multi-iteration evolution with mixed accept/reject."""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
+from prometheus.domain.ports import JudgePort, LLMPort, ProposerPort
+
+
+def _make_eval(scores: list[float]) -> EvalResult:
+    return EvalResult(
+        scores=scores,
+        feedbacks=["feedback"] * len(scores),
+        trajectories=[
+            Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
+            for i, s in enumerate(scores)
+        ],
+    )
+
+
+class TestMultiIterationEvolution:
+    """Tests for the evolution loop across multiple iterations."""
+
+    @pytest.fixture
+    def seed_prompt(self) -> Prompt:
+        return Prompt(text="You are a helpful assistant.")
+
+    @pytest.fixture
+    def task_description(self) -> str:
+        return "Answer factual questions."
+
+    @pytest.fixture
+    def synthetic_pool(self) -> list[SyntheticExample]:
+        return [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
+
+    @pytest.mark.asyncio
+    async def test_mixed_accept_reject(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """Iteration 1: accept, iteration 2: reject, iteration 3: accept."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
+
+        # Build eval sequence: initial, then per-iteration (current, new)
+        evals = [
+            _make_eval([0.3, 0.3, 0.3]),  # initial seed eval
+            # Iter 1: accept (old=0.4, new=0.8)
+            _make_eval([0.4, 0.4, 0.4]),
+            _make_eval([0.8, 0.8, 0.8]),
+            # Iter 2: reject (old=0.7, new=0.2)
+            _make_eval([0.7, 0.7, 0.7]),
+            _make_eval([0.2, 0.2, 0.2]),
+            # Iter 3: accept (old=0.5, new=0.9)
+            _make_eval([0.5, 0.5, 0.5]),
+            _make_eval([0.9, 0.9, 0.9]),
+        ]
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        mock_proposer.propose.side_effect = [
+            Prompt(text="Better prompt v1"),
+            Prompt(text="Worse prompt v2"),
+            Prompt(text="Best prompt v3"),
+        ]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=3,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.iteration == 3
+        assert state.best_candidate is not None
+        assert state.best_candidate.best_score == pytest.approx(2.7)  # 0.9*3
+        assert len(state.history) == 3
+        assert state.history[0]["event"] == "accepted"
+        assert state.history[1]["event"] == "rejected"
+        assert state.history[2]["event"] == "accepted"
+
+    @pytest.mark.asyncio
+    async def test_all_rejected_keeps_seed(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """When all proposals are rejected, the seed prompt stays as best."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
+
+        evals = [
+            _make_eval([0.5, 0.5, 0.5]),  # initial
+        ]
+        for _ in range(3):
+            evals.append(_make_eval([0.5, 0.5, 0.5]))  # current
+            evals.append(_make_eval([0.1, 0.1, 0.1]))  # worse proposal
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        mock_proposer.propose.side_effect = [
+            Prompt(text="bad v1"),
+            Prompt(text="bad v2"),
+            Prompt(text="bad v3"),
+        ]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=3,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.best_candidate.prompt.text == seed_prompt.text
+        assert state.best_candidate.best_score == pytest.approx(1.5)  # 0.5*3
+
+    @pytest.mark.asyncio
+    async def test_all_accepted_chain(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """All iterations accept, forming an improvement chain."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
+
+        evals = [
+            _make_eval([0.2, 0.2]),  # initial
+        ]
+        for i in range(1, 5):
+            score = 0.2 + i * 0.15
+            evals.append(_make_eval([score, score]))  # current
+            evals.append(_make_eval([score + 0.1, score + 0.1]))  # new (accepted)
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        mock_proposer.propose.side_effect = [
+            Prompt(text=f"Improved v{i}") for i in range(4)
+        ]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=4,
+            minibatch_size=2,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert len(state.candidates) == 5  # seed + 4 accepted
+        assert all(h["event"] == "accepted" for h in state.history)
+
+    @pytest.mark.asyncio
+    async def test_error_recovery_continues_loop(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """When an iteration errors, the loop continues."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
+
+        # Eval sequence for 3 iterations:
+        # - iter 1: evaluate current → propose → evaluate new (accepted)
+        # - iter 2: evaluate current → propose (ERROR, no new eval)
+        # - iter 3: evaluate current → propose → evaluate new (accepted)
+        evals = [
+            _make_eval([0.3, 0.3]),  # initial
+            _make_eval([0.5, 0.5]),  # iter 1 current
+            _make_eval([0.9, 0.9]),  # iter 1 new (accepted)
+            _make_eval([0.5, 0.5]),  # iter 2 current (proposer errors after this)
+            _make_eval([0.5, 0.5]),  # iter 3 current
+            _make_eval([0.8, 0.8]),  # iter 3 new (accepted)
+        ]
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        # Proposer raises on iter 2
+        mock_proposer.propose.side_effect = [
+            Prompt(text="good v1"),
+            RuntimeError("LLM timeout"),
+            Prompt(text="good v3"),
+        ]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=2,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.iteration == 3
+        assert state.history[1]["event"] == "error"
+        assert "LLM timeout" in state.history[1]["error"]
+        assert state.history[0]["event"] == "accepted"
+        assert state.history[2]["event"] == "accepted"
+
+    @pytest.mark.asyncio
+    async def test_perfect_score_skips_proposer(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """When all scores are perfect, no proposition is made."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
+
+        perfect_eval = _make_eval([1.0, 1.0])
+        evaluator.evaluate = AsyncMock(return_value=perfect_eval)
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=5,
+            minibatch_size=2,
+            perfect_score=1.0,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        mock_proposer.propose.assert_not_called()
+        assert all(h["event"] == "skip_perfect" for h in state.history)
+
+    @pytest.mark.asyncio
+    async def test_llm_call_counting(
+        self,
+        seed_prompt: Prompt,
+        task_description: str,
+        synthetic_pool: list[SyntheticExample],
+    ) -> None:
+        """Verify LLM call counting: 2*N per eval (execute + judge) + 1 per propose."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_judge = MagicMock(spec=JudgePort)
+        mock_proposer = MagicMock(spec=ProposerPort)
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
+
+        evals = [_make_eval([0.3, 0.3, 0.3])]  # initial
+        for _ in range(2):
+            evals.append(_make_eval([0.4, 0.4, 0.4]))
+            evals.append(_make_eval([0.6, 0.6, 0.6]))
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        mock_proposer.propose.side_effect = [
+            Prompt(text="v1"),
+            Prompt(text="v2"),
+        ]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+            max_iterations=2,
+            minibatch_size=3,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        # Initial: 2*3=6, Iter1: 2*3 + 1 + 2*3 = 13, Iter2: same = 13
+        # Total: 6 + 13 + 13 = 32
+        assert state.total_llm_calls == 32
@@ -1,7 +1,9 @@
 """End-to-end pipeline test with mocked LLM calls."""
 from __future__ import annotations

-from unittest.mock import MagicMock
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest

 from prometheus.application.bootstrap import SyntheticBootstrap
 from prometheus.application.dto import OptimizationConfig
@@ -23,9 +25,10 @@ def _make_eval(scores: list[float]) -> EvalResult:


 class TestFullPipeline:
-    def test_pipeline_produces_result(self) -> None:
+    @pytest.mark.asyncio
+    async def test_pipeline_produces_result(self) -> None:
        """Full pipeline with mocked ports produces an OptimizationResult."""
-        mock_llm = MagicMock(spec=LLMPort)
+        mock_llm = AsyncMock(spec=LLMPort)
        mock_llm.execute.return_value = "mock response"

        mock_judge = MagicMock(spec=JudgePort)
@@ -38,11 +41,11 @@ class TestFullPipeline:
            eval_sequence.append(_make_eval([0.6, 0.6, 0.6, 0.6, 0.6]))  # new eval (accepted)
        mock_judge.judge_batch.return_value = [(0.5, "ok")] * 5

-        mock_proposer = MagicMock(spec=ProposerPort)
+        mock_proposer = AsyncMock(spec=ProposerPort)
        mock_proposer.propose.return_value = Prompt(text="Improved prompt")

        evaluator = PromptEvaluator(mock_llm, mock_judge)
-        evaluator.evaluate = MagicMock(side_effect=eval_sequence)
+        evaluator.evaluate = AsyncMock(side_effect=eval_sequence)

        mock_gen = MagicMock()
        mock_gen.generate_inputs.return_value = [
@@ -65,7 +68,7 @@ class TestFullPipeline:
            seed=42,
        )

-        result = use_case.execute(config)
+        result = await use_case.execute(config)

        assert result.initial_prompt == "Answer questions."
        assert result.optimized_prompt == "Improved prompt"
@@ -0,0 +1,199 @@
+"""Integration test — ground-truth evaluation end-to-end with real similarity metrics."""
+from __future__ import annotations
+
+import asyncio
+import json
+
+import pytest
+from unittest.mock import AsyncMock
+
+from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator
+from prometheus.domain.entities import GroundTruthExample, Prompt
+from prometheus.domain.ports import LLMPort
+from prometheus.infrastructure.dataset_loader import FileDatasetLoader
+from prometheus.infrastructure.similarity import (
+    BleuSimilarity,
+    CosineSimilarity,
+    ExactMatchSimilarity,
+    RougeLSimilarity,
+    create_similarity_adapter,
+)
+
+
+def _make_dataset(items: list[tuple[str, str]]) -> list[GroundTruthExample]:
+    return [
+        GroundTruthExample(input_text=inp, expected_output=exp, id=i)
+        for i, (inp, exp) in enumerate(items)
+    ]
+
+
+@pytest.fixture
+def qa_dataset():
+    return _make_dataset([
+        ("What is the capital of France?", "Paris"),
+        ("What is 2+2?", "4"),
+        ("What color is the sky?", "blue"),
+    ])
+
+
+@pytest.fixture
+def prompt():
+    return Prompt(text="Answer the following question concisely.")
+
+
+@pytest.fixture
+def mock_executor():
+    """Returns responses that partially match the ground truth."""
+    port = AsyncMock(spec=LLMPort)
+    port.execute.side_effect = [
+        "Paris is the capital of France.",
+        "The answer is 4.",
+        "The sky is blue.",
+    ]
+    return port
+
+
+class TestGroundTruthIntegrationWithExactMatch:
+    @pytest.mark.asyncio
+    async def test_exact_match_on_qa(self, mock_executor, qa_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor,
+            similarity=ExactMatchSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        # None of the outputs are exact matches with expected outputs
+        assert all(s == 0.0 for s in result.scores)
+
+    @pytest.mark.asyncio
+    async def test_exact_match_with_exact_outputs(self, qa_dataset, prompt):
+        exact_executor = AsyncMock(spec=LLMPort)
+        exact_executor.execute.side_effect = ["Paris", "4", "blue"]
+        evaluator = GroundTruthEvaluator(
+            executor=exact_executor,
+            similarity=ExactMatchSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        assert all(s == 1.0 for s in result.scores)
+
+
+class TestGroundTruthIntegrationWithBleu:
+    @pytest.mark.asyncio
+    async def test_bleu_scores_partial_match(self, mock_executor, qa_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor,
+            similarity=BleuSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        assert all(0.0 < s < 1.0 for s in result.scores)
+        assert result.mean_score > 0.0
+
+    @pytest.mark.asyncio
+    async def test_bleu_perfect_match(self, qa_dataset, prompt):
+        perfect_executor = AsyncMock(spec=LLMPort)
+        perfect_executor.execute.side_effect = ["Paris", "4", "blue"]
+        evaluator = GroundTruthEvaluator(
+            executor=perfect_executor,
+            similarity=BleuSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        assert all(s > 0.0 for s in result.scores)
+
+
+class TestGroundTruthIntegrationWithRouge:
+    @pytest.mark.asyncio
+    async def test_rouge_l_scores(self, mock_executor, qa_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor,
+            similarity=RougeLSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        assert all(s > 0.0 for s in result.scores)
+
+
+class TestGroundTruthIntegrationWithCosine:
+    @pytest.mark.asyncio
+    async def test_cosine_scores(self, mock_executor, qa_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor,
+            similarity=CosineSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, qa_dataset)
+        assert all(s > 0.0 for s in result.scores)
+
+
+class TestDatasetLoaderIntegration:
+    @pytest.mark.asyncio
+    async def test_load_csv_and_evaluate(self, tmp_path, prompt):
+        csv_file = tmp_path / "eval.csv"
+        csv_file.write_text("input,expected_output\nWhat is 2+2?,4\nWhat color is grass?,green\n")
+
+        loader = FileDatasetLoader()
+        dataset = loader.load(str(csv_file))
+        assert len(dataset) == 2
+
+        executor = AsyncMock(spec=LLMPort)
+        executor.execute.side_effect = ["4", "green"]
+
+        evaluator = GroundTruthEvaluator(
+            executor=executor,
+            similarity=ExactMatchSimilarity(),
+        )
+        result = await evaluator.evaluate(prompt, dataset)
+        assert all(s == 1.0 for s in result.scores)
+
+    @pytest.mark.asyncio
+    async def test_load_json_and_evaluate(self, tmp_path, prompt):
+        json_file = tmp_path / "eval.json"
+        data = [
+            {"input": "What is 2+2?", "expected_output": "4"},
+            {"input": "What color is grass?", "expected_output": "green"},
+        ]
+        json_file.write_text(json.dumps(data))
+
+        loader = FileDatasetLoader()
+        dataset = loader.load(str(json_file))
+        assert len(dataset) == 2
+
+        executor = AsyncMock(spec=LLMPort)
+        executor.execute.side_effect = ["4", "not green"]
+
+        evaluator = GroundTruthEvaluator(
+            executor=executor,
+            similarity=create_similarity_adapter("bleu"),
+        )
+        result = await evaluator.evaluate(prompt, dataset)
+        # First item should score well, second poorly
+        assert result.scores[0] > result.scores[1]
+
+
+class TestMetricComparison:
+    """Compare different metrics on the same outputs to ensure they behave differently."""
+
+    @pytest.mark.asyncio
+    async def test_metrics_give_different_scores(self, qa_dataset, prompt):
+        results = {}
+        for metric_name, metric_cls in [
+            ("exact", ExactMatchSimilarity),
+            ("bleu", BleuSimilarity),
+            ("rouge_l", RougeLSimilarity),
+            ("cosine", CosineSimilarity),
+        ]:
+            executor = AsyncMock(spec=LLMPort)
+            executor.execute.side_effect = [
+                "Paris is the capital of France.",
+                "The answer is 4.",
+                "The sky is blue.",
+            ]
+            evaluator = GroundTruthEvaluator(
+                executor=executor,
+                similarity=metric_cls(),
+            )
+            result = await evaluator.evaluate(prompt, qa_dataset)
+            results[metric_name] = result.mean_score
+
+        # Exact match should be 0 (no exact matches)
+        assert results["exact"] == 0.0
+        # All other metrics should give partial credit
+        assert results["bleu"] > 0.0
+        assert results["rouge_l"] > 0.0
+        assert results["cosine"] > 0.0
@@ -0,0 +1,294 @@
+"""Unit tests for infrastructure adapters — LLM, Judge, Proposer, Synthetic.
+
+Uses mocked DSPy modules to isolate adapter logic from LLM calls.
+"""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import dspy
+import pytest
+
+from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
+from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
+from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
+from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
+from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
+
+
+# --- LLM Adapter ---
+
+
+class TestDSPyLLMAdapter:
+    """Tests for DSPyLLMAdapter.execute()."""
+
+    @pytest.fixture
+    def mock_lm(self) -> MagicMock:
+        return MagicMock(spec=dspy.LM)
+
+    @pytest.fixture
+    def adapter(self, mock_lm: MagicMock) -> DSPyLLMAdapter:
+        return DSPyLLMAdapter(lm=mock_lm)
+
+    @pytest.mark.asyncio
+    async def test_execute_returns_output_string(
+        self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
+    ) -> None:
+        mock_predictor = MagicMock()
+        mock_predictor.return_value = MagicMock(output="Hello response")
+        adapter._predictor = mock_predictor
+
+        prompt = Prompt(text="Say hello.")
+        result = await adapter.execute(prompt, "Hi there")
+
+        assert result == "Hello response"
+
+    @pytest.mark.asyncio
+    async def test_execute_passes_prompt_text_and_input(
+        self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
+    ) -> None:
+        mock_predictor = MagicMock()
+        mock_predictor.return_value = MagicMock(output="response")
+        adapter._predictor = mock_predictor
+
+        prompt = Prompt(text="Translate this.")
+        await adapter.execute(prompt, "Hello world")
+
+        mock_predictor.assert_called_once_with(
+            instruction="Translate this.",
+            input_text="Hello world",
+        )
+
+    @pytest.mark.asyncio
+    async def test_execute_uses_dspy_context(
+        self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
+    ) -> None:
+        mock_predictor = MagicMock()
+        mock_predictor.return_value = MagicMock(output="ok")
+        adapter._predictor = mock_predictor
+
+        with patch("prometheus.infrastructure.llm_adapter.dspy.context") as mock_ctx:
+            await adapter.execute(Prompt(text="test"), "input")
+            mock_ctx.assert_called_once_with(lm=mock_lm)
+
+    @pytest.mark.asyncio
+    async def test_execute_converts_output_to_str(
+        self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
+    ) -> None:
+        mock_predictor = MagicMock()
+        mock_predictor.return_value = MagicMock(output=42)
+        adapter._predictor = mock_predictor
+
+        result = await adapter.execute(Prompt(text="test"), "input")
+        assert isinstance(result, str)
+        assert result == "42"
+
+
+# --- Judge Adapter ---
+
+
+class TestDSPyJudgeAdapter:
+    """Tests for DSPyJudgeAdapter.judge_batch()."""
+
+    @pytest.fixture
+    def mock_lm(self) -> MagicMock:
+        return MagicMock(spec=dspy.LM)
+
+    @pytest.fixture
+    def adapter(self, mock_lm: MagicMock) -> DSPyJudgeAdapter:
+        return DSPyJudgeAdapter(lm=mock_lm)
+
+    @pytest.mark.asyncio
+    async def test_judge_batch_returns_scores_and_feedback(
+        self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._judge = MagicMock()
+        adapter._judge.side_effect = [
+            MagicMock(score=0.9, feedback="Excellent."),
+            MagicMock(score=0.4, feedback="Incomplete."),
+        ]
+
+        pairs = [("What is 2+2?", "4"), ("Capital of France?", "London")]
+        result = await adapter.judge_batch("math and geography", pairs)
+
+        assert len(result) == 2
+        assert result[0] == (0.9, "Excellent.")
+        assert result[1] == (0.4, "Incomplete.")
+
+    @pytest.mark.asyncio
+    async def test_judge_batch_empty_pairs(
+        self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
+    ) -> None:
+        result = await adapter.judge_batch("task", [])
+        assert result == []
+
+    @pytest.mark.asyncio
+    async def test_judge_batch_uses_dspy_context(
+        self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._judge = MagicMock()
+        adapter._judge.return_value = MagicMock(score=0.5, feedback="ok")
+
+        with patch("prometheus.infrastructure.judge_adapter.dspy.context") as mock_ctx:
+            await adapter.judge_batch("task", [("in", "out")])
+            mock_ctx.assert_called_once_with(lm=mock_lm)
+
+    @pytest.mark.asyncio
+    async def test_judge_batch_returns_all_results(
+        self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
+    ) -> None:
+        """Judge calls run in parallel but all results are returned."""
+        adapter._judge = MagicMock()
+        adapter._judge.side_effect = [
+            MagicMock(score=0.5, feedback="ok"),
+            MagicMock(score=0.7, feedback="better"),
+            MagicMock(score=0.3, feedback="worse"),
+        ]
+
+        pairs = [("first", "out1"), ("second", "out2"), ("third", "out3")]
+        results = await adapter.judge_batch("task", pairs)
+
+        assert len(results) == 3
+        scores = [r[0] for r in results]
+        assert 0.5 in scores
+        assert 0.7 in scores
+        assert 0.3 in scores
+
+
+# --- Proposer Adapter ---
+
+
+class TestDSPyProposerAdapter:
+    """Tests for DSPyProposerAdapter.propose()."""
+
+    @pytest.fixture
+    def mock_lm(self) -> MagicMock:
+        return MagicMock(spec=dspy.LM)
+
+    @pytest.fixture
+    def adapter(self, mock_lm: MagicMock) -> DSPyProposerAdapter:
+        return DSPyProposerAdapter(lm=mock_lm)
+
+    @pytest.mark.asyncio
+    async def test_propose_returns_new_prompt(
+        self, adapter: DSPyProposerAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._proposer = MagicMock()
+        adapter._proposer.return_value = MagicMock(
+            new_instruction="Be concise and accurate."
+        )
+
+        current = Prompt(text="Answer questions.")
+        trajectories = [
+            Trajectory("in", "out", 0.3, "too verbose", "Answer questions.")
+        ]
+        result = await adapter.propose(current, trajectories, "Q&A task")
+
+        assert isinstance(result, Prompt)
+        assert result.text == "Be concise and accurate."
+
+    @pytest.mark.asyncio
+    async def test_propose_uses_dspy_context(
+        self, adapter: DSPyProposerAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._proposer = MagicMock()
+        adapter._proposer.return_value = MagicMock(new_instruction="improved")
+
+        with patch("prometheus.infrastructure.proposer_adapter.dspy.context") as mock_ctx:
+            await adapter.propose(Prompt(text="t"), [], "task")
+            mock_ctx.assert_called_once_with(lm=mock_lm)
+
+    def test_format_failures_single_trajectory(self) -> None:
+        trajectories = [
+            Trajectory("What is AI?", "A type of robot.", 0.3, "Incomplete definition.", "prompt")
+        ]
+        result = DSPyProposerAdapter._format_failures(trajectories)
+
+        assert "What is AI?" in result
+        assert "A type of robot." in result
+        assert "0.30" in result
+        assert "Incomplete definition." in result
+        assert "# Example 1" in result
+
+    def test_format_failures_multiple_trajectories(self) -> None:
+        trajectories = [
+            Trajectory("input1", "output1", 0.4, "bad", "prompt"),
+            Trajectory("input2", "output2", 0.2, "worse", "prompt"),
+        ]
+        result = DSPyProposerAdapter._format_failures(trajectories)
+
+        assert "# Example 1" in result
+        assert "# Example 2" in result
+        assert "---" in result
+        assert "input1" in result
+        assert "input2" in result
+
+    def test_format_failures_empty_list(self) -> None:
+        result = DSPyProposerAdapter._format_failures([])
+        assert result == ""
+
+
+# --- Synthetic Adapter ---
+
+
+class TestDSPySyntheticAdapter:
+    """Tests for DSPySyntheticAdapter.generate_inputs()."""
+
+    @pytest.fixture
+    def mock_lm(self) -> MagicMock:
+        return MagicMock(spec=dspy.LM)
+
+    @pytest.fixture
+    def adapter(self, mock_lm: MagicMock) -> DSPySyntheticAdapter:
+        return DSPySyntheticAdapter(lm=mock_lm)
+
+    def test_generate_inputs_returns_examples(
+        self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._generator = MagicMock()
+        adapter._generator.return_value = MagicMock(
+            examples=["What is AI?", "Explain ML.", "What is NLP?"]
+        )
+
+        result = adapter.generate_inputs("AI task", 3)
+
+        assert len(result) == 3
+        assert all(isinstance(ex, SyntheticExample) for ex in result)
+        assert result[0].input_text == "What is AI?"
+        assert result[0].id == 0
+        assert result[1].id == 1
+
+    def test_generate_inputs_truncates_to_n(
+        self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._generator = MagicMock()
+        adapter._generator.return_value = MagicMock(
+            examples=["q1", "q2", "q3", "q4", "q5"]
+        )
+
+        result = adapter.generate_inputs("task", 3)
+
+        assert len(result) == 3
+
+    def test_generate_inputs_passes_correct_args(
+        self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._generator = MagicMock()
+        adapter._generator.return_value = MagicMock(examples=["q1"])
+
+        adapter.generate_inputs("my task", 5)
+
+        adapter._generator.assert_called_once_with(
+            task_description="my task",
+            n_examples=5,
+        )
+
+    def test_generate_inputs_empty_list(
+        self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
+    ) -> None:
+        adapter._generator = MagicMock()
+        adapter._generator.return_value = MagicMock(examples=[])
+
+        result = adapter.generate_inputs("task", 0)
+
+        assert result == []
@@ -0,0 +1,333 @@
+"""Unit tests for checkpoint & resume functionality."""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.domain.entities import (
+    Candidate,
+    EvalResult,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+from prometheus.infrastructure.checkpoint import JsonCheckpointPersistence
+
+
+# ---------------------------------------------------------------------------
+# JsonCheckpointPersistence — save/load round-trip
+# ---------------------------------------------------------------------------
+
+
+class TestJsonCheckpointPersistence:
+    def test_roundtrip_full_state(self, tmp_path: Path) -> None:
+        """Saving and loading preserves all fields."""
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts")
+
+        state = OptimizationState(
+            iteration=7,
+            best_candidate=Candidate(
+                prompt=Prompt(text="best prompt", metadata={"source": "test"}),
+                best_score=0.92,
+                generation=5,
+            ),
+            candidates=[
+                Candidate(prompt=Prompt(text="p1"), best_score=0.5, generation=0),
+                Candidate(prompt=Prompt(text="p2"), best_score=0.92, generation=5),
+            ],
+            synthetic_pool=[
+                SyntheticExample(input_text="q1", category="cat_a", id=0),
+                SyntheticExample(input_text="q2", category="cat_b", id=1),
+            ],
+            history=[{"iteration": 1, "event": "accepted", "old_score": 0.5, "new_score": 0.7}],
+            total_llm_calls=42,
+        )
+
+        ckpt.save(state)
+        assert ckpt.latest_exists()
+
+        loaded = ckpt.load()
+        assert loaded is not None
+        assert loaded.iteration == 7
+        assert loaded.total_llm_calls == 42
+        assert loaded.best_candidate is not None
+        assert loaded.best_candidate.prompt.text == "best prompt"
+        assert loaded.best_candidate.prompt.metadata == {"source": "test"}
+        assert loaded.best_candidate.best_score == 0.92
+        assert len(loaded.candidates) == 2
+        assert len(loaded.synthetic_pool) == 2
+        assert loaded.synthetic_pool[0].input_text == "q1"
+        assert loaded.synthetic_pool[1].category == "cat_b"
+        assert loaded.history[0]["event"] == "accepted"
+
+    def test_load_returns_none_when_no_checkpoint(self, tmp_path: Path) -> None:
+        """Loading from empty dir returns None."""
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "nope")
+        assert ckpt.load() is None
+        assert not ckpt.latest_exists()
+
+    def test_creates_directory_on_save(self, tmp_path: Path) -> None:
+        """Save creates the directory tree if it doesn't exist."""
+        deep_dir = tmp_path / "a" / "b" / "c"
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=deep_dir)
+        state = OptimizationState(iteration=1)
+        ckpt.save(state)
+        assert (deep_dir / "latest.json").exists()
+
+    def test_overwrites_previous_checkpoint(self, tmp_path: Path) -> None:
+        """Second save overwrites the first."""
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path)
+
+        ckpt.save(OptimizationState(iteration=1, total_llm_calls=10))
+        ckpt.save(OptimizationState(iteration=5, total_llm_calls=50))
+
+        loaded = ckpt.load()
+        assert loaded is not None
+        assert loaded.iteration == 5
+        assert loaded.total_llm_calls == 50
+
+    def test_json_is_human_readable(self, tmp_path: Path) -> None:
+        """Checkpoint file is valid, pretty-printed JSON."""
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path)
+        state = OptimizationState(
+            iteration=3,
+            best_candidate=Candidate(prompt=Prompt(text="hello"), best_score=0.8),
+        )
+        ckpt.save(state)
+
+        raw = json.loads((tmp_path / "latest.json").read_text())
+        assert raw["schema_version"] == 1
+        assert raw["iteration"] == 3
+        assert raw["best_candidate"]["prompt_text"] == "hello"
+
+
+# ---------------------------------------------------------------------------
+# EvolutionLoop — checkpoint integration
+# ---------------------------------------------------------------------------
+
+
+class TestEvolutionCheckpoint:
+    @pytest.mark.asyncio
+    async def test_checkpoint_saved_on_interval(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+    ) -> None:
+        """Checkpoint is saved every checkpoint_interval iterations."""
+        from unittest.mock import MagicMock
+
+        evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        # All iterations accepted so checkpoint triggers
+        good_eval = EvalResult(
+            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
+            feedbacks=["ok"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"out{i}", s, "ok", "p")
+                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
+            ],
+        )
+        better_eval = EvalResult(
+            scores=[0.8, 0.9, 0.7, 0.8, 0.9],
+            feedbacks=["good"] * 5,
+            trajectories=[],
+        )
+        # initial_eval + 5 iterations (each needs old_eval + new_eval)
+        evaluator.evaluate = AsyncMock(
+            side_effect=[good_eval]  # initial
+            + [good_eval, better_eval] * 5  # 5 iterations
+        )
+
+        proposer = AsyncMock()
+        proposer.propose.return_value = Prompt(text="improved prompt")
+
+        checkpoint_port = MagicMock()
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=proposer,
+            bootstrap=bootstrap,
+            max_iterations=5,
+            minibatch_size=5,
+            checkpoint_port=checkpoint_port,
+            checkpoint_interval=2,
+        )
+
+        await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        # Checkpoint at iterations 2, 4 (every 2nd)
+        save_calls = checkpoint_port.save.call_count
+        assert save_calls >= 2  # at least at iters 2 and 4
+
+    @pytest.mark.asyncio
+    async def test_no_checkpoint_without_port(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+    ) -> None:
+        """No checkpointing happens when checkpoint_port is None (default)."""
+        evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        perfect_eval = EvalResult(
+            scores=[1.0] * 5,
+            feedbacks=["perfect"] * 5,
+            trajectories=[
+                Trajectory(f"in{i}", f"out{i}", 1.0, "perfect", "p")
+                for i in range(5)
+            ],
+        )
+        evaluator.evaluate = AsyncMock(return_value=perfect_eval)
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=AsyncMock(),
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=5,
+            checkpoint_port=None,
+        )
+        # Should run without error — no checkpoint port, no crash
+        await loop.run(seed_prompt, synthetic_pool, task_description)
+
+    @pytest.mark.asyncio
+    async def test_resume_skips_seed_evaluation(
+        self,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+    ) -> None:
+        """When initial_state is provided, seed eval is skipped and loop starts from saved iteration."""
+        evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        proposer = AsyncMock()
+        proposer.propose.return_value = Prompt(text="new prompt")
+
+        # Only return evaluations for resumed iterations (1 iter: old_eval + new_eval)
+        old_eval = EvalResult(
+            scores=[0.5] * 5,
+            feedbacks=["ok"] * 5,
+            trajectories=[
+                Trajectory(f"in{i}", f"out{i}", 0.5, "ok", "p") for i in range(5)
+            ],
+        )
+        new_eval = EvalResult(
+            scores=[0.8] * 5,
+            feedbacks=["good"] * 5,
+            trajectories=[],
+        )
+        evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval])
+
+        # Create a state simulating checkpoint at iteration 4
+        initial_state = OptimizationState(
+            iteration=4,
+            best_candidate=Candidate(
+                prompt=Prompt(text="checkpoint prompt"), best_score=2.5, generation=4
+            ),
+            candidates=[Candidate(prompt=Prompt(text="checkpoint prompt"), best_score=2.5)],
+            total_llm_calls=40,
+        )
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=proposer,
+            bootstrap=bootstrap,
+            max_iterations=5,  # only iteration 5 remains
+            minibatch_size=5,
+        )
+        state = await loop.run(
+            seed_prompt=Prompt(text="seed"),
+            synthetic_pool=synthetic_pool,
+            task_description=task_description,
+            initial_state=initial_state,
+        )
+
+        # Should have run only 1 iteration (iter 5)
+        assert state.iteration == 5
+        # total_llm_calls should include the 40 from checkpoint + new calls
+        assert state.total_llm_calls > 40
+
+    @pytest.mark.asyncio
+    async def test_full_save_and_resume_roundtrip(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        tmp_path: Path,
+    ) -> None:
+        """End-to-end: run a few iterations, checkpoint, resume, finish."""
+        evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        old_eval = EvalResult(
+            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
+            feedbacks=["ok"] * 5,
+            trajectories=[
+                Trajectory(f"in{i}", f"out{i}", s, "ok", "p")
+                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
+            ],
+        )
+        new_eval = EvalResult(
+            scores=[0.8, 0.9, 0.7, 0.8, 0.9],
+            feedbacks=["good"] * 5,
+            trajectories=[],
+        )
+        evaluator.evaluate = AsyncMock(
+            side_effect=[old_eval, old_eval, new_eval, old_eval, new_eval]
+        )
+        proposer = AsyncMock()
+        proposer.propose.return_value = Prompt(text="improved prompt")
+
+        ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts")
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=proposer,
+            bootstrap=bootstrap,
+            max_iterations=2,
+            minibatch_size=5,
+            checkpoint_port=ckpt,
+            checkpoint_interval=1,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+        assert state.iteration == 2
+        assert ckpt.latest_exists()
+
+        # Capture the checkpoint state *before* resume (state is mutated in-place)
+        loaded = ckpt.load()
+        assert loaded is not None
+        saved_llm_calls = loaded.total_llm_calls
+        saved_iteration = loaded.iteration
+
+        # Set up evaluator for resumed run (just 1 more iteration)
+        evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval])
+        proposer.propose.return_value = Prompt(text="even better prompt")
+
+        loop2 = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=proposer,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=5,
+            checkpoint_port=ckpt,
+            checkpoint_interval=1,
+        )
+        resumed = await loop2.run(
+            seed_prompt, synthetic_pool, task_description,
+            initial_state=loaded,
+        )
+        assert resumed.iteration == 3
+        assert resumed.total_llm_calls > saved_llm_calls
+        assert resumed.iteration > saved_iteration
@@ -0,0 +1,278 @@
+"""Tests for the CLI interface — prometheus optimize, version, etc.
+
+Uses Typer's CliRunner for isolated command testing.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+import yaml
+from typer.testing import CliRunner
+
+from prometheus.application.dto import OptimizationResult
+from prometheus.cli.app import app
+
+runner = CliRunner()
+
+
+class TestCLIOptimize:
+    """Tests for the `prometheus optimize` command."""
+
+    def _write_config(self, tmp_path: Path, **overrides: object) -> Path:
+        """Write a minimal valid config YAML and return its path."""
+        data = {
+            "seed_prompt": "You are a helpful assistant.",
+            "task_description": "Answer factual questions accurately.",
+        }
+        data.update(overrides)
+        config_file = tmp_path / "config.yaml"
+        with open(config_file, "w") as f:
+            yaml.dump(data, f)
+        return config_file
+
+    def test_optimize_with_valid_config(self, tmp_path: Path) -> None:
+        config_file = self._write_config(tmp_path)
+        output_file = tmp_path / "output.yaml"
+
+        mock_result = OptimizationResult(
+            optimized_prompt="Improved prompt",
+            initial_prompt="You are a helpful assistant.",
+            iterations_used=5,
+            total_llm_calls=50,
+            initial_score=0.3,
+            final_score=0.9,
+            improvement=0.6,
+            history=[],
+        )
+
+        mock_uc = AsyncMock()
+        mock_uc.execute.return_value = mock_result
+
+        with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
+            with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
+                with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
+                    mock_llm_cls.return_value = MagicMock()
+                    with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
+                        mock_judge_cls.return_value = MagicMock()
+                        with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
+                            mock_prop_cls.return_value = MagicMock()
+                            with patch("prometheus.cli.commands.optimize.dspy"):
+                                result = runner.invoke(
+                                    app,
+                                    [
+                                        "optimize",
+                                        "-i",
+                                        str(config_file),
+                                        "-o",
+                                        str(output_file),
+                                    ],
+                                )
+
+        assert result.exit_code == 0
+        assert "Optimized Prompt" in result.output
+
+    def test_optimize_missing_input_file(self) -> None:
+        result = runner.invoke(
+            app,
+            ["optimize", "-i", "/nonexistent/config.yaml"],
+        )
+        assert result.exit_code != 0
+
+    def test_optimize_with_verbose_flag(self, tmp_path: Path) -> None:
+        config_file = self._write_config(tmp_path)
+        output_file = tmp_path / "output.yaml"
+
+        mock_result = OptimizationResult(
+            optimized_prompt="Improved",
+            initial_prompt="test",
+            iterations_used=1,
+            total_llm_calls=10,
+            initial_score=0.3,
+            final_score=0.8,
+            improvement=0.5,
+            history=[],
+        )
+
+        mock_uc = AsyncMock()
+        mock_uc.execute.return_value = mock_result
+
+        with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
+            with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
+                with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
+                    mock_llm_cls.return_value = MagicMock()
+                    with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
+                        mock_judge_cls.return_value = MagicMock()
+                        with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
+                            mock_prop_cls.return_value = MagicMock()
+                            with patch("prometheus.cli.commands.optimize.dspy"):
+                                result = runner.invoke(
+                                    app,
+                                    [
+                                        "optimize",
+                                        "-i",
+                                        str(config_file),
+                                        "-o",
+                                        str(output_file),
+                                        "-v",
+                                    ],
+                                )
+
+        assert result.exit_code == 0
+
+    def test_optimize_displays_metrics(self, tmp_path: Path) -> None:
+        config_file = self._write_config(tmp_path)
+        output_file = tmp_path / "output.yaml"
+
+        mock_result = OptimizationResult(
+            optimized_prompt="Better prompt",
+            initial_prompt="test",
+            iterations_used=3,
+            total_llm_calls=30,
+            initial_score=0.40,
+            final_score=0.85,
+            improvement=0.45,
+            history=[],
+        )
+
+        mock_uc = AsyncMock()
+        mock_uc.execute.return_value = mock_result
+
+        with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
+            with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
+                with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
+                    mock_llm_cls.return_value = MagicMock()
+                    with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
+                        mock_judge_cls.return_value = MagicMock()
+                        with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
+                            mock_prop_cls.return_value = MagicMock()
+                            with patch("prometheus.cli.commands.optimize.dspy"):
+                                result = runner.invoke(
+                                    app,
+                                    [
+                                        "optimize",
+                                        "-i",
+                                        str(config_file),
+                                        "-o",
+                                        str(output_file),
+                                    ],
+                                )
+
+        assert result.exit_code == 0
+        assert "0.40" in result.output
+        assert "0.85" in result.output
+        assert "+0.45" in result.output
+
+    def test_optimize_with_max_concurrency_flag(self, tmp_path: Path) -> None:
+        config_file = self._write_config(tmp_path)
+        output_file = tmp_path / "output.yaml"
+
+        mock_result = OptimizationResult(
+            optimized_prompt="Better prompt",
+            initial_prompt="test",
+            iterations_used=1,
+            total_llm_calls=10,
+            initial_score=0.3,
+            final_score=0.8,
+            improvement=0.5,
+            history=[],
+        )
+
+        mock_uc = AsyncMock()
+        mock_uc.execute.return_value = mock_result
+
+        with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
+            with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
+                with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
+                    mock_llm_cls.return_value = MagicMock()
+                    with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
+                        mock_judge_cls.return_value = MagicMock()
+                        with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
+                            mock_prop_cls.return_value = MagicMock()
+                            with patch("prometheus.cli.commands.optimize.dspy"):
+                                result = runner.invoke(
+                                    app,
+                                    [
+                                        "optimize",
+                                        "-i",
+                                        str(config_file),
+                                        "-o",
+                                        str(output_file),
+                                        "--max-concurrency",
+                                        "10",
+                                    ],
+                                )
+
+        assert result.exit_code == 0
+
+
+class TestCLIHelp:
+    """Tests for CLI help and no-args behavior."""
+
+    def test_no_args_shows_help(self) -> None:
+        result = runner.invoke(app, [])
+        # Typer uses exit code 2 when no_args_is_help=True
+        assert result.exit_code in (0, 2)
+        assert "PROMETHEUS" in result.output or "Usage" in result.output
+
+    def test_optimize_help(self) -> None:
+        result = runner.invoke(app, ["optimize", "--help"])
+        assert result.exit_code == 0
+        assert "input" in result.output.lower() or "INPUT" in result.output
+
+    def test_version_help(self) -> None:
+        result = runner.invoke(app, ["version", "--help"])
+        assert result.exit_code == 0
+
+    def test_init_help(self) -> None:
+        result = runner.invoke(app, ["init", "--help"])
+        assert result.exit_code == 0
+
+    def test_list_help(self) -> None:
+        result = runner.invoke(app, ["list", "--help"])
+        assert result.exit_code == 0
+
+
+class TestCLIVersion:
+    """Tests for the `prometheus version` command."""
+
+    def test_version_prints_version(self) -> None:
+        result = runner.invoke(app, ["version"])
+        assert result.exit_code == 0
+        assert "PROMETHEUS" in result.output
+        assert "0.1.0" in result.output
+
+
+class TestCLIList:
+    """Tests for the `prometheus list` command."""
+
+    def test_list_no_runs(self, tmp_path: Path) -> None:
+        result = runner.invoke(app, ["list", "-d", str(tmp_path)])
+        assert result.exit_code == 0
+        assert "No optimization runs found" in result.output
+
+    def test_list_with_result(self, tmp_path: Path) -> None:
+        result_data = {
+            "optimized_prompt": "Better prompt for testing",
+            "initial_prompt": "test",
+            "iterations_used": 5,
+            "total_llm_calls": 50,
+            "initial_score": 0.30,
+            "final_score": 0.90,
+            "improvement": 0.60,
+            "history": [],
+        }
+        result_file = tmp_path / "output.yaml"
+        import yaml as _yaml
+        with open(result_file, "w") as f:
+            _yaml.dump(result_data, f)
+
+        result = runner.invoke(app, ["list", "-d", str(tmp_path)])
+        assert result.exit_code == 0
+        assert "0.30" in result.output
+        assert "0.90" in result.output
+
+    def test_list_nonexistent_directory(self) -> None:
+        result = runner.invoke(app, ["list", "-d", "/nonexistent/dir"])
+        assert result.exit_code == 1
@@ -300,3 +300,33 @@ class TestConfigValidation:
        )
        assert config.max_iterations == 1
        assert config.perfect_score == 0.0
+
+
+class TestEvalConfigValidation:
+    """Tests for ground-truth evaluation config fields."""
+
+    def test_eval_defaults(self) -> None:
+        config = OptimizationConfig(seed_prompt="a", task_description="b")
+        assert config.eval_dataset_path is None
+        assert config.eval_metric == "bleu"
+
+    def test_eval_dataset_path_set(self) -> None:
+        config = OptimizationConfig(
+            seed_prompt="a", task_description="b",
+            eval_dataset_path="data.csv",
+        )
+        assert config.eval_dataset_path == "data.csv"
+
+    def test_valid_eval_metrics(self) -> None:
+        for metric in ("exact", "bleu", "rouge_l", "cosine", "llm_judge"):
+            config = OptimizationConfig(
+                seed_prompt="a", task_description="b", eval_metric=metric,
+            )
+            assert config.eval_metric == metric
+
+    def test_invalid_eval_metric_raises(self) -> None:
+        with pytest.raises(ValidationError, match="eval_metric must be one of"):
+            OptimizationConfig(
+                seed_prompt="a", task_description="b",
+                eval_metric="invalid_metric",
+            )
@@ -0,0 +1,86 @@
+"""Tests for the ground-truth dataset loader."""
+from __future__ import annotations
+
+import json
+import os
+import tempfile
+
+import pytest
+
+from prometheus.domain.entities import GroundTruthExample
+from prometheus.infrastructure.dataset_loader import FileDatasetLoader
+
+
+@pytest.fixture
+def loader():
+    return FileDatasetLoader()
+
+
+class TestCsvLoader:
+    def test_load_csv(self, loader, tmp_path):
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("input,expected_output\nhello,world\nfoo,bar\n")
+        result = loader.load(str(csv_file))
+        assert len(result) == 2
+        assert result[0].input_text == "hello"
+        assert result[0].expected_output == "world"
+        assert result[1].input_text == "foo"
+        assert result[1].expected_output == "bar"
+
+    def test_load_csv_skips_empty_input(self, loader, tmp_path):
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("input,expected_output\n,bar\nhello,world\n")
+        result = loader.load(str(csv_file))
+        assert len(result) == 1
+        assert result[0].input_text == "hello"
+
+    def test_load_csv_with_whitespace(self, loader, tmp_path):
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("input,expected_output\n  hello  ,  world  \n")
+        result = loader.load(str(csv_file))
+        assert result[0].input_text == "hello"
+        assert result[0].expected_output == "world"
+
+    def test_load_csv_empty_file(self, loader, tmp_path):
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("input,expected_output\n")
+        result = loader.load(str(csv_file))
+        assert len(result) == 0
+
+
+class TestJsonLoader:
+    def test_load_json(self, loader, tmp_path):
+        json_file = tmp_path / "test.json"
+        data = [
+            {"input": "hello", "expected_output": "world"},
+            {"input": "foo", "expected_output": "bar"},
+        ]
+        json_file.write_text(json.dumps(data))
+        result = loader.load(str(json_file))
+        assert len(result) == 2
+        assert result[0].input_text == "hello"
+        assert result[0].expected_output == "world"
+
+    def test_load_json_skips_empty_input(self, loader, tmp_path):
+        json_file = tmp_path / "test.json"
+        data = [
+            {"input": "", "expected_output": "bar"},
+            {"input": "hello", "expected_output": "world"},
+        ]
+        json_file.write_text(json.dumps(data))
+        result = loader.load(str(json_file))
+        assert len(result) == 1
+
+    def test_load_json_not_array_raises(self, loader, tmp_path):
+        json_file = tmp_path / "test.json"
+        json_file.write_text(json.dumps({"not": "an array"}))
+        with pytest.raises(ValueError, match="must be an array"):
+            loader.load(str(json_file))
+
+
+class TestUnsupportedFormat:
+    def test_unsupported_extension_raises(self, loader, tmp_path):
+        txt_file = tmp_path / "test.txt"
+        txt_file.write_text("hello")
+        with pytest.raises(ValueError, match="Unsupported dataset format"):
+            loader.load(str(txt_file))
@@ -278,6 +278,7 @@ class TestPerCallIsolation:
        adapter._judge_dimensions = []
        adapter._dimension_names = ""
        adapter._weights = {}
+        adapter.call_count = 0

        # Mock _judge to fail on first call, succeed on second
        call_count = 0
@@ -8,10 +8,30 @@ import pytest
 from prometheus.application.bootstrap import SyntheticBootstrap
 from prometheus.application.evaluator import PromptEvaluator
 from prometheus.application.evolution import EvolutionLoop
-from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
+from prometheus.domain.entities import (
+    Candidate,
+    EvalResult,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+
+
+def _make_eval(scores: list[float], label: str = "ok") -> EvalResult:
+    """Helper to build an EvalResult from a list of scores."""
+    return EvalResult(
+        scores=scores,
+        feedbacks=[label] * len(scores),
+        trajectories=[
+            Trajectory(f"input{i}", f"output{i}", s, label, "prompt")
+            for i, s in enumerate(scores)
+        ],
+    )


 class TestEvolutionLoop:
+    """Tests for the original single-candidate hill-climbing mode (population_size=1)."""
+
    @pytest.mark.asyncio
    async def test_accepts_improvement(
        self,
@@ -27,28 +47,9 @@ class TestEvolutionLoop:
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]

-        initial_eval = EvalResult(
-            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
-            feedbacks=["bad"] * 5,
-            trajectories=[
-                Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
-                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
-            ],
-        )
-        old_eval = EvalResult(
-            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
-            feedbacks=["bad"] * 5,
-            trajectories=[
-                Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
-                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
-            ],
-        )
-        new_eval = EvalResult(
-            scores=[0.8, 0.9, 0.7, 0.8, 0.9],
-            feedbacks=["good"] * 5,
-            trajectories=[],
-        )
-        evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval])
+        low_eval = _make_eval([0.3, 0.4, 0.3, 0.5, 0.2], "bad")
+        high_eval = _make_eval([0.8, 0.9, 0.7, 0.8, 0.9], "good")
+        evaluator.evaluate = AsyncMock(side_effect=[low_eval, low_eval, high_eval])

        loop = EvolutionLoop(
            evaluator=evaluator,
@@ -57,7 +58,6 @@ class TestEvolutionLoop:
            max_iterations=1,
            minibatch_size=5,
        )
-        with patch.object(loop, "_log"):
        state = await loop.run(seed_prompt, synthetic_pool, task_description)

        assert state.best_candidate is not None
@@ -78,28 +78,9 @@ class TestEvolutionLoop:
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]

-        initial_eval = EvalResult(
-            scores=[0.7, 0.8, 0.7, 0.8, 0.9],
-            feedbacks=["ok"] * 5,
-            trajectories=[
-                Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
-                for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
-            ],
-        )
-        old_eval = EvalResult(
-            scores=[0.7, 0.8, 0.7, 0.8, 0.9],
-            feedbacks=["ok"] * 5,
-            trajectories=[
-                Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
-                for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
-            ],
-        )
-        new_eval = EvalResult(
-            scores=[0.2, 0.1, 0.3, 0.2, 0.1],
-            feedbacks=["bad"] * 5,
-            trajectories=[],
-        )
-        evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval])
+        high_eval = _make_eval([0.7, 0.8, 0.7, 0.8, 0.9], "ok")
+        low_eval = _make_eval([0.2, 0.1, 0.3, 0.2, 0.1], "bad")
+        evaluator.evaluate = AsyncMock(side_effect=[high_eval, high_eval, low_eval])

        loop = EvolutionLoop(
            evaluator=evaluator,
@@ -108,7 +89,6 @@ class TestEvolutionLoop:
            max_iterations=1,
            minibatch_size=5,
        )
-        with patch.object(loop, "_log"):
        state = await loop.run(seed_prompt, synthetic_pool, task_description)

        assert state.best_candidate is not None
@@ -129,14 +109,7 @@ class TestEvolutionLoop:
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]

-        perfect_eval = EvalResult(
-            scores=[1.0, 1.0, 1.0, 1.0, 1.0],
-            feedbacks=["perfect"] * 5,
-            trajectories=[
-                Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt")
-                for i in range(5)
-            ],
-        )
+        perfect_eval = _make_eval([1.0, 1.0, 1.0, 1.0, 1.0], "perfect")
        evaluator.evaluate = AsyncMock(return_value=perfect_eval)

        loop = EvolutionLoop(
@@ -146,7 +119,226 @@ class TestEvolutionLoop:
            max_iterations=3,
            minibatch_size=5,
        )
-        with patch.object(loop, "_log"):
        await loop.run(seed_prompt, synthetic_pool, task_description)
-
        mock_proposer_port.propose.assert_not_called()
+
+
+class TestPopulationEvolution:
+    """Tests for population-based evolution (population_size > 1)."""
+
+    @pytest.mark.asyncio
+    async def test_population_initialization(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+        mock_mutation_port: AsyncMock,
+    ) -> None:
+        """Population is initialized with the right number of candidates."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        evaluator.evaluate = AsyncMock(
+            return_value=_make_eval([0.5] * 5, "ok")
+        )
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=0,  # no iterations, just initialization
+            minibatch_size=5,
+            population_size=4,
+            mutation_port=mock_mutation_port,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        # 1 seed + 3 mutations = 4 candidates
+        assert len(state.candidates) == 4
+        assert mock_mutation_port.mutate.call_count == 3
+
+    @pytest.mark.asyncio
+    async def test_population_initialization_uses_proposer_fallback(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """When no mutation_port is provided, population init falls back to proposer."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        evaluator.evaluate = AsyncMock(
+            return_value=_make_eval([0.5] * 5, "ok")
+        )
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=0,
+            minibatch_size=5,
+            population_size=3,
+            # mutation_port intentionally omitted
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert len(state.candidates) == 3
+        assert mock_proposer_port.propose.call_count == 2  # 3-1 = 2 init mutations
+
+    @pytest.mark.asyncio
+    async def test_population_iteration_replaces_worst(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+        mock_crossover_port: AsyncMock,
+        mock_mutation_port: AsyncMock,
+    ) -> None:
+        """Crossover child replaces worst candidate when its fitness is higher."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        # Sequence:
+        # 1. Initial eval (seed)
+        # 2. Population init: 3 mutation calls use proposer.propose(), NOT evaluator.evaluate
+        # 3. Population iteration: crossover produces child → eval child
+        # Only 2 evaluator.evaluate calls total
+        seed_eval = _make_eval([0.5] * 5, "ok")
+        # Crossover child eval - high score to beat worst
+        child_eval = _make_eval([0.9, 0.9, 0.8, 0.9, 0.8], "great")
+
+        all_evals = [seed_eval, child_eval]
+        evaluator.evaluate = AsyncMock(side_effect=all_evals)
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+            population_size=4,
+            crossover_rate=1.0,
+            crossover_port=mock_crossover_port,
+            mutation_rate=0.0,  # disable post-crossover mutation for determinism
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        accepted_events = [h for h in state.history if h.get("event") == "pop_accepted"]
+        assert len(accepted_events) >= 1
+
+    @pytest.mark.asyncio
+    async def test_population_iteration_rejects_inferior_child(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+        mock_crossover_port: AsyncMock,
+    ) -> None:
+        """Inferior child is rejected and doesn't replace any candidate."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        seed_eval = _make_eval([0.8] * 5, "ok")
+        # Crossover produces very LOW-scoring child
+        child_eval = _make_eval([0.1] * 5, "terrible")
+
+        all_evals = [seed_eval, child_eval]
+        evaluator.evaluate = AsyncMock(side_effect=all_evals)
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+            population_size=4,
+            crossover_rate=1.0,
+            crossover_port=mock_crossover_port,
+            mutation_rate=0.0,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        rejected_events = [h for h in state.history if h.get("event") == "pop_rejected"]
+        assert len(rejected_events) >= 1
+
+
+class TestDiversityScore:
+    """Tests for the diversity/similarity scoring logic."""
+
+    def test_identical_prompts_have_high_similarity(self) -> None:
+        """Identical prompts should have very high similarity."""
+        identical = Prompt(text="You are a helpful assistant. Answer the question.")
+        pop_a = Candidate(prompt=identical, best_score=4.0, generation=0)
+        pop_b = Candidate(
+            prompt=Prompt(text="Completely different prompt about data analysis."),
+            best_score=3.0,
+            generation=0,
+        )
+        sim_same = EvolutionLoop._compute_diversity_score(identical, [pop_a, pop_b])
+        # Average includes similarity to the different member, so ~0.5 not 0.9+
+        assert sim_same > 0.3
+
+    def test_different_prompts_have_lower_similarity(self) -> None:
+        """Different prompts should have lower similarity than identical ones."""
+        prompt_a = Prompt(text="You are a helpful assistant. Answer the question.")
+        prompt_b = Prompt(text="Provide detailed analysis of complex data patterns with precision.")
+        pop_a = Candidate(prompt=prompt_a, best_score=4.0, generation=0)
+        pop_b = Candidate(prompt=prompt_b, best_score=3.0, generation=0)
+        sim_a = EvolutionLoop._compute_diversity_score(prompt_a, [pop_a, pop_b])
+        sim_b = EvolutionLoop._compute_diversity_score(prompt_b, [pop_a, pop_b])
+        # Both should be < 1.0 since they're different
+        assert sim_a < 1.0
+        assert sim_b < 1.0
+
+    def test_single_member_population_returns_1(self) -> None:
+        """Single-member population always returns 1.0 (no penalty)."""
+        prompt = Prompt(text="Any prompt text here.")
+        pop = [Candidate(prompt=prompt, best_score=1.0, generation=0)]
+        sim = EvolutionLoop._compute_diversity_score(prompt, pop)
+        assert sim == 1.0
+
+    def test_empty_prompt_returns_zero(self) -> None:
+        """Empty prompt text returns 0.0 when population has >1 member."""
+        prompt = Prompt(text="")
+        pop = [
+            Candidate(prompt=Prompt(text="some text"), best_score=1.0, generation=0),
+            Candidate(prompt=Prompt(text="other text"), best_score=2.0, generation=0),
+        ]
+        sim = EvolutionLoop._compute_diversity_score(prompt, pop)
+        assert sim == 0.0
+
+
+class TestPromptDiff:
+    """Tests for the static _compute_prompt_diff helper."""
+
+    def test_identical_prompts(self) -> None:
+        result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello\nworld")
+        assert result["lines_added"] == 0
+        assert result["lines_removed"] == 0
+        assert result["chars_delta"] == 0
+
+    def test_added_lines(self) -> None:
+        result = EvolutionLoop._compute_prompt_diff("hello", "hello\nworld")
+        assert result["lines_added"] == 1
+        assert result["lines_removed"] == 0
+        assert result["chars_delta"] == 6  # "\nworld"
+
+    def test_removed_lines(self) -> None:
+        result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello")
+        assert result["lines_added"] == 0
+        assert result["lines_removed"] == 1
@@ -0,0 +1,133 @@
+"""Tests for GroundTruthEvaluator — execution + similarity comparison."""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock
+
+import pytest
+
+from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator
+from prometheus.domain.entities import EvalResult, GroundTruthExample, Prompt
+from prometheus.domain.ports import LLMPort, SimilarityPort
+
+
+@pytest.fixture
+def mock_executor() -> AsyncMock:
+    port = AsyncMock(spec=LLMPort)
+    port.execute.return_value = "Paris is the capital of France."
+    return port
+
+
+@pytest.fixture
+def mock_similarity() -> AsyncMock:
+    port = AsyncMock(spec=SimilarityPort)
+    port.compute.return_value = 0.85
+    return port
+
+
+@pytest.fixture
+def gt_dataset() -> list[GroundTruthExample]:
+    return [
+        GroundTruthExample(input_text="What is the capital of France?", expected_output="Paris", id=0),
+        GroundTruthExample(input_text="What is 2+2?", expected_output="4", id=1),
+        GroundTruthExample(input_text="What color is the sky?", expected_output="blue", id=2),
+    ]
+
+
+@pytest.fixture
+def prompt() -> Prompt:
+    return Prompt(text="Answer the following question accurately.")
+
+
+@pytest.mark.asyncio
+class TestGroundTruthEvaluator:
+    async def test_evaluate_happy_path(self, mock_executor, mock_similarity, gt_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor,
+            similarity=mock_similarity,
+            max_concurrency=2,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+
+        assert isinstance(result, EvalResult)
+        assert len(result.scores) == 3
+        assert len(result.feedbacks) == 3
+        assert len(result.trajectories) == 3
+        assert all(s == 0.85 for s in result.scores)
+        assert result.mean_score == pytest.approx(0.85)
+        assert result.total_score == pytest.approx(2.55)
+
+    async def test_executor_called_for_each_input(self, mock_executor, mock_similarity, gt_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=mock_similarity,
+        )
+        await evaluator.evaluate(prompt, gt_dataset)
+        assert mock_executor.execute.call_count == 3
+
+    async def test_similarity_called_for_each_output(self, mock_executor, mock_similarity, gt_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=mock_similarity,
+        )
+        await evaluator.evaluate(prompt, gt_dataset)
+        assert mock_similarity.compute.call_count == 3
+
+    async def test_execution_error_produces_zero_score(self, mock_similarity, gt_dataset, prompt):
+        failing_executor = AsyncMock(spec=LLMPort)
+        failing_executor.execute.side_effect = RuntimeError("API timeout")
+
+        evaluator = GroundTruthEvaluator(
+            executor=failing_executor, similarity=mock_similarity,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+
+        assert len(result.scores) == 3
+        # The similarity adapter is called with the error sentinel
+        assert all(isinstance(s, float) for s in result.scores)
+        assert all("[execution error:" in t.output_text for t in result.trajectories)
+
+    async def test_empty_dataset(self, mock_executor, mock_similarity, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=mock_similarity,
+        )
+        result = await evaluator.evaluate(prompt, [])
+        assert result.scores == []
+        assert result.mean_score == 0.0
+        assert result.total_score == 0.0
+
+    async def test_trajectory_contains_prompt_used(self, mock_executor, mock_similarity, gt_dataset, prompt):
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=mock_similarity,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+        for t in result.trajectories:
+            assert t.prompt_used == prompt.text
+
+    async def test_scores_clamped_to_unit_range(self, mock_executor, gt_dataset, prompt):
+        # Similarity returns a value > 1.0 (should be clamped)
+        over_similarity = AsyncMock(spec=SimilarityPort)
+        over_similarity.compute.return_value = 1.5
+
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=over_similarity,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+        assert all(0.0 <= s <= 1.0 for s in result.scores)
+
+    async def test_feedback_for_exact_match(self, mock_executor, gt_dataset, prompt):
+        exact_similarity = AsyncMock(spec=SimilarityPort)
+        exact_similarity.compute.return_value = 1.0
+
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=exact_similarity,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+        assert all("Exact match" in fb for fb in result.feedbacks)
+
+    async def test_feedback_for_poor_match(self, mock_executor, gt_dataset, prompt):
+        poor_similarity = AsyncMock(spec=SimilarityPort)
+        poor_similarity.compute.return_value = 0.1
+
+        evaluator = GroundTruthEvaluator(
+            executor=mock_executor, similarity=poor_similarity,
+        )
+        result = await evaluator.evaluate(prompt, gt_dataset)
+        assert all("Poor match" in fb for fb in result.feedbacks)
@@ -0,0 +1,316 @@
+"""Unit tests for hold-out validation and early stopping."""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.domain.entities import (
+    Candidate,
+    EvalResult,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+
+
+def _make_eval(mean_score: float, n: int = 5) -> EvalResult:
+    """Helper: create an EvalResult with a given mean score."""
+    scores = [mean_score] * n
+    return EvalResult(
+        scores=scores,
+        feedbacks=["feedback"] * n,
+        trajectories=[
+            Trajectory(f"input{i}", f"output{i}", mean_score, "feedback", "prompt")
+            for i in range(n)
+        ],
+    )
+
+
+class TestBootstrapSplit:
+    """Tests for SyntheticBootstrap.split_pool."""
+
+    def test_split_produces_correct_sizes(self):
+        pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(20)]
+        train, val = SyntheticBootstrap.split_pool(pool, 0.3)
+        assert len(train) + len(val) == 20
+        assert len(val) == 6  # 20 * 0.3 = 6
+        assert len(train) == 14
+
+    def test_split_zero_fraction_returns_all_train(self):
+        pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(10)]
+        train, val = SyntheticBootstrap.split_pool(pool, 0.0)
+        assert len(train) == 10
+        assert len(val) == 0
+
+    def test_split_single_element(self):
+        pool = [SyntheticExample(input_text="only", id=0)]
+        train, val = SyntheticBootstrap.split_pool(pool, 0.3)
+        assert len(train) == 1
+        assert len(val) == 0
+
+    def test_split_deterministic_with_seed(self):
+        pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(50)]
+        train1, val1 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42)))
+        train2, val2 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42)))
+        assert [ex.id for ex in train1] == [ex.id for ex in train2]
+        assert [ex.id for ex in val1] == [ex.id for ex in val2]
+
+    def test_split_no_overlap(self):
+        pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(30)]
+        train, val = SyntheticBootstrap.split_pool(pool, 0.3)
+        train_ids = {ex.id for ex in train}
+        val_ids = {ex.id for ex in val}
+        assert train_ids.isdisjoint(val_ids)
+        assert train_ids | val_ids == {ex.id for ex in pool}
+
+
+class TestValidationEvaluation:
+    """Tests for hold-out evaluation during evolution."""
+
+    @pytest.mark.asyncio
+    async def test_validation_pool_evaluated_after_each_iteration(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """When a validation pool is provided, the best candidate is evaluated on it."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        # Initial eval (train) + validation eval + iteration train eval + new prompt eval + validation eval
+        train_eval = _make_eval(0.5)
+        val_eval = _make_eval(0.6)
+        new_eval = _make_eval(0.7)
+        val_eval_2 = _make_eval(0.65)
+
+        evaluator.evaluate = AsyncMock(
+            side_effect=[train_eval, val_eval, train_eval, new_eval, val_eval_2]
+        )
+
+        validation_pool = synthetic_pool[-6:]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+        )
+        state = await loop.run(
+            seed_prompt, synthetic_pool, task_description,
+            validation_pool=validation_pool,
+        )
+
+        # Should have validation metrics in state
+        assert state.best_validation_score is not None
+        # History should contain validation_eval entries
+        val_events = [h for h in state.history if h["event"] == "validation_eval"]
+        assert len(val_events) >= 1
+
+    @pytest.mark.asyncio
+    async def test_no_validation_without_pool(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """Without a validation pool, no validation is performed."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        train_eval = _make_eval(0.5)
+        old_eval = _make_eval(0.5)
+        new_eval = _make_eval(0.7)
+        evaluator.evaluate = AsyncMock(side_effect=[train_eval, old_eval, new_eval])
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+        )
+        state = await loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.best_validation_score is None
+        assert not state.early_stopped
+        val_events = [h for h in state.history if h["event"] == "validation_eval"]
+        assert len(val_events) == 0
+
+
+class TestEarlyStopping:
+    """Tests for early stopping when validation score degrades."""
+
+    @pytest.mark.asyncio
+    async def test_early_stop_triggers_on_patience_exceeded(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """Early stopping triggers when validation doesn't improve for K iterations."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        patience = 3
+        # Build eval sequence:
+        # 1. Initial train eval
+        # 2. Initial validation eval (0.5)
+        # Then for each of 3 iterations:
+        #   - train eval (current best)
+        #   - train eval (new prompt - accepted)
+        #   - validation eval (degrading)
+        evals = [
+            _make_eval(0.5),   # initial train
+            _make_eval(0.5),   # initial validation
+        ]
+        for i in range(patience):
+            evals.extend([
+                _make_eval(0.5 + i * 0.1),  # current eval (train)
+                _make_eval(0.6 + i * 0.1),  # new eval (train) - accepted
+                _make_eval(0.4),            # validation eval (degrading)
+            ])
+
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        validation_pool = synthetic_pool[-5:]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=10,  # would go further without early stop
+            minibatch_size=5,
+            early_stop_patience=patience,
+        )
+        state = await loop.run(
+            seed_prompt, synthetic_pool, task_description,
+            validation_pool=validation_pool,
+        )
+
+        assert state.early_stopped is True
+        assert state.iteration == patience
+        assert state.best_validation_score is not None
+        # Should have an early_stop event in history
+        early_stop_events = [h for h in state.history if h["event"] == "early_stop"]
+        assert len(early_stop_events) == 1
+
+    @pytest.mark.asyncio
+    async def test_early_stop_does_not_trigger_when_improving(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """When validation keeps improving, early stopping does not trigger."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        evals = [
+            _make_eval(0.3),   # initial train
+            _make_eval(0.3),   # initial validation
+        ]
+        # 3 iterations, each with improving validation
+        for i in range(3):
+            evals.extend([
+                _make_eval(0.3 + i * 0.1),  # current train eval
+                _make_eval(0.4 + i * 0.1),  # new train eval (accepted)
+                _make_eval(0.3 + (i + 1) * 0.1),  # validation eval (improving)
+            ])
+
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+
+        validation_pool = synthetic_pool[-5:]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=5,
+            early_stop_patience=5,
+        )
+        state = await loop.run(
+            seed_prompt, synthetic_pool, task_description,
+            validation_pool=validation_pool,
+        )
+
+        assert state.early_stopped is False
+        assert state.iteration == 3
+        assert state.best_validation_score is not None
+
+    @pytest.mark.asyncio
+    async def test_validation_patience_resets_on_improvement(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: AsyncMock,
+        mock_judge_port: AsyncMock,
+        mock_proposer_port: AsyncMock,
+    ) -> None:
+        """Patience counter resets when validation improves after degrading."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        evals = [
+            _make_eval(0.3),   # initial train
+            _make_eval(0.3),   # initial validation
+            # iter 1: degrade
+            _make_eval(0.3),   # current train
+            _make_eval(0.5),   # new train (accepted)
+            _make_eval(0.2),   # validation degrade (patience=1)
+            # iter 2: degrade
+            _make_eval(0.5),   # current train
+            _make_eval(0.6),   # new train (accepted)
+            _make_eval(0.2),   # validation degrade (patience=2)
+            # iter 3: improve! (resets patience)
+            _make_eval(0.6),   # current train
+            _make_eval(0.7),   # new train (accepted)
+            _make_eval(0.4),   # validation improve (patience=0)
+            # iter 4: degrade again
+            _make_eval(0.7),   # current train
+            _make_eval(0.8),   # new train (accepted)
+            _make_eval(0.2),   # validation degrade (patience=1)
+        ]
+
+        evaluator.evaluate = AsyncMock(side_effect=evals)
+        validation_pool = synthetic_pool[-5:]
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=4,
+            minibatch_size=5,
+            early_stop_patience=3,
+        )
+        state = await loop.run(
+            seed_prompt, synthetic_pool, task_description,
+            validation_pool=validation_pool,
+        )
+
+        assert state.early_stopped is False
+        assert state.iteration == 4
@@ -0,0 +1,189 @@
+"""Unit tests for structured logging configuration."""
+from __future__ import annotations
+
+import json
+import logging
+from pathlib import Path
+
+from prometheus.cli.logging_setup import configure_logging, get_logger
+
+
+class TestConfigureLogging:
+    def _count_handlers(self, name: str = "prometheus") -> int:
+        return len(logging.getLogger(name).handlers)
+
+    def test_default_creates_console_handler(self) -> None:
+        configure_logging(level=logging.INFO)
+        prom = logging.getLogger("prometheus")
+        assert len(prom.handlers) == 1
+        assert isinstance(prom.handlers[0], logging.StreamHandler)
+        prom.handlers.clear()
+
+    def test_json_format_produces_valid_json(self, capsys) -> None:
+        configure_logging(level=logging.INFO, log_format="json")
+        logger = get_logger("test_json")
+        logger.info("hello", extra={"structured": {"key": "value"}})
+
+        captured = capsys.readouterr()
+        # Output goes to stderr
+        line = captured.err.strip().split("\n")[-1]
+        data = json.loads(line)
+        assert data["message"] == "hello"
+        assert data["structured"]["key"] == "value"
+        assert data["level"] == "INFO"
+        assert "timestamp" in data
+
+        logging.getLogger("prometheus").handlers.clear()
+
+    def test_text_format_includes_structured_extras(self, capsys) -> None:
+        configure_logging(level=logging.INFO, log_format="text")
+        logger = get_logger("test_text")
+        logger.info("msg", extra={"structured": {"foo": "bar"}})
+
+        captured = capsys.readouterr()
+        assert "foo=bar" in captured.err
+
+        logging.getLogger("prometheus").handlers.clear()
+
+    def test_debug_level_shows_debug_messages(self, capsys) -> None:
+        configure_logging(level=logging.DEBUG)
+        logger = get_logger("test_debug")
+        logger.debug("debug msg")
+
+        captured = capsys.readouterr()
+        assert "debug msg" in captured.err
+
+        logging.getLogger("prometheus").handlers.clear()
+
+    def test_warning_level_hides_debug_messages(self, capsys) -> None:
+        configure_logging(level=logging.WARNING)
+        logger = get_logger("test_warn")
+        logger.debug("should not appear")
+        logger.info("also hidden")
+
+        captured = capsys.readouterr()
+        assert "should not appear" not in captured.err
+        assert "also hidden" not in captured.err
+
+        logging.getLogger("prometheus").handlers.clear()
+
+    def test_file_handler_writes_to_file(self, tmp_path: Path) -> None:
+        log_file = tmp_path / "test.log"
+        configure_logging(level=logging.INFO, log_file=str(log_file))
+        logger = get_logger("test_file")
+        logger.info("file message")
+
+        prom = logging.getLogger("prometheus")
+        # Flush handlers
+        for h in prom.handlers:
+            h.flush()
+        prom.handlers.clear()
+
+        content = log_file.read_text()
+        assert "file message" in content
+
+    def test_json_file_output(self, tmp_path: Path) -> None:
+        log_file = tmp_path / "test.json.log"
+        configure_logging(level=logging.INFO, log_format="json", log_file=str(log_file))
+        logger = get_logger("test_json_file")
+        logger.info("json file msg", extra={"structured": {"x": 1}})
+
+        prom = logging.getLogger("prometheus")
+        for h in prom.handlers:
+            h.flush()
+        prom.handlers.clear()
+
+        content = log_file.read_text().strip()
+        data = json.loads(content)
+        assert data["message"] == "json file msg"
+        assert data["structured"]["x"] == 1
+
+    def test_reconfigure_clears_old_handlers(self) -> None:
+        configure_logging(level=logging.INFO)
+        configure_logging(level=logging.DEBUG)
+        prom = logging.getLogger("prometheus")
+        assert len(prom.handlers) == 1
+        prom.handlers.clear()
+
+    def test_propagate_false_prevents_duplicate_output(self, capsys) -> None:
+        configure_logging(level=logging.INFO)
+        prom = logging.getLogger("prometheus")
+        assert prom.propagate is False
+        prom.handlers.clear()
+
+
+class TestGetLogger:
+    def test_returns_child_of_prometheus(self) -> None:
+        logger = get_logger("mymodule")
+        assert logger.name == "prometheus.mymodule"
+
+    def test_inherits_level_from_parent(self) -> None:
+        configure_logging(level=logging.DEBUG)
+        logger = get_logger("child")
+        assert logger.getEffectiveLevel() <= logging.DEBUG
+        logging.getLogger("prometheus").handlers.clear()
+
+
+class TestJsonFormatter:
+    def test_exception_included(self, capsys) -> None:
+        configure_logging(level=logging.ERROR, log_format="json")
+        logger = get_logger("test_exc")
+        try:
+            raise ValueError("boom")
+        except ValueError:
+            logger.error("failed", exc_info=True)
+
+        captured = capsys.readouterr()
+        line = captured.err.strip().split("\n")[-1]
+        data = json.loads(line)
+        assert "ValueError: boom" in data["exception"]
+
+        logging.getLogger("prometheus").handlers.clear()
+
+
+class TestLoggingCLIIntegration:
+    """Tests for CLI flags that configure logging."""
+
+    def test_verbose_flag_enables_info(self, tmp_path: Path) -> None:
+        """Simulate what -v does — configure_logging at INFO level."""
+        configure_logging(level=logging.INFO)
+        logger = get_logger("evolution")
+        logger.info("test message")
+
+        prom = logging.getLogger("prometheus")
+        assert len(prom.handlers) == 1
+        prom.handlers.clear()
+
+    def test_debug_flag_enables_debug(self) -> None:
+        """Simulate what --debug does — configure_logging at DEBUG level."""
+        configure_logging(level=logging.DEBUG)
+        logger = get_logger("evolution")
+        logger.debug("debug message")
+
+        prom = logging.getLogger("prometheus")
+        assert prom.level == logging.DEBUG
+        prom.handlers.clear()
+
+    def test_log_format_invalid_rejected(self) -> None:
+        """Invalid log_format should be caught by OptimizationConfig validator."""
+        from pydantic import ValidationError
+        from prometheus.application.dto import OptimizationConfig
+
+        import pytest
+
+        with pytest.raises(ValidationError, match="log_format must be one of"):
+            OptimizationConfig(
+                seed_prompt="a",
+                task_description="b",
+                log_format="xml",
+            )
+
+    def test_log_format_text_and_json_accepted(self) -> None:
+        """Both text and json log_format values should be valid."""
+        from prometheus.application.dto import OptimizationConfig
+
+        for fmt in ("text", "json"):
+            config = OptimizationConfig(
+                seed_prompt="a", task_description="b", log_format=fmt,
+            )
+            assert config.log_format == fmt
@@ -0,0 +1,96 @@
+"""Additional unit tests for scoring edge cases."""
+from __future__ import annotations
+
+import pytest
+
+from prometheus.domain.entities import EvalResult, Trajectory
+from prometheus.domain.scoring import normalize_score, should_accept
+
+
+def _make_eval(scores: list[float]) -> EvalResult:
+    return EvalResult(
+        scores=scores,
+        feedbacks=[""] * len(scores),
+        trajectories=[
+            Trajectory(f"in{i}", f"out{i}", s, "", "p")
+            for i, s in enumerate(scores)
+        ],
+    )
+
+
+class TestShouldAcceptEdgeCases:
+    """Extended edge-case tests for should_accept."""
+
+    def test_tiny_improvement_accepted(self) -> None:
+        old = _make_eval([0.5])
+        new = _make_eval([0.5001])
+        assert should_accept(old, new) is True
+
+    def test_tiny_improvement_below_threshold(self) -> None:
+        old = _make_eval([0.5])
+        new = _make_eval([0.5001])
+        assert should_accept(old, new, min_improvement=0.01) is False
+
+    def test_zero_scores_equal(self) -> None:
+        old = _make_eval([0.0, 0.0])
+        new = _make_eval([0.0, 0.0])
+        assert should_accept(old, new) is False
+
+    def test_negative_to_zero_not_accepted(self) -> None:
+        """Scores should be [0,1] but test should_accept with edge values."""
+        old = _make_eval([-0.1])
+        new = _make_eval([0.0])
+        assert should_accept(old, new) is True
+
+    def test_large_improvement(self) -> None:
+        old = _make_eval([0.0, 0.0, 0.0])
+        new = _make_eval([1.0, 1.0, 1.0])
+        assert should_accept(old, new) is True
+
+    def test_single_score_improvement(self) -> None:
+        old = _make_eval([0.4])
+        new = _make_eval([0.5])
+        assert should_accept(old, new) is True
+
+    def test_min_improvement_exactly_met(self) -> None:
+        """When improvement exactly equals min_improvement, still rejected (strict >)."""
+        old = _make_eval([0.5])
+        new = _make_eval([0.7])
+        assert should_accept(old, new, min_improvement=0.2) is False
+
+    def test_min_improvement_just_over(self) -> None:
+        old = _make_eval([0.5])
+        new = _make_eval([0.7001])
+        assert should_accept(old, new, min_improvement=0.2) is True
+
+
+class TestNormalizeScoreEdgeCases:
+    """Extended edge-case tests for normalize_score."""
+
+    def test_exact_bounds(self) -> None:
+        assert normalize_score(0.0) == 0.0
+        assert normalize_score(1.0) == 1.0
+
+    def test_very_large_value(self) -> None:
+        assert normalize_score(1e10) == 1.0
+
+    def test_very_negative_value(self) -> None:
+        assert normalize_score(-1e10) == 0.0
+
+    def test_custom_bounds_at_edges(self) -> None:
+        assert normalize_score(5.0, min_val=0.0, max_val=10.0) == 5.0
+        assert normalize_score(0.0, min_val=0.0, max_val=10.0) == 0.0
+        assert normalize_score(10.0, min_val=0.0, max_val=10.0) == 10.0
+
+    def test_negative_custom_range(self) -> None:
+        assert normalize_score(0.0, min_val=-5.0, max_val=5.0) == 0.0
+        assert normalize_score(-3.0, min_val=-5.0, max_val=5.0) == -3.0
+        assert normalize_score(-10.0, min_val=-5.0, max_val=5.0) == -5.0
+
+    def test_zero_span_range(self) -> None:
+        """When min == max, clamps to min."""
+        assert normalize_score(5.0, min_val=5.0, max_val=5.0) == 5.0
+        assert normalize_score(0.0, min_val=5.0, max_val=5.0) == 5.0
+
+    def test_fractional_score(self) -> None:
+        assert normalize_score(0.3333) == pytest.approx(0.3333)
@@ -0,0 +1,133 @@
+"""Tests for similarity adapters — exact, BLEU, ROUGE-L, cosine."""
+from __future__ import annotations
+
+import pytest
+
+from prometheus.infrastructure.similarity import (
+    BleuSimilarity,
+    CosineSimilarity,
+    ExactMatchSimilarity,
+    RougeLSimilarity,
+    create_similarity_adapter,
+)
+
+
+class TestExactMatchSimilarity:
+    def test_exact_match(self):
+        s = ExactMatchSimilarity()
+        assert s.compute("Hello World", "Hello World") == 1.0
+
+    def test_case_insensitive(self):
+        s = ExactMatchSimilarity()
+        assert s.compute("hello world", "HELLO WORLD") == 1.0
+
+    def test_whitespace_trimmed(self):
+        s = ExactMatchSimilarity()
+        assert s.compute("  hello  ", "hello") == 1.0
+
+    def test_no_match(self):
+        s = ExactMatchSimilarity()
+        assert s.compute("hello", "world") == 0.0
+
+    def test_partial_no_match(self):
+        s = ExactMatchSimilarity()
+        assert s.compute("hello world", "hello") == 0.0
+
+
+class TestBleuSimilarity:
+    def test_perfect_match(self):
+        s = BleuSimilarity()
+        assert s.compute("the cat sat on the mat", "the cat sat on the mat") == 1.0
+
+    def test_no_overlap(self):
+        s = BleuSimilarity()
+        assert s.compute("aaa bbb ccc", "ddd eee fff") == 0.0
+
+    def test_partial_overlap(self):
+        s = BleuSimilarity()
+        score = s.compute("the cat sat", "the cat")
+        assert 0.0 < score < 1.0
+
+    def test_empty_prediction(self):
+        s = BleuSimilarity()
+        assert s.compute("", "hello world") == 0.0
+
+    def test_empty_expected(self):
+        s = BleuSimilarity()
+        assert s.compute("hello world", "") == 0.0
+
+    def test_both_empty(self):
+        s = BleuSimilarity()
+        assert s.compute("", "") == 0.0
+
+    def test_shorter_prediction_gets_brevity_penalty(self):
+        s = BleuSimilarity()
+        short = s.compute("cat", "the cat sat on the mat")
+        full = s.compute("the cat sat on the mat", "the cat sat on the mat")
+        assert short < full
+
+
+class TestRougeLSimilarity:
+    def test_perfect_match(self):
+        s = RougeLSimilarity()
+        assert s.compute("the cat sat", "the cat sat") == 1.0
+
+    def test_no_overlap(self):
+        s = RougeLSimilarity()
+        assert s.compute("aaa bbb", "ccc ddd") == 0.0
+
+    def test_partial_overlap(self):
+        s = RougeLSimilarity()
+        score = s.compute("the cat sat on the mat", "the cat on the rug")
+        assert 0.0 < score < 1.0
+
+    def test_empty_prediction(self):
+        s = RougeLSimilarity()
+        assert s.compute("", "hello") == 0.0
+
+    def test_subsequence(self):
+        s = RougeLSimilarity()
+        # "cat mat" is a subsequence of "the cat sat on the mat"
+        score = s.compute("cat mat", "the cat sat on the mat")
+        assert score > 0.0
+
+
+class TestCosineSimilarity:
+    def test_identical_texts(self):
+        s = CosineSimilarity()
+        assert s.compute("hello world", "hello world") == pytest.approx(1.0)
+
+    def test_no_overlap(self):
+        s = CosineSimilarity()
+        assert s.compute("aaa bbb", "ccc ddd") == 0.0
+
+    def test_partial_overlap(self):
+        s = CosineSimilarity()
+        score = s.compute("hello world foo", "hello world bar")
+        assert 0.0 < score < 1.0
+
+    def test_empty_prediction(self):
+        s = CosineSimilarity()
+        assert s.compute("", "hello") == 0.0
+
+
+class TestCreateSimilarityAdapter:
+    def test_create_exact(self):
+        adapter = create_similarity_adapter("exact")
+        assert isinstance(adapter, ExactMatchSimilarity)
+
+    def test_create_bleu(self):
+        adapter = create_similarity_adapter("bleu")
+        assert isinstance(adapter, BleuSimilarity)
+
+    def test_create_rouge_l(self):
+        adapter = create_similarity_adapter("rouge_l")
+        assert isinstance(adapter, RougeLSimilarity)
+
+    def test_create_cosine(self):
+        adapter = create_similarity_adapter("cosine")
+        assert isinstance(adapter, CosineSimilarity)
+
+    def test_unknown_metric_raises(self):
+        with pytest.raises(ValueError, match="Unknown eval metric"):
+            create_similarity_adapter("nonexistent")
@@ -0,0 +1,233 @@
+"""Unit tests for OptimizePromptUseCase — direct orchestration tests."""
+from __future__ import annotations
+
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.dto import OptimizationConfig, OptimizationResult
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.application.use_cases import OptimizePromptUseCase
+from prometheus.domain.entities import (
+    Candidate,
+    EvalResult,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+
+
+def _make_eval(scores: list[float]) -> EvalResult:
+    return EvalResult(
+        scores=scores,
+        feedbacks=["feedback"] * len(scores),
+        trajectories=[
+            Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
+            for i, s in enumerate(scores)
+        ],
+    )
+
+
+def _make_state(
+    iterations: int = 3,
+    initial_score: float = 0.3,
+    final_score: float = 0.8,
+    accepted: bool = True,
+) -> OptimizationState:
+    seed = Candidate(prompt=Prompt(text="seed"), best_score=initial_score, generation=0)
+    best = Candidate(
+        prompt=Prompt(text="optimized" if accepted else "seed"),
+        best_score=final_score,
+        generation=iterations if accepted else 0,
+    )
+    history = []
+    for i in range(1, iterations + 1):
+        event = "accepted" if accepted else "rejected"
+        history.append({"iteration": i, "event": event, "old_score": 0.3, "new_score": 0.8})
+
+    return OptimizationState(
+        iteration=iterations,
+        best_candidate=best,
+        candidates=[seed, best] if accepted else [seed],
+        total_llm_calls=iterations * 11 + 10,
+        history=history,
+    )
+
+
+class TestOptimizePromptUseCaseExecute:
+    """Tests for the execute() orchestration method."""
+
+    @pytest.fixture
+    def mock_evaluator(self) -> MagicMock:
+        return MagicMock(spec=PromptEvaluator)
+
+    @pytest.fixture
+    def mock_proposer(self) -> MagicMock:
+        return MagicMock()
+
+    @pytest.fixture
+    def mock_bootstrap(self) -> MagicMock:
+        return MagicMock(spec=SyntheticBootstrap)
+
+    @pytest.fixture
+    def use_case(
+        self,
+        mock_evaluator: MagicMock,
+        mock_proposer: MagicMock,
+        mock_bootstrap: MagicMock,
+    ) -> OptimizePromptUseCase:
+        return OptimizePromptUseCase(
+            evaluator=mock_evaluator,
+            proposer=mock_proposer,
+            bootstrap=mock_bootstrap,
+        )
+
+    @pytest.fixture
+    def config(self) -> OptimizationConfig:
+        return OptimizationConfig(
+            seed_prompt="Answer the question.",
+            task_description="Q&A task",
+            max_iterations=5,
+            n_synthetic_inputs=20,
+            minibatch_size=5,
+            seed=42,
+        )
+
+    @pytest.mark.asyncio
+    async def test_returns_optimization_result(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = [
+            SyntheticExample(input_text=f"q{i}", id=i) for i in range(20)
+        ]
+
+        mock_state = _make_state(iterations=3, initial_score=0.3, final_score=0.9)
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            result = await use_case.execute(config)
+
+        assert isinstance(result, OptimizationResult)
+        assert result.initial_prompt == "Answer the question."
+        assert result.final_score == 0.9
+        assert result.improvement == pytest.approx(0.6)
+
+    @pytest.mark.asyncio
+    async def test_bootstrap_called_with_config_params(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = []
+        mock_state = _make_state()
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            await use_case.execute(config)
+
+        mock_bootstrap.run.assert_called_once_with(
+            task_description="Q&A task",
+            n_examples=20,
+        )
+
+    @pytest.mark.asyncio
+    async def test_evolution_loop_configured_from_config(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = []
+        mock_state = _make_state()
+
+        with patch.object(EvolutionLoop, "run", return_value=mock_state) as mock_run:
+            await use_case.execute(config)
+
+            # Verify the loop was instantiated with correct params
+            mock_run.assert_called_once()
+            call_args = mock_run.call_args
+            seed_prompt = call_args[0][0]
+            assert seed_prompt.text == "Answer the question."
+            synthetic_pool = call_args[0][1]
+            assert len(synthetic_pool) == 0  # bootstrap returned empty
+            assert call_args[0][2] == "Q&A task"
+
+    @pytest.mark.asyncio
+    async def test_total_llm_calls_includes_bootstrap_call(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = []
+        mock_state = _make_state(iterations=3)
+        # total_llm_calls from state + 1 for bootstrap
+        expected = mock_state.total_llm_calls + 1
+
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            result = await use_case.execute(config)
+
+        assert result.total_llm_calls == expected
+
+    @pytest.mark.asyncio
+    async def test_no_candidates_fallback(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = [
+            SyntheticExample(input_text=f"q{i}", id=i) for i in range(20)
+        ]
+        mock_state = OptimizationState(
+            iteration=0,
+            best_candidate=None,
+            candidates=[],
+            total_llm_calls=0,
+        )
+
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            result = await use_case.execute(config)
+
+        assert result.optimized_prompt == "Answer the question."
+        assert result.initial_score == 0.0
+        assert result.final_score == 0.0
+        assert result.improvement == 0.0
+
+    @pytest.mark.asyncio
+    async def test_iterations_used_matches_state(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = []
+        mock_state = _make_state(iterations=7)
+
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            result = await use_case.execute(config)
+
+        assert result.iterations_used == 7
+
+    @pytest.mark.asyncio
+    async def test_history_passed_through(
+        self,
+        use_case: OptimizePromptUseCase,
+        mock_bootstrap: MagicMock,
+        config: OptimizationConfig,
+    ) -> None:
+        mock_bootstrap.run.return_value = []
+        history = [
+            {"iteration": 1, "event": "accepted"},
+            {"iteration": 2, "event": "rejected"},
+        ]
+        mock_state = _make_state()
+        mock_state.history = history
+
+        with patch.object(EvolutionLoop, "run", return_value=mock_state):
+            result = await use_case.execute(config)
+
+        assert result.history == history