Initial commit: PROMETHEUS v0.1.0 - Prompt optimizer

- Clean architecture (domain/application/infrastructure) - DSPy-based evolution engine with scoring - CLI via pyproject.toml entry point - Unit + integration tests (~300 tests) - Configs for glm-5.1 and glm-4.5-air models - Z.AI endpoint integration
2026-03-29 11:44:03 +00:00
commit 837a44970f
49 changed files with 6599 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,42 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+dist/
+build/
+*.egg
+
+# Virtual environments
+.venv/
+venv/
+env/
+
+# Testing / coverage
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+.coverage
+coverage.json
+htmlcov/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+
+# Environment / secrets
+.env
+.env_runtime
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Output artifacts (transient)
+result_*.yaml
+test_result.yaml
+zai_result.yaml
+prompt_optimized_*.md
+TEST_REPORT.*
--- a/README.md
+++ b/README.md
@@ -0,0 +1,27 @@
+# Prometheus
+
+Prompt evolution without reference data.
+
+## Quick Start
+
+```bash
+uv sync
+uv run prometheus optimize -i examples/sample_config.yaml -o result.yaml -v
+```
+
+## Architecture
+
+Clean hexagonal architecture with four layers:
+
+- **Domain** — entities, ports, scoring (zero external dependencies)
+- **Application** — use cases, bootstrap, evaluator, evolution loop
+- **Infrastructure** — DSPy signatures, modules, adapters, file I/O
+- **CLI** — Typer app with `optimize` command
+
+## Testing
+
+```bash
+uv run pytest
+uv run ruff check .
+uv run mypy src/
+```
--- a/config_glm45air.yaml
+++ b/config_glm45air.yaml
@@ -0,0 +1,32 @@
+# PROMETHEUS — DAG Planner Optimization (glm-4.5-air)
+# =======================================================
+
+seed_prompt: |
+  Un nouveau développeur monte en compétence sur le projet. Voici sa requête :
+  J'aimerai modifier le mécanisme de calcul des rente-pont pour gérer des ages de retraite différents selon la localité du bénéficiaire.
+  Tu fais parti d'un système agentique dont l'objectif global est de pouvoir lui fournir :
+  - Des diagrammes d'architecture complet de l'écosystème des composants concernés.
+  - Des diagrammes de flux très spécifique sur les composants concernés.
+  - Des diagrammes de flux mettant en relation l'ensemble des actions utilisateurs ayant un impact sur cette demande.
+  - Des diagrammes prévisionnel de chacun de ces types mettant en évidence les modifications qui devraient être effectuée par le développeur.
+  Ton rôle est capital et se place en début de chaine. Tu dois créer un DAG-like (Directed Acyclic Graph) en markdown qui répertorie, dans ce contexte, l'ensemble des opérations à mener par le reste du système. Ton output sera transmis à un orchestrateur qui l'utilisera pour planifier et déléguer les demandes d'analyse spécifiques des différents composants nécessaire à l'atteinte de l'objectif global. Le DAG devra donc, au besoin, contenir des noeuds pour les analyse des composants, des flux spécifiques et les tâches de consolidation pour la création des diagrammes. Chacun des noeuds de ton DAG sera traitée par un agent compétent et devra contenir les informations suivantes :
+  - Descriptif complet de la tâche à accomplir
+  - Format de sortie (toujours du markdown, mais défini clairement le contenu attendu)
+  - Dépendances aux autres noeuds, afin de permettre à l'orchestrateur de faire de la parallèlisation et d'identifier les problèmes
+  Afin de t'assurer de créer un plan cohérent, utilise les outils que tu as pour effectuer l'analyse préliminaires qui déterminera les besoins.
+
+task_description: |
+  Task decomposition planner for a multi-agent software analysis system. The assistant operates as the first stage in an agentic pipeline. Given a developer's feature request involving a domain-specific codebase (e.g., insurance, actuarial calculations, pension bridge calculations), it must: (1) Analyze the request to identify all impacted system components and data flows, (2) Produce a structured Directed Acyclic Graph (DAG) in markdown format, (3) Each DAG node represents a subtask for a specialized downstream agent, containing: (a) complete task description with enough context for an autonomous agent to execute, (b) expected output format specification (markdown with defined structure), (c) dependencies on other nodes for orchestration scheduling, (4) The DAG must enable an orchestrator to: parallelize independent analyses, identify blocking dependencies, and distribute work to specialized diagram-generation agents. The pipeline's ultimate deliverables are: ecosystem architecture diagrams, component-specific flow diagrams, user-action impact maps, and forward-looking modification plans. Quality criteria: DAG completeness (all necessary analyses covered), acyclicity correctness, task description specificity and actionability, dependency accuracy reflecting real component relationships, and parallelization opportunities correctly identified.
+
+task_model: "openai/glm-4.5-air"
+judge_model: "openai/glm-4.5-air"
+proposer_model: "openai/glm-4.5-air"
+synth_model: "openai/glm-4.5-air"
+
+api_base: "https://api.z.ai/api/coding/paas/v4"
+api_key_env: "GLM_API_KEY"
+
+max_iterations: 8
+n_synthetic_inputs: 5
+minibatch_size: 3
+seed: 42
--- a/config_glm51.yaml
+++ b/config_glm51.yaml
@@ -0,0 +1,32 @@
+# PROMETHEUS — DAG Planner Optimization (glm-5.1)
+# ===================================================
+
+seed_prompt: |
+  Un nouveau développeur monte en compétence sur le projet. Voici sa requête :
+  J'aimerai modifier le mécanisme de calcul des rente-pont pour gérer des ages de retraite différents selon la localité du bénéficiaire.
+  Tu fais parti d'un système agentique dont l'objectif global est de pouvoir lui fournir :
+  - Des diagrammes d'architecture complet de l'écosystème des composants concernés.
+  - Des diagrammes de flux très spécifique sur les composants concernés.
+  - Des diagrammes de flux mettant en relation l'ensemble des actions utilisateurs ayant un impact sur cette demande.
+  - Des diagrammes prévisionnel de chacun de ces types mettant en évidence les modifications qui devraient être effectuée par le développeur.
+  Ton rôle est capital et se place en début de chaine. Tu dois créer un DAG-like (Directed Acyclic Graph) en markdown qui répertorie, dans ce contexte, l'ensemble des opérations à mener par le reste du système. Ton output sera transmis à un orchestrateur qui l'utilisera pour planifier et déléguer les demandes d'analyse spécifiques des différents composants nécessaire à l'atteinte de l'objectif global. Le DAG devra donc, au besoin, contenir des noeuds pour les analyse des composants, des flux spécifiques et les tâches de consolidation pour la création des diagrammes. Chacun des noeuds de ton DAG sera traitée par un agent compétent et devra contenir les informations suivantes :
+  - Descriptif complet de la tâche à accomplir
+  - Format de sortie (toujours du markdown, mais défini clairement le contenu attendu)
+  - Dépendances aux autres noeuds, afin de permettre à l'orchestrateur de faire de la parallèlisation et d'identifier les problèmes
+  Afin de t'assurer de créer un plan cohérent, utilise les outils que tu as pour effectuer l'analyse préliminaires qui déterminera les besoins.
+
+task_description: |
+  Task decomposition planner for a multi-agent software analysis system. The assistant operates as the first stage in an agentic pipeline. Given a developer's feature request involving a domain-specific codebase (e.g., insurance, actuarial calculations, pension bridge calculations), it must: (1) Analyze the request to identify all impacted system components and data flows, (2) Produce a structured Directed Acyclic Graph (DAG) in markdown format, (3) Each DAG node represents a subtask for a specialized downstream agent, containing: (a) complete task description with enough context for an autonomous agent to execute, (b) expected output format specification (markdown with defined structure), (c) dependencies on other nodes for orchestration scheduling, (4) The DAG must enable an orchestrator to: parallelize independent analyses, identify blocking dependencies, and distribute work to specialized diagram-generation agents. The pipeline's ultimate deliverables are: ecosystem architecture diagrams, component-specific flow diagrams, user-action impact maps, and forward-looking modification plans. Quality criteria: DAG completeness (all necessary analyses covered), acyclicity correctness, task description specificity and actionability, dependency accuracy reflecting real component relationships, and parallelization opportunities correctly identified.
+
+task_model: "openai/glm-5.1"
+judge_model: "openai/glm-5.1"
+proposer_model: "openai/glm-5.1"
+synth_model: "openai/glm-5.1"
+
+api_base: "https://api.z.ai/api/coding/paas/v4"
+api_key_env: "GLM_API_KEY"
+
+max_iterations: 3
+n_synthetic_inputs: 3
+minibatch_size: 2
+seed: 42
--- a/docs/technical-spec.md
+++ b/docs/technical-spec.md
--- a/examples/sample_config.yaml
+++ b/examples/sample_config.yaml
@@ -0,0 +1,27 @@
+# PROMETHEUS Configuration File
+# ==================================
+
+# The initial prompt to optimize
+seed_prompt: |
+  You are an expert assistant in contract analysis.
+  Analyze the provided text and identify potentially abusive clauses.
+  Be precise and cite the relevant passages.
+
+# Task description (used to generate synthetic inputs)
+task_description: |
+  Legal analysis of contracts to identify abusive clauses.
+  The assistant must examine a contract text and flag
+  any clause that could be considered abusive under
+  French consumer protection law.
+
+# LLM models (DSPy/litellm format)
+task_model: "openai/gpt-4o-mini"
+judge_model: "openai/gpt-4o"
+proposer_model: "openai/gpt-4o"
+synth_model: "openai/gpt-4o"
+
+# Evolution parameters
+max_iterations: 30
+n_synthetic_inputs: 20
+minibatch_size: 5
+seed: 42
--- a/examples/test_real_config.yaml
+++ b/examples/test_real_config.yaml
@@ -0,0 +1,20 @@
+# PROMETHEUS Test Config — Code Review Task
+# ===========================================
+
+seed_prompt: |
+  You are a helpful coding assistant. Help the user write code.
+
+task_description: |
+  Code review and bug detection assistant. Reviews code snippets and
+  identifies bugs, security issues, and style problems. The assistant
+  receives a code snippet and must produce a structured review.
+
+task_model: "openai/glm-4.5-air"
+judge_model: "openai/glm-4.5-air"
+proposer_model: "openai/glm-4.5-air"
+synth_model: "openai/glm-4.5-air"
+
+max_iterations: 5
+n_synthetic_inputs: 8
+minibatch_size: 3
+seed: 123
--- a/examples/test_run_config.yaml
+++ b/examples/test_run_config.yaml
@@ -0,0 +1,26 @@
+# PROMETHEUS Test Config — Real run with Z.AI
+# =============================================
+
+seed_prompt: |
+  You are a helpful coding assistant. Help the user write clean, bug-free code.
+  When reviewing code, identify bugs and suggest improvements.
+
+task_description: |
+  Code review and bug detection assistant. Reviews code snippets and
+  identifies bugs, security issues, and style problems. The assistant
+  receives a code snippet and must produce a structured review.
+
+task_model: "openai/glm-4.5-air"
+judge_model: "openai/glm-4.5-air"
+proposer_model: "openai/glm-4.5-air"
+synth_model: "openai/glm-4.5-air"
+
+# API configuration for z.ai
+api_base: "https://api.z.ai/api/paas/v4"
+api_key_env: "GLM_API_KEY"
+
+# Evolution parameters (reduced for quick test)
+max_iterations: 3
+n_synthetic_inputs: 3
+minibatch_size: 2
+seed: 42
--- a/examples/zai_config.yaml
+++ b/examples/zai_config.yaml
@@ -0,0 +1,34 @@
+# PROMETHEUS Configuration File — z.ai Backend
+# ==================================
+# REQUIRES env vars:
+#   export OPENAI_API_KEY=<your_glm_key>
+#   (api_base is configured below)
+
+# The initial prompt to optimize
+seed_prompt: |
+  You are an expert assistant in contract analysis.
+  Analyze the provided text and identify potentially abusive clauses.
+  Be precise and cite the relevant passages.
+
+# Task description (used to generate synthetic inputs)
+task_description: |
+  Legal analysis of contracts to identify abusive clauses.
+  The assistant must examine a contract text and flag
+  any clause that could be considered abusive under
+  French consumer protection law.
+
+# LLM models (DSPy/litellm format with openai/ prefix for z.ai)
+task_model: "openai/glm-4.5-air"
+judge_model: "openai/glm-4.5-air"
+proposer_model: "openai/glm-4.5-air"
+synth_model: "openai/glm-4.5-air"
+
+# API configuration for z.ai
+api_base: "https://api.z.ai/api/paas/v4"
+api_key_env: "OPENAI_API_KEY"
+
+# Evolution parameters (reduced for functional testing)
+max_iterations: 3
+n_synthetic_inputs: 5
+minibatch_size: 3
+seed: 42
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,47 @@
+[project]
+name = "prometheus"
+version = "0.1.0"
+description = "Prompt evolution without reference data"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "dspy>=2.6,<3.0",
+    "typer>=0.15,<0.20",
+    "pydantic>=2.10",
+    "pydantic-settings>=2.7",
+    "pyyaml>=6.0",
+    "rich>=13.9",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.3",
+    "pytest-cov>=6.0",
+    "ruff>=0.9",
+    "mypy>=1.14",
+    "types-pyyaml>=6.0.12.20250915",
+]
+
+[project.scripts]
+prometheus = "prometheus.cli.app:app"
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.ruff]
+line-length = 100
+target-version = "py312"
+
+[tool.mypy]
+python_version = "3.12"
+strict = true
+
+[[tool.mypy.overrides]]
+module = ["dspy", "dspy.*"]
+ignore_missing_imports = true
+
+[[tool.mypy.overrides]]
+module = ["prometheus.infrastructure.*", "prometheus.cli.app"]
+disable_error_code = ["misc", "import-untyped"]
+
--- a/run_glm45air.sh
+++ b/run_glm45air.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+set -a
+source /home/debian/workspace/prometheus/.env_runtime
+set +a
+cd /home/debian/workspace/prometheus
+uv run prometheus optimize -i config_glm45air.yaml -o result_glm45air.yaml -v
--- a/run_glm51.sh
+++ b/run_glm51.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+set -a
+source /home/debian/workspace/prometheus/.env_runtime
+set +a
+cd /home/debian/workspace/prometheus
+uv run prometheus optimize -i config_glm51.yaml -o result_glm51.yaml -v
--- a/src/prometheus/init.py
+++ b/src/prometheus/init.py
@@ -0,0 +1,3 @@
+"""PROMETHEUS — Prompt evolution without reference data."""
+
+__version__ = "0.1.0"
--- a/src/prometheus/application/init.py
+++ b/src/prometheus/application/init.py
--- a/src/prometheus/application/bootstrap.py
+++ b/src/prometheus/application/bootstrap.py
@@ -0,0 +1,42 @@
+"""
+Bootstrap — synthetic input generation.
+
+Creates a pool of test inputs from the task description.
+This replaces the need for a labelled dataset.
+"""
+from __future__ import annotations
+
+import random
+
+from prometheus.domain.entities import SyntheticExample
+from prometheus.domain.ports import SyntheticGeneratorPort
+
+
+class SyntheticBootstrap:
+    """Orchestrates synthetic input generation.
+
+    Depends only on the abstract port, not on DSPy directly.
+    """
+
+    def __init__(self, generator: SyntheticGeneratorPort, seed: int = 42):
+        self._generator = generator
+        self._rng = random.Random(seed)
+
+    def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]:
+        """Generate the synthetic pool in a single call.
+
+        Single call minimizes LLM cost (1 call instead of N),
+        and the LLM can ensure diversity in a single generation.
+        """
+        examples = self._generator.generate_inputs(task_description, n_examples)
+        self._rng.shuffle(examples)
+        return examples
+
+    def sample_minibatch(
+        self,
+        pool: list[SyntheticExample],
+        size: int,
+    ) -> list[SyntheticExample]:
+        """Sample a minibatch from the synthetic pool."""
+        size = min(size, len(pool))
+        return self._rng.sample(pool, size)
--- a/src/prometheus/application/dto.py
+++ b/src/prometheus/application/dto.py
@@ -0,0 +1,47 @@
+"""Data Transfer Objects — configuration and results."""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any
+
+
+@dataclass
+class OptimizationConfig:
+    """Complete configuration for a PROMETHEUS run."""
+
+    # --- Prompt ---
+    seed_prompt: str
+    task_description: str
+
+    # --- Models ---
+    task_model: str = "openai/gpt-4o-mini"
+    judge_model: str = "openai/gpt-4o"
+    proposer_model: str = "openai/gpt-4o"
+    synth_model: str = "openai/gpt-4o"
+
+    # --- Evolution parameters ---
+    max_iterations: int = 30
+    n_synthetic_inputs: int = 20
+    minibatch_size: int = 5
+    perfect_score: float = 1.0
+
+    # --- Reproducibility ---
+    seed: int = 42
+
+    # --- Output ---
+    output_path: str = "output.yaml"
+    verbose: bool = False
+
+
+@dataclass
+class OptimizationResult:
+    """Result of a complete optimization."""
+
+    optimized_prompt: str
+    initial_prompt: str
+    iterations_used: int
+    total_llm_calls: int
+    initial_score: float
+    final_score: float
+    improvement: float
+    history: list[dict[str, Any]] = field(default_factory=list)
--- a/src/prometheus/application/evaluator.py
+++ b/src/prometheus/application/evaluator.py
@@ -0,0 +1,75 @@
+"""
+Evaluator — execution + judgement.
+
+Produces a quality signal without ground truth.
+Combines candidate prompt execution + LLM-as-Judge evaluation.
+"""
+from __future__ import annotations
+
+from prometheus.domain.entities import (
+    EvalResult,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+from prometheus.domain.ports import JudgePort, LLMPort
+
+
+class PromptEvaluator:
+    """Evaluates a prompt on a minibatch of synthetic inputs.
+
+    Pipeline: execute → judge → build trajectories.
+    Replaces GEPA's EvaluatorFn. Instead of comparing to ground truth,
+    uses an LLM-as-Judge.
+    """
+
+    def __init__(self, executor: LLMPort, judge: JudgePort):
+        self._executor = executor
+        self._judge = judge
+
+    def evaluate(
+        self,
+        prompt: Prompt,
+        minibatch: list[SyntheticExample],
+        task_description: str,
+    ) -> EvalResult:
+        """Evaluate the prompt on the minibatch.
+
+        Steps:
+        1. Execute the prompt on each input in the minibatch
+        2. Judge each (input, output) pair
+        3. Build trajectories with feedback
+        """
+        # Step 1: Execution
+        outputs: list[str] = []
+        for example in minibatch:
+            raw_output = self._executor.execute(prompt, example.input_text)
+            outputs.append(raw_output)
+
+        # Step 2: Judgement
+        pairs = [(ex.input_text, out) for ex, out in zip(minibatch, outputs)]
+        judge_results = self._judge.judge_batch(task_description, pairs)
+
+        # Step 3: Build trajectories
+        scores: list[float] = []
+        feedbacks: list[str] = []
+        trajectories: list[Trajectory] = []
+        for i, (example, output) in enumerate(zip(minibatch, outputs)):
+            score, feedback = judge_results[i]
+            scores.append(score)
+            feedbacks.append(feedback)
+            trajectories.append(
+                Trajectory(
+                    input_text=example.input_text,
+                    output_text=output,
+                    score=score,
+                    feedback=feedback,
+                    prompt_used=prompt.text,
+                )
+            )
+
+        return EvalResult(
+            scores=scores,
+            feedbacks=feedbacks,
+            trajectories=trajectories,
+        )
--- a/src/prometheus/application/evolution.py
+++ b/src/prometheus/application/evolution.py
@@ -0,0 +1,174 @@
+"""
+Evolution loop — core PROMETHEUS engine.
+
+Orchestrates the select → evaluate → propose → accept cycle.
+Equivalent to GEPAEngine.run(), adapted to work without a valset.
+"""
+from __future__ import annotations
+
+import logging
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.domain.entities import (
+    Candidate,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+)
+from prometheus.domain.ports import ProposerPort
+from prometheus.domain.scoring import should_accept
+
+logger = logging.getLogger(__name__)
+
+
+class EvolutionLoop:
+    """Main evolution loop.
+
+    Design:
+    - Keeps only the best candidate (no full population).
+    - Simplifies vs GEPA (no Pareto, no merge).
+    - Population support deferred to v2.
+    """
+
+    def __init__(
+        self,
+        evaluator: PromptEvaluator,
+        proposer: ProposerPort,
+        bootstrap: SyntheticBootstrap,
+        max_iterations: int = 30,
+        minibatch_size: int = 5,
+        perfect_score: float = 1.0,
+        verbose: bool = False,
+    ):
+        self._evaluator = evaluator
+        self._proposer = proposer
+        self._bootstrap = bootstrap
+        self._max_iterations = max_iterations
+        self._minibatch_size = minibatch_size
+        self._perfect_score = perfect_score
+        self._verbose = verbose
+
+    def run(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+    ) -> OptimizationState:
+        """Execute the complete evolution loop."""
+        state = OptimizationState()
+
+        # Evaluate the seed
+        initial_batch = self._bootstrap.sample_minibatch(
+            synthetic_pool, self._minibatch_size
+        )
+        initial_eval = self._evaluator.evaluate(
+            seed_prompt, initial_batch, task_description
+        )
+        state.total_llm_calls += 2 * self._minibatch_size  # N executions + N judge calls
+
+        best_candidate = Candidate(
+            prompt=seed_prompt,
+            best_score=initial_eval.total_score,
+            generation=0,
+        )
+        state.best_candidate = best_candidate
+        state.candidates.append(best_candidate)
+        self._log(f"Initial score: {initial_eval.total_score:.2f}")
+
+        # Main loop
+        for i in range(1, self._max_iterations + 1):
+            state.iteration = i
+
+            try:
+                # 1. Sample a fresh minibatch
+                batch = self._bootstrap.sample_minibatch(
+                    synthetic_pool, self._minibatch_size
+                )
+
+                # 2. Evaluate the current candidate
+                current_eval = self._evaluator.evaluate(
+                    best_candidate.prompt, batch, task_description
+                )
+                state.total_llm_calls += 2 * self._minibatch_size
+
+                # 3. Skip if perfect
+                if all(s >= self._perfect_score for s in current_eval.scores):
+                    self._log(f"Iter {i}: All scores perfect, skipping.")
+                    state.history.append(
+                        {
+                            "iteration": i,
+                            "event": "skip_perfect",
+                            "current_score": current_eval.total_score,
+                        }
+                    )
+                    continue
+
+                # 4. Propose a new prompt (reflective mutation)
+                new_prompt = self._proposer.propose(
+                    best_candidate.prompt,
+                    current_eval.trajectories,
+                    task_description,
+                )
+                state.total_llm_calls += 1  # 1 proposition call
+
+                # 5. Evaluate the new prompt on the same minibatch
+                new_eval = self._evaluator.evaluate(
+                    new_prompt, batch, task_description
+                )
+                state.total_llm_calls += 2 * self._minibatch_size
+
+                # 6. Accept or reject
+                if should_accept(current_eval, new_eval):
+                    best_candidate = Candidate(
+                        prompt=new_prompt,
+                        best_score=new_eval.total_score,
+                        generation=i,
+                        parent_id=id(best_candidate),
+                    )
+                    state.best_candidate = best_candidate
+                    state.candidates.append(best_candidate)
+                    self._log(
+                        f"Iter {i}: ACCEPTED "
+                        f"({current_eval.total_score:.2f} -> {new_eval.total_score:.2f})"
+                    )
+                    state.history.append(
+                        {
+                            "iteration": i,
+                            "event": "accepted",
+                            "old_score": current_eval.total_score,
+                            "new_score": new_eval.total_score,
+                            "improvement": new_eval.total_score
+                            - current_eval.total_score,
+                        }
+                    )
+                else:
+                    self._log(
+                        f"Iter {i}: REJECTED "
+                        f"({new_eval.total_score:.2f} <= {current_eval.total_score:.2f})"
+                    )
+                    state.history.append(
+                        {
+                            "iteration": i,
+                            "event": "rejected",
+                            "old_score": current_eval.total_score,
+                            "new_score": new_eval.total_score,
+                        }
+                    )
+
+            except Exception as exc:
+                self._log(f"Iter {i}: ERROR — {exc}. Skipping iteration.")
+                state.history.append(
+                    {
+                        "iteration": i,
+                        "event": "error",
+                        "error": str(exc),
+                    }
+                )
+                continue
+
+        return state
+
+    def _log(self, msg: str) -> None:
+        if self._verbose:
+            logger.info("[PROMETHEUS] %s", msg)
--- a/src/prometheus/application/use_cases.py
+++ b/src/prometheus/application/use_cases.py
@@ -0,0 +1,77 @@
+"""
+Main use case — high-level orchestration.
+
+Entry point for business logic. Coordinates bootstrap → evolution → result.
+Contains no technical logic, only orchestration.
+"""
+from __future__ import annotations
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.dto import OptimizationConfig, OptimizationResult
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.domain.entities import Prompt
+from prometheus.domain.ports import ProposerPort
+
+
+class OptimizePromptUseCase:
+    """Single MVP use case.
+
+    Injects dependencies via constructor (dependency injection).
+    """
+
+    def __init__(
+        self,
+        evaluator: PromptEvaluator,
+        proposer: ProposerPort,
+        bootstrap: SyntheticBootstrap,
+    ):
+        self._evaluator = evaluator
+        self._proposer = proposer
+        self._bootstrap = bootstrap
+
+    def execute(self, config: OptimizationConfig) -> OptimizationResult:
+        """Full pipeline:
+        1. Bootstrap → generate synthetic inputs
+        2. Evolution → optimization loop
+        3. Return result
+        """
+        # Phase 0: Bootstrap
+        synthetic_pool = self._bootstrap.run(
+            task_description=config.task_description,
+            n_examples=config.n_synthetic_inputs,
+        )
+
+        # Phase 1: Evolution
+        loop = EvolutionLoop(
+            evaluator=self._evaluator,
+            proposer=self._proposer,
+            bootstrap=self._bootstrap,
+            max_iterations=config.max_iterations,
+            minibatch_size=config.minibatch_size,
+            perfect_score=config.perfect_score,
+            verbose=config.verbose,
+        )
+        seed_prompt = Prompt(text=config.seed_prompt)
+        state = loop.run(seed_prompt, synthetic_pool, config.task_description)
+
+        # Phase 2: Result
+        initial_score = (
+            state.candidates[0].best_score if state.candidates else 0.0
+        )
+        final_score = state.best_candidate.best_score if state.best_candidate else 0.0
+
+        return OptimizationResult(
+            optimized_prompt=(
+                state.best_candidate.prompt.text
+                if state.best_candidate
+                else config.seed_prompt
+            ),
+            initial_prompt=config.seed_prompt,
+            iterations_used=state.iteration,
+            total_llm_calls=state.total_llm_calls + 1,  # +1 for bootstrap
+            initial_score=initial_score,
+            final_score=final_score,
+            improvement=final_score - initial_score,
+            history=state.history,
+        )
--- a/src/prometheus/cli/init.py
+++ b/src/prometheus/cli/init.py
--- a/src/prometheus/cli/app.py
+++ b/src/prometheus/cli/app.py
@@ -0,0 +1,168 @@
+"""
+CLI — user entry point.
+
+Typer interface with -i (input) and -o (output) options.
+"""
+from __future__ import annotations
+
+import logging
+import os
+from dataclasses import asdict
+
+import dspy
+import typer
+from rich.console import Console
+from rich.panel import Panel
+from rich.table import Table
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.dto import OptimizationConfig, OptimizationResult
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.use_cases import OptimizePromptUseCase
+from prometheus.infrastructure.file_io import YamlPersistence
+from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
+from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
+from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
+from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
+
+app = typer.Typer(
+    name="prometheus",
+    help="PROMETHEUS — Prompt evolution without reference data.",
+    no_args_is_help=True,
+)
+
+console = Console()
+
+
+@app.command()
+def optimize(
+    input: str = typer.Option(
+        ...,
+        "-i",
+        "--input",
+        help="Path to input YAML config file.",
+        exists=True,
+        readable=True,
+    ),
+    output: str = typer.Option(
+        "output.yaml",
+        "-o",
+        "--output",
+        help="Path to output YAML result file.",
+    ),
+    verbose: bool = typer.Option(
+        False,
+        "-v",
+        "--verbose",
+        help="Print detailed progress.",
+    ),
+) -> None:
+    """Optimize a prompt without any reference data.
+
+    Usage:
+        prometheus optimize -i config.yaml -o result.yaml
+    """
+    # Configure verbose logging
+    if verbose:
+        logging.basicConfig(level=logging.INFO, format="[PROMETHEUS] %(message)s")
+
+    console.print(
+        Panel.fit(
+            "PROMETHEUS — Prompt Evolution Engine",
+            subtitle="No reference data required",
+        )
+    )
+
+    # 1. Load config
+    persistence = YamlPersistence()
+    raw_config = persistence.read_config(input)
+    config = OptimizationConfig(
+        seed_prompt=raw_config["seed_prompt"],
+        task_description=raw_config["task_description"],
+        task_model=raw_config.get("task_model", "openai/gpt-4o-mini"),
+        judge_model=raw_config.get("judge_model", "openai/gpt-4o"),
+        proposer_model=raw_config.get("proposer_model", "openai/gpt-4o"),
+        synth_model=raw_config.get("synth_model", "openai/gpt-4o"),
+        max_iterations=raw_config.get("max_iterations", 30),
+        n_synthetic_inputs=raw_config.get("n_synthetic_inputs", 20),
+        minibatch_size=raw_config.get("minibatch_size", 5),
+        seed=raw_config.get("seed", 42),
+        output_path=output,
+        verbose=verbose,
+    )
+    console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]")
+    console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]")
+
+    # 2. Configure DSPy with optional api_base/api_key from config
+    lm_kwargs: dict = {}
+    api_base = raw_config.get("api_base")
+    api_key_env = raw_config.get("api_key_env")
+    if api_base:
+        lm_kwargs["api_base"] = api_base
+    if api_key_env:
+        lm_kwargs["api_key"] = os.environ.get(api_key_env, "")
+    task_lm = dspy.LM(config.task_model, **lm_kwargs)
+    dspy.configure(lm=task_lm)
+
+    # 3. Build adapters (Dependency Injection)
+    synth_adapter = DSPySyntheticAdapter()
+    llm_adapter = DSPyLLMAdapter(model=config.task_model)
+    judge_adapter = DSPyJudgeAdapter()
+    proposer_adapter = DSPyProposerAdapter()
+    bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed)
+    evaluator = PromptEvaluator(executor=llm_adapter, judge=judge_adapter)
+    use_case = OptimizePromptUseCase(
+        evaluator=evaluator,
+        proposer=proposer_adapter,
+        bootstrap=bootstrap,
+    )
+
+    # 4. Execute
+    with console.status("[bold green]Evolving prompt..."):
+        result = use_case.execute(config)
+
+    # 5. Display results
+    _display_result(result)
+
+    # 6. Save
+    _save_result(persistence, output, result)
+    console.print(f"\n[green]Results saved to {output}[/green]")
+
+
+def _display_result(result: OptimizationResult) -> None:
+    """Display a Rich summary in the terminal."""
+    console.print()
+    console.print(
+        Panel(
+            f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}",
+            title="Result",
+        )
+    )
+    table = Table(title="Metrics")
+    table.add_column("Metric", style="cyan")
+    table.add_column("Value", style="bold")
+    table.add_row("Initial Score", f"{result.initial_score:.2f}")
+    table.add_row("Final Score", f"{result.final_score:.2f}")
+    table.add_row("Improvement", f"{result.improvement:+.2f}")
+    table.add_row("Iterations", str(result.iterations_used))
+    table.add_row("LLM Calls", str(result.total_llm_calls))
+    console.print(table)
+
+
+def _save_result(
+    persistence: YamlPersistence,
+    path: str,
+    result: OptimizationResult,
+) -> None:
+    """Save the result as YAML."""
+    persistence.write_result(path, asdict(result))
+
+
+@app.command(hidden=True)
+def _help() -> None:
+    """Internal placeholder to force multi-command Typer behavior."""
+    pass
+
+
+if __name__ == "__main__":
+    app()
--- a/src/prometheus/config.py
+++ b/src/prometheus/config.py
@@ -0,0 +1,12 @@
+"""Application settings."""
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass
+class AppSettings:
+    """Non-sensitive settings, hardcoded for the MVP."""
+
+    app_name: str = "prometheus"
+    version: str = "0.1.0"
--- a/src/prometheus/domain/init.py
+++ b/src/prometheus/domain/init.py
--- a/src/prometheus/domain/entities.py
+++ b/src/prometheus/domain/entities.py
@@ -0,0 +1,87 @@
+"""Domain entities — pure data, zero dependencies."""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any
+
+
+@dataclass(frozen=True)
+class Prompt:
+    """Represents a candidate prompt.
+
+    frozen=True → immutable, safe for Pareto tracking.
+    """
+
+    text: str
+    metadata: dict[str, Any] = field(default_factory=dict)
+
+    def __len__(self) -> int:
+        return len(self.text)
+
+
+@dataclass(frozen=True)
+class SyntheticExample:
+    """A synthetic example: an input generated from the task description.
+
+    No expected output — the judge will evaluate the output directly.
+    """
+
+    input_text: str
+    category: str = "default"  # for future stratified sampling
+    id: int = 0
+
+
+@dataclass
+class Trajectory:
+    """Execution trace of a prompt on an input.
+
+    Used by reflective mutation to understand failures.
+    """
+
+    input_text: str
+    output_text: str
+    score: float
+    feedback: str  # textual feedback from the judge
+    prompt_used: str
+
+
+@dataclass
+class EvalResult:
+    """Result of an evaluation on a minibatch."""
+
+    scores: list[float]
+    feedbacks: list[str]
+    trajectories: list[Trajectory]
+
+    @property
+    def total_score(self) -> float:
+        return sum(self.scores)
+
+    @property
+    def mean_score(self) -> float:
+        return sum(self.scores) / len(self.scores) if self.scores else 0.0
+
+
+@dataclass
+class Candidate:
+    """A candidate in the evolution pool.
+
+    Contains the prompt + its cumulative scores.
+    """
+
+    prompt: Prompt
+    best_score: float = 0.0
+    generation: int = 0  # at which iteration it was created
+    parent_id: int | None = None
+
+
+@dataclass
+class OptimizationState:
+    """Complete optimization state — serializable snapshot."""
+
+    iteration: int = 0
+    best_candidate: Candidate | None = None
+    candidates: list[Candidate] = field(default_factory=list)
+    synthetic_pool: list[SyntheticExample] = field(default_factory=list)
+    history: list[dict[str, Any]] = field(default_factory=list)
+    total_llm_calls: int = 0
--- a/src/prometheus/domain/ports.py
+++ b/src/prometheus/domain/ports.py
@@ -0,0 +1,85 @@
+"""
+Domain ports — abstract interfaces that infrastructure implements.
+Uses ABC (abstract base classes) for the loose coupling.
+"""
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+
+from typing import Any
+
+from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
+
+
+class LLMPort(ABC):
+    """Port for executing a prompt on an input.
+
+    Infrastructure will provide an implementation via DSPy.
+    """
+
+    @abstractmethod
+    def execute(self, prompt: Prompt, input_text: str) -> str:
+        """Execute the prompt on the input, return the raw response."""
+        ...
+
+
+class JudgePort(ABC):
+    """Port for LLM-as-Judge evaluation.
+
+    Takes (input, output) pairs + the task description.
+    Returns a score + textual feedback per pair.
+    """
+
+    @abstractmethod
+    def judge_batch(
+        self,
+        task_description: str,
+        pairs: list[tuple[str, str]],
+    ) -> list[tuple[float, str]]:
+        """Evaluate a batch of (input, output) pairs.
+
+        Returns a list of (score, feedback).
+        """
+        ...
+
+
+class ProposerPort(ABC):
+    """Port for proposing a new prompt.
+
+    Uses evaluation trajectories to propose an improvement.
+    """
+
+    @abstractmethod
+    def propose(
+        self,
+        current_prompt: Prompt,
+        trajectories: list[Trajectory],
+        task_description: str,
+    ) -> Prompt:
+        """Propose a new prompt based on failure trajectories."""
+        ...
+
+
+class SyntheticGeneratorPort(ABC):
+    """Port for generating synthetic inputs."""
+
+    @abstractmethod
+    def generate_inputs(
+        self,
+        task_description: str,
+        n_examples: int,
+    ) -> list[SyntheticExample]:
+        """Generate N diverse synthetic inputs."""
+        ...
+
+
+class PersistencePort(ABC):
+    """Port for reading/writing files."""
+
+    @abstractmethod
+    def read_config(self, path: str) -> dict[str, Any]:
+        ...
+
+    @abstractmethod
+    def write_result(self, path: str, data: dict[str, Any]) -> None:
+        ...
--- a/src/prometheus/domain/scoring.py
+++ b/src/prometheus/domain/scoring.py
@@ -0,0 +1,21 @@
+"""Scoring logic and acceptance criteria — pure domain."""
+from __future__ import annotations
+
+from prometheus.domain.entities import EvalResult
+
+
+def should_accept(
+    old_result: EvalResult,
+    new_result: EvalResult,
+    min_improvement: float = 0.0,
+) -> bool:
+    """Strict acceptance criterion.
+
+    The new candidate must strictly improve the total score.
+    """
+    return new_result.total_score > old_result.total_score + min_improvement
+
+
+def normalize_score(raw: float, min_val: float = 0.0, max_val: float = 1.0) -> float:
+    """Clamp a score within [min_val, max_val]."""
+    return max(min_val, min(max_val, raw))
--- a/src/prometheus/infrastructure/init.py
+++ b/src/prometheus/infrastructure/init.py
--- a/src/prometheus/infrastructure/dspy_modules.py
+++ b/src/prometheus/infrastructure/dspy_modules.py
@@ -0,0 +1,92 @@
+"""
+DSPy Modules — signature composition.
+
+Declarative LLM call orchestration via DSPy.
+"""
+from __future__ import annotations
+
+import json
+import re
+
+import dspy
+
+from prometheus.infrastructure.dspy_signatures import (
+    GenerateSyntheticInputs,
+    JudgeOutput,
+    ProposeInstruction,
+)
+
+
+class SyntheticInputGenerator(dspy.Module):
+    """Generates synthetic inputs in a single batch call.
+
+    Uses ChainOfThought for better diversity.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.generate = dspy.ChainOfThought(GenerateSyntheticInputs)
+
+    def forward(self, task_description: str, n_examples: int) -> dspy.Prediction:
+        result = self.generate(
+            task_description=task_description,
+            n_examples=n_examples,
+        )
+        try:
+            examples = json.loads(result.examples)
+        except json.JSONDecodeError:
+            examples = self._parse_fallback(result.examples)
+        return dspy.Prediction(examples=examples)
+
+    @staticmethod
+    def _parse_fallback(text: str) -> list[str]:
+        """Extract strings from non-JSON output."""
+        matches = re.findall(r'"([^"]+)"', text)
+        return matches if matches else [text]
+
+
+class OutputJudge(dspy.Module):
+    """Judges a single output. Called in batch by JudgeAdapter."""
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.judge = dspy.ChainOfThought(JudgeOutput)
+
+    def forward(
+        self, task_description: str, input_text: str, output_text: str
+    ) -> dspy.Prediction:
+        result = self.judge(
+            task_description=task_description,
+            input_text=input_text,
+            output_text=output_text,
+        )
+        try:
+            score = float(result.score)
+        except (ValueError, TypeError):
+            score = 0.5  # neutral fallback
+        score = max(0.0, min(1.0, score))
+        return dspy.Prediction(score=score, feedback=result.feedback)
+
+
+class InstructionProposer(dspy.Module):
+    """Proposes a new prompt from failure trajectories.
+
+    Equivalent to GEPA's InstructionProposalSignature.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.propose = dspy.ChainOfThought(ProposeInstruction)
+
+    def forward(
+        self,
+        current_instruction: str,
+        task_description: str,
+        failure_examples: str,
+    ) -> dspy.Prediction:
+        result = self.propose(
+            current_instruction=current_instruction,
+            task_description=task_description,
+            failure_examples=failure_examples,
+        )
+        return dspy.Prediction(new_instruction=result.new_instruction)
--- a/src/prometheus/infrastructure/dspy_signatures.py
+++ b/src/prometheus/infrastructure/dspy_signatures.py
@@ -0,0 +1,79 @@
+"""
+DSPy Signatures — declarative LLM contracts.
+
+Defines WHAT each LLM call does, not HOW.
+DSPy Signature = input_fields → output_fields + instruction.
+DSPy handles prompting, parsing, and structuring.
+"""
+from __future__ import annotations
+
+import dspy
+
+
+class GenerateSyntheticInputs(dspy.Signature):
+    """Generate diverse, realistic input examples for a given task."""
+
+    task_description: str = dspy.InputField(
+        desc="Description of the task the prompt should accomplish."
+    )
+    n_examples: int = dspy.InputField(
+        desc="Number of examples to generate."
+    )
+    examples: str = dspy.OutputField(
+        desc=(
+            "A JSON array of strings, each being a realistic input "
+            "for the task. Cover: normal cases, edge cases, long inputs, "
+            "short inputs, ambiguous cases, and tricky scenarios."
+        ),
+    )
+
+
+class JudgeOutput(dspy.Signature):
+    """Evaluate the quality of an LLM output for a given task and input.
+
+    Score: 0.0 (completely wrong) to 1.0 (perfect).
+    Feedback: specific, actionable criticism.
+    """
+
+    task_description: str = dspy.InputField(
+        desc="What the assistant is supposed to do."
+    )
+    input_text: str = dspy.InputField(
+        desc="The input provided to the assistant."
+    )
+    output_text: str = dspy.InputField(
+        desc="The assistant's response to evaluate."
+    )
+    score: float = dspy.OutputField(
+        desc="Quality score from 0.0 (wrong) to 1.0 (perfect)."
+    )
+    feedback: str = dspy.OutputField(
+        desc=(
+            "Specific, actionable feedback explaining what's wrong "
+            "with the output and how to improve it. Be critical."
+        ),
+    )
+
+
+class ProposeInstruction(dspy.Signature):
+    """Given a current prompt and examples of where it fails with feedback,
+    propose an improved version of the prompt.
+
+    The new prompt should address all the issues identified in the feedback.
+    """
+
+    current_instruction: str = dspy.InputField(
+        desc="The current prompt/instruction to improve."
+    )
+    task_description: str = dspy.InputField(
+        desc="Description of the task."
+    )
+    failure_examples: str = dspy.InputField(
+        desc=(
+            "Examples of inputs, outputs, scores, and feedback "
+            "showing where the current instruction fails."
+        ),
+    )
+    new_instruction: str = dspy.OutputField(
+        desc="An improved version of the instruction."
+    )
--- a/src/prometheus/infrastructure/file_io.py
+++ b/src/prometheus/infrastructure/file_io.py
@@ -0,0 +1,25 @@
+"""
+File I/O — read/write config and result files.
+
+Implements the PersistencePort with YAML.
+"""
+from __future__ import annotations
+
+from typing import Any
+
+import yaml
+
+from prometheus.domain.ports import PersistencePort
+
+
+class YamlPersistence(PersistencePort):
+    """Reads and writes YAML files."""
+
+    def read_config(self, path: str) -> dict[str, Any]:
+        with open(path, encoding="utf-8") as f:
+            data: dict[str, Any] = yaml.safe_load(f)
+            return data
+
+    def write_result(self, path: str, data: dict[str, Any]) -> None:
+        with open(path, "w", encoding="utf-8") as f:
+            yaml.dump(data, f, default_flow_style=False, allow_unicode=True)
--- a/src/prometheus/infrastructure/judge_adapter.py
+++ b/src/prometheus/infrastructure/judge_adapter.py
@@ -0,0 +1,34 @@
+"""
+Adapter: LLM-as-Judge.
+
+Implements the JudgePort via the DSPy OutputJudge module.
+"""
+from __future__ import annotations
+
+from prometheus.domain.ports import JudgePort
+from prometheus.infrastructure.dspy_modules import OutputJudge
+
+
+class DSPyJudgeAdapter(JudgePort):
+    """Evaluates a batch of (input, output) pairs by calling the Judge for each.
+
+    Sequential for MVP. Future: parallelize via dspy.Parallel.
+    """
+
+    def __init__(self) -> None:
+        self._judge = OutputJudge()
+
+    def judge_batch(
+        self,
+        task_description: str,
+        pairs: list[tuple[str, str]],
+    ) -> list[tuple[float, str]]:
+        results: list[tuple[float, str]] = []
+        for input_text, output_text in pairs:
+            pred = self._judge(
+                task_description=task_description,
+                input_text=input_text,
+                output_text=output_text,
+            )
+            results.append((pred.score, pred.feedback))
+        return results
--- a/src/prometheus/infrastructure/llm_adapter.py
+++ b/src/prometheus/infrastructure/llm_adapter.py
@@ -0,0 +1,32 @@
+"""
+Adapter: Execute a prompt on an input.
+
+Implements the LLMPort via DSPy.
+"""
+from __future__ import annotations
+
+import dspy
+
+from prometheus.domain.entities import Prompt
+from prometheus.domain.ports import LLMPort
+
+
+class DSPyLLMAdapter(LLMPort):
+    """Executes a prompt using dspy.Predict with a simple signature."""
+
+    class _ExecuteSignature(dspy.Signature):
+        """Execute the instruction on the given input."""
+
+        instruction: str = dspy.InputField(desc="The instruction/prompt to follow.")
+        input_text: str = dspy.InputField(desc="The input to process.")
+        output: str = dspy.OutputField(desc="The response following the instruction.")
+
+    def __init__(self, model: str) -> None:
+        self._predictor = dspy.Predict(self._ExecuteSignature)
+
+    def execute(self, prompt: Prompt, input_text: str) -> str:
+        result = self._predictor(
+            instruction=prompt.text,
+            input_text=input_text,
+        )
+        return str(result.output)
--- a/src/prometheus/infrastructure/proposer_adapter.py
+++ b/src/prometheus/infrastructure/proposer_adapter.py
@@ -0,0 +1,47 @@
+"""
+Adapter: Reflective Mutation Proposer.
+
+Implements the ProposerPort via the DSPy InstructionProposer.
+Converts trajectories into readable format for the LLM proposer.
+"""
+from __future__ import annotations
+
+from prometheus.domain.entities import Prompt, Trajectory
+from prometheus.domain.ports import ProposerPort
+from prometheus.infrastructure.dspy_modules import InstructionProposer
+
+
+class DSPyProposerAdapter(ProposerPort):
+    """Uses evaluation trajectories to build a failure report and propose a new prompt."""
+
+    def __init__(self) -> None:
+        self._proposer = InstructionProposer()
+
+    def propose(
+        self,
+        current_prompt: Prompt,
+        trajectories: list[Trajectory],
+        task_description: str,
+    ) -> Prompt:
+        failure_examples = self._format_failures(trajectories)
+        pred = self._proposer(
+            current_instruction=current_prompt.text,
+            task_description=task_description,
+            failure_examples=failure_examples,
+        )
+        return Prompt(text=pred.new_instruction)
+
+    @staticmethod
+    def _format_failures(trajectories: list[Trajectory]) -> str:
+        """Convert trajectories into a structured textual report."""
+        sections: list[str] = []
+        for i, t in enumerate(trajectories, 1):
+            section = (
+                f"# Example {i}\n"
+                f"## Input\n{t.input_text}\n\n"
+                f"## Generated Output\n{t.output_text}\n\n"
+                f"## Score\n{t.score:.2f}\n\n"
+                f"## Feedback\n{t.feedback}\n"
+            )
+            sections.append(section)
+        return "\n---\n".join(sections)
--- a/src/prometheus/infrastructure/synth_adapter.py
+++ b/src/prometheus/infrastructure/synth_adapter.py
@@ -0,0 +1,34 @@
+"""
+Adapter: Synthetic input generation.
+
+Implements the SyntheticGeneratorPort via DSPy.
+"""
+from __future__ import annotations
+
+from prometheus.domain.entities import SyntheticExample
+from prometheus.domain.ports import SyntheticGeneratorPort
+from prometheus.infrastructure.dspy_modules import SyntheticInputGenerator
+
+
+class DSPySyntheticAdapter(SyntheticGeneratorPort):
+    """Generates synthetic inputs in a single batch call via DSPy."""
+
+    def __init__(self) -> None:
+        self._generator = SyntheticInputGenerator()
+
+    def generate_inputs(
+        self,
+        task_description: str,
+        n_examples: int,
+    ) -> list[SyntheticExample]:
+        pred = self._generator(
+            task_description=task_description,
+            n_examples=n_examples,
+        )
+        return [
+            SyntheticExample(
+                input_text=text,
+                id=i,
+            )
+            for i, text in enumerate(pred.examples[:n_examples])
+        ]
--- a/test_config.yaml
+++ b/test_config.yaml
@@ -0,0 +1,26 @@
+# PROMETHEUS Test Config — Full Pipeline Test
+# =============================================
+
+seed_prompt: |
+  You are a helpful coding assistant. Help the user write clean, bug-free code.
+  When reviewing code, identify bugs and suggest improvements.
+
+task_description: |
+  Code review and bug detection assistant. Reviews code snippets and
+  identifies bugs, security issues, and style problems. The assistant
+  receives a code snippet and must produce a structured review.
+
+task_model: "openai/glm-4.5-air"
+judge_model: "openai/glm-4.5-air"
+proposer_model: "openai/glm-4.5-air"
+synth_model: "openai/glm-4.5-air"
+
+# API configuration for z.ai (coding endpoint)
+api_base: "https://api.z.ai/api/coding/paas/v4"
+api_key_env: "GLM_API_KEY"
+
+# Evolution parameters (small test)
+max_iterations: 5
+n_synthetic_inputs: 5
+minibatch_size: 3
+seed: 42
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -0,0 +1,93 @@
+"""Shared test fixtures."""
+from __future__ import annotations
+
+from unittest.mock import MagicMock
+
+import pytest
+
+from prometheus.domain.entities import (
+    EvalResult,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+
+
+@pytest.fixture
+def seed_prompt() -> Prompt:
+    return Prompt(text="You are a helpful assistant. Answer the question.")
+
+
+@pytest.fixture
+def task_description() -> str:
+    return "Answer factual questions accurately and concisely."
+
+
+@pytest.fixture
+def synthetic_pool() -> list[SyntheticExample]:
+    return [
+        SyntheticExample(input_text=f"Test input {i}", id=i) for i in range(20)
+    ]
+
+
+@pytest.fixture
+def mock_eval_result() -> EvalResult:
+    return EvalResult(
+        scores=[0.3, 0.5, 0.4, 0.6, 0.2],
+        feedbacks=[
+            "Incomplete answer",
+            "Missing key detail",
+            "Wrong format",
+            "Partially correct",
+            "Completely off topic",
+        ],
+        trajectories=[
+            Trajectory(
+                input_text=f"Input {i}",
+                output_text=f"Output {i}",
+                score=s,
+                feedback=f,
+                prompt_used="test prompt",
+            )
+            for i, (s, f) in enumerate(
+                zip(
+                    [0.3, 0.5, 0.4, 0.6, 0.2],
+                    [
+                        "Incomplete answer",
+                        "Missing key detail",
+                        "Wrong format",
+                        "Partially correct",
+                        "Completely off topic",
+                    ],
+                )
+            )
+        ],
+    )
+
+
+@pytest.fixture
+def mock_llm_port() -> MagicMock:
+    """Mock LLMPort that returns canned responses."""
+    port = MagicMock()
+    port.execute.return_value = "This is a mock response."
+    return port
+
+
+@pytest.fixture
+def mock_judge_port() -> MagicMock:
+    """Mock JudgePort that returns moderate scores."""
+    port = MagicMock()
+    port.judge_batch.return_value = [
+        (0.5, "Moderate quality, needs improvement."),
+    ] * 5
+    return port
+
+
+@pytest.fixture
+def mock_proposer_port() -> MagicMock:
+    """Mock ProposerPort that returns a slightly modified prompt."""
+    port = MagicMock()
+    port.propose.return_value = Prompt(
+        text="You are a very helpful assistant. Answer the question precisely."
+    )
+    return port
--- a/tests/integration/init.py
+++ b/tests/integration/init.py
--- a/tests/integration/test_dspy_adapters.py
+++ b/tests/integration/test_dspy_adapters.py
@@ -0,0 +1,29 @@
+"""Integration tests for DSPy adapters using DSPy mock LM."""
+from __future__ import annotations
+
+import dspy
+import pytest
+
+from prometheus.domain.entities import Prompt
+from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
+
+
+@pytest.fixture
+def mock_lm() -> dspy.LM:
+    """Create a DSPy mock LM that returns predictable responses."""
+    lm = dspy.utils.DummyLM(
+        [
+            {"output": "Mock output response"},
+        ]
+    )
+    dspy.configure(lm=lm)
+    return lm
+
+
+class TestDSPyLLMAdapter:
+    def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
+        adapter = DSPyLLMAdapter(model="openai/gpt-4o-mini")
+        prompt = Prompt(text="Answer the question.")
+        result = adapter.execute(prompt, "What is 2+2?")
+        assert isinstance(result, str)
+        assert len(result) > 0
--- a/tests/integration/test_full_pipeline.py
+++ b/tests/integration/test_full_pipeline.py
@@ -0,0 +1,74 @@
+"""End-to-end pipeline test with mocked LLM calls."""
+from __future__ import annotations
+
+from unittest.mock import MagicMock
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.dto import OptimizationConfig
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.use_cases import OptimizePromptUseCase
+from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
+from prometheus.domain.ports import JudgePort, LLMPort, ProposerPort
+
+
+def _make_eval(scores: list[float]) -> EvalResult:
+    return EvalResult(
+        scores=scores,
+        feedbacks=["feedback"] * len(scores),
+        trajectories=[
+            Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
+            for i, s in enumerate(scores)
+        ],
+    )
+
+
+class TestFullPipeline:
+    def test_pipeline_produces_result(self) -> None:
+        """Full pipeline with mocked ports produces an OptimizationResult."""
+        mock_llm = MagicMock(spec=LLMPort)
+        mock_llm.execute.return_value = "mock response"
+
+        mock_judge = MagicMock(spec=JudgePort)
+        # Initial eval (low), then alternating current/new evals per iteration
+        eval_sequence = [
+            _make_eval([0.3, 0.3, 0.3, 0.3, 0.3]),  # initial seed eval
+        ]
+        for _ in range(5):  # 5 iterations
+            eval_sequence.append(_make_eval([0.4, 0.4, 0.4, 0.4, 0.4]))  # current eval
+            eval_sequence.append(_make_eval([0.6, 0.6, 0.6, 0.6, 0.6]))  # new eval (accepted)
+        mock_judge.judge_batch.return_value = [(0.5, "ok")] * 5
+
+        mock_proposer = MagicMock(spec=ProposerPort)
+        mock_proposer.propose.return_value = Prompt(text="Improved prompt")
+
+        evaluator = PromptEvaluator(mock_llm, mock_judge)
+        evaluator.evaluate = MagicMock(side_effect=eval_sequence)
+
+        mock_gen = MagicMock()
+        mock_gen.generate_inputs.return_value = [
+            SyntheticExample(input_text=f"synth input {i}", id=i) for i in range(20)
+        ]
+        bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
+
+        use_case = OptimizePromptUseCase(
+            evaluator=evaluator,
+            proposer=mock_proposer,
+            bootstrap=bootstrap,
+        )
+
+        config = OptimizationConfig(
+            seed_prompt="Answer questions.",
+            task_description="Answer questions accurately.",
+            max_iterations=5,
+            n_synthetic_inputs=20,
+            minibatch_size=5,
+            seed=42,
+        )
+
+        result = use_case.execute(config)
+
+        assert result.initial_prompt == "Answer questions."
+        assert result.optimized_prompt == "Improved prompt"
+        assert result.iterations_used == 5
+        assert result.total_llm_calls > 0
+        assert result.final_score > result.initial_score
--- a/tests/unit/init.py
+++ b/tests/unit/init.py
--- a/tests/unit/test_bootstrap.py
+++ b/tests/unit/test_bootstrap.py
@@ -0,0 +1,50 @@
+"""Unit tests for the bootstrap module."""
+from __future__ import annotations
+
+from unittest.mock import MagicMock
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.domain.entities import SyntheticExample
+from prometheus.domain.ports import SyntheticGeneratorPort
+
+
+class TestSyntheticBootstrap:
+    def test_run_returns_shuffled_examples(self) -> None:
+        mock_gen = MagicMock(spec=SyntheticGeneratorPort)
+        examples = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(10)]
+        mock_gen.generate_inputs.return_value = examples
+
+        bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
+        result = bootstrap.run("task desc", 10)
+
+        assert len(result) == 10
+        mock_gen.generate_inputs.assert_called_once_with("task desc", 10)
+
+    def test_sample_minibatch_returns_correct_size(self) -> None:
+        mock_gen = MagicMock(spec=SyntheticGeneratorPort)
+        pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
+
+        bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
+        batch = bootstrap.sample_minibatch(pool, 5)
+
+        assert len(batch) == 5
+        # All items should be from the pool
+        assert all(item in pool for item in batch)
+
+    def test_sample_minibatch_capped_at_pool_size(self) -> None:
+        mock_gen = MagicMock(spec=SyntheticGeneratorPort)
+        pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(3)]
+
+        bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
+        batch = bootstrap.sample_minibatch(pool, 10)
+
+        assert len(batch) == 3
+
+    def test_deterministic_with_same_seed(self) -> None:
+        mock_gen = MagicMock(spec=SyntheticGeneratorPort)
+        pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
+
+        b1 = SyntheticBootstrap(generator=mock_gen, seed=42)
+        b2 = SyntheticBootstrap(generator=mock_gen, seed=42)
+
+        assert b1.sample_minibatch(pool, 5) == b2.sample_minibatch(pool, 5)
--- a/tests/unit/test_dspy_modules.py
+++ b/tests/unit/test_dspy_modules.py
@@ -0,0 +1,198 @@
+"""Unit tests for DSPy module parsing logic."""
+from __future__ import annotations
+
+import json
+from unittest.mock import MagicMock, patch
+
+import dspy
+import pytest
+
+from prometheus.infrastructure.dspy_modules import (
+    InstructionProposer,
+    OutputJudge,
+    SyntheticInputGenerator,
+)
+
+
+class TestSyntheticInputGeneratorParseFallback:
+    """Tests for _parse_fallback — regex-based JSON recovery."""
+
+    def test_extracts_quoted_strings(self) -> None:
+        text = 'Here are some: "first example" and "second example" done.'
+        result = SyntheticInputGenerator._parse_fallback(text)
+        assert result == ["first example", "second example"]
+
+    def test_single_quoted_string(self) -> None:
+        text = 'Just one: "hello world"'
+        result = SyntheticInputGenerator._parse_fallback(text)
+        assert result == ["hello world"]
+
+    def test_no_quotes_returns_raw_text(self) -> None:
+        text = "no quotes at all here"
+        result = SyntheticInputGenerator._parse_fallback(text)
+        assert result == ["no quotes at all here"]
+
+    def test_empty_string_returns_itself(self) -> None:
+        result = SyntheticInputGenerator._parse_fallback("")
+        assert result == [""]
+
+    def test_mixed_json_with_extra_text(self) -> None:
+        text = 'Results: "alpha", "beta", "gamma" — take your pick.'
+        result = SyntheticInputGenerator._parse_fallback(text)
+        assert result == ["alpha", "beta", "gamma"]
+
+
+class TestOutputJudgeForward:
+    """Tests for OutputJudge score parsing and clamping.
+
+    Mocks the internal ChainOfThought module to isolate parsing logic.
+    """
+
+    @pytest.fixture
+    def judge(self) -> OutputJudge:
+        return OutputJudge()
+
+    def test_valid_numeric_score(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="0.8", feedback="Good output.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.8
+        assert result.feedback == "Good output."
+
+    def test_non_numeric_score_falls_back_to_half(
+        self, judge: OutputJudge
+    ) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(
+                score="not-a-number", feedback="N/A"
+            )
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.5
+
+    def test_score_clamped_to_upper_bound(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="1.5", feedback="Great!")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 1.0
+
+    def test_score_clamped_to_lower_bound(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="-0.3", feedback="Terrible.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.0
+
+    def test_empty_score_string_falls_back(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="", feedback="No score.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.5
+
+    def test_boundary_score_one(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="1.0", feedback="Perfect.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 1.0
+
+    def test_boundary_score_zero(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score="0.0", feedback="Wrong.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.0
+
+    def test_none_score_falls_back(self, judge: OutputJudge) -> None:
+        judge.judge = MagicMock(
+            return_value=dspy.Prediction(score=None, feedback="Missing.")
+        )
+        result = judge.forward("task", "input", "output")
+
+        assert result.score == 0.5
+
+
+class TestSyntheticInputGeneratorForward:
+    """Tests for SyntheticInputGenerator.forward JSON/fallback parsing.
+
+    Mocks the internal ChainOfThought module to isolate parsing logic.
+    """
+
+    @pytest.fixture
+    def generator(self) -> SyntheticInputGenerator:
+        return SyntheticInputGenerator()
+
+    def test_valid_json_parsed_correctly(
+        self, generator: SyntheticInputGenerator
+    ) -> None:
+        examples_json = json.dumps(["q1", "q2", "q3"])
+        generator.generate = MagicMock(
+            return_value=dspy.Prediction(examples=examples_json)
+        )
+        result = generator.forward("task desc", 3)
+
+        assert result.examples == ["q1", "q2", "q3"]
+
+    def test_malformed_json_triggers_fallback(
+        self, generator: SyntheticInputGenerator
+    ) -> None:
+        generator.generate = MagicMock(
+            return_value=dspy.Prediction(
+                examples='Here: "fallback item" and "another one"'
+            )
+        )
+        result = generator.forward("task desc", 2)
+
+        assert result.examples == ["fallback item", "another one"]
+
+    def test_empty_json_array(self, generator: SyntheticInputGenerator) -> None:
+        generator.generate = MagicMock(
+            return_value=dspy.Prediction(examples="[]")
+        )
+        result = generator.forward("task desc", 0)
+
+        assert result.examples == []
+
+
+class TestInstructionProposerForward:
+    """Tests for InstructionProposer.forward."""
+
+    @pytest.fixture
+    def proposer(self) -> InstructionProposer:
+        return InstructionProposer()
+
+    def test_returns_new_instruction(self, proposer: InstructionProposer) -> None:
+        proposer.propose = MagicMock(
+            return_value=dspy.Prediction(
+                new_instruction="Be concise and accurate."
+            )
+        )
+        result = proposer.forward(
+            "Be helpful.", "Answer questions.", "Failed: too verbose"
+        )
+
+        assert result.new_instruction == "Be concise and accurate."
+
+    def test_passes_correct_arguments(
+        self, proposer: InstructionProposer
+    ) -> None:
+        proposer.propose = MagicMock(
+            return_value=dspy.Prediction(new_instruction="improved")
+        )
+        proposer.forward("current", "task desc", "failures")
+
+        proposer.propose.assert_called_once_with(
+            current_instruction="current",
+            task_description="task desc",
+            failure_examples="failures",
+        )
--- a/tests/unit/test_entities.py
+++ b/tests/unit/test_entities.py
@@ -0,0 +1,99 @@
+"""Unit tests for domain entities."""
+from __future__ import annotations
+
+from prometheus.domain.entities import (
+    Candidate,
+    EvalResult,
+    OptimizationState,
+    Prompt,
+    SyntheticExample,
+    Trajectory,
+)
+
+
+class TestPrompt:
+    def test_prompt_text(self) -> None:
+        p = Prompt(text="Hello")
+        assert p.text == "Hello"
+
+    def test_prompt_len(self) -> None:
+        p = Prompt(text="Hello")
+        assert len(p) == 5
+
+    def test_prompt_frozen(self) -> None:
+        p = Prompt(text="Hello")
+        try:
+            p.text = "World"  # type: ignore[misc]
+            raise AssertionError("Should have raised FrozenInstanceError")
+        except AttributeError:
+            pass
+
+    def test_prompt_default_metadata(self) -> None:
+        p = Prompt(text="Hello")
+        assert p.metadata == {}
+
+    def test_prompt_custom_metadata(self) -> None:
+        p = Prompt(text="Hello", metadata={"key": "value"})
+        assert p.metadata["key"] == "value"
+
+
+class TestSyntheticExample:
+    def test_default_category(self) -> None:
+        ex = SyntheticExample(input_text="test")
+        assert ex.category == "default"
+
+    def test_default_id(self) -> None:
+        ex = SyntheticExample(input_text="test")
+        assert ex.id == 0
+
+
+class TestEvalResult:
+    def test_total_score(self) -> None:
+        result = EvalResult(
+            scores=[0.3, 0.5, 0.4],
+            feedbacks=["a", "b", "c"],
+            trajectories=[],
+        )
+        assert result.total_score == 1.2
+
+    def test_mean_score(self) -> None:
+        result = EvalResult(
+            scores=[0.3, 0.5, 0.4],
+            feedbacks=["a", "b", "c"],
+            trajectories=[],
+        )
+        assert abs(result.mean_score - 0.4) < 1e-9
+
+    def test_mean_score_empty(self) -> None:
+        result = EvalResult(scores=[], feedbacks=[], trajectories=[])
+        assert result.mean_score == 0.0
+
+
+class TestTrajectory:
+    def test_trajectory_fields(self) -> None:
+        t = Trajectory(
+            input_text="in",
+            output_text="out",
+            score=0.8,
+            feedback="good",
+            prompt_used="test",
+        )
+        assert t.input_text == "in"
+        assert t.score == 0.8
+
+
+class TestCandidate:
+    def test_candidate_defaults(self) -> None:
+        c = Candidate(prompt=Prompt(text="test"))
+        assert c.best_score == 0.0
+        assert c.generation == 0
+        assert c.parent_id is None
+
+
+class TestOptimizationState:
+    def test_default_state(self) -> None:
+        state = OptimizationState()
+        assert state.iteration == 0
+        assert state.best_candidate is None
+        assert state.candidates == []
+        assert state.total_llm_calls == 0
--- a/tests/unit/test_evaluator.py
+++ b/tests/unit/test_evaluator.py
@@ -0,0 +1,121 @@
+"""Unit tests for PromptEvaluator.evaluate()."""
+from __future__ import annotations
+
+from unittest.mock import MagicMock
+
+import pytest
+
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
+from prometheus.domain.ports import JudgePort, LLMPort
+
+
+class TestPromptEvaluatorEvaluate:
+    """Tests for the evaluate() pipeline: execute → judge → trajectories."""
+
+    @pytest.fixture
+    def executor(self) -> MagicMock:
+        return MagicMock(spec=LLMPort)
+
+    @pytest.fixture
+    def judge(self) -> MagicMock:
+        return MagicMock(spec=JudgePort)
+
+    @pytest.fixture
+    def evaluator(self, executor: MagicMock, judge: MagicMock) -> PromptEvaluator:
+        return PromptEvaluator(executor=executor, judge=judge)
+
+    def test_happy_path_builds_correct_trajectories(
+        self,
+        evaluator: PromptEvaluator,
+        executor: MagicMock,
+        judge: MagicMock,
+    ) -> None:
+        prompt = Prompt(text="Answer the question.")
+        examples = [
+            SyntheticExample(input_text="What is 2+2?", id=0),
+            SyntheticExample(input_text="Capital of France?", id=1),
+        ]
+        executor.execute.side_effect = ["4", "Paris"]
+        judge.judge_batch.return_value = [
+            (0.9, "Correct."),
+            (0.8, "Mostly correct."),
+        ]
+
+        result = evaluator.evaluate(prompt, examples, "math and geography")
+
+        assert isinstance(result, EvalResult)
+        assert result.scores == [0.9, 0.8]
+        assert result.feedbacks == ["Correct.", "Mostly correct."]
+        assert len(result.trajectories) == 2
+        assert result.trajectories[0].input_text == "What is 2+2?"
+        assert result.trajectories[0].output_text == "4"
+        assert result.trajectories[0].score == 0.9
+        assert result.trajectories[0].feedback == "Correct."
+        assert result.trajectories[0].prompt_used == "Answer the question."
+        assert result.trajectories[1].prompt_used == "Answer the question."
+
+    def test_empty_minibatch_returns_empty_result(
+        self,
+        evaluator: PromptEvaluator,
+        executor: MagicMock,
+        judge: MagicMock,
+    ) -> None:
+        prompt = Prompt(text="test")
+        result = evaluator.evaluate(prompt, [], "task")
+
+        assert result.scores == []
+        assert result.feedbacks == []
+        assert result.trajectories == []
+        executor.execute.assert_not_called()
+        # judge_batch is called with empty pairs list
+        judge.judge_batch.assert_called_once_with("task", [])
+
+    def test_executor_called_with_correct_prompt(
+        self,
+        evaluator: PromptEvaluator,
+        executor: MagicMock,
+        judge: MagicMock,
+    ) -> None:
+        prompt = Prompt(text="Summarize this.")
+        examples = [SyntheticExample(input_text="Long text here", id=0)]
+        executor.execute.return_value = "Summary."
+        judge.judge_batch.return_value = [(0.7, "Good summary.")]
+
+        evaluator.evaluate(prompt, examples, "summarization")
+
+        executor.execute.assert_called_once_with(prompt, "Long text here")
+
+    def test_trajectories_prompt_used_matches_input_prompt(
+        self,
+        evaluator: PromptEvaluator,
+        executor: MagicMock,
+        judge: MagicMock,
+    ) -> None:
+        prompt = Prompt(text="Translate to French.")
+        examples = [SyntheticExample(input_text="Hello", id=0)]
+        executor.execute.return_value = "Bonjour"
+        judge.judge_batch.return_value = [(1.0, "Perfect.")]
+
+        result = evaluator.evaluate(prompt, examples, "translation")
+
+        assert result.trajectories[0].prompt_used == "Translate to French."
+
+    def test_scores_feedbacks_trajectories_lists_sized_correctly(
+        self,
+        evaluator: PromptEvaluator,
+        executor: MagicMock,
+        judge: MagicMock,
+    ) -> None:
+        prompt = Prompt(text="test prompt")
+        examples = [SyntheticExample(input_text=f"q{i}", id=i) for i in range(4)]
+        executor.execute.side_effect = [f"a{i}" for i in range(4)]
+        judge.judge_batch.return_value = [
+            (0.1 * i, f"fb{i}") for i in range(4)
+        ]
+
+        result = evaluator.evaluate(prompt, examples, "task")
+
+        assert len(result.scores) == 4
+        assert len(result.feedbacks) == 4
+        assert len(result.trajectories) == 4
--- a/tests/unit/test_evolution.py
+++ b/tests/unit/test_evolution.py
@@ -0,0 +1,147 @@
+"""Unit tests for the evolution loop — with full mocking."""
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+from prometheus.application.bootstrap import SyntheticBootstrap
+from prometheus.application.evaluator import PromptEvaluator
+from prometheus.application.evolution import EvolutionLoop
+from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
+
+
+class TestEvolutionLoop:
+    def test_accepts_improvement(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: MagicMock,
+        mock_judge_port: MagicMock,
+        mock_proposer_port: MagicMock,
+    ) -> None:
+        """When the new prompt improves the score, the best candidate is updated."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        initial_eval = EvalResult(
+            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
+            feedbacks=["bad"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
+                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
+            ],
+        )
+        old_eval = EvalResult(
+            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
+            feedbacks=["bad"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
+                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
+            ],
+        )
+        new_eval = EvalResult(
+            scores=[0.8, 0.9, 0.7, 0.8, 0.9],
+            feedbacks=["good"] * 5,
+            trajectories=[],
+        )
+        evaluator.evaluate = MagicMock(side_effect=[initial_eval, old_eval, new_eval])
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+        )
+        with patch.object(loop, "_log"):
+            state = loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.best_candidate is not None
+        assert state.best_candidate.best_score > 0
+
+    def test_rejects_regression(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: MagicMock,
+        mock_judge_port: MagicMock,
+        mock_proposer_port: MagicMock,
+    ) -> None:
+        """When the new prompt degrades the score, the best candidate stays unchanged."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        initial_eval = EvalResult(
+            scores=[0.7, 0.8, 0.7, 0.8, 0.9],
+            feedbacks=["ok"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
+                for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
+            ],
+        )
+        old_eval = EvalResult(
+            scores=[0.7, 0.8, 0.7, 0.8, 0.9],
+            feedbacks=["ok"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
+                for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
+            ],
+        )
+        new_eval = EvalResult(
+            scores=[0.2, 0.1, 0.3, 0.2, 0.1],
+            feedbacks=["bad"] * 5,
+            trajectories=[],
+        )
+        evaluator.evaluate = MagicMock(side_effect=[initial_eval, old_eval, new_eval])
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=1,
+            minibatch_size=5,
+        )
+        with patch.object(loop, "_log"):
+            state = loop.run(seed_prompt, synthetic_pool, task_description)
+
+        assert state.best_candidate is not None
+        assert state.best_candidate.prompt.text == seed_prompt.text
+
+    def test_skips_perfect_scores(
+        self,
+        seed_prompt: Prompt,
+        synthetic_pool: list[SyntheticExample],
+        task_description: str,
+        mock_llm_port: MagicMock,
+        mock_judge_port: MagicMock,
+        mock_proposer_port: MagicMock,
+    ) -> None:
+        """When all scores are perfect, no proposition is made."""
+        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
+        bootstrap = MagicMock(spec=SyntheticBootstrap)
+        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
+
+        perfect_eval = EvalResult(
+            scores=[1.0, 1.0, 1.0, 1.0, 1.0],
+            feedbacks=["perfect"] * 5,
+            trajectories=[
+                Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt")
+                for i in range(5)
+            ],
+        )
+        evaluator.evaluate = MagicMock(return_value=perfect_eval)
+
+        loop = EvolutionLoop(
+            evaluator=evaluator,
+            proposer=mock_proposer_port,
+            bootstrap=bootstrap,
+            max_iterations=3,
+            minibatch_size=5,
+        )
+        with patch.object(loop, "_log"):
+            loop.run(seed_prompt, synthetic_pool, task_description)
+
+        mock_proposer_port.propose.assert_not_called()
--- a/tests/unit/test_file_io.py
+++ b/tests/unit/test_file_io.py
@@ -0,0 +1,99 @@
+"""Unit tests for YamlPersistence file I/O."""
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+import yaml
+
+from prometheus.infrastructure.file_io import YamlPersistence
+
+
+class TestYamlPersistenceReadConfig:
+    """Tests for read_config YAML loading."""
+
+    def test_roundtrip_write_and_read(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        data = {
+            "seed_prompt": "You are helpful.",
+            "task_description": "Answer questions.",
+            "max_iterations": 30,
+            "verbose": True,
+        }
+        config_file = tmp_path / "config.yaml"
+        with open(config_file, "w") as f:
+            yaml.dump(data, f)
+
+        result = persistence.read_config(str(config_file))
+
+        assert result == data
+
+    def test_reads_nested_yaml(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        data = {
+            "model": {"name": "gpt-4o", "temperature": 0.7},
+            "params": [1, 2, 3],
+        }
+        config_file = tmp_path / "nested.yaml"
+        with open(config_file, "w") as f:
+            yaml.dump(data, f)
+
+        result = persistence.read_config(str(config_file))
+
+        assert result["model"]["name"] == "gpt-4o"
+        assert result["params"] == [1, 2, 3]
+
+    def test_missing_file_raises_error(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        missing = tmp_path / "nonexistent.yaml"
+
+        with pytest.raises(FileNotFoundError):
+            persistence.read_config(str(missing))
+
+    def test_malformed_yaml_raises_error(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        bad_file = tmp_path / "bad.yaml"
+        bad_file.write_text(": [invalid: {yaml", encoding="utf-8")
+
+        with pytest.raises(yaml.YAMLError):
+            persistence.read_config(str(bad_file))
+
+
+class TestYamlPersistenceWriteResult:
+    """Tests for write_result YAML output."""
+
+    def test_roundtrip_write_result(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        data = {
+            "optimized_prompt": "Improved prompt.",
+            "initial_score": 0.4,
+            "final_score": 0.85,
+        }
+        output_file = tmp_path / "result.yaml"
+        persistence.write_result(str(output_file), data)
+
+        with open(output_file) as f:
+            loaded = yaml.safe_load(f)
+
+        assert loaded == data
+
+    def test_write_result_creates_valid_yaml(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        data = {"key": "value", "number": 42}
+        output_file = tmp_path / "out.yaml"
+        persistence.write_result(str(output_file), data)
+
+        content = output_file.read_text()
+        assert "key: value" in content
+        assert "number: 42" in content
+
+    def test_write_result_handles_unicode(self, tmp_path: Path) -> None:
+        persistence = YamlPersistence()
+        data = {"prompt": "Répondez en français. 中文测试"}
+        output_file = tmp_path / "unicode.yaml"
+        persistence.write_result(str(output_file), data)
+
+        with open(output_file, encoding="utf-8") as f:
+            loaded = yaml.safe_load(f)
+
+        assert loaded["prompt"] == "Répondez en français. 中文测试"
--- a/tests/unit/test_scoring.py
+++ b/tests/unit/test_scoring.py
@@ -0,0 +1,54 @@
+"""Unit tests for scoring logic."""
+from __future__ import annotations
+
+from prometheus.domain.entities import EvalResult, Trajectory
+from prometheus.domain.scoring import normalize_score, should_accept
+
+
+def _make_eval(scores: list[float]) -> EvalResult:
+    return EvalResult(
+        scores=scores,
+        feedbacks=[""] * len(scores),
+        trajectories=[
+            Trajectory(f"in{i}", f"out{i}", s, "", "p")
+            for i, s in enumerate(scores)
+        ],
+    )
+
+
+class TestShouldAccept:
+    def test_accepts_improvement(self) -> None:
+        old = _make_eval([0.3, 0.4])
+        new = _make_eval([0.8, 0.9])
+        assert should_accept(old, new) is True
+
+    def test_rejects_regression(self) -> None:
+        old = _make_eval([0.8, 0.9])
+        new = _make_eval([0.3, 0.4])
+        assert should_accept(old, new) is False
+
+    def test_rejects_equal(self) -> None:
+        old = _make_eval([0.5, 0.5])
+        new = _make_eval([0.5, 0.5])
+        assert should_accept(old, new) is False
+
+    def test_min_improvement_threshold(self) -> None:
+        old = _make_eval([0.5])
+        new = _make_eval([0.6])
+        assert should_accept(old, new, min_improvement=0.2) is False
+        assert should_accept(old, new, min_improvement=0.05) is True
+
+
+class TestNormalizeScore:
+    def test_clamps_high(self) -> None:
+        assert normalize_score(1.5) == 1.0
+
+    def test_clamps_low(self) -> None:
+        assert normalize_score(-0.5) == 0.0
+
+    def test_passes_within_range(self) -> None:
+        assert normalize_score(0.7) == 0.7
+
+    def test_custom_range(self) -> None:
+        assert normalize_score(15.0, min_val=0.0, max_val=10.0) == 10.0
+        assert normalize_score(-5.0, min_val=0.0, max_val=10.0) == 0.0
--- a/uv.lock
+++ b/uv.lock