Files
Prompt-optimizer/docs/technical-spec.md
Gartoid 837a44970f Initial commit: PROMETHEUS v0.1.0 - Prompt optimizer
- Clean architecture (domain/application/infrastructure)
- DSPy-based evolution engine with scoring
- CLI via pyproject.toml entry point
- Unit + integration tests (~300 tests)
- Configs for glm-5.1 and glm-4.5-air models
- Z.AI endpoint integration
2026-03-29 11:44:03 +00:00

61 KiB
Raw Permalink Blame History

PROMETHEUS MVP — Spécification Technique Détaillée

Version: 0.1.0
Stack: Python 3.12+ · uv · DSPy · Typer
Architecture: Clean Architecture (hexagonale)
Date: 2025

Table des Matières

  1. Vue d'Ensemble & Objectifs
  2. Structure du Projet
  3. Couche Domaine
  4. Couche Application
  5. Couche Infrastructure
  6. Couche Présentation (CLI)
  7. Algorithme Central — Pseudo-Code Détaillé
  8. Format des Fichiers I/O
  9. Configuration & Environnement
  10. Tests
  11. Diagrammes d'Architecture Complète

1. Vue d'Ensemble & Objectifs

1.1 Énoncé du Problème

Les frameworks d'optimisation de prompt (GEPA, TextGrad, Promptolution) nécessitent tous un dataset labellisé pour calculer un signal de qualité. PROMETHEUS élimine cette dépendance en synthétisant ses propres données de test et en utilisant un LLM-as-Judge comme fonction d'évaluation.

1.2 Objectifs du MVP

# Objectif Critère d'acceptance
O1 Optimiser un prompt sans aucune donnée labellisée Seed prompt → prompt amélioré, 0 fichier de données requis
O2 Interface CLI simple prometheus optimize -i config.yaml -o result.yaml
O3 Budget maîtrisé < 500 appels LLM pour une run complète
O4 Reproductible Seed déterministe, résultats identiques si même seed + même modèle
O5 Observable Logging structuré, métriques par itération

1.3 Flux Nominal

┌──────────────┐     ┌───────────────┐     ┌──────────────────┐     ┌────────────┐
│  Fichier     │     │               │     │                  │     │  Fichier   │
│  config.yaml ├───► │  Bootstrap    ├───► │  Evolution Loop  ├───► │  output    │
│  (seed prompt│     │  (synth inputs│     │  (judge + mutate │     │  (optimized│
│   + params)  │     │   generation) │     │   + accept)      │     │   prompt)  │
└──────────────┘     └───────────────┘     └──────────────────┘     └────────────┘

2. Structure du Projet

prometheus/
├── pyproject.toml                    # uv project config
├── README.md
├── specs/
│   └── technical-spec.md             # ce fichier
│
├── src/
│   └── prometheus/
│       ├── __init__.py
│       ├── cli/                      # PRESENTATION LAYER
│       │   ├── __init__.py
│       │   └── app.py                # Typer CLI app
│       │
│       ├── domain/                   # DOMAIN LAYER (zero dependencies)
│       │   ├── __init__.py
│       │   ├── entities.py           # Dataclasses: Prompt, Candidate, EvalResult, SyntheticExample
│       │   ├── ports.py              # Abstract interfaces (Protocol classes)
│       │   └── scoring.py            # Score combination logic, acceptance criteria
│       │
│       ├── application/              # APPLICATION LAYER (depends on domain only)
│       │   ├── __init__.py
│       │   ├── use_cases.py          # OptimizePromptUseCase
│       │   ├── bootstrap.py          # SyntheticInputBootstrap
│       │   ├── evolution.py          # EvolutionLoop, ReflectiveMutation
│       │   ├── evaluator.py          # DualEvaluator (judge + execution)
│       │   └── dto.py                # Config & Result dataclasses
│       │
│       ├── infrastructure/           # INFRASTRUCTURE LAYER (depends on domain + application)
│       │   ├── __init__.py
│       │   ├── dspy_signatures.py    # DSPy Signature definitions
│       │   ├── dspy_modules.py       # DSPy Module implementations
│       │   ├── llm_adapter.py        # LLMAdapter (implements domain port)
│       │   ├── judge_adapter.py      # JudgeAdapter (implements domain port)
│       │   ├── proposer_adapter.py   # ProposerAdapter (implements domain port)
│       │   ├── synth_adapter.py      # SyntheticGeneratorAdapter (implements domain port)
│       │   └── file_io.py            # FileReader, FileWriter
│       │
│       └── config.py                 # Settings (pydantic-settings)
│
├── tests/
│   ├── unit/
│   │   ├── test_entities.py
│   │   ├── test_scoring.py
│   │   ├── test_evolution.py
│   │   └── test_bootstrap.py
│   ├── integration/
│   │   ├── test_dspy_adapters.py
│   │   └── test_full_pipeline.py
│   └── conftest.py
│
└── examples/
    ├── basic_usage.py
    └── sample_config.yaml

2.1 pyproject.toml

[project]
name = "prometheus"
version = "0.1.0"
description = "Prompt evolution without reference data"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "dspy>=2.6",
    "typer>=0.15",
    "pydantic>=2.10",
    "pydantic-settings>=2.7",
    "pyyaml>=6.0",
    "rich>=13.9",
]
[project.optional-dependencies]
dev = [
    "pytest>=8.3",
    "pytest-cov>=6.0",
    "ruff>=0.9",
    "mypy>=1.14",
]
[project.scripts]
prometheus = "prometheus.cli.app:app"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.ruff]
line-length = 100
target-version = "py312"
[tool.mypy]
python_version = "3.12"
strict = true

3. Couche Domaine — Entities & Ports

Objectif

Définir le cœur métier sans aucune dépendance externe. Aucune import de dspy, pydantic, ou quoi que ce soit hors stdlib.

3.1 entities.py

"""Domain entities — pure data, zero dependencies."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
@dataclass(frozen=True)
class Prompt:
    """
    Représente un prompt candidat.
    frozen=True → immutable, safe pour le Pareto tracking.
    """
    text: str
    metadata: dict[str, Any] = field(default_factory=dict)
    def __len__(self) -> int:
        return len(self.text)
@dataclass(frozen=True)
class SyntheticExample:
    """
    Un exemple synthétique: un input généré à partir de la task description.
    Pas d'output attendu — le juge évaluera la sortie directement.
    """
    input_text: str
    category: str = "default"  # pour le sampling stratifié futur
    id: int = 0
@dataclass
class Trajectory:
    """
    Trace d'exécution d'un prompt sur un input.
    Utilisé par la reflective mutation pour comprendre les échecs.
    """
    input_text: str
    output_text: str
    score: float
    feedback: str  # feedback textuel du juge
    prompt_used: str
@dataclass
class EvalResult:
    """Résultat d'une évaluation sur un minibatch."""
    scores: list[float]
    feedbacks: list[str]
    trajectories: list[Trajectory]
    @property
    def total_score(self) -> float:
        return sum(self.scores)
    @property
    def mean_score(self) -> float:
        return sum(self.scores) / len(self.scores) if self.scores else 0.0
@dataclass
class Candidate:
    """
    Un candidat dans le pool d'évolution.
    Contient le prompt + ses scores cumulés.
    """
    prompt: Prompt
    best_score: float = 0.0
    generation: int = 0  # à quelle itération il a été créé
    parent_id: int | None = None
@dataclass
class OptimizationState:
    """État complet de l'optimisation — snapshot sérialisable."""
    iteration: int = 0
    best_candidate: Candidate | None = None
    candidates: list[Candidate] = field(default_factory=list)
    synthetic_pool: list[SyntheticExample] = field(default_factory=list)
    history: list[dict[str, Any]] = field(default_factory=list)
    total_llm_calls: int = 0

3.2 ports.py

"""
Domain ports — interfaces abstraites que l'infrastructure implémente.
Utilise Protocol (structural typing) pour le loose coupling.
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from prometheus.domain.entities import (
    Prompt, SyntheticExample, Trajectory, EvalResult
)
class LLMPort(ABC):
    """
    Port d'exécution d'un prompt sur un input.
    L'infrastructure fournira une implémentation via DSPy.
    """
    @abstractmethod
    def execute(self, prompt: Prompt, input_text: str) -> str:
        """Exécute le prompt sur l'input, retourne la réponse brute."""
        ...
class JudgePort(ABC):
    """
    Port d'évaluation par LLM-as-Judge.
    Prend des paires (input, output) + la task description.
    Retourne un score + un feedback textuel par paire.
    """
    @abstractmethod
    def judge_batch(
        self,
        task_description: str,
        pairs: list[tuple[str, str]],
    ) -> list[tuple[float, str]]:
        """
        Évalue un batch de (input, output).
        Retourne une liste de (score, feedback).
        """
        ...
class ProposerPort(ABC):
    """
    Port de proposition d'un nouveau prompt.
    Utilise les trajectoires d'évaluation pour proposer une amélioration.
    """
    @abstractmethod
    def propose(
        self,
        current_prompt: Prompt,
        trajectories: list[Trajectory],
        task_description: str,
    ) -> Prompt:
        """Propose un nouveau prompt basé sur les trajectoires d'échec."""
        ...
class SyntheticGeneratorPort(ABC):
    """
    Port de génération d'inputs synthétiques.
    """
    @abstractmethod
    def generate_inputs(
        self,
        task_description: str,
        n_examples: int,
    ) -> list[SyntheticExample]:
        """Génère N inputs synthétiques diversifiés."""
        ...
class PersistencePort(ABC):
    """Port de lecture/écriture des fichiers."""
    @abstractmethod
    def read_config(self, path: str) -> dict:
        ...
    @abstractmethod
    def write_result(self, path: str, data: dict) -> None:
        ...

3.3 scoring.py

"""Logique de scoring et critères d'acceptation — pur domaine."""
from prometheus.domain.entities import EvalResult
def should_accept(
    old_result: EvalResult,
    new_result: EvalResult,
    min_improvement: float = 0.0,
) -> bool:
    """
    Critère d'acceptation strict.
    Le nouveau candidat doit strictement améliorer le score total.
    """
    return new_result.total_score > old_result.total_score + min_improvement
def normalize_score(raw: float, min_val: float = 0.0, max_val: float = 1.0) -> float:
    """Clamp un score dans [min_val, max_val]."""
    return max(min_val, min(max_val, raw))

4. Couche Application — Use Cases

Objectif

Orchestrer la logique métier en utilisant uniquement les ports du domaine. Ne dépend jamais de l'infrastructure concrète.

4.1 dto.py

"""Data Transfer Objects — configuration et résultats."""
from dataclasses import dataclass, field
@dataclass
class OptimizationConfig:
    """Configuration complète d'une run PROMETHEUS."""
    # --- Prompt ---
    seed_prompt: str
    task_description: str
    # --- Modèles ---
    task_model: str = "openai/gpt-4o-mini"
    judge_model: str = "openai/gpt-4o"
    proposer_model: str = "openai/gpt-4o"
    synth_model: str = "openai/gpt-4o"
    # --- Paramètres d'évolution ---
    max_iterations: int = 30
    n_synthetic_inputs: int = 20
    minibatch_size: int = 5
    perfect_score: float = 1.0
    # --- Reproductibilité ---
    seed: int = 42
    # --- Sortie ---
    output_path: str = "output.yaml"
    verbose: bool = False
@dataclass
class OptimizationResult:
    """Résultat d'une optimisation complète."""
    optimized_prompt: str
    initial_prompt: str
    iterations_used: int
    total_llm_calls: int
    initial_score: float
    final_score: float
    improvement: float
    history: list[dict] = field(default_factory=list)

4.2 bootstrap.py

"""
Bootstrap — génération d'inputs synthétiques.
Objectif: Créer un pool d'inputs de test à partir de la task description.
C'est le remplacement du dataset labellisé.
"""
from __future__ import annotations
import random
from prometheus.domain.ports import SyntheticGeneratorPort
from prometheus.domain.entities import SyntheticExample
class SyntheticBootstrap:
    """
    Orchestre la génération d'inputs synthétiques.
    Ne dépend que du port abstrait, pas de DSPy directement.
    """
    def __init__(self, generator: SyntheticGeneratorPort, seed: int = 42):
        self._generator = generator
        self._rng = random.Random(seed)
    def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]:
        """
        Génère le pool synthétique en un seul appel.
        Pourquoi un seul appel ?
        - Minimise les coûts LLM (1 appel au lieu de N)
        - Le LLM peut assurer la diversité en une seule génération
        - Le batch dans un seul prompt permet une meilleure couverture
        """
        examples = self._generator.generate_inputs(task_description, n_examples)
        # Shuffle pour la randomisation
        self._rng.shuffle(examples)
        return examples
    def sample_minibatch(
        self,
        pool: list[SyntheticExample],
        size: int,
    ) -> list[SyntheticExample]:
        """Échantillonne un minibatch du pool synthétique."""
        size = min(size, len(pool))
        return self._rng.sample(pool, size)

4.3 evaluator.py

"""
Évaluateur — exécution + jugement.
Objectif: Produire un signal de qualité sans ground truth.
Combine l'exécution du prompt candidat + l'évaluation par un LLM-as-Judge.
"""
from __future__ import annotations
from prometheus.domain.entities import (
    Prompt, SyntheticExample, Trajectory, EvalResult
)
from prometheus.domain.ports import LLMPort, JudgePort
class PromptEvaluator:
    """
    Évalue un prompt sur un minibatch d'inputs synthétiques.
    Pipeline: execute → judge → construire les trajectoires.
    Ce composant remplace la EvaluatorFn de GEPA.
    Au lieu de comparer à un ground truth, il utilise un LLM-as-Judge.
    """
    def __init__(self, executor: LLMPort, judge: JudgePort):
        self._executor = executor
        self._judge = judge
    def evaluate(
        self,
        prompt: Prompt,
        minibatch: list[SyntheticExample],
        task_description: str,
    ) -> EvalResult:
        """
        Évalue le prompt sur le minibatch.
        Étapes:
        1. Exécuter le prompt sur chaque input du minibatch
        2. Juger chaque paire (input, output)
        3. Construire les trajectoires avec le feedback
        Retourne un EvalResult avec scores + feedbacks + trajectoires.
        """
        # ── Étape 1: Exécution ──
        outputs: list[str] = []
        for example in minibatch:
            raw_output = self._executor.execute(prompt, example.input_text)
            outputs.append(raw_output)
        # ── Étape 2: Jugement ──
        pairs = [(ex.input_text, out) for ex, out in zip(minibatch, outputs)]
        judge_results = self._judge.judge_batch(task_description, pairs)
        # ── Étape 3: Construction des trajectoires ──
        scores: list[float] = []
        feedbacks: list[str] = []
        trajectories: list[Trajectory] = []
        for i, (example, output) in enumerate(zip(minibatch, outputs)):
            score, feedback = judge_results[i]
            scores.append(score)
            feedbacks.append(feedback)
            trajectories.append(Trajectory(
                input_text=example.input_text,
                output_text=output,
                score=score,
                feedback=feedback,
                prompt_used=prompt.text,
            ))
        return EvalResult(
            scores=scores,
            feedbacks=feedbacks,
            trajectories=trajectories,
        )

4.4 evolution.py

"""
Boucle d'évolution — cœur du moteur PROMETHEUS.
Objectif: Orchestrer le cycle select → evaluate → propose → accept.
C'est l'équivalent du GEPAEngine.run(), adapté pour fonctionner sans valset.
"""
from __future__ import annotations
from prometheus.domain.entities import (
    Prompt, Candidate, EvalResult, OptimizationState, SyntheticExample
)
from prometheus.domain.ports import ProposerPort
from prometheus.domain.scoring import should_accept
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.bootstrap import SyntheticBootstrap
class EvolutionLoop:
    """
    Boucle d'évolution principale.
    Design:
    - Garde seulement le meilleur candidat (pas de population complète)
    - Cela simplifie énormément vs GEPA (pas de Pareto, pas de merge)
    - Si le MVP fonctionne, on ajoutera la population dans la v2
    """
    def __init__(
        self,
        evaluator: PromptEvaluator,
        proposer: ProposerPort,
        bootstrap: SyntheticBootstrap,
        max_iterations: int = 30,
        minibatch_size: int = 5,
        perfect_score: float = 1.0,
        verbose: bool = False,
    ):
        self._evaluator = evaluator
        self._proposer = proposer
        self._bootstrap = bootstrap
        self._max_iterations = max_iterations
        self._minibatch_size = minibatch_size
        self._perfect_score = perfect_score
        self._verbose = verbose
    def run(
        self,
        seed_prompt: Prompt,
        synthetic_pool: list[SyntheticExample],
        task_description: str,
    ) -> OptimizationState:
        """
        Exécute la boucle d'évolution complète.
        Pseudo-code:
        ```
        state.best = Candidate(seed_prompt)
        state.best.score = evaluate(seed_prompt)
        for i in range(max_iterations):
            batch = sample_minibatch(pool)
            old_eval = evaluate(state.best.prompt, batch)
            if all perfect: continue
            new_prompt = propose(state.best.prompt, old_eval.trajectories)
            new_eval = evaluate(new_prompt, batch)
            if new_eval > old_eval:
                state.best = Candidate(new_prompt, score=new_eval)
        return state
        ```
        """
        state = OptimizationState()
        # ── Évaluer le seed ──
        initial_batch = self._bootstrap.sample_minibatch(
            synthetic_pool, self._minibatch_size
        )
        initial_eval = self._evaluator.evaluate(
            seed_prompt, initial_batch, task_description
        )
        state.total_llm_calls += self._minibatch_size + 1  # executions + 1 judge
        best_candidate = Candidate(
            prompt=seed_prompt,
            best_score=initial_eval.total_score,
            generation=0,
        )
        state.best_candidate = best_candidate
        state.candidates.append(best_candidate)
        self._log(f"Initial score: {initial_eval.total_score:.2f}")
        # ── Boucle principale ──
        for i in range(1, self._max_iterations + 1):
            state.iteration = i
            # 1. Sampler un minibatch frais
            batch = self._bootstrap.sample_minibatch(
                synthetic_pool, self._minibatch_size
            )
            # 2. Évaluer le candidat actuel
            current_eval = self._evaluator.evaluate(
                best_candidate.prompt, batch, task_description
            )
            state.total_llm_calls += self._minibatch_size + 1
            # 3. Skip si parfait
            if all(s >= self._perfect_score for s in current_eval.scores):
                self._log(f"Iter {i}: All scores perfect, skipping.")
                state.history.append({
                    "iteration": i,
                    "event": "skip_perfect",
                    "current_score": current_eval.total_score,
                })
                continue
            # 4. Proposer un nouveau prompt (reflective mutation)
            new_prompt = self._proposer.propose(
                best_candidate.prompt,
                current_eval.trajectories,
                task_description,
            )
            state.total_llm_calls += 1  # 1 appel de proposition
            # 5. Évaluer le nouveau prompt sur le même minibatch
            new_eval = self._evaluator.evaluate(
                new_prompt, batch, task_description
            )
            state.total_llm_calls += self._minibatch_size + 1
            # 6. Accepter ou rejeter
            if should_accept(current_eval, new_eval):
                best_candidate = Candidate(
                    prompt=new_prompt,
                    best_score=new_eval.total_score,
                    generation=i,
                    parent_id=id(best_candidate),
                )
                state.best_candidate = best_candidate
                state.candidates.append(best_candidate)
                self._log(
                    f"Iter {i}: ACCEPTED "
                    f"({current_eval.total_score:.2f}{new_eval.total_score:.2f})"
                )
                state.history.append({
                    "iteration": i,
                    "event": "accepted",
                    "old_score": current_eval.total_score,
                    "new_score": new_eval.total_score,
                    "improvement": new_eval.total_score - current_eval.total_score,
                })
            else:
                self._log(
                    f"Iter {i}: REJECTED "
                    f"({new_eval.total_score:.2f}{current_eval.total_score:.2f})"
                )
                state.history.append({
                    "iteration": i,
                    "event": "rejected",
                    "old_score": current_eval.total_score,
                    "new_score": new_eval.total_score,
                })
        return state
    def _log(self, msg: str) -> None:
        if self._verbose:
            print(f"[PROMETHEUS] {msg}")

4.5 use_cases.py

"""
Use Case principal — orchestration de haut niveau.
Objectif: Point d'entrée métier. Coordonne bootstrap → evolution → résultat.
Ne contient aucune logique technique, seulement de l'orchestration.
"""
from __future__ import annotations
from prometheus.domain.entities import Prompt
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
class OptimizePromptUseCase:
    """
    Use case unique du MVP.
    Injecte les dépendances via le constructeur (dependency injection).
    """
    def __init__(
        self,
        evaluator: PromptEvaluator,
        proposer: "ProposerPort",  # noqa: F821
        bootstrap: SyntheticBootstrap,
    ):
        self._evaluator = evaluator
        self._proposer = proposer
        self._bootstrap = bootstrap
    def execute(self, config: OptimizationConfig) -> OptimizationResult:
        """
        Pipeline complet:
        1. Bootstrap → générer les inputs synthétiques
        2. Evolution → boucle d'optimisation
        3. Retourner le résultat
        """
        # ── Phase 0: Bootstrap ──
        synthetic_pool = self._bootstrap.run(
            task_description=config.task_description,
            n_examples=config.n_synthetic_inputs,
        )
        # ── Phase 1: Evolution ──
        loop = EvolutionLoop(
            evaluator=self._evaluator,
            proposer=self._proposer,
            bootstrap=self._bootstrap,
            max_iterations=config.max_iterations,
            minibatch_size=config.minibatch_size,
            perfect_score=config.perfect_score,
            verbose=config.verbose,
        )
        seed_prompt = Prompt(text=config.seed_prompt)
        state = loop.run(seed_prompt, synthetic_pool, config.task_description)
        # ── Phase 2: Résultat ──
        initial_score = state.history[0].get("current_score", 0.0) if state.history else 0.0
        final_score = state.best_candidate.best_score if state.best_candidate else 0.0
        return OptimizationResult(
            optimized_prompt=state.best_candidate.prompt.text if state.best_candidate else config.seed_prompt,
            initial_prompt=config.seed_prompt,
            iterations_used=state.iteration,
            total_llm_calls=state.total_llm_calls + 1,  # +1 pour le bootstrap
            initial_score=initial_score,
            final_score=final_score,
            improvement=final_score - initial_score,
            history=state.history,
        )

5. Couche Infrastructure — DSPy Adapters

Objectif

Implémenter les ports du domaine avec DSPy. Chaque adapter encapsule un dspy.Signature + un dspy.Module.

5.1 dspy_signatures.py

"""
DSPy Signatures — contrats LLM déclaratifs.
Objectif: Définir CE que fait chaque appel LLM, pas COMMENT.
DSPy Signature = input_fields → output_fields + instruction.
DSPy se charge du prompting, du parsing, et de la structuration.
"""
import dspy
class GenerateSyntheticInputs(dspy.Signature):
    """Generate diverse, realistic input examples for a given task."""
    task_description: str = dspy.InputField(
        desc="Description of the task the prompt should accomplish."
    )
    n_examples: int = dspy.InputField(
        desc="Number of examples to generate."
    )
    examples: str = dspy.OutputField(
        desc=(
            "A JSON array of strings, each being a realistic input "
            "for the task. Cover: normal cases, edge cases, long inputs, "
            "short inputs, ambiguous cases, and tricky scenarios."
        ),
    )
class JudgeOutput(dspy.Signature):
    """
    Evaluate the quality of an LLM output for a given task and input.
    Score: 0.0 (completely wrong) to 1.0 (perfect).
    Feedback: specific, actionable criticism.
    """
    task_description: str = dspy.InputField(
        desc="What the assistant is supposed to do."
    )
    input_text: str = dspy.InputField(
        desc="The input provided to the assistant."
    )
    output_text: str = dspy.InputField(
        desc="The assistant's response to evaluate."
    )
    score: float = dspy.OutputField(
        desc="Quality score from 0.0 (wrong) to 1.0 (perfect)."
    )
    feedback: str = dspy.OutputField(
        desc=(
            "Specific, actionable feedback explaining what's wrong "
            "with the output and how to improve it. Be critical."
        ),
    )
class ProposeInstruction(dspy.Signature):
    """
    Given a current prompt and examples of where it fails with feedback,
    propose an improved version of the prompt.
    The new prompt should address all the issues identified in the feedback.
    """
    current_instruction: str = dspy.InputField(
        desc="The current prompt/instruction to improve."
    )
    task_description: str = dspy.InputField(
        desc="Description of the task."
    )
    failure_examples: str = dspy.InputField(
        desc=(
            "Examples of inputs, outputs, scores, and feedback "
            "showing where the current instruction fails."
        ),
    )
    new_instruction: str = dspy.OutputField(
        desc="An improved version of the instruction."
    )

5.2 dspy_modules.py

"""
DSPy Modules — composition de signatures.
Objectif: Orchestration déclarative des appels LLM via DSPy.
"""
import dspy
import json
class SyntheticInputGenerator(dspy.Module):
    """
    Génère des inputs synthétiques en un seul appel batch.
    Utilise ChainOfThought pour une meilleure diversité.
    """
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought(GenerateSyntheticInputs)
    def forward(self, task_description: str, n_examples: int):
        result = self.generate(
            task_description=task_description,
            n_examples=n_examples,
        )
        # Parser le JSON array
        try:
            examples = json.loads(result.examples)
        except json.JSONDecodeError:
            # Fallback: extraire les strings du texte
            examples = self._parse_fallback(result.examples)
        return dspy.Prediction(examples=examples)
    @staticmethod
    def _parse_fallback(text: str) -> list[str]:
        """Extract strings from non-JSON output."""
        # Tenter de trouver un JSON array dans le texte
        import re
        matches = re.findall(r'"([^"]+)"', text)
        return matches if matches else [text]
class OutputJudge(dspy.Module):
    """
    Juge un output unique. Sera appelé en batch par le JudgeAdapter.
    """
    def __init__(self):
        super().__init__()
        self.judge = dspy.ChainOfThought(JudgeOutput)
    def forward(self, task_description: str, input_text: str, output_text: str):
        result = self.judge(
            task_description=task_description,
            input_text=input_text,
            output_text=output_text,
        )
        # Parser le score (DSPy peut retourner un string)
        try:
            score = float(result.score)
        except (ValueError, TypeError):
            score = 0.5  # fallback neutre
        score = max(0.0, min(1.0, score))
        return dspy.Prediction(score=score, feedback=result.feedback)
class InstructionProposer(dspy.Module):
    """
    Propose un nouveau prompt à partir des trajectoires d'échec.
    C'est l'équivalent du InstructionProposalSignature de GEPA.
    """
    def __init__(self):
        super().__init__()
        self.propose = dspy.ChainOfThought(ProposeInstruction)
    def forward(
        self,
        current_instruction: str,
        task_description: str,
        failure_examples: str,
    ):
        result = self.propose(
            current_instruction=current_instruction,
            task_description=task_description,
            failure_examples=failure_examples,
        )
        return dspy.Prediction(new_instruction=result.new_instruction)

5.3 llm_adapter.py

"""
Adapter: Exécution d'un prompt sur un input.
Objectif: Implémenter le port LLMPort via DSPy.
"""
import dspy
from prometheus.domain.ports import LLMPort
from prometheus.domain.entities import Prompt
class DSPyLLMAdapter(LLMPort):
    """
    Exécute un prompt en utilisant dspy.Predict avec une signature simple.
    """
    class _ExecuteSignature(dspy.Signature):
        """Execute the instruction on the given input."""
        instruction: str = dspy.InputField(desc="The instruction/prompt to follow.")
        input_text: str = dspy.InputField(desc="The input to process.")
        output: str = dspy.OutputField(desc="The response following the instruction.")
    def __init__(self, model: str):
        self._predictor = dspy.Predict(self._ExecuteSignature)
        # Le modèle est configuré globalement via dspy.configure()
        # Mais on peut aussi le configurer localement si besoin
    def execute(self, prompt: Prompt, input_text: str) -> str:
        result = self._predictor(
            instruction=prompt.text,
            input_text=input_text,
        )
        return result.output

5.4 judge_adapter.py

"""
Adapter: LLM-as-Judge.
Objectif: Implémenter le port JudgePort via le DSPy OutputJudge module.
"""
from prometheus.domain.ports import JudgePort
from prometheus.infrastructure.dspy_modules import OutputJudge
class DSPyJudgeAdapter(JudgePort):
    """
    Évalue un batch de (input, output) en appelant le Judge pour chaque paire.
    Optimisation future: paralléliser les appels via dspy.Parallel.
    Pour le MVP, on reste séquentiel.
    """
    def __init__(self):
        self._judge = OutputJudge()
    def judge_batch(
        self,
        task_description: str,
        pairs: list[tuple[str, str]],
    ) -> list[tuple[float, str]]:
        results = []
        for input_text, output_text in pairs:
            pred = self._judge(
                task_description=task_description,
                input_text=input_text,
                output_text=output_text,
            )
            results.append((pred.score, pred.feedback))
        return results

5.5 proposer_adapter.py

"""
Adapter: Reflective Mutation Proposer.
Objectif: Implémenter le port ProposerPort via le DSPy InstructionProposer.
Convertit les trajectoires en format lisible pour le LLM proposer.
"""
from prometheus.domain.ports import ProposerPort
from prometheus.domain.entities import Prompt, Trajectory
from prometheus.infrastructure.dspy_modules import InstructionProposer
class DSPyProposerAdapter(ProposerPort):
    """
    Utilise les trajectoires d'évaluation pour construire
    un "failure report" et proposer un nouveau prompt.
    """
    def __init__(self):
        self._proposer = InstructionProposer()
    def propose(
        self,
        current_prompt: Prompt,
        trajectories: list[Trajectory],
        task_description: str,
    ) -> Prompt:
        # Formater les trajectoires en exemples d'échec
        failure_examples = self._format_failures(trajectories)
        pred = self._proposer(
            current_instruction=current_prompt.text,
            task_description=task_description,
            failure_examples=failure_examples,
        )
        return Prompt(text=pred.new_instruction)
    @staticmethod
    def _format_failures(trajectories: list[Trajectory]) -> str:
        """
        Convertit les trajectoires en un rapport textuel structuré.
        Format inspiré du InstructionProposalSignature de GEPA:
        # Example 1
        ## Input
        <input_text>
        ## Generated Output
        <output_text>
        ## Score
        <score>
        ## Feedback
        <feedback>
        """
        sections = []
        for i, t in enumerate(trajectories, 1):
            section = (
                f"# Example {i}\n"
                f"## Input\n{t.input_text}\n\n"
                f"## Generated Output\n{t.output_text}\n\n"
                f"## Score\n{t.score:.2f}\n\n"
                f"## Feedback\n{t.feedback}\n"
            )
            sections.append(section)
        return "\n---\n".join(sections)

5.6 synth_adapter.py

"""
Adapter: Génération d'inputs synthétiques.
Objectif: Implémenter le port SyntheticGeneratorPort via DSPy.
"""
from prometheus.domain.ports import SyntheticGeneratorPort
from prometheus.domain.entities import SyntheticExample
from prometheus.infrastructure.dspy_modules import SyntheticInputGenerator
class DSPySyntheticAdapter(SyntheticGeneratorPort):
    """
    Génère des inputs synthétiques en un seul appel batch via DSPy.
    """
    def __init__(self):
        self._generator = SyntheticInputGenerator()
    def generate_inputs(
        self,
        task_description: str,
        n_examples: int,
    ) -> list[SyntheticExample]:
        pred = self._generator(
            task_description=task_description,
            n_examples=n_examples,
        )
        return [
            SyntheticExample(
                input_text=text,
                id=i,
            )
            for i, text in enumerate(pred.examples[:n_examples])
        ]

5.7 file_io.py

"""
File I/O — lecture/écriture des fichiers config et résultats.
Objectif: Implémenter le port PersistencePort avec YAML.
"""
import yaml
from prometheus.domain.ports import PersistencePort
class YamlPersistence(PersistencePort):
    """Lit et écrit des fichiers YAML."""
    def read_config(self, path: str) -> dict:
        with open(path, "r", encoding="utf-8") as f:
            return yaml.safe_load(f)
    def write_result(self, path: str, data: dict) -> None:
        with open(path, "w", encoding="utf-8") as f:
            yaml.dump(data, f, default_flow_style=False, allow_unicode=True)

6. Couche Présentation — CLI

Objectif

Fournir une interface CLI simple via Typer. Point d'entrée unique: prometheus optimize -i config.yaml -o result.yaml

6.1 config.py

"""
Configuration globale — pydantic-settings.
Objectif: Charger la config depuis fichier + env vars + defaults.
"""
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class AppSettings:
    """Settings non-sensibles, hardcoded pour le MVP."""
    app_name: str = "prometheus"
    version: str = "0.1.0"

6.2 cli/app.py

"""
CLI — point d'entrée utilisateur.
Objectif: Interface Typer avec options -i (input) et -o (output).
"""
import typer
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
import dspy
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.use_cases import OptimizePromptUseCase
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.infrastructure.file_io import YamlPersistence
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
app = typer.Typer(
    name="prometheus",
    help="🔥 PROMETHEUS — Prompt evolution without reference data.",
    no_args_is_help=True,
)
console = Console()
@app.command()
def optimize(
    input: str = typer.Option(
        ..., "-i", "--input",
        help="Path to input YAML config file.",
        exists=True, readable=True,
    ),
    output: str = typer.Option(
        "output.yaml", "-o", "--output",
        help="Path to output YAML result file.",
    ),
    verbose: bool = typer.Option(
        False, "-v", "--verbose",
        help="Print detailed progress.",
    ),
) -> None:
    """
    Optimize a prompt without any reference data.
    Usage:
        prometheus optimize -i config.yaml -o result.yaml
    """
    console.print(Panel.fit(
        "🔥 [bold red]PROMETHEUS[/bold red] — Prompt Evolution Engine",
        subtitle="No reference data required",
    ))
    # ── 1. Charger la config ──
    persistence = YamlPersistence()
    raw_config = persistence.read_config(input)
    config = OptimizationConfig(
        seed_prompt=raw_config["seed_prompt"],
        task_description=raw_config["task_description"],
        task_model=raw_config.get("task_model", "openai/gpt-4o-mini"),
        judge_model=raw_config.get("judge_model", "openai/gpt-4o"),
        proposer_model=raw_config.get("proposer_model", "openai/gpt-4o"),
        synth_model=raw_config.get("synth_model", "openai/gpt-4o"),
        max_iterations=raw_config.get("max_iterations", 30),
        n_synthetic_inputs=raw_config.get("n_synthetic_inputs", 20),
        minibatch_size=raw_config.get("minibatch_size", 5),
        seed=raw_config.get("seed", 42),
        output_path=output,
        verbose=verbose,
    )
    console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]")
    console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]")
    # ── 2. Configurer DSPy ──
    # Modèle principal pour la plupart des opérations
    task_lm = dspy.LM(config.task_model)
    judge_lm = dspy.LM(config.judge_model)
    proposer_lm = dspy.LM(config.proposer_model)
    synth_lm = dspy.LM(config.synth_model)
    # ── 3. Construire les adaptateurs (Dependency Injection) ──
    dspy.configure(lm=task_lm)  # default LM
    synth_adapter = DSPySyntheticAdapter()
    # Configurer le modèle de synthèse spécifiquement
    # (Dans le MVP, on utilise le LM par défaut)
    llm_adapter = DSPyLLMAdapter(model=config.task_model)
    judge_adapter = DSPyJudgeAdapter()
    proposer_adapter = DSPyProposerAdapter()
    bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed)
    evaluator = PromptEvaluator(executor=llm_adapter, judge=judge_adapter)
    use_case = OptimizePromptUseCase(
        evaluator=evaluator,
        proposer=proposer_adapter,
        bootstrap=bootstrap,
    )
    # ── 4. Exécuter ──
    with console.status("[bold green]Evolving prompt..."):
        result = use_case.execute(config)
    # ── 5. Afficher les résultats ──
    _display_result(result)
    # ── 6. Sauvegarder ──
    _save_result(persistence, output, result)
    console.print(f"\n[green]✅ Results saved to {output}[/green]")
def _display_result(result: OptimizationResult) -> None:
    """Affiche un résumé Rich dans le terminal."""
    console.print()
    console.print(Panel(
        f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}",
        title="🔥 Result",
    ))
    table = Table(title="Metrics")
    table.add_column("Metric", style="cyan")
    table.add_column("Value", style="bold")
    table.add_row("Initial Score", f"{result.initial_score:.2f}")
    table.add_row("Final Score", f"{result.final_score:.2f}")
    table.add_row("Improvement", f"{result.improvement:+.2f}")
    table.add_row("Iterations", str(result.iterations_used))
    table.add_row("LLM Calls", str(result.total_llm_calls))
    console.print(table)
def _save_result(
    persistence: YamlPersistence,
    path: str,
    result: OptimizationResult,
) -> None:
    """Sauvegarde le résultat en YAML."""
    from dataclasses import asdict
    persistence.write_result(path, asdict(result))
if __name__ == "__main__":
    app()

7. Algorithme Central

Diagramme de Flux Détaillé

flowchart TB
    START(["prometheus optimize<br/>-i config.yaml<br/>-o result.yaml"]) --> LOAD
    LOAD["Load config.yaml"] --> INIT_DSPY["Configure DSPy LMs"]
    INIT_DSPY --> BOOTSTRAP
    subgraph BOOTSTRAP["Phase 0: Bootstrap"]
        direction TB
        B1["DSPySyntheticAdapter<br/>.generate_inputs()"] --> B2["SyntheticInputGenerator<br/>dspy.ChainOfThought<br/>(GenerateSyntheticInputs)"]
        B2 --> B3["Pool d'inputs synthétiques<br/>[input₁, input₂, ..., input₂₀]"]
    end
    B3 --> LOOP_START
    subgraph LOOP["Phase 1: Evolution Loop (×30)"]
        direction TB
        LOOP_START --> SELECT["Garder le meilleur candidat"]
        SELECT --> SAMPLE["Bootstrap.sample_minibatch()<br/>5 inputs aléatoires"]
        SAMPLE --> EXEC
        subgraph EXEC["Evaluate Current"]
            direction TB
            E1["DSPyLLMAdapter.execute()<br/>→ 5 outputs"] --> E2["DSPyJudgeAdapter.judge_batch()<br/>→ 5 × (score, feedback)"]
            E2 --> E3["Construire Trajectories<br/>(input, output, score, feedback)"]
        end
        E3 --> CHECK_PERFECT{"All scores ≥ 1.0 ?"}
        CHECK_PERFECT -->|Yes| NEXT_ITER["Skip → next iteration"]
        CHECK_PERFECT -->|No| PROPOSE
        subgraph PROPOSE["Reflective Mutation"]
            direction TB
            P1["DSPyProposerAdapter.propose()"] --> P2["Formater failure report<br/>à partir des Trajectories"]
            P2 --> P3["InstructionProposer<br/>dspy.ChainOfThought<br/>(ProposeInstruction)"]
            P3 --> P4["new_prompt"]
        end
        PROPOSE --> EVAL_NEW
        subgraph EVAL_NEW["Evaluate New"]
            direction TB
            EN1["DSPyLLMAdapter.execute()<br/>→ 5 outputs"] --> EN2["DSPyJudgeAdapter.judge_batch()<br/>→ 5 × (score, feedback)"]
        end
        EVAL_NEW --> ACCEPT{"new_score > old_score ?"}
        ACCEPT -->|Yes| UPDATE["best_candidate = nouveau"]
        ACCEPT -->|No| NEXT_ITER
        UPDATE --> NEXT_ITER
    end
    NEXT_ITER --> MORE{"iterations < max ?"}
    MORE -->|Yes| SELECT
    MORE -->|No| SAVE["Sauvegarder output.yaml"]
    SAVE --> DONE(["✅ Done"])
    style BOOTSTRAP fill:#0f3460,stroke:#00d2ff,color:#fff
    style LOOP fill:#1a1a2e,stroke:#e94560,color:#fff
    style EXEC fill:#16213e,stroke:#00d2ff,color:#fff
    style PROPOSE fill:#16213e,stroke:#e94560,color:#fff
    style EVAL_NEW fill:#16213e,stroke:#00d2ff,color:#fff

Budget LLM Détaillé par Itération

Itération type (minibatch_size=5):
┌──────────────────────────────────────┬──────────┐
│ Opération                            │ Appels   │
├──────────────────────────────────────┼──────────┤
│ Execute current (task_lm)            │ 5        │
│ Judge current  (judge_lm)            │ 5        │
│ Propose new    (proposer_lm)         │ 1        │
│ Execute new    (task_lm)             │ 5        │
│ Judge new      (judge_lm)            │ 5        │
├──────────────────────────────────────┼──────────┤
│ TOTAL par itération                  │ 21       │
├──────────────────────────────────────┼──────────┤
│ Bootstrap                            │ 1        │
│ 30 itérations × 21                   │ 630      │
├──────────────────────────────────────┼──────────┤
│ TOTAL MVP                            │ ~631     │
└──────────────────────────────────────┴──────────┘

8. Format des Fichiers I/O

8.1 Input: config.yaml

# PROMETHEUS Configuration File
# ==================================
# Le prompt initial à optimiser
seed_prompt: |
  Tu es un assistant expert en analyse de contrats.
  Analyse le texte fourni et identifie les clauses potentiellement abusives.
  Sois précis et cite les passages concernés.
# Description de la tâche (utilisé pour générer les inputs synthétiques)
task_description: |
  Analyse juridique de contrats pour identifier les clauses abusives.
  L'assistant doit examiner un texte de contrat et signaler
  toute clause qui pourrait être considérée comme abusive selon
  le droit de la consommation français.
# Modèles LLM (format DSPy/litellm)
task_model: "openai/gpt-4o-mini"
judge_model: "openai/gpt-4o"
proposer_model: "openai/gpt-4o"
synth_model: "openai/gpt-4o"
# Paramètres d'évolution
max_iterations: 30
n_synthetic_inputs: 20
minibatch_size: 5
seed: 42

8.2 Output: result.yaml

# PROMETHEUS Optimization Result
# ================================
optimized_prompt: |
  Tu es un analyste juridique spécialisé en droit de la consommation français.
  Pour chaque contrat analysé, applique cette méthodologie:
  1. Identifie toutes les clauses restrictives pour le consommateur
  2. Compare chaque clause aux critères d'abusivité de l'Article L.212-1
  3. Signale les clauses abusives avec: le texte exact, le motif d'abusivité,
     et le risque juridique associé
  Sois exhaustif et cite systématiquement les passages concernés.
initial_prompt: |
  Tu es un assistant expert en analyse de contrats.
  Analyse le texte fourni et identifie les clauses potentiellement abusives.
  Sois précis et cite les passages concernés.
initial_score: 6.8
final_score: 8.9
improvement: 2.1
iterations_used: 30
total_llm_calls: 631
history:
  - iteration: 1
    event: "accepted"
    old_score: 1.2
    new_score: 1.8
    improvement: 0.6
  - iteration: 2
    event: "rejected"
    old_score: 1.8
    new_score: 1.5
  # ... etc

9. Configuration & Environnement

9.1 Variables d'Environnement

# Requis (si utilisation d'OpenAI)
export OPENAI_API_KEY="sk-..."
# Optionnel (si utilisation d'autres providers)
export ANTHROPIC_API_KEY="..."
export TOGETHER_API_KEY="..."
# Optionnel
export PROMETHEUS_LOG_LEVEL="INFO"  # DEBUG pour les traces détaillées

9.2 Installation et Exécution

# Installation
git clone <repo>
cd prometheus
uv sync
# Exécution
uv run prometheus optimize -i config.yaml -o result.yaml -v
# Avec options
uv run prometheus optimize \
  -i examples/legal_contract.yaml \
  -o results/legal_optimized.yaml \
  --verbose

10. Stratégie de Tests

10.1 Pyramide de Tests

         ┌─────────────┐
         │   E2E Test   │  test_full_pipeline.py
         │  (1-2 tests) │  → Mock LLM, vérifie le flux complet
         ├─────────────┤
         │ Integration  │  test_dspy_adapters.py
         │  (3-5 tests) │  → Vraies signatures DSPy, mock LM
         ├─────────────┤
         │    Unit      │  test_entities.py
         │ (10+ tests)  │  test_scoring.py
         │              │  test_evolution.py (avec mocks)
         └─────────────┘

10.2 tests/conftest.py

"""Shared test fixtures."""
import pytest
from unittest.mock import MagicMock
from prometheus.domain.entities import (
    Prompt, SyntheticExample, Trajectory, EvalResult, Candidate
)
@pytest.fixture
def seed_prompt():
    return Prompt(text="You are a helpful assistant. Answer the question.")
@pytest.fixture
def task_description():
    return "Answer factual questions accurately and concisely."
@pytest.fixture
def synthetic_pool():
    return [
        SyntheticExample(input_text=f"Test input {i}", id=i)
        for i in range(20)
    ]
@pytest.fixture
def mock_eval_result():
    return EvalResult(
        scores=[0.3, 0.5, 0.4, 0.6, 0.2],
        feedbacks=[
            "Incomplete answer",
            "Missing key detail",
            "Wrong format",
            "Partially correct",
            "Completely off topic",
        ],
        trajectories=[
            Trajectory(
                input_text=f"Input {i}",
                output_text=f"Output {i}",
                score=s,
                feedback=f,
                prompt_used="test prompt",
            )
            for i, (s, f) in enumerate(zip(
                [0.3, 0.5, 0.4, 0.6, 0.2],
                [
                    "Incomplete answer",
                    "Missing key detail",
                    "Wrong format",
                    "Partially correct",
                    "Completely off topic",
                ],
            ))
        ],
    )
@pytest.fixture
def mock_llm_port():
    """Mock LLMPort that returns canned responses."""
    port = MagicMock()
    port.execute.return_value = "This is a mock response."
    return port
@pytest.fixture
def mock_judge_port():
    """Mock JudgePort that returns moderate scores."""
    port = MagicMock()
    port.judge_batch.return_value = [
        (0.5, "Moderate quality, needs improvement."),
    ] * 5
    return port
@pytest.fixture
def mock_proposer_port():
    """Mock ProposerPort that returns a slightly modified prompt."""
    port = MagicMock()
    port.propose.return_value = Prompt(
        text="You are a very helpful assistant. Answer the question precisely."
    )
    return port

10.3 tests/unit/test_evolution.py

"""Unit tests for the evolution loop — with full mocking."""
import pytest
from unittest.mock import MagicMock, patch
from prometheus.domain.entities import Prompt, SyntheticExample, EvalResult, Trajectory
from prometheus.application.evolution import EvolutionLoop
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.bootstrap import SyntheticBootstrap
class TestEvolutionLoop:
    """Teste la logique d'acceptation/rejet de la boucle d'évolution."""
    def test_accepts_improvement(self, seed_prompt, synthetic_pool, task_description,
                                  mock_llm_port, mock_judge_port, mock_proposer_port):
        """
        Scénario: le nouveau prompt améliore le score.
        Attendu: le meilleur candidat est mis à jour.
        """
        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
        # Old eval = low scores, new eval = high scores
        old_eval = EvalResult(
            scores=[0.3, 0.4, 0.3, 0.5, 0.2],
            feedbacks=["bad"] * 5,
            trajectories=[
                Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
                for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
            ],
        )
        new_eval = EvalResult(
            scores=[0.8, 0.9, 0.7, 0.8, 0.9],
            feedbacks=["good"] * 5,
            trajectories=[],
        )
        # evaluator.evaluate called twice per iteration (old + new)
        evaluator.evaluate = MagicMock(side_effect=[old_eval, new_eval])
        loop = EvolutionLoop(
            evaluator=evaluator,
            proposer=mock_proposer_port,
            bootstrap=bootstrap,
            max_iterations=1,
            minibatch_size=5,
        )
        # initial eval
        initial_eval = MagicMock()
        initial_eval.total_score = 1.7
        with patch.object(loop, '_log'):
            state = loop.run(seed_prompt, synthetic_pool, task_description)
        assert state.best_candidate.best_score > 0
    def test_rejects_regression(self, seed_prompt, synthetic_pool, task_description,
                                 mock_llm_port, mock_judge_port, mock_proposer_port):
        """
        Scénario: le nouveau prompt dégrade le score.
        Attendu: le meilleur candidat reste inchangé.
        """
        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
        old_eval = EvalResult(
            scores=[0.7, 0.8, 0.7, 0.8, 0.9],
            feedbacks=["ok"] * 5,
            trajectories=[
                Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
                for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
            ],
        )
        new_eval = EvalResult(
            scores=[0.2, 0.1, 0.3, 0.2, 0.1],
            feedbacks=["bad"] * 5,
            trajectories=[],
        )
        evaluator.evaluate = MagicMock(side_effect=[old_eval, new_eval])
        loop = EvolutionLoop(
            evaluator=evaluator,
            proposer=mock_proposer_port,
            bootstrap=bootstrap,
            max_iterations=1,
            minibatch_size=5,
        )
        with patch.object(loop, '_log'):
            state = loop.run(seed_prompt, synthetic_pool, task_description)
        # Le seed prompt devrait rester le meilleur
        assert state.best_candidate.prompt.text == seed_prompt.text
    def test_skips_perfect_scores(self, seed_prompt, synthetic_pool, task_description,
                                   mock_llm_port, mock_judge_port, mock_proposer_port):
        """
        Scénario: tous les scores sont parfaits.
        Attendu: pas de proposition, passage à l'itération suivante.
        """
        evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
        bootstrap = MagicMock(spec=SyntheticBootstrap)
        bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
        perfect_eval = EvalResult(
            scores=[1.0, 1.0, 1.0, 1.0, 1.0],
            feedbacks=["perfect"] * 5,
            trajectories=[
                Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt")
                for i in range(5)
            ],
        )
        evaluator.evaluate = MagicMock(return_value=perfect_eval)
        loop = EvolutionLoop(
            evaluator=evaluator,
            proposer=mock_proposer_port,
            bootstrap=bootstrap,
            max_iterations=3,
            minibatch_size=5,
        )
        with patch.object(loop, '_log'):
            state = loop.run(seed_prompt, synthetic_pool, task_description)
        # Le proposer ne devrait jamais avoir été appelé
        mock_proposer_port.propose.assert_not_called()

11. Diagrammes d'Architecture

11.1 Architecture Hexagonale — Vue Composants

flowchart TB
    subgraph PRESENTATION["🎯 PRESENTATION (CLI)"]
        CLI["typer CLI<br/>prometheus/cli/app.py"]
    end
    subgraph APPLICATION["⚙️ APPLICATION (Use Cases)"]
        UC["OptimizePromptUseCase"]
        BOOT["SyntheticBootstrap"]
        EVAL["PromptEvaluator"]
        EVO["EvolutionLoop"]
    end
    subgraph DOMAIN["💎 DOMAIN (Entities + Ports)"]
        ENT["Prompt<br/>SyntheticExample<br/>Trajectory<br/>EvalResult<br/>Candidate<br/>OptimizationState"]
        PORTS["LLMPort<br/>JudgePort<br/>ProposerPort<br/>SyntheticGeneratorPort<br/>PersistencePort"]
        SCORE["scoring.py"]
    end
    subgraph INFRA["🔧 INFRASTRUCTURE (DSPy)"]
        DSPY_SIG["dspy_signatures.py<br/>GenerateSyntheticInputs<br/>JudgeOutput<br/>ProposeInstruction"]
        DSPY_MOD["dspy_modules.py<br/>SyntheticInputGenerator<br/>OutputJudge<br/>InstructionProposer"]
        ADAPTERS["DSPyLLMAdapter<br/>DSPyJudgeAdapter<br/>DSPyProposerAdapter<br/>DSPySyntheticAdapter"]
        FILE_IO["YamlPersistence"]
    end
    CLI -->|"OptimizationConfig"| UC
    UC --> BOOT
    UC --> EVO
    EVO --> EVAL
    EVO -->|"ProposerPort"| ADAPTERS
    BOOT -->|"SyntheticGeneratorPort"| ADAPTERS
    EVAL -->|"LLMPort"| ADAPTERS
    EVAL -->|"JudgePort"| ADAPTERS
    ADAPTERS --> DSPY_MOD
    DSPY_MOD --> DSPY_SIG
    CLI -->|"PersistencePort"| FILE_IO
    UC -.->|"depends on"| ENT
    UC -.->|"depends on"| PORTS
    EVO -.->|"depends on"| ENT
    EVO -.->|"depends on"| SCORE
    EVAL -.->|"depends on"| ENT
    ADAPTERS -.->|"implements"| PORTS
    style PRESENTATION fill:#1a1a2e,stroke:#00d2ff,color:#fff
    style APPLICATION fill:#0f3460,stroke:#00d2ff,color:#fff
    style DOMAIN fill:#16213e,stroke:#e94560,color:#fff
    style INFRA fill:#1a1a2e,stroke:#e94560,color:#fff

11.2 Dependency Rule

flowchart LR
    CLI["CLI"] --> APP["Application"]
    APP --> DOMAIN["Domain"]
    INFRA["Infrastructure"] --> DOMAIN
    CLI --> INFRA
    style DOMAIN fill:#e94560,color:#fff
    style APP fill:#0f3460,color:#fff
    style INFRA fill:#1a1a2e,color:#fff
    style CLI fill:#16213e,color:#fff

Règle: Les flèches ne vont JAMAIS du Domain vers l'extérieur. Le Domain ne connaît ni DSPy, ni Typer, ni YAML.

11.3 Sequence Diagram — Run Complète

sequenceDiagram
    participant User
    participant CLI as CLI (Typer)
    participant UC as OptimizePromptUseCase
    participant BOOT as SyntheticBootstrap
    participant SYNTH as DSPySyntheticAdapter
    participant LOOP as EvolutionLoop
    participant EVAL as PromptEvaluator
    participant LLM as DSPyLLMAdapter
    participant JUDGE as DSPyJudgeAdapter
    participant PROP as DSPyProposerAdapter
    participant FS as YamlPersistence
    User->>CLI: prometheus optimize -i in.yaml -o out.yaml
    CLI->>FS: read_config("in.yaml")
    FS-->>CLI: raw_config dict
    CLI->>UC: execute(config)
    Note over UC,SYNTH: Phase 0: Bootstrap
    UC->>BOOT: run(task_desc, 20)
    BOOT->>SYNTH: generate_inputs(task_desc, 20)
    SYNTH->>SYNTH: dspy.ChainOfThought(GenerateSyntheticInputs)
    SYNTH-->>BOOT: [20 SyntheticExample]
    BOOT-->>UC: synthetic_pool
    Note over UC,PROP: Phase 1: Evolution
    loop 30 iterations
        UC->>LOOP: run(seed_prompt, pool, task_desc)
        LOOP->>BOOT: sample_minibatch(pool, 5)
        BOOT-->>LOOP: [5 examples]
        LOOP->>EVAL: evaluate(current_prompt, batch)
        EVAL->>LLM: execute(prompt, input) ×5
        LLM-->>EVAL: 5 outputs
        EVAL->>JUDGE: judge_batch(task_desc, pairs)
        JUDGE->>JUDGE: dspy.ChainOfThought(JudgeOutput) ×5
        JUDGE-->>EVAL: [(score, feedback) ×5]
        EVAL-->>LOOP: EvalResult
        LOOP->>PROP: propose(prompt, trajectories)
        PROP->>PROP: dspy.ChainOfThought(ProposeInstruction)
        PROP-->>LOOP: new Prompt
        LOOP->>EVAL: evaluate(new_prompt, batch)
        EVAL->>LLM: execute(new_prompt, input) ×5
        EVAL->>JUDGE: judge_batch(task_desc, pairs)
        EVAL-->>LOOP: new EvalResult
        alt new_score > old_score
            LOOP->>LOOP: best = new_prompt
        end
    end
    LOOP-->>UC: OptimizationState
    UC-->>CLI: OptimizationResult
    CLI->>FS: write_result("out.yaml", result)
    CLI-->>User: ✅ Optimized prompt + metrics

11.4 Data Flow Diagram

flowchart LR
    subgraph INPUT
        YAML["config.yaml"]
    end
    subgraph GENERATION
        SYNTH["Synthetic Pool<br/>20 inputs"]
    end
    subgraph EVAL["Evaluation Pipeline"]
        EXEC["Execute<br/>(task_lm)"]
        JUDGE["Judge<br/>(judge_lm)"]
    end
    subgraph PROPOSAL
        PROP["Propose<br/>(proposer_lm)"]
    end
    subgraph OUTPUT
        RESULT["result.yaml"]
    end
    YAML --> SYNTH
    SYNTH -->|"minibatch 5"| EXEC
    EXEC -->|"outputs"| JUDGE
    JUDGE -->|"scores + feedbacks"| PROP
    PROP -->|"new_prompt"| EXEC
    JUDGE -->|"scores"| RESULT
    PROP --> RESULT
    style INPUT fill:#1a1a2e,stroke:#00d2ff,color:#fff
    style GENERATION fill:#0f3460,stroke:#00d2ff,color:#fff
    style EVAL fill:#16213e,stroke:#e94560,color:#fff
    style PROPOSAL fill:#1a1a2e,stroke:#e94560,color:#fff
    style OUTPUT fill:#0f3460,stroke:#e94560,color:#fff

Résumé des Sections

Section Objectif Fichiers clés
Domain Cœur métier pur, zéro dépendance entities.py, ports.py, scoring.py
Application Orchestration métier via les ports use_cases.py, bootstrap.py, evaluator.py, evolution.py
Infrastructure Implémentation DSPy des ports dspy_signatures.py, dspy_modules.py, *_adapter.py
CLI Interface utilisateur Typer cli/app.py
I/O Config YAML en entrée, résultat YAML en sortie file_io.py
Tests Pyramide unit → integration → e2e tests/
Le flux: config.yaml → CLI → UseCase → Bootstrap (synth inputs) → EvolutionLoop (evaluate × propose × accept) × N → result.yaml