feat: v0.2.0 sprint — ground truth eval, crossover/mutation, checkpointing, similarity guards, dataset loader, CLI commands, extended test coverage

Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes
2 integration tests that broke when the codebase went async (DSPyLLMAdapter
and full pipeline tests now properly await coroutines).

277 tests pass (260 unit + 17 integration).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
FullStackDev
2026-03-29 19:13:50 +00:00
parent b9745566c8
commit a5bf2ad59c
43 changed files with 5007 additions and 358 deletions

369
docs/FEATURE_ROADMAP.md Normal file
View File

@@ -0,0 +1,369 @@
# PROMETHEUS Feature Roadmap
> Complete codebase review — features needed for production-grade prompt optimization.
> Generated from v0.1.0 architecture review (2026-03-29).
---
## Legend
| Marker | Meaning |
|--------|---------|
| **CLI** | Exposed as a CLI option/flag |
| **Config** | YAML config field |
| **Internal** | No user-facing surface, architectural improvement |
| **P1** | Critical / must-have for reliability |
| **P2** | High value, should-have |
| **P3** | Nice-to-have, deferred to later versions |
---
## 1. Multi-Model Routing (P1)
**Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config.
**Feature:**
- Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field.
- Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model).
**Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields.
**Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring.
---
## 2. Async / Parallel Execution (P1)
**Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.
**Feature:**
- Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`).
- Parallelize judge calls within a batch.
- Keep the proposer sequential (single call per iteration).
**Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field.
**Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`.
---
## 3. Robust Error Handling & Retry (P1)
**Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.
**Feature:**
- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
- Configurable `max_retries` and `retry_delay_base`.
- Circuit breaker: if N consecutive iterations fail, pause and alert.
- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.
**Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag.
**Scope:** Infrastructure adapters + evolution loop.
---
## 4. Checkpoint & Resume (P2)
**Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.
**Feature:**
- Save `OptimizationState` to disk every K iterations (or every accepted improvement).
- Resume from the latest checkpoint file on restart.
- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.
**Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field.
**Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`.
---
## 5. Population-Based Evolution (P2)
**Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned.
**Feature:**
- Maintain a population of K candidates (e.g., top-K by score or Pareto front).
- Crossover: combine instructions from two parent candidates.
- Mutation operators: paraphrase, constrain, generalize, specialize.
- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).
**Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags.
**Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure.
---
## 6. Hold-Out Validation (P2)
**Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.
**Feature:**
- Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
- Evolution uses train minibatches for accept/reject decisions.
- After each iteration, evaluate the best candidate on the hold-out set.
- Report both train and validation scores in results.
- Optional early stopping if validation score degrades for K consecutive iterations.
**Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`.
**Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics).
---
## 7. Custom Judge Criteria (P2)
**Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.
**Feature:**
- Allow users to define custom judge rubrics, criteria, and scoring scales.
- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
- Allow `perfect_score` to reflect the custom scale.
**Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides.
**Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation).
---
## 8. Real-World Evaluation Harness (P2)
**Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.
**Feature:**
- Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns).
- When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.
**Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge).
**Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure.
---
## 9. Logging & Observability (P2)
**Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.
**Feature:**
- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
- JSON-formatted log output for machine parsing.
- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
- Optional OpenTelemetry export for distributed tracing.
**Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`.
**Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain.
---
## 10. CLI Improvements (P2)
**Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands.
**Feature:**
- Fix Typer subcommand routing.
- `prometheus version` — show version.
- `prometheus init` — scaffold a config YAML interactively.
- `prometheus list` — list past optimization runs.
- `prometheus diff` — compare two result files (before/after prompt diff, score improvement).
- `prometheus eval` — evaluate a prompt against a dataset without optimization.
**Surface:** CLI subcommands.
**Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command.
---
## 11. Input Validation & Schema Enforcement (P2)
**Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.
**Feature:**
- Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency).
- Provide clear, actionable error messages for missing/invalid fields.
- Support config migration/upgrade from older versions.
**Surface:** Internal. Errors surface as clear CLI messages.
**Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution.
---
## 12. Adaptive Minibatch Sizing (P3)
**Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.
**Feature:**
- Start with a small minibatch for quick early iterations.
- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
- Shrink if too many evaluations fail (cost optimization).
**Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config.
**Scope:** `EvolutionLoop`, `SyntheticBootstrap`.
---
## 13. Prompt Diversity Tracking (P3)
**Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.
**Feature:**
- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
- Report diversity metrics in the result.
- Flag stagnation (N iterations with <epsilon change).
**Surface:** Internal. Reported in `OptimizationResult.history` entries.
**Scope:** `EvolutionLoop`, `OptimizationResult` (add diversity field per history entry).
---
## 14. Temperature & Sampling Control (P3)
**Current state:** No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.
**Feature:**
- Per-model-slot temperature and sampling parameters.
- Higher temperature for proposer (creativity), lower for judge (consistency).
**Surface:** `task_temperature`, `judge_temperature`, `proposer_temperature`, `synth_temperature` Config fields. `--temperature` CLI flag for global override.
**Scope:** `cli/app.py` (DSPy LM configuration), infrastructure adapters.
---
## 15. Cost Estimation & Budget Caps (P3)
**Current state:** `total_llm_calls` is tracked (inaccurately). No cost estimation, no budget caps.
**Feature:**
- Estimate cost per run based on model pricing and approximate token counts.
- Allow users to set a budget cap (`--max-cost-usd`).
- Report estimated cost in the result.
**Surface:** `--max-cost-usd` CLI flag. `max_cost_usd` Config field. Cost breakdown in result output.
**Scope:** `cli/app.py`, `OptimizationResult` (add cost fields), token counting in adapters.
---
## 16. Multi-Objective Optimization (P3)
**Current state:** Single scalar score from the judge. The `Prompt` entity comment mentions "Pareto tracking" but it's not implemented.
**Feature:**
- Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
- Maintain a Pareto front of non-dominated candidates.
- Allow users to set objective weights or constraints.
**Surface:** `objectives` Config field (list of `{name, weight, judge_criteria}`). CLI: `--objective` repeatable flag.
**Scope:** `EvolutionLoop` (Pareto front), `scoring.py` (multi-objective acceptance), `OptimizationResult` (Pareto set).
---
## 17. Export Optimized Prompt (P3)
**Current state:** The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.
**Feature:**
- `prometheus export` command to extract the optimized prompt as plain text.
- Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
- Copy to clipboard option.
**Surface:** `prometheus export --format <txt|md|json|langchain|dspy>` CLI subcommand. `--clipboard` flag.
**Scope:** New `cli/commands/export.py`, format renderers in infrastructure.
---
## 18. Config Profiles / Presets (P3)
**Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.
**Feature:**
- Named profiles: `fast`, `thorough`, `economy`, `research`.
- Profile overrides individual config fields.
- User-defined profiles stored in `~/.prometheus/profiles/`.
**Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands.
**Scope:** `cli/app.py`, new `ProfileManager` in application.
---
## Summary Table
| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope |
|---|---------|----------|-------------|----------------|-----------------|
| 1 | Multi-Model Routing | P1 | Existing | Existing | Small |
| 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium |
| 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium |
| 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium |
| 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large |
| 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium |
| 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium |
| 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large |
| 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium |
| 10 | CLI Improvements | P2 | Subcommands | — | Medium |
| 11 | Input Validation | P2 | — (error messages) | — | Small |
| 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small |
| 13 | Prompt Diversity Tracking | P3 | — | — | Small |
| 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small |
| 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small |
| 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large |
| 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small |
| 18 | Config Profiles / Presets | P3 | `--profile` | — | Small |
---
## Known Bugs (from TEST_REPORT.md and code review)
| # | Bug | Severity | File |
|---|-----|----------|------|
| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters |
| 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` |
| 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` |
| 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` |
| 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` |
| 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` |
| 7 | `AppSettings` is never imported or used | LOW | `config.py` |
| 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` |
| 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` |
---
## Test Coverage Gaps
| Area | Current | Needed |
|------|---------|--------|
| CLI commands | 0 tests | Unit + integration for each subcommand |
| Config validation | 0 tests | Schema validation, missing fields, type errors |
| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery |
| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters |
| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios |
| Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` |
---
## Recommended Implementation Order
### Phase 1 — Production Reliability (P1)
1. Fix multi-model routing (#1) — highest impact, smallest scope
2. Add error handling & retry (#3) — essential for production runs
3. Implement async/parallel execution (#2) — biggest wall-clock improvement
### Phase 2 — Optimization Quality (P2)
4. Input validation (#11) — small scope, high reliability gain
5. Logging & observability (#9) — enables debugging long runs
6. CLI improvements (#10) — fix Typer bug, add basic commands
7. Hold-out validation (#6) — prevents overfitting
8. Checkpoint & resume (#4) — essential for long runs
9. Custom judge criteria (#7) — enables domain-specific optimization
### Phase 3 — Advanced Features (P3)
10. Population-based evolution (#5)
11. Real-world eval harness (#8)
12. Remaining P3 features as demand dictates

View File

@@ -5,12 +5,12 @@ description = "Prompt evolution without reference data"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"dspy>=2.6,<3.0",
"typer>=0.15,<0.20",
"pydantic>=2.10",
"pydantic-settings>=2.7",
"pyyaml>=6.0",
"rich>=13.9",
"dspy==2.6.27",
"typer==0.19.2",
"pydantic==2.12.5",
"pydantic-settings==2.13.1",
"pyyaml==6.0.3",
"rich==14.3.3",
]
[project.optional-dependencies]
@@ -46,6 +46,6 @@ module = ["dspy", "dspy.*"]
ignore_missing_imports = true
[[tool.mypy.overrides]]
module = ["prometheus.infrastructure.*", "prometheus.cli.app"]
module = ["prometheus.infrastructure.*", "prometheus.cli.app", "prometheus.cli.commands.*"]
disable_error_code = ["misc", "import-untyped"]

View File

@@ -22,6 +22,24 @@ class SyntheticBootstrap:
self._generator = generator
self._rng = random.Random(seed)
@staticmethod
def split_pool(
pool: list[SyntheticExample],
validation_fraction: float,
rng: random.Random | None = None,
) -> tuple[list[SyntheticExample], list[SyntheticExample]]:
"""Split *pool* into (train, validation) sets.
Returns (pool, []) when *validation_fraction* is 0.
"""
if validation_fraction <= 0.0 or len(pool) < 2:
return pool, []
n_val = max(1, int(len(pool) * validation_fraction))
shuffled = list(pool)
_rng = rng or random.Random(42)
_rng.shuffle(shuffled)
return shuffled[:-n_val], shuffled[-n_val:]
def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]:
"""Generate the synthetic pool in a single call.

View File

@@ -16,6 +16,7 @@ from prometheus.domain.entities import (
Trajectory,
)
from prometheus.domain.ports import JudgePort, LLMPort
from prometheus.domain.scoring import normalize_score
logger = logging.getLogger(__name__)
@@ -72,6 +73,7 @@ class PromptEvaluator:
trajectories: list[Trajectory] = []
for i, (example, output) in enumerate(zip(minibatch, outputs)):
score, feedback = judge_results[i]
score = normalize_score(score)
scores.append(score)
feedbacks.append(feedback)
trajectories.append(

View File

@@ -2,24 +2,33 @@
Evolution loop — core PROMETHEUS engine.
Orchestrates the select → evaluate → propose → accept cycle.
Equivalent to GEPAEngine.run(), adapted to work without a valset.
Supports two modes:
- Single-candidate hill climbing (population_size=1, backward compat)
- Population-based evolution with crossover & mutation (population_size>1)
"""
from __future__ import annotations
import logging
import random
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.cli.logging_setup import get_logger
from prometheus.domain.entities import (
Candidate,
OptimizationState,
Prompt,
SyntheticExample,
)
from prometheus.domain.ports import ProposerPort
from prometheus.domain.ports import (
CheckpointPort,
CrossoverPort,
MutationPort,
ProposerPort,
)
from prometheus.domain.scoring import should_accept
logger = logging.getLogger(__name__)
logger = get_logger("evolution")
class CircuitBreakerOpen(Exception):
@@ -30,9 +39,9 @@ class EvolutionLoop:
"""Main evolution loop.
Design:
- Keeps only the best candidate (no full population).
- Simplifies vs GEPA (no Pareto, no merge).
- Population support deferred to v2.
- population_size=1: classic single-candidate hill climbing (backward compat).
- population_size>1: population-based evolution with crossover, mutation,
and diversity maintenance.
Error handling:
- Transient errors are retried by adapters.
@@ -51,6 +60,17 @@ class EvolutionLoop:
verbose: bool = False,
circuit_breaker_threshold: int = 5,
error_strategy: str = "retry",
checkpoint_port: CheckpointPort | None = None,
checkpoint_interval: int = 5,
# --- Population-based evolution params ---
population_size: int = 1,
crossover_rate: float = 0.5,
mutation_rate: float = 0.3,
diversity_penalty: float = 0.1,
crossover_port: CrossoverPort | None = None,
mutation_port: MutationPort | None = None,
# --- Hold-out validation params ---
early_stop_patience: int = 5,
):
self._evaluator = evaluator
self._proposer = proposer
@@ -61,18 +81,44 @@ class EvolutionLoop:
self._verbose = verbose
self._circuit_breaker_threshold = circuit_breaker_threshold
self._error_strategy = error_strategy
self._checkpoint_port = checkpoint_port
self._checkpoint_interval = checkpoint_interval
self._population_size = population_size
self._crossover_rate = crossover_rate
self._mutation_rate = mutation_rate
self._diversity_penalty = diversity_penalty
self._crossover_port = crossover_port
self._mutation_port = mutation_port
self._early_stop_patience = early_stop_patience
async def run(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
initial_state: OptimizationState | None = None,
validation_pool: list[SyntheticExample] | None = None,
) -> OptimizationState:
"""Execute the complete evolution loop."""
state = OptimizationState()
"""Execute the complete evolution loop.
If *initial_state* is provided (from a checkpoint), resume from that
point — skipping the seed evaluation and continuing at the saved iteration.
If *validation_pool* is provided (non-empty), the best candidate is
evaluated on the hold-out set after each iteration and early stopping
is applied when validation score degrades for ``early_stop_patience``
consecutive iterations.
"""
state = initial_state or OptimizationState()
consecutive_failures = 0
# Evaluate the seed
# Hold-out validation tracking
has_validation = bool(validation_pool)
best_validation_score: float = -1.0
validation_patience_counter: int = 0
# Only evaluate the seed when starting fresh (no checkpoint resume)
if initial_state is None:
initial_batch = self._bootstrap.sample_minibatch(
synthetic_pool, self._minibatch_size
)
@@ -81,31 +127,162 @@ class EvolutionLoop:
)
state.total_llm_calls += 2 * self._minibatch_size # N executions + N judge calls
best_candidate = Candidate(
seed_candidate = Candidate(
prompt=seed_prompt,
best_score=initial_eval.total_score,
generation=0,
)
state.best_candidate = best_candidate
state.candidates.append(best_candidate)
self._log(f"Initial score: {initial_eval.total_score:.2f}")
state.best_candidate = seed_candidate
state.candidates.append(seed_candidate)
logger.info(
"Initial evaluation complete",
extra={
"structured": {
"event": "initial_eval",
"score": round(initial_eval.total_score, 4),
"minibatch_size": self._minibatch_size,
"sample_ids": [ex.id for ex in initial_batch],
},
},
)
# Evaluate seed on validation set
if has_validation and state.best_candidate is not None:
val_eval = await self._evaluator.evaluate(
state.best_candidate.prompt, validation_pool, task_description
)
state.total_llm_calls += 2 * len(validation_pool)
best_validation_score = val_eval.mean_score
logger.info(
"Initial validation evaluation",
extra={
"structured": {
"event": "validation_eval",
"iteration": 0,
"validation_score": round(best_validation_score, 4),
"validation_pool_size": len(validation_pool),
},
},
)
# Population initialization: seed the population with mutations
if self._population_size > 1:
await self._initialize_population(
state, seed_prompt, seed_candidate, task_description
)
else:
logger.info(
"Resuming from checkpoint",
extra={
"structured": {
"event": "resume",
"iteration": state.iteration,
"total_llm_calls": state.total_llm_calls,
},
},
)
# Restore validation tracking from state history
if has_validation:
for entry in reversed(state.history):
if entry.get("event") == "validation_eval":
best_validation_score = entry.get("best_validation_score", -1.0)
validation_patience_counter = entry.get("validation_patience", 0)
break
# Determine starting iteration
start_iteration = state.iteration + 1
# Main loop
for i in range(1, self._max_iterations + 1):
for i in range(start_iteration, self._max_iterations + 1):
state.iteration = i
try:
await self._run_iteration(
i, state, best_candidate, synthetic_pool, task_description
if self._population_size > 1 and len(state.candidates) > 1:
await self._run_population_iteration(
i, state, synthetic_pool, task_description
)
else:
await self._run_single_iteration(
i, state, synthetic_pool, task_description
)
# Update best_candidate from state after successful iteration
best_candidate = state.best_candidate # type: ignore[assignment]
consecutive_failures = 0
# Hold-out validation: evaluate best candidate on validation set
if has_validation and state.best_candidate is not None:
val_eval = await self._evaluator.evaluate(
state.best_candidate.prompt, validation_pool, task_description
)
state.total_llm_calls += 2 * len(validation_pool)
current_val_score = val_eval.mean_score
if current_val_score > best_validation_score:
best_validation_score = current_val_score
validation_patience_counter = 0
else:
validation_patience_counter += 1
state.history.append({
"iteration": i,
"event": "validation_eval",
"validation_score": round(current_val_score, 4),
"best_validation_score": round(best_validation_score, 4),
"validation_patience": validation_patience_counter,
})
logger.info(
"Validation evaluation",
extra={
"structured": {
"event": "validation_eval",
"iteration": i,
"validation_score": round(current_val_score, 4),
"best_validation_score": round(best_validation_score, 4),
"patience": f"{validation_patience_counter}/{self._early_stop_patience}",
},
},
)
if validation_patience_counter >= self._early_stop_patience:
logger.warning(
"Early stopping triggered — validation score did not improve for %d iterations",
self._early_stop_patience,
extra={
"structured": {
"event": "early_stop",
"iteration": i,
"best_validation_score": round(best_validation_score, 4),
"patience": self._early_stop_patience,
},
},
)
state.history.append({
"iteration": i,
"event": "early_stop",
"best_validation_score": round(best_validation_score, 4),
"patience": self._early_stop_patience,
})
state.best_validation_score = best_validation_score
state.early_stopped = True
if self._checkpoint_port is not None:
self._checkpoint_port.save(state)
break
# Checkpoint on accepted improvement (detected via state change)
self._maybe_checkpoint(state)
except Exception as exc:
consecutive_failures += 1
self._log(
f"Iter {i}: ERROR ({consecutive_failures} consecutive) — {exc}"
logger.error(
"Iteration error",
extra={
"structured": {
"event": "iteration_error",
"iteration": i,
"consecutive_failures": consecutive_failures,
"error": str(exc),
},
},
exc_info=True,
)
state.history.append(
{
@@ -118,9 +295,16 @@ class EvolutionLoop:
# Check circuit breaker
if consecutive_failures >= self._circuit_breaker_threshold:
self._log(
f"Circuit breaker tripped after {consecutive_failures} "
f"consecutive failures."
logger.warning(
"Circuit breaker tripped",
extra={
"structured": {
"event": "circuit_breaker",
"iteration": i,
"consecutive_failures": consecutive_failures,
"error_strategy": self._error_strategy,
},
},
)
state.history.append(
{
@@ -134,7 +318,9 @@ class EvolutionLoop:
f"Circuit breaker tripped after "
f"{consecutive_failures} consecutive failures"
) from exc
# skip / retry strategies: stop the loop gracefully
# skip / retry strategies: save checkpoint, then stop the loop gracefully
if self._checkpoint_port is not None:
self._checkpoint_port.save(state)
break
if self._error_strategy == "abort":
@@ -142,21 +328,77 @@ class EvolutionLoop:
# skip / retry: continue to next iteration
continue
# Store final validation metadata on state
if has_validation:
state.best_validation_score = best_validation_score
return state
async def _run_iteration(
# ------------------------------------------------------------------
# Population initialization
# ------------------------------------------------------------------
async def _initialize_population(
self,
state: OptimizationState,
seed_prompt: Prompt,
seed_candidate: Candidate,
task_description: str,
) -> None:
"""Fill the population with mutated variants of the seed prompt."""
n_needed = self._population_size - 1
mutation_types = ["paraphrase", "constrain", "generalize", "specialize"]
for idx in range(n_needed):
mutation_type = mutation_types[idx % len(mutation_types)]
if self._mutation_port is not None:
new_prompt = await self._mutation_port.mutate(
seed_prompt, task_description, mutation_type
)
else:
# Fallback: use proposer for reflective mutation
new_prompt = await self._proposer.propose(
seed_prompt, [], task_description
)
state.total_llm_calls += 1
new_candidate = Candidate(
prompt=new_prompt,
best_score=seed_candidate.best_score, # estimate until evaluated
generation=0,
parent_id=id(seed_candidate),
)
state.candidates.append(new_candidate)
logger.info(
"Population initialized",
extra={
"structured": {
"event": "population_init",
"population_size": len(state.candidates),
},
},
)
# ------------------------------------------------------------------
# Single-candidate iteration (original hill-climbing)
# ------------------------------------------------------------------
async def _run_single_iteration(
self,
i: int,
state: OptimizationState,
best_candidate: Candidate,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> None:
"""Execute a single iteration. Mutates *state* in-place."""
"""Execute a single-candidate iteration. Mutates *state* in-place."""
best_candidate = state.best_candidate # type: ignore[assignment]
# 1. Sample a fresh minibatch
batch = self._bootstrap.sample_minibatch(
synthetic_pool, self._minibatch_size
)
sample_ids = [ex.id for ex in batch]
# 2. Evaluate the current candidate
current_eval = await self._evaluator.evaluate(
@@ -164,9 +406,31 @@ class EvolutionLoop:
)
state.total_llm_calls += 2 * self._minibatch_size
logger.debug(
"Iteration minibatch evaluated",
extra={
"structured": {
"event": "minibatch_eval",
"iteration": i,
"sample_ids": sample_ids,
"scores": [round(s, 4) for s in current_eval.scores],
"total_score": round(current_eval.total_score, 4),
},
},
)
# 3. Skip if perfect
if all(s >= self._perfect_score for s in current_eval.scores):
self._log(f"Iter {i}: All scores perfect, skipping.")
logger.info(
"Iteration skipped — all scores perfect",
extra={
"structured": {
"event": "skip_perfect",
"iteration": i,
"total_score": round(current_eval.total_score, 4),
},
},
)
state.history.append(
{
"iteration": i,
@@ -177,12 +441,26 @@ class EvolutionLoop:
return
# 4. Propose a new prompt (reflective mutation) — sequential
state.total_llm_calls += 1
new_prompt = await self._proposer.propose(
best_candidate.prompt,
current_eval.trajectories,
task_description,
)
state.total_llm_calls += 1 # 1 proposition call
prompt_diff = self._compute_prompt_diff(
best_candidate.prompt.text, new_prompt.text
)
logger.debug(
"Proposed new prompt",
extra={
"structured": {
"event": "proposer_output",
"iteration": i,
"prompt_diff": prompt_diff,
},
},
)
# 5. Evaluate the new prompt on the same minibatch
new_eval = await self._evaluator.evaluate(
@@ -200,9 +478,22 @@ class EvolutionLoop:
)
state.best_candidate = new_candidate
state.candidates.append(new_candidate)
self._log(
f"Iter {i}: ACCEPTED "
f"({current_eval.total_score:.2f} -> {new_eval.total_score:.2f})"
logger.info(
"Iteration accepted",
extra={
"structured": {
"event": "accepted",
"iteration": i,
"old_score": round(current_eval.total_score, 4),
"new_score": round(new_eval.total_score, 4),
"improvement": round(
new_eval.total_score - current_eval.total_score, 4
),
"sample_ids": sample_ids,
"new_scores": [round(s, 4) for s in new_eval.scores],
"prompt_diff": prompt_diff,
},
},
)
state.history.append(
{
@@ -215,9 +506,19 @@ class EvolutionLoop:
}
)
else:
self._log(
f"Iter {i}: REJECTED "
f"({new_eval.total_score:.2f} <= {current_eval.total_score:.2f})"
logger.info(
"Iteration rejected",
extra={
"structured": {
"event": "rejected",
"iteration": i,
"old_score": round(current_eval.total_score, 4),
"new_score": round(new_eval.total_score, 4),
"sample_ids": sample_ids,
"new_scores": [round(s, 4) for s in new_eval.scores],
"prompt_diff": prompt_diff,
},
},
)
state.history.append(
{
@@ -228,6 +529,213 @@ class EvolutionLoop:
}
)
def _log(self, msg: str) -> None:
if self._verbose:
logger.info("[PROMETHEUS] %s", msg)
# ------------------------------------------------------------------
# Population-based iteration
# ------------------------------------------------------------------
async def _run_population_iteration(
self,
i: int,
state: OptimizationState,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> None:
"""Execute a population-based iteration. Mutates *state* in-place."""
population = state.candidates
# 1. Sample a fresh minibatch
batch = self._bootstrap.sample_minibatch(
synthetic_pool, self._minibatch_size
)
sample_ids = [ex.id for ex in batch]
# 2. Select two parents via tournament selection
parent_a = self._tournament_select(population)
parent_b = self._tournament_select(population)
# 3. Generate child: crossover or reflective mutation
use_crossover = (
random.random() < self._crossover_rate
and self._crossover_port is not None
)
if use_crossover:
state.total_llm_calls += 1
child_prompt = await self._crossover_port.crossover(
parent_a.prompt, parent_b.prompt, task_description
)
origin = "crossover"
else:
# Reflective mutation: evaluate a parent, propose improvement
state.total_llm_calls += 2 * self._minibatch_size
parent_eval = await self._evaluator.evaluate(
parent_a.prompt, batch, task_description
)
state.total_llm_calls += 1
child_prompt = await self._proposer.propose(
parent_a.prompt,
parent_eval.trajectories,
task_description,
)
origin = "reflective"
# 4. Optional mutation
if random.random() < self._mutation_rate and self._mutation_port is not None:
mutation_type = random.choice(
["paraphrase", "constrain", "generalize", "specialize"]
)
state.total_llm_calls += 1
child_prompt = await self._mutation_port.mutate(
child_prompt, task_description, mutation_type
)
origin += f"+mutation({mutation_type})"
# 5. Evaluate the child
state.total_llm_calls += 2 * self._minibatch_size
child_eval = await self._evaluator.evaluate(
child_prompt, batch, task_description
)
# 6. Compute fitness with diversity penalty
child_score = child_eval.total_score
diversity_sim = self._compute_diversity_score(
child_prompt, population
)
child_fitness = child_score - self._diversity_penalty * (1.0 - diversity_sim)
# 7. Find worst candidate and replace if child is better
worst_idx = min(
range(len(population)),
key=lambda idx: population[idx].best_score,
)
worst_fitness = (
population[worst_idx].best_score
- self._diversity_penalty * (1.0 - self._compute_diversity_score(
population[worst_idx].prompt, population
))
)
accepted = child_fitness > worst_fitness
if accepted:
new_candidate = Candidate(
prompt=child_prompt,
best_score=child_score,
generation=i,
parent_id=id(parent_a),
)
population[worst_idx] = new_candidate
# Update best if this child is the new best
if child_score > (state.best_candidate.best_score if state.best_candidate else 0):
state.best_candidate = new_candidate
logger.info(
"Population iteration accepted",
extra={
"structured": {
"event": "pop_accepted",
"iteration": i,
"origin": origin,
"child_score": round(child_score, 4),
"child_fitness": round(child_fitness, 4),
"diversity_sim": round(diversity_sim, 4),
"replaced_idx": worst_idx,
"sample_ids": sample_ids,
},
},
)
state.history.append(
{
"iteration": i,
"event": "pop_accepted",
"origin": origin,
"child_score": child_score,
"child_fitness": child_fitness,
"diversity_sim": diversity_sim,
}
)
else:
logger.info(
"Population iteration rejected",
extra={
"structured": {
"event": "pop_rejected",
"iteration": i,
"origin": origin,
"child_score": round(child_score, 4),
"child_fitness": round(child_fitness, 4),
"worst_fitness": round(worst_fitness, 4),
"sample_ids": sample_ids,
},
},
)
state.history.append(
{
"iteration": i,
"event": "pop_rejected",
"origin": origin,
"child_score": child_score,
"child_fitness": child_fitness,
"worst_fitness": worst_fitness,
}
)
# ------------------------------------------------------------------
# Selection and diversity helpers
# ------------------------------------------------------------------
def _tournament_select(
self,
population: list[Candidate],
tournament_size: int = 3,
) -> Candidate:
"""Tournament selection: pick the best from a random subset."""
k = min(tournament_size, len(population))
contestants = random.sample(population, k)
return max(contestants, key=lambda c: c.best_score)
@staticmethod
def _compute_diversity_score(
prompt: Prompt,
population: list[Candidate],
) -> float:
"""Compute the average Jaccard similarity between *prompt* and all
population members. Returns 1.0 when population has only one member
(no diversity penalty)."""
if len(population) <= 1:
return 1.0
prompt_words = set(prompt.text.lower().split())
if not prompt_words:
return 0.0
similarities: list[float] = []
for candidate in population:
other_words = set(candidate.prompt.text.lower().split())
if not other_words:
continue
intersection = prompt_words & other_words
union = prompt_words | other_words
sim = len(intersection) / len(union) if union else 0.0
similarities.append(sim)
# Average similarity (lower = more diverse)
return sum(similarities) / len(similarities) if similarities else 0.0
@staticmethod
def _compute_prompt_diff(old: str, new: str) -> dict[str, int]:
"""Compute a simple diff summary between two prompts."""
old_lines = set(old.splitlines())
new_lines = set(new.splitlines())
return {
"lines_added": len(new_lines - old_lines),
"lines_removed": len(old_lines - new_lines),
"chars_delta": len(new) - len(old),
}
def _maybe_checkpoint(self, state: OptimizationState) -> None:
"""Save a checkpoint if the interval is met or on accepted improvements."""
if self._checkpoint_port is None:
return
if state.iteration % self._checkpoint_interval == 0:
self._checkpoint_port.save(state)

View File

@@ -0,0 +1,116 @@
"""
Ground-truth evaluator — execution + similarity comparison.
Produces a quality signal *with* ground truth by comparing model outputs
against expected outputs using a configurable similarity metric.
"""
from __future__ import annotations
import asyncio
import logging
from prometheus.domain.entities import (
EvalResult,
GroundTruthExample,
Prompt,
Trajectory,
)
from prometheus.domain.ports import LLMPort, SimilarityPort
logger = logging.getLogger(__name__)
class GroundTruthEvaluator:
"""Evaluates a prompt against a ground-truth dataset.
Pipeline: execute → compare with similarity metric → build trajectories.
Unlike PromptEvaluator (which uses LLM-as-Judge), this compares outputs
directly against known-good expected outputs.
"""
def __init__(
self,
executor: LLMPort,
similarity: SimilarityPort,
max_concurrency: int = 5,
):
self._executor = executor
self._similarity = similarity
self._semaphore = asyncio.Semaphore(max_concurrency)
async def evaluate(
self,
prompt: Prompt,
dataset: list[GroundTruthExample],
) -> EvalResult:
"""Evaluate the prompt on the ground-truth dataset.
Steps:
1. Execute the prompt on each input (parallel, bounded)
2. Compare each output against expected using similarity metric
3. Build trajectories with feedback
"""
# Step 1: Parallel execution (per-item isolation)
output_coros = [
self._execute_single(prompt, example) for example in dataset
]
outputs = await asyncio.gather(*output_coros)
# Step 2: Compute similarity scores
scores: list[float] = []
feedbacks: list[str] = []
trajectories: list[Trajectory] = []
for example, output in zip(dataset, outputs):
score = self._similarity.compute(output, example.expected_output)
score = max(0.0, min(1.0, score)) # normalize to [0, 1]
scores.append(score)
feedback = self._build_feedback(output, example.expected_output, score)
feedbacks.append(feedback)
trajectories.append(
Trajectory(
input_text=example.input_text,
output_text=output,
score=score,
feedback=feedback,
prompt_used=prompt.text,
)
)
logger.info(
"Ground-truth evaluation complete: %d items, mean_score=%.4f",
len(dataset),
sum(scores) / len(scores) if scores else 0.0,
)
return EvalResult(
scores=scores,
feedbacks=feedbacks,
trajectories=trajectories,
)
async def _execute_single(
self, prompt: Prompt, example: GroundTruthExample
) -> str:
async with self._semaphore:
try:
return await self._executor.execute(prompt, example.input_text)
except Exception as exc:
logger.warning(
"Execution failed for input '%s': %s",
example.input_text[:40],
exc,
)
return f"[execution error: {exc}]"
@staticmethod
def _build_feedback(output: str, expected: str, score: float) -> str:
"""Build human-readable feedback for a ground-truth comparison."""
if score >= 0.99:
return "Exact match."
elif score >= 0.7:
return f"Close match (score={score:.2f}). Expected: {expected[:100]}"
elif score >= 0.3:
return f"Partial match (score={score:.2f}). Expected: {expected[:100]}"
else:
return f"Poor match (score={score:.2f}). Expected: {expected[:100]}"

View File

@@ -10,8 +10,16 @@ from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.cli.logging_setup import get_logger
from prometheus.domain.entities import Prompt
from prometheus.domain.ports import ProposerPort
from prometheus.domain.ports import (
CheckpointPort,
CrossoverPort,
MutationPort,
ProposerPort,
)
logger = get_logger("use_cases")
class OptimizePromptUseCase:
@@ -25,24 +33,60 @@ class OptimizePromptUseCase:
evaluator: PromptEvaluator,
proposer: ProposerPort,
bootstrap: SyntheticBootstrap,
checkpoint_port: CheckpointPort | None = None,
crossover_port: CrossoverPort | None = None,
mutation_port: MutationPort | None = None,
):
self._evaluator = evaluator
self._proposer = proposer
self._bootstrap = bootstrap
self._checkpoint_port = checkpoint_port
self._crossover_port = crossover_port
self._mutation_port = mutation_port
async def execute(self, config: OptimizationConfig) -> OptimizationResult:
"""Full pipeline:
1. Bootstrap → generate synthetic inputs
2. Evolution → optimization loop
2. Evolution → optimization loop (with optional checkpoint resume)
3. Return result
"""
# Phase 0: Bootstrap
# Phase 0: Bootstrap (skip synthetic generation on resume if pool was saved)
initial_state = None
if config.resume and self._checkpoint_port is not None:
initial_state = self._checkpoint_port.load()
if initial_state is not None and initial_state.synthetic_pool:
synthetic_pool = initial_state.synthetic_pool
logger.info(
"Resumed checkpoint includes %d synthetic inputs — skipping bootstrap",
extra={"structured": {"event": "resume_skip_bootstrap", "pool_size": len(synthetic_pool)}},
)
else:
synthetic_pool = self._bootstrap.run(
task_description=config.task_description,
n_examples=config.n_synthetic_inputs,
)
else:
synthetic_pool = self._bootstrap.run(
task_description=config.task_description,
n_examples=config.n_synthetic_inputs,
)
# Phase 1: Evolution
# Split into train / validation if configured
validation_pool: list = []
if config.validation_split > 0:
synthetic_pool, validation_pool = SyntheticBootstrap.split_pool(
synthetic_pool, config.validation_split,
)
logger.info(
"Split synthetic pool: %d train, %d validation (%.0f%% hold-out)",
len(synthetic_pool), len(validation_pool),
config.validation_split * 100,
extra={"structured": {
"event": "pool_split",
"train_size": len(synthetic_pool),
"val_size": len(validation_pool),
}},
)
loop = EvolutionLoop(
evaluator=self._evaluator,
proposer=self._proposer,
@@ -53,9 +97,22 @@ class OptimizePromptUseCase:
verbose=config.verbose,
circuit_breaker_threshold=config.circuit_breaker_threshold,
error_strategy=config.error_strategy,
checkpoint_port=self._checkpoint_port,
checkpoint_interval=config.checkpoint_interval,
population_size=config.population_size,
crossover_rate=config.crossover_rate,
mutation_rate=config.mutation_rate,
diversity_penalty=config.diversity_penalty,
crossover_port=self._crossover_port,
mutation_port=self._mutation_port,
early_stop_patience=config.early_stop_patience,
)
seed_prompt = Prompt(text=config.seed_prompt)
state = await loop.run(seed_prompt, synthetic_pool, config.task_description)
state = await loop.run(
seed_prompt, synthetic_pool, config.task_description,
initial_state=initial_state,
validation_pool=validation_pool or None,
)
# Phase 2: Result
initial_score = (
@@ -71,9 +128,12 @@ class OptimizePromptUseCase:
),
initial_prompt=config.seed_prompt,
iterations_used=state.iteration,
total_llm_calls=state.total_llm_calls + 1, # +1 for bootstrap
total_llm_calls=state.total_llm_calls + 1, # +1 for bootstrap synthesis call
initial_score=initial_score,
final_score=final_score,
improvement=final_score - initial_score,
history=state.history,
final_validation_score=state.best_validation_score,
best_validation_score=state.best_validation_score,
early_stopped=state.early_stopped,
)

View File

@@ -1,31 +1,13 @@
"""
CLI — user entry point.
Typer interface with -i (input) and -o (output) options.
Registers all subcommands and delegates to cli/commands/.
"""
from __future__ import annotations
import asyncio
import logging
import os
from dataclasses import asdict
import dspy
import typer
from pydantic import ValidationError
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.use_cases import OptimizePromptUseCase
from prometheus.infrastructure.file_io import YamlPersistence
from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
from prometheus.cli.commands import init, list_runs, optimize, version
app = typer.Typer(
name="prometheus",
@@ -33,205 +15,12 @@ app = typer.Typer(
no_args_is_help=True,
)
console = Console()
@app.command()
def optimize(
input: str = typer.Option(
...,
"-i",
"--input",
help="Path to input YAML config file.",
exists=True,
readable=True,
),
output: str = typer.Option(
"output.yaml",
"-o",
"--output",
help="Path to output YAML result file.",
),
verbose: bool = typer.Option(
False,
"-v",
"--verbose",
help="Print detailed progress.",
),
max_retries: int = typer.Option(
3,
"--max-retries",
help="Max retry attempts for transient LLM errors (429, timeout, 5xx).",
),
error_strategy: str = typer.Option(
"retry",
"--error-strategy",
help="How to handle errors: skip | retry | abort.",
),
max_concurrency: int = typer.Option(
5,
"--max-concurrency",
help="Max parallel LLM calls for minibatch execution and judging.",
),
) -> None:
"""Optimize a prompt without any reference data.
Usage:
prometheus optimize -i config.yaml -o result.yaml
"""
asyncio.run(
_async_optimize(input, output, verbose, max_retries, error_strategy, max_concurrency)
)
async def _async_optimize(
input: str,
output: str,
verbose: bool,
max_retries: int,
error_strategy: str,
max_concurrency: int,
) -> None:
# Configure verbose logging
if verbose:
logging.basicConfig(level=logging.INFO, format="[PROMETHEUS] %(message)s")
console.print(
Panel.fit(
"PROMETHEUS — Prompt Evolution Engine",
subtitle="No reference data required",
)
)
# 1. Load & validate config
persistence = YamlPersistence()
raw_config = persistence.read_config(input)
# CLI flags override config file values
raw_config.setdefault("max_retries", max_retries)
raw_config.setdefault("error_strategy", error_strategy)
raw_config.setdefault("max_concurrency", max_concurrency)
raw_config["output_path"] = output
raw_config["verbose"] = verbose
try:
config = OptimizationConfig.model_validate(raw_config)
except ValidationError as exc:
console.print("[bold red]Configuration error:[/bold red]\n")
for err in exc.errors():
loc = "".join(str(l) for l in err["loc"])
console.print(f" [red]• {loc}: {err['msg']}[/red]")
raise typer.Exit(code=1) from exc
console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]")
console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]")
# 2. Create per-model DSPy LM instances
def _model_lm_kwargs(
model_api_base: str | None,
model_api_key_env: str | None,
) -> dict:
"""Build kwargs for dspy.LM, using per-model overrides with global fallback."""
kwargs: dict = {}
api_base = model_api_base or config.api_base
api_key_env = model_api_key_env or config.api_key_env
if api_base:
kwargs["api_base"] = api_base
if api_key_env:
kwargs["api_key"] = os.environ.get(api_key_env, "")
return kwargs
task_lm = dspy.LM(
config.task_model,
**_model_lm_kwargs(config.task_api_base, config.task_api_key_env),
)
judge_lm = dspy.LM(
config.judge_model,
**_model_lm_kwargs(config.judge_api_base, config.judge_api_key_env),
)
proposer_lm = dspy.LM(
config.proposer_model,
**_model_lm_kwargs(config.proposer_api_base, config.proposer_api_key_env),
)
synth_lm = dspy.LM(
config.synth_model,
**_model_lm_kwargs(config.synth_api_base, config.synth_api_key_env),
)
# 3. Build adapters (Dependency Injection — each gets its own LM + retry config)
synth_adapter = DSPySyntheticAdapter(lm=synth_lm)
llm_adapter = DSPyLLMAdapter(
lm=task_lm,
max_retries=config.max_retries,
retry_delay_base=config.retry_delay_base,
)
judge_adapter = DSPyJudgeAdapter(
lm=judge_lm,
max_retries=config.max_retries,
retry_delay_base=config.retry_delay_base,
max_concurrency=config.max_concurrency,
)
proposer_adapter = DSPyProposerAdapter(
lm=proposer_lm,
max_retries=config.max_retries,
retry_delay_base=config.retry_delay_base,
)
bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed)
evaluator = PromptEvaluator(
executor=llm_adapter,
judge=judge_adapter,
max_concurrency=config.max_concurrency,
)
use_case = OptimizePromptUseCase(
evaluator=evaluator,
proposer=proposer_adapter,
bootstrap=bootstrap,
)
# 4. Execute
with console.status("[bold green]Evolving prompt..."):
result = await use_case.execute(config)
# 5. Display results
_display_result(result)
# 6. Save
_save_result(persistence, output, result)
console.print(f"\n[green]Results saved to {output}[/green]")
def _display_result(result: OptimizationResult) -> None:
"""Display a Rich summary in the terminal."""
console.print()
console.print(
Panel(
f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}",
title="Result",
)
)
table = Table(title="Metrics")
table.add_column("Metric", style="cyan")
table.add_column("Value", style="bold")
table.add_row("Initial Score", f"{result.initial_score:.2f}")
table.add_row("Final Score", f"{result.final_score:.2f}")
table.add_row("Improvement", f"{result.improvement:+.2f}")
table.add_row("Iterations", str(result.iterations_used))
table.add_row("LLM Calls", str(result.total_llm_calls))
console.print(table)
def _save_result(
persistence: YamlPersistence,
path: str,
result: OptimizationResult,
) -> None:
"""Save the result as YAML."""
persistence.write_result(path, asdict(result))
@app.command(hidden=True)
def _help() -> None:
"""Internal placeholder to force multi-command Typer behavior."""
pass
# Register all subcommands — having multiple commands fixes the
# Typer 0.24+ bug where a single-command app absorbs the subcommand.
optimize.register(app)
version.register(app)
init.register(app)
list_runs.register(app)
if __name__ == "__main__":

View File

@@ -0,0 +1 @@
"""CLI command modules."""

View File

@@ -0,0 +1,97 @@
"""prometheus init — scaffold a config YAML interactively."""
from __future__ import annotations
from pathlib import Path
import typer
import yaml
from rich.console import Console
console = Console()
_TEMPLATE = """\
# PROMETHEUS configuration
# Generated by `prometheus init`
# --- Required ---
seed_prompt: {seed_prompt}
task_description: {task_description}
# --- Models ---
task_model: {task_model}
judge_model: {judge_model}
proposer_model: {proposer_model}
synth_model: {synth_model}
# --- Global API settings (optional) ---
# api_base: https://api.openai.com/v1
# api_key_env: OPENAI_API_KEY
# --- Evolution parameters ---
max_iterations: 30
n_synthetic_inputs: 20
minibatch_size: 5
perfect_score: 1.0
seed: 42
# --- Concurrency ---
max_concurrency: 5
# --- Error handling ---
max_retries: 3
retry_delay_base: 1.0
circuit_breaker_threshold: 5
error_strategy: retry
"""
def register(app: typer.Typer) -> None:
"""Register the init command on the Typer app."""
@app.command()
def init(
output: str = typer.Option(
"config.yaml",
"-o",
"--output",
help="Path for the generated config file.",
),
) -> None:
"""Interactively scaffold a PROMETHEUS config YAML.
Prompts for required fields and writes a ready-to-edit config file.
"""
target = Path(output)
if target.exists() and not typer.confirm(
f"{output} already exists. Overwrite?", default=False
):
raise typer.Exit(code=0)
seed_prompt: str = typer.prompt("Seed prompt")
task_description: str = typer.prompt("Task description")
task_model: str = typer.prompt("Task model", default="openai/gpt-4o-mini")
judge_model: str = typer.prompt("Judge model", default="openai/gpt-4o")
proposer_model: str = typer.prompt("Proposer model", default="openai/gpt-4o")
synth_model: str = typer.prompt("Synth model", default="openai/gpt-4o")
content = _TEMPLATE.format(
seed_prompt=_yaml_string(seed_prompt),
task_description=_yaml_string(task_description),
task_model=task_model,
judge_model=judge_model,
proposer_model=proposer_model,
synth_model=synth_model,
)
target.write_text(content, encoding="utf-8")
console.print(f"[green]Config written to {output}[/green]")
console.print("[dim]Edit it as needed, then run: prometheus optimize -i config.yaml[/dim]")
def _yaml_string(value: str) -> str:
"""Quote a string for YAML if it contains special characters."""
if any(ch in value for ch in (":", "#", "'", '"', "\n", "{", "}", "[", "]", ",")):
escaped = value.replace("'", "''")
return f"'{escaped}'"
return value

View File

@@ -0,0 +1,101 @@
"""prometheus list — list past optimization runs."""
from __future__ import annotations
import glob as globmod
from pathlib import Path
import typer
import yaml
from rich.console import Console
from rich.table import Table
console = Console()
_DEFAULT_PATTERNS = ("output.yaml", "results/*.yaml", "*.result.yaml")
def register(app: typer.Typer) -> None:
"""Register the list command on the Typer app."""
@app.command("list")
def list_runs(
directory: str = typer.Option(
".",
"-d",
"--directory",
help="Directory to scan for result YAML files.",
),
) -> None:
"""List past optimization runs found in result YAML files.
Scans the given directory for YAML files that look like PROMETHEUS
output (they contain 'optimized_prompt' and 'final_score' keys) and
displays a summary table.
"""
base = Path(directory)
if not base.is_dir():
console.print(f"[red]Directory not found: {directory}[/red]")
raise typer.Exit(code=1)
runs: list[dict] = []
for pattern in _DEFAULT_PATTERNS:
for path_str in globmod.glob(str(base / pattern), recursive=False):
_try_read_run(path_str, runs)
# Also try nested directories one level deep
for path_str in globmod.glob(str(base / "**/*.yaml"), recursive=True):
_try_read_run(path_str, runs)
if not runs:
console.print("[dim]No optimization runs found.[/dim]")
raise typer.Exit(code=0)
# Deduplicate by path
seen: set[str] = set()
unique_runs: list[dict] = []
for run in runs:
if run["path"] not in seen:
seen.add(run["path"])
unique_runs.append(run)
table = Table(title="PROMETHEUS Runs")
table.add_column("File", style="cyan")
table.add_column("Initial", justify="right")
table.add_column("Final", justify="right", style="green")
table.add_column("Delta", justify="right")
table.add_column("Iters", justify="right")
table.add_column("Prompt (first 60 chars)", style="dim")
for run in sorted(unique_runs, key=lambda r: r["path"]):
table.add_row(
run["path"],
f"{run['initial_score']:.2f}",
f"{run['final_score']:.2f}",
f"{run['improvement']:+.2f}",
str(run["iterations"]),
run["prompt_preview"],
)
console.print(table)
def _try_read_run(path_str: str, runs: list[dict]) -> None:
"""Try to parse a YAML file as a PROMETHEUS result and append metadata."""
try:
with open(path_str, encoding="utf-8") as f:
data = yaml.safe_load(f)
if not isinstance(data, dict):
return
if "optimized_prompt" not in data or "final_score" not in data:
return
runs.append({
"path": path_str,
"initial_score": float(data.get("initial_score", 0.0)),
"final_score": float(data.get("final_score", 0.0)),
"improvement": float(data.get("improvement", 0.0)),
"iterations": int(data.get("iterations_used", 0)),
"prompt_preview": str(data.get("optimized_prompt", ""))[:60],
})
except (OSError, yaml.YAMLError, ValueError, TypeError):
pass

View File

@@ -27,8 +27,6 @@ from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter
from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter
from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter
from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter
from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
from prometheus.infrastructure.similarity import create_similarity_adapter
from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
@@ -337,6 +335,29 @@ async def _async_optimize(
with console.status("[bold green]Evolving prompt..."):
result = await use_case.execute(config)
# 4b. Compute actual LLM call count from adapter counters
actual_llm_calls = (
llm_adapter.call_count
+ judge_adapter.call_count
+ proposer_adapter.call_count
+ synth_adapter.call_count
+ (crossover_adapter.call_count if crossover_adapter else 0)
+ (mutation_adapter.call_count if mutation_adapter else 0)
)
result = OptimizationResult(
optimized_prompt=result.optimized_prompt,
initial_prompt=result.initial_prompt,
iterations_used=result.iterations_used,
total_llm_calls=actual_llm_calls,
initial_score=result.initial_score,
final_score=result.final_score,
improvement=result.improvement,
history=result.history,
final_validation_score=result.final_validation_score,
best_validation_score=result.best_validation_score,
early_stopped=result.early_stopped,
)
# 5. Display results
_display_result(result)

View File

@@ -0,0 +1,18 @@
"""prometheus version — print the current version."""
from __future__ import annotations
import typer
from rich.console import Console
from prometheus import __version__
console = Console()
def register(app: typer.Typer) -> None:
"""Register the version command on the Typer app."""
@app.command()
def version() -> None:
"""Print the PROMETHEUS version."""
console.print(f"PROMETHEUS {__version__}")

View File

@@ -0,0 +1,96 @@
"""
Structured logging configuration for PROMETHEUS.
Supports text (human-readable) and JSON (machine-parseable) output,
configurable log levels, and optional file output.
Fixes Bug #4: verbose mode now reliably produces output by configuring
handlers explicitly instead of relying on ``logging.basicConfig``.
"""
from __future__ import annotations
import json
import logging
import sys
from datetime import datetime, timezone
from typing import TextIO
_PROMETHEUS_LOGGER = "prometheus"
class _JsonFormatter(logging.Formatter):
"""Emit one JSON object per log line."""
def format(self, record: logging.LogRecord) -> str:
message = record.getMessage()
payload: dict = {
"timestamp": datetime.fromtimestamp(record.created, tz=timezone.utc).isoformat(),
"level": record.levelname,
"logger": record.name,
"message": message,
}
# Merge any extra structured fields the caller attached.
if hasattr(record, "structured"):
payload["structured"] = record.structured # type: ignore[attr-defined]
if record.exc_info and record.exc_info[1] is not None:
payload["exception"] = self.formatException(record.exc_info)
return json.dumps(payload, default=str)
class _TextFormatter(logging.Formatter):
"""Human-readable format with structured extras appended."""
def format(self, record: logging.LogRecord) -> str:
base = super().format(record)
if hasattr(record, "structured") and record.structured:
extras = " ".join(f"{k}={v}" for k, v in record.structured.items())
base = f"{base} {extras}"
return base
def configure_logging(
*,
level: int = logging.WARNING,
log_format: str = "text",
log_file: str | None = None,
) -> None:
"""Configure the prometheus root logger.
Args:
level: Logging level (e.g. logging.DEBUG, logging.INFO).
log_format: ``"text"`` for human-readable or ``"json"`` for
machine-parseable output.
log_file: Optional path to also write logs to a file.
"""
prom_logger = logging.getLogger(_PROMETHEUS_LOGGER)
prom_logger.setLevel(level)
# Remove any stale handlers so re-configuration is idempotent.
prom_logger.handlers.clear()
if log_format == "json":
fmt = _JsonFormatter()
else:
fmt = _TextFormatter(
fmt="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
datefmt="%H:%M:%S",
)
# Console handler (stderr so it doesn't mix with Rich stdout)
console_handler = logging.StreamHandler(sys.stderr)
console_handler.setFormatter(fmt)
prom_logger.addHandler(console_handler)
# Optional file handler
if log_file:
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(fmt)
prom_logger.addHandler(file_handler)
# Prevent propagation to root logger to avoid duplicate output
prom_logger.propagate = False
def get_logger(name: str) -> logging.Logger:
"""Return a child logger under the prometheus namespace."""
return logging.getLogger(f"{_PROMETHEUS_LOGGER}.{name}")

View File

@@ -31,6 +31,15 @@ class SyntheticExample:
id: int = 0
@dataclass(frozen=True)
class GroundTruthExample:
"""A ground-truth evaluation example with a known-good expected output."""
input_text: str
expected_output: str
id: int = 0
@dataclass
class Trajectory:
"""Execution trace of a prompt on an input.
@@ -85,3 +94,6 @@ class OptimizationState:
synthetic_pool: list[SyntheticExample] = field(default_factory=list)
history: list[dict[str, Any]] = field(default_factory=list)
total_llm_calls: int = 0
# Hold-out validation
best_validation_score: float | None = None
early_stopped: bool = False

View File

@@ -8,7 +8,14 @@ from abc import ABC, abstractmethod
from typing import Any
from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
from prometheus.domain.entities import (
Candidate,
GroundTruthExample,
OptimizationState,
Prompt,
SyntheticExample,
Trajectory,
)
class LLMPort(ABC):
@@ -73,6 +80,34 @@ class SyntheticGeneratorPort(ABC):
...
class CrossoverPort(ABC):
"""Port for crossover — combining instructions from two parent candidates."""
@abstractmethod
async def crossover(
self,
parent_a: Prompt,
parent_b: Prompt,
task_description: str,
) -> Prompt:
"""Combine instructions from two parents into a child prompt."""
...
class MutationPort(ABC):
"""Port for mutating a prompt — paraphrase, constrain, generalize, specialize."""
@abstractmethod
async def mutate(
self,
prompt: Prompt,
task_description: str,
mutation_type: str = "paraphrase",
) -> Prompt:
"""Apply a mutation to the prompt."""
...
class PersistencePort(ABC):
"""Port for reading/writing files."""
@@ -83,3 +118,49 @@ class PersistencePort(ABC):
@abstractmethod
def write_result(self, path: str, data: dict[str, Any]) -> None:
...
class SimilarityPort(ABC):
"""Port for computing similarity between a prediction and expected output.
Infrastructure provides concrete metrics (exact match, BLEU, ROUGE, cosine).
"""
@abstractmethod
def compute(self, prediction: str, expected: str) -> float:
"""Compute similarity score in [0, 1]. 1.0 = perfect match."""
...
class DatasetLoaderPort(ABC):
"""Port for loading ground-truth evaluation datasets."""
@abstractmethod
def load(self, path: str) -> list[GroundTruthExample]:
"""Load a dataset from a CSV or JSON file.
Each row must have 'input' and 'expected_output' fields.
"""
...
class CheckpointPort(ABC):
"""Port for saving and loading optimization checkpoints.
Enables resuming long-running optimizations after interruption.
"""
@abstractmethod
def save(self, state: OptimizationState) -> None:
"""Persist the current optimization state to disk."""
...
@abstractmethod
def load(self) -> OptimizationState | None:
"""Load the latest checkpoint. Returns None if no checkpoint exists."""
...
@abstractmethod
def latest_exists(self) -> bool:
"""Check if a checkpoint file is available for resuming."""
...

View File

@@ -0,0 +1,149 @@
"""
JSON checkpoint persistence — save/load optimization state to disk.
Implements the CheckpointPort with JSON for human-readable, versionable snapshots.
"""
from __future__ import annotations
import json
import logging
from dataclasses import asdict
from pathlib import Path
from prometheus.cli.logging_setup import get_logger
from prometheus.domain.entities import (
Candidate,
OptimizationState,
Prompt,
SyntheticExample,
)
from prometheus.domain.ports import CheckpointPort
logger = get_logger("checkpoint")
_CHECKPOINT_FILE = "latest.json"
class JsonCheckpointPersistence(CheckpointPort):
"""Saves optimization state as JSON to a configurable directory."""
def __init__(self, checkpoint_dir: str | Path = ".prometheus/checkpoints") -> None:
self._dir = Path(checkpoint_dir)
def save(self, state: OptimizationState) -> None:
"""Persist the current optimization state to disk."""
self._dir.mkdir(parents=True, exist_ok=True)
path = self._dir / _CHECKPOINT_FILE
data = _serialize_state(state)
path.write_text(json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
logger.info(
"Checkpoint saved",
extra={
"structured": {
"event": "checkpoint_saved",
"path": str(path),
"iteration": state.iteration,
"total_llm_calls": state.total_llm_calls,
},
},
)
def load(self) -> OptimizationState | None:
"""Load the latest checkpoint. Returns None if no checkpoint exists."""
path = self._dir / _CHECKPOINT_FILE
if not path.exists():
return None
raw = json.loads(path.read_text(encoding="utf-8"))
state = _deserialize_state(raw)
logger.info(
"Checkpoint loaded",
extra={
"structured": {
"event": "checkpoint_loaded",
"path": str(path),
"iteration": state.iteration,
"total_llm_calls": state.total_llm_calls,
},
},
)
return state
def latest_exists(self) -> bool:
"""Check if a checkpoint file is available for resuming."""
return (self._dir / _CHECKPOINT_FILE).exists()
# ---------------------------------------------------------------------------
# Serialization helpers — keep the JSON format stable and self-describing.
# ---------------------------------------------------------------------------
_SCHEMA_VERSION = 1
def _serialize_state(state: OptimizationState) -> dict:
"""Convert OptimizationState to a JSON-safe dict."""
return {
"schema_version": _SCHEMA_VERSION,
"iteration": state.iteration,
"best_candidate": _serialize_candidate(state.best_candidate),
"candidates": [_serialize_candidate(c) for c in state.candidates],
"synthetic_pool": [
{"input_text": ex.input_text, "category": ex.category, "id": ex.id}
for ex in state.synthetic_pool
],
"history": state.history,
"total_llm_calls": state.total_llm_calls,
}
def _serialize_candidate(candidate: Candidate | None) -> dict | None:
if candidate is None:
return None
return {
"prompt_text": candidate.prompt.text,
"prompt_metadata": candidate.prompt.metadata,
"best_score": candidate.best_score,
"generation": candidate.generation,
"parent_id": candidate.parent_id,
}
def _deserialize_state(data: dict) -> OptimizationState:
"""Reconstruct OptimizationState from a checkpoint dict."""
version = data.get("schema_version", 1)
# Future migration hooks go here: if version < 2: ...
best_raw = data.get("best_candidate")
best_candidate = _deserialize_candidate(best_raw)
candidates = [_deserialize_candidate(c) for c in data.get("candidates", [])]
synthetic_pool = [
SyntheticExample(
input_text=ex["input_text"],
category=ex.get("category", "default"),
id=ex.get("id", 0),
)
for ex in data.get("synthetic_pool", [])
]
state = OptimizationState(
iteration=data.get("iteration", 0),
best_candidate=best_candidate,
candidates=candidates,
synthetic_pool=synthetic_pool,
history=data.get("history", []),
total_llm_calls=data.get("total_llm_calls", 0),
)
return state
def _deserialize_candidate(raw: dict | None) -> Candidate | None:
if raw is None:
return None
return Candidate(
prompt=Prompt(text=raw["prompt_text"], metadata=raw.get("prompt_metadata", {})),
best_score=raw.get("best_score", 0.0),
generation=raw.get("generation", 0),
parent_id=raw.get("parent_id"),
)

View File

@@ -0,0 +1,63 @@
"""
Adapter: Instruction Crossover via DSPy.
Implements CrossoverPort — combines two parent prompts into a child.
"""
from __future__ import annotations
import asyncio
import dspy
from prometheus.domain.entities import Prompt
from prometheus.domain.ports import CrossoverPort
from prometheus.infrastructure.dspy_modules import InstructionCrossover
from prometheus.infrastructure.retry import async_retry_with_backoff
class DSPyCrossoverAdapter(CrossoverPort):
"""Uses DSPy to combine two parent instructions into a child."""
def __init__(
self,
lm: dspy.LM,
max_retries: int = 3,
retry_delay_base: float = 1.0,
) -> None:
self._lm = lm
self._crossover = InstructionCrossover()
self._max_retries = max_retries
self._retry_delay_base = retry_delay_base
self.call_count: int = 0
async def crossover(
self,
parent_a: Prompt,
parent_b: Prompt,
task_description: str,
) -> Prompt:
async def _call() -> Prompt:
return await asyncio.to_thread(
self._sync_crossover, parent_a, parent_b, task_description,
)
return await async_retry_with_backoff(
_call,
max_retries=self._max_retries,
retry_delay_base=self._retry_delay_base,
)
def _sync_crossover(
self,
parent_a: Prompt,
parent_b: Prompt,
task_description: str,
) -> Prompt:
with dspy.context(lm=self._lm):
pred = self._crossover(
parent_a=parent_a.text,
parent_b=parent_b.text,
task_description=task_description,
)
self.call_count += 1
return Prompt(text=pred.child_instruction)

View File

@@ -0,0 +1,75 @@
"""Dataset loader — loads ground-truth CSV/JSON datasets."""
from __future__ import annotations
import csv
import json
import logging
from pathlib import Path
from prometheus.domain.entities import GroundTruthExample
from prometheus.domain.ports import DatasetLoaderPort
logger = logging.getLogger(__name__)
class FileDatasetLoader(DatasetLoaderPort):
"""Loads evaluation datasets from CSV or JSON files.
CSV files must have 'input' and 'expected_output' columns.
JSON files must be an array of objects with 'input' and 'expected_output' keys.
"""
def load(self, path: str) -> list[GroundTruthExample]:
suffix = Path(path).suffix.lower()
if suffix == ".csv":
return self._load_csv(path)
elif suffix in (".json", ".jsonl"):
return self._load_json(path)
else:
raise ValueError(
f"Unsupported dataset format '{suffix}'. Use .csv, .json, or .jsonl."
)
def _load_csv(self, path: str) -> list[GroundTruthExample]:
examples: list[GroundTruthExample] = []
with open(path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
input_text = row.get("input", "").strip()
expected = row.get("expected_output", "").strip()
if not input_text:
logger.warning("Skipping CSV row %d: empty 'input' field", i + 1)
continue
examples.append(
GroundTruthExample(
input_text=input_text,
expected_output=expected,
id=i,
)
)
logger.info("Loaded %d examples from CSV: %s", len(examples), path)
return examples
def _load_json(self, path: str) -> list[GroundTruthExample]:
with open(path, encoding="utf-8") as f:
data = json.load(f)
if not isinstance(data, list):
raise ValueError("JSON dataset must be an array of objects.")
examples: list[GroundTruthExample] = []
for i, item in enumerate(data):
input_text = item.get("input", "").strip() if isinstance(item, dict) else ""
expected = (
item.get("expected_output", "").strip() if isinstance(item, dict) else ""
)
if not input_text:
logger.warning("Skipping JSON item %d: empty 'input' field", i)
continue
examples.append(
GroundTruthExample(
input_text=input_text,
expected_output=expected,
id=i,
)
)
logger.info("Loaded %d examples from JSON: %s", len(examples), path)
return examples

View File

@@ -59,6 +59,7 @@ class DSPyJudgeAdapter(JudgePort):
if self._judge_dimensions
else {}
)
self.call_count: int = 0
async def judge_batch(
self,
@@ -104,13 +105,15 @@ class DSPyJudgeAdapter(JudgePort):
def _sync_judge(self, task_description: str, input_text: str, output_text: str):
with dspy.context(lm=self._lm):
return self._judge(
result = self._judge(
task_description=task_description,
input_text=input_text,
output_text=output_text,
judge_criteria=self._judge_criteria,
dimension_names=self._dimension_names,
)
self.call_count += 1
return result
def _aggregate_result(self, pred: Any) -> tuple[float, str]:
"""Compute weighted aggregate score from dimension scores if available."""

View File

@@ -34,6 +34,7 @@ class DSPyLLMAdapter(LLMPort):
self._predictor = dspy.Predict(self._ExecuteSignature)
self._max_retries = max_retries
self._retry_delay_base = retry_delay_base
self.call_count: int = 0
async def execute(self, prompt: Prompt, input_text: str) -> str:
async def _call() -> str:
@@ -52,4 +53,5 @@ class DSPyLLMAdapter(LLMPort):
instruction=prompt.text,
input_text=input_text,
)
self.call_count += 1
return str(result.output)

View File

@@ -0,0 +1,70 @@
"""
Adapter: Instruction Mutation via DSPy.
Implements MutationPort — applies typed mutations (paraphrase, constrain,
generalize, specialize) to a prompt.
"""
from __future__ import annotations
import asyncio
import random
import dspy
from prometheus.domain.entities import Prompt
from prometheus.domain.ports import MutationPort
from prometheus.infrastructure.dspy_modules import InstructionMutator
from prometheus.infrastructure.retry import async_retry_with_backoff
_MUTATION_TYPES = ("paraphrase", "constrain", "generalize", "specialize")
class DSPyMutationAdapter(MutationPort):
"""Uses DSPy to apply typed mutations to an instruction."""
def __init__(
self,
lm: dspy.LM,
max_retries: int = 3,
retry_delay_base: float = 1.0,
) -> None:
self._lm = lm
self._mutator = InstructionMutator()
self._max_retries = max_retries
self._retry_delay_base = retry_delay_base
self.call_count: int = 0
async def mutate(
self,
prompt: Prompt,
task_description: str,
mutation_type: str = "paraphrase",
) -> Prompt:
if mutation_type not in _MUTATION_TYPES:
mutation_type = random.choice(_MUTATION_TYPES)
async def _call() -> Prompt:
return await asyncio.to_thread(
self._sync_mutate, prompt, task_description, mutation_type,
)
return await async_retry_with_backoff(
_call,
max_retries=self._max_retries,
retry_delay_base=self._retry_delay_base,
)
def _sync_mutate(
self,
prompt: Prompt,
task_description: str,
mutation_type: str,
) -> Prompt:
with dspy.context(lm=self._lm):
pred = self._mutator(
current_instruction=prompt.text,
task_description=task_description,
mutation_type=mutation_type,
)
self.call_count += 1
return Prompt(text=pred.mutated_instruction)

View File

@@ -29,6 +29,7 @@ class DSPyProposerAdapter(ProposerPort):
self._proposer = InstructionProposer()
self._max_retries = max_retries
self._retry_delay_base = retry_delay_base
self.call_count: int = 0
async def propose(
self,
@@ -56,6 +57,7 @@ class DSPyProposerAdapter(ProposerPort):
task_description=task_description,
failure_examples=failure_examples,
)
self.call_count += 1
return Prompt(text=pred.new_instruction)
@staticmethod

View File

@@ -0,0 +1,153 @@
"""Similarity adapters — concrete metrics for comparing prediction vs expected."""
from __future__ import annotations
import math
import re
from collections import Counter
from prometheus.domain.ports import SimilarityPort
class ExactMatchSimilarity(SimilarityPort):
"""Case-insensitive exact string match. Returns 1.0 or 0.0."""
def compute(self, prediction: str, expected: str) -> float:
return 1.0 if prediction.strip().lower() == expected.strip().lower() else 0.0
class BleuSimilarity(SimilarityPort):
"""BLEU-style n-gram precision (up to 4-grams).
Simplified implementation using sentence-level BLEU with brevity penalty.
Returns a score in [0, 1].
"""
def __init__(self, max_n: int = 4):
self._max_n = max_n
def compute(self, prediction: str, expected: str) -> float:
pred_tokens = _tokenize(prediction)
ref_tokens = _tokenize(expected)
if not pred_tokens or not ref_tokens:
return 0.0 if not ref_tokens else 0.0
# Modified precision for each n-gram
precisions: list[float] = []
for n in range(1, self._max_n + 1):
pred_ngrams = _ngrams(pred_tokens, n)
ref_ngrams = _ngrams(ref_tokens, n)
if not pred_ngrams:
break
clipped = sum(min(pred_ngrams[ng], ref_ngrams.get(ng, 0)) for ng in pred_ngrams)
total = sum(pred_ngrams.values())
precisions.append(clipped / total if total > 0 else 0.0)
if not precisions:
return 0.0
# Geometric mean of precisions
log_avg = sum(math.log(p) for p in precisions if p > 0)
n_nonzero = sum(1 for p in precisions if p > 0)
if n_nonzero == 0:
return 0.0
geo_mean = math.exp(log_avg / n_nonzero)
# Brevity penalty
bp = 1.0
if len(pred_tokens) < len(ref_tokens):
bp = math.exp(1 - len(ref_tokens) / len(pred_tokens))
return min(bp * geo_mean, 1.0)
class RougeLSimilarity(SimilarityPort):
"""ROUGE-L using Longest Common Subsequence.
Returns F1 score combining precision and recall in [0, 1].
"""
def compute(self, prediction: str, expected: str) -> float:
pred_tokens = _tokenize(prediction)
ref_tokens = _tokenize(expected)
if not pred_tokens or not ref_tokens:
return 0.0
lcs_len = _lcs_length(pred_tokens, ref_tokens)
precision = lcs_len / len(pred_tokens)
recall = lcs_len / len(ref_tokens)
if precision + recall == 0:
return 0.0
f1 = 2 * precision * recall / (precision + recall)
return f1
class CosineSimilarity(SimilarityPort):
"""TF-IDF cosine similarity between bag-of-words vectors.
Lightweight semantic similarity without external embedding models.
"""
def compute(self, prediction: str, expected: str) -> float:
pred_tokens = _tokenize(prediction)
ref_tokens = _tokenize(expected)
if not pred_tokens or not ref_tokens:
return 0.0
pred_counts = Counter(pred_tokens)
ref_counts = Counter(ref_tokens)
all_tokens = set(pred_counts) | set(ref_counts)
dot = sum(pred_counts.get(t, 0) * ref_counts.get(t, 0) for t in all_tokens)
norm_pred = math.sqrt(sum(v * v for v in pred_counts.values()))
norm_ref = math.sqrt(sum(v * v for v in ref_counts.values()))
if norm_pred == 0 or norm_ref == 0:
return 0.0
return dot / (norm_pred * norm_ref)
# --- Helpers ---
def _tokenize(text: str) -> list[str]:
"""Simple whitespace + punctuation tokenizer."""
return re.findall(r"\w+", text.lower())
def _ngrams(tokens: list[str], n: int) -> Counter:
"""Count n-grams in a token list."""
return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))
def _lcs_length(a: list[str], b: list[str]) -> int:
"""Compute length of the Longest Common Subsequence."""
m, n = len(a), len(b)
prev = [0] * (n + 1)
for i in range(1, m + 1):
curr = [0] * (n + 1)
for j in range(1, n + 1):
if a[i - 1] == b[j - 1]:
curr[j] = prev[j - 1] + 1
else:
curr[j] = max(prev[j], curr[j - 1])
prev = curr
return prev[n]
def create_similarity_adapter(metric: str) -> SimilarityPort:
"""Factory: create a SimilarityPort by metric name.
Supported metrics: exact, bleu, rouge_l, cosine.
"""
adapters = {
"exact": ExactMatchSimilarity,
"bleu": BleuSimilarity,
"rouge_l": RougeLSimilarity,
"cosine": CosineSimilarity,
}
cls = adapters.get(metric)
if cls is None:
raise ValueError(
f"Unknown eval metric '{metric}'. Choose from: {sorted(adapters)}"
)
return cls()

View File

@@ -18,6 +18,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort):
def __init__(self, lm: dspy.LM) -> None:
self._lm = lm
self._generator = SyntheticInputGenerator()
self.call_count: int = 0
def generate_inputs(
self,
@@ -29,6 +30,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort):
task_description=task_description,
n_examples=n_examples,
)
self.call_count += 1
return [
SyntheticExample(
input_text=text,

View File

@@ -91,3 +91,27 @@ def mock_proposer_port() -> AsyncMock:
text="You are a very helpful assistant. Answer the question precisely."
)
return port
@pytest.fixture
def mock_crossover_port() -> AsyncMock:
"""Mock CrossoverPort that combines two parent prompts."""
port = AsyncMock()
async def _crossover(parent_a: Prompt, parent_b: Prompt, task_description: str) -> Prompt:
return Prompt(text=f"{parent_a.text} Also, {parent_b.text.lower()}")
port.crossover = AsyncMock(side_effect=_crossover)
return port
@pytest.fixture
def mock_mutation_port() -> AsyncMock:
"""Mock MutationPort that paraphrases a prompt."""
port = AsyncMock()
async def _mutate(prompt: Prompt, task_description: str, mutation_type: str = "paraphrase") -> Prompt:
return Prompt(text=f"[{mutation_type}] {prompt.text}")
port.mutate = AsyncMock(side_effect=_mutate)
return port

View File

@@ -20,9 +20,10 @@ def mock_lm() -> dspy.LM:
class TestDSPyLLMAdapter:
def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
@pytest.mark.asyncio
async def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
adapter = DSPyLLMAdapter(lm=mock_lm)
prompt = Prompt(text="Answer the question.")
result = adapter.execute(prompt, "What is 2+2?")
result = await adapter.execute(prompt, "What is 2+2?")
assert isinstance(result, str)
assert len(result) > 0

View File

@@ -0,0 +1,300 @@
"""Integration tests for multi-iteration evolution with mixed accept/reject."""
from __future__ import annotations
from unittest.mock import AsyncMock, MagicMock
import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
from prometheus.domain.ports import JudgePort, LLMPort, ProposerPort
def _make_eval(scores: list[float]) -> EvalResult:
return EvalResult(
scores=scores,
feedbacks=["feedback"] * len(scores),
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
for i, s in enumerate(scores)
],
)
class TestMultiIterationEvolution:
"""Tests for the evolution loop across multiple iterations."""
@pytest.fixture
def seed_prompt(self) -> Prompt:
return Prompt(text="You are a helpful assistant.")
@pytest.fixture
def task_description(self) -> str:
return "Answer factual questions."
@pytest.fixture
def synthetic_pool(self) -> list[SyntheticExample]:
return [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
@pytest.mark.asyncio
async def test_mixed_accept_reject(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""Iteration 1: accept, iteration 2: reject, iteration 3: accept."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
# Build eval sequence: initial, then per-iteration (current, new)
evals = [
_make_eval([0.3, 0.3, 0.3]), # initial seed eval
# Iter 1: accept (old=0.4, new=0.8)
_make_eval([0.4, 0.4, 0.4]),
_make_eval([0.8, 0.8, 0.8]),
# Iter 2: reject (old=0.7, new=0.2)
_make_eval([0.7, 0.7, 0.7]),
_make_eval([0.2, 0.2, 0.2]),
# Iter 3: accept (old=0.5, new=0.9)
_make_eval([0.5, 0.5, 0.5]),
_make_eval([0.9, 0.9, 0.9]),
]
evaluator.evaluate = AsyncMock(side_effect=evals)
mock_proposer.propose.side_effect = [
Prompt(text="Better prompt v1"),
Prompt(text="Worse prompt v2"),
Prompt(text="Best prompt v3"),
]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=3,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.iteration == 3
assert state.best_candidate is not None
assert state.best_candidate.best_score == pytest.approx(2.7) # 0.9*3
assert len(state.history) == 3
assert state.history[0]["event"] == "accepted"
assert state.history[1]["event"] == "rejected"
assert state.history[2]["event"] == "accepted"
@pytest.mark.asyncio
async def test_all_rejected_keeps_seed(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""When all proposals are rejected, the seed prompt stays as best."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
evals = [
_make_eval([0.5, 0.5, 0.5]), # initial
]
for _ in range(3):
evals.append(_make_eval([0.5, 0.5, 0.5])) # current
evals.append(_make_eval([0.1, 0.1, 0.1])) # worse proposal
evaluator.evaluate = AsyncMock(side_effect=evals)
mock_proposer.propose.side_effect = [
Prompt(text="bad v1"),
Prompt(text="bad v2"),
Prompt(text="bad v3"),
]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=3,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_candidate.prompt.text == seed_prompt.text
assert state.best_candidate.best_score == pytest.approx(1.5) # 0.5*3
@pytest.mark.asyncio
async def test_all_accepted_chain(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""All iterations accept, forming an improvement chain."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
evals = [
_make_eval([0.2, 0.2]), # initial
]
for i in range(1, 5):
score = 0.2 + i * 0.15
evals.append(_make_eval([score, score])) # current
evals.append(_make_eval([score + 0.1, score + 0.1])) # new (accepted)
evaluator.evaluate = AsyncMock(side_effect=evals)
mock_proposer.propose.side_effect = [
Prompt(text=f"Improved v{i}") for i in range(4)
]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=4,
minibatch_size=2,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert len(state.candidates) == 5 # seed + 4 accepted
assert all(h["event"] == "accepted" for h in state.history)
@pytest.mark.asyncio
async def test_error_recovery_continues_loop(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""When an iteration errors, the loop continues."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
# Eval sequence for 3 iterations:
# - iter 1: evaluate current → propose → evaluate new (accepted)
# - iter 2: evaluate current → propose (ERROR, no new eval)
# - iter 3: evaluate current → propose → evaluate new (accepted)
evals = [
_make_eval([0.3, 0.3]), # initial
_make_eval([0.5, 0.5]), # iter 1 current
_make_eval([0.9, 0.9]), # iter 1 new (accepted)
_make_eval([0.5, 0.5]), # iter 2 current (proposer errors after this)
_make_eval([0.5, 0.5]), # iter 3 current
_make_eval([0.8, 0.8]), # iter 3 new (accepted)
]
evaluator.evaluate = AsyncMock(side_effect=evals)
# Proposer raises on iter 2
mock_proposer.propose.side_effect = [
Prompt(text="good v1"),
RuntimeError("LLM timeout"),
Prompt(text="good v3"),
]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=2,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.iteration == 3
assert state.history[1]["event"] == "error"
assert "LLM timeout" in state.history[1]["error"]
assert state.history[0]["event"] == "accepted"
assert state.history[2]["event"] == "accepted"
@pytest.mark.asyncio
async def test_perfect_score_skips_proposer(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""When all scores are perfect, no proposition is made."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:2]
perfect_eval = _make_eval([1.0, 1.0])
evaluator.evaluate = AsyncMock(return_value=perfect_eval)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=5,
minibatch_size=2,
perfect_score=1.0,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
mock_proposer.propose.assert_not_called()
assert all(h["event"] == "skip_perfect" for h in state.history)
@pytest.mark.asyncio
async def test_llm_call_counting(
self,
seed_prompt: Prompt,
task_description: str,
synthetic_pool: list[SyntheticExample],
) -> None:
"""Verify LLM call counting: 2*N per eval (execute + judge) + 1 per propose."""
mock_llm = MagicMock(spec=LLMPort)
mock_judge = MagicMock(spec=JudgePort)
mock_proposer = MagicMock(spec=ProposerPort)
evaluator = PromptEvaluator(mock_llm, mock_judge)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:3]
evals = [_make_eval([0.3, 0.3, 0.3])] # initial
for _ in range(2):
evals.append(_make_eval([0.4, 0.4, 0.4]))
evals.append(_make_eval([0.6, 0.6, 0.6]))
evaluator.evaluate = AsyncMock(side_effect=evals)
mock_proposer.propose.side_effect = [
Prompt(text="v1"),
Prompt(text="v2"),
]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
max_iterations=2,
minibatch_size=3,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
# Initial: 2*3=6, Iter1: 2*3 + 1 + 2*3 = 13, Iter2: same = 13
# Total: 6 + 13 + 13 = 32
assert state.total_llm_calls == 32

View File

@@ -1,7 +1,9 @@
"""End-to-end pipeline test with mocked LLM calls."""
from __future__ import annotations
from unittest.mock import MagicMock
from unittest.mock import AsyncMock, MagicMock
import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig
@@ -23,9 +25,10 @@ def _make_eval(scores: list[float]) -> EvalResult:
class TestFullPipeline:
def test_pipeline_produces_result(self) -> None:
@pytest.mark.asyncio
async def test_pipeline_produces_result(self) -> None:
"""Full pipeline with mocked ports produces an OptimizationResult."""
mock_llm = MagicMock(spec=LLMPort)
mock_llm = AsyncMock(spec=LLMPort)
mock_llm.execute.return_value = "mock response"
mock_judge = MagicMock(spec=JudgePort)
@@ -38,11 +41,11 @@ class TestFullPipeline:
eval_sequence.append(_make_eval([0.6, 0.6, 0.6, 0.6, 0.6])) # new eval (accepted)
mock_judge.judge_batch.return_value = [(0.5, "ok")] * 5
mock_proposer = MagicMock(spec=ProposerPort)
mock_proposer = AsyncMock(spec=ProposerPort)
mock_proposer.propose.return_value = Prompt(text="Improved prompt")
evaluator = PromptEvaluator(mock_llm, mock_judge)
evaluator.evaluate = MagicMock(side_effect=eval_sequence)
evaluator.evaluate = AsyncMock(side_effect=eval_sequence)
mock_gen = MagicMock()
mock_gen.generate_inputs.return_value = [
@@ -65,7 +68,7 @@ class TestFullPipeline:
seed=42,
)
result = use_case.execute(config)
result = await use_case.execute(config)
assert result.initial_prompt == "Answer questions."
assert result.optimized_prompt == "Improved prompt"

View File

@@ -0,0 +1,199 @@
"""Integration test — ground-truth evaluation end-to-end with real similarity metrics."""
from __future__ import annotations
import asyncio
import json
import pytest
from unittest.mock import AsyncMock
from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator
from prometheus.domain.entities import GroundTruthExample, Prompt
from prometheus.domain.ports import LLMPort
from prometheus.infrastructure.dataset_loader import FileDatasetLoader
from prometheus.infrastructure.similarity import (
BleuSimilarity,
CosineSimilarity,
ExactMatchSimilarity,
RougeLSimilarity,
create_similarity_adapter,
)
def _make_dataset(items: list[tuple[str, str]]) -> list[GroundTruthExample]:
return [
GroundTruthExample(input_text=inp, expected_output=exp, id=i)
for i, (inp, exp) in enumerate(items)
]
@pytest.fixture
def qa_dataset():
return _make_dataset([
("What is the capital of France?", "Paris"),
("What is 2+2?", "4"),
("What color is the sky?", "blue"),
])
@pytest.fixture
def prompt():
return Prompt(text="Answer the following question concisely.")
@pytest.fixture
def mock_executor():
"""Returns responses that partially match the ground truth."""
port = AsyncMock(spec=LLMPort)
port.execute.side_effect = [
"Paris is the capital of France.",
"The answer is 4.",
"The sky is blue.",
]
return port
class TestGroundTruthIntegrationWithExactMatch:
@pytest.mark.asyncio
async def test_exact_match_on_qa(self, mock_executor, qa_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor,
similarity=ExactMatchSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
# None of the outputs are exact matches with expected outputs
assert all(s == 0.0 for s in result.scores)
@pytest.mark.asyncio
async def test_exact_match_with_exact_outputs(self, qa_dataset, prompt):
exact_executor = AsyncMock(spec=LLMPort)
exact_executor.execute.side_effect = ["Paris", "4", "blue"]
evaluator = GroundTruthEvaluator(
executor=exact_executor,
similarity=ExactMatchSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
assert all(s == 1.0 for s in result.scores)
class TestGroundTruthIntegrationWithBleu:
@pytest.mark.asyncio
async def test_bleu_scores_partial_match(self, mock_executor, qa_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor,
similarity=BleuSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
assert all(0.0 < s < 1.0 for s in result.scores)
assert result.mean_score > 0.0
@pytest.mark.asyncio
async def test_bleu_perfect_match(self, qa_dataset, prompt):
perfect_executor = AsyncMock(spec=LLMPort)
perfect_executor.execute.side_effect = ["Paris", "4", "blue"]
evaluator = GroundTruthEvaluator(
executor=perfect_executor,
similarity=BleuSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
assert all(s > 0.0 for s in result.scores)
class TestGroundTruthIntegrationWithRouge:
@pytest.mark.asyncio
async def test_rouge_l_scores(self, mock_executor, qa_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor,
similarity=RougeLSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
assert all(s > 0.0 for s in result.scores)
class TestGroundTruthIntegrationWithCosine:
@pytest.mark.asyncio
async def test_cosine_scores(self, mock_executor, qa_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor,
similarity=CosineSimilarity(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
assert all(s > 0.0 for s in result.scores)
class TestDatasetLoaderIntegration:
@pytest.mark.asyncio
async def test_load_csv_and_evaluate(self, tmp_path, prompt):
csv_file = tmp_path / "eval.csv"
csv_file.write_text("input,expected_output\nWhat is 2+2?,4\nWhat color is grass?,green\n")
loader = FileDatasetLoader()
dataset = loader.load(str(csv_file))
assert len(dataset) == 2
executor = AsyncMock(spec=LLMPort)
executor.execute.side_effect = ["4", "green"]
evaluator = GroundTruthEvaluator(
executor=executor,
similarity=ExactMatchSimilarity(),
)
result = await evaluator.evaluate(prompt, dataset)
assert all(s == 1.0 for s in result.scores)
@pytest.mark.asyncio
async def test_load_json_and_evaluate(self, tmp_path, prompt):
json_file = tmp_path / "eval.json"
data = [
{"input": "What is 2+2?", "expected_output": "4"},
{"input": "What color is grass?", "expected_output": "green"},
]
json_file.write_text(json.dumps(data))
loader = FileDatasetLoader()
dataset = loader.load(str(json_file))
assert len(dataset) == 2
executor = AsyncMock(spec=LLMPort)
executor.execute.side_effect = ["4", "not green"]
evaluator = GroundTruthEvaluator(
executor=executor,
similarity=create_similarity_adapter("bleu"),
)
result = await evaluator.evaluate(prompt, dataset)
# First item should score well, second poorly
assert result.scores[0] > result.scores[1]
class TestMetricComparison:
"""Compare different metrics on the same outputs to ensure they behave differently."""
@pytest.mark.asyncio
async def test_metrics_give_different_scores(self, qa_dataset, prompt):
results = {}
for metric_name, metric_cls in [
("exact", ExactMatchSimilarity),
("bleu", BleuSimilarity),
("rouge_l", RougeLSimilarity),
("cosine", CosineSimilarity),
]:
executor = AsyncMock(spec=LLMPort)
executor.execute.side_effect = [
"Paris is the capital of France.",
"The answer is 4.",
"The sky is blue.",
]
evaluator = GroundTruthEvaluator(
executor=executor,
similarity=metric_cls(),
)
result = await evaluator.evaluate(prompt, qa_dataset)
results[metric_name] = result.mean_score
# Exact match should be 0 (no exact matches)
assert results["exact"] == 0.0
# All other metrics should give partial credit
assert results["bleu"] > 0.0
assert results["rouge_l"] > 0.0
assert results["cosine"] > 0.0

294
tests/unit/test_adapters.py Normal file
View File

@@ -0,0 +1,294 @@
"""Unit tests for infrastructure adapters — LLM, Judge, Proposer, Synthetic.
Uses mocked DSPy modules to isolate adapter logic from LLM calls.
"""
from __future__ import annotations
from unittest.mock import AsyncMock, MagicMock, patch
import dspy
import pytest
from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
# --- LLM Adapter ---
class TestDSPyLLMAdapter:
"""Tests for DSPyLLMAdapter.execute()."""
@pytest.fixture
def mock_lm(self) -> MagicMock:
return MagicMock(spec=dspy.LM)
@pytest.fixture
def adapter(self, mock_lm: MagicMock) -> DSPyLLMAdapter:
return DSPyLLMAdapter(lm=mock_lm)
@pytest.mark.asyncio
async def test_execute_returns_output_string(
self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
) -> None:
mock_predictor = MagicMock()
mock_predictor.return_value = MagicMock(output="Hello response")
adapter._predictor = mock_predictor
prompt = Prompt(text="Say hello.")
result = await adapter.execute(prompt, "Hi there")
assert result == "Hello response"
@pytest.mark.asyncio
async def test_execute_passes_prompt_text_and_input(
self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
) -> None:
mock_predictor = MagicMock()
mock_predictor.return_value = MagicMock(output="response")
adapter._predictor = mock_predictor
prompt = Prompt(text="Translate this.")
await adapter.execute(prompt, "Hello world")
mock_predictor.assert_called_once_with(
instruction="Translate this.",
input_text="Hello world",
)
@pytest.mark.asyncio
async def test_execute_uses_dspy_context(
self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
) -> None:
mock_predictor = MagicMock()
mock_predictor.return_value = MagicMock(output="ok")
adapter._predictor = mock_predictor
with patch("prometheus.infrastructure.llm_adapter.dspy.context") as mock_ctx:
await adapter.execute(Prompt(text="test"), "input")
mock_ctx.assert_called_once_with(lm=mock_lm)
@pytest.mark.asyncio
async def test_execute_converts_output_to_str(
self, adapter: DSPyLLMAdapter, mock_lm: MagicMock
) -> None:
mock_predictor = MagicMock()
mock_predictor.return_value = MagicMock(output=42)
adapter._predictor = mock_predictor
result = await adapter.execute(Prompt(text="test"), "input")
assert isinstance(result, str)
assert result == "42"
# --- Judge Adapter ---
class TestDSPyJudgeAdapter:
"""Tests for DSPyJudgeAdapter.judge_batch()."""
@pytest.fixture
def mock_lm(self) -> MagicMock:
return MagicMock(spec=dspy.LM)
@pytest.fixture
def adapter(self, mock_lm: MagicMock) -> DSPyJudgeAdapter:
return DSPyJudgeAdapter(lm=mock_lm)
@pytest.mark.asyncio
async def test_judge_batch_returns_scores_and_feedback(
self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
) -> None:
adapter._judge = MagicMock()
adapter._judge.side_effect = [
MagicMock(score=0.9, feedback="Excellent."),
MagicMock(score=0.4, feedback="Incomplete."),
]
pairs = [("What is 2+2?", "4"), ("Capital of France?", "London")]
result = await adapter.judge_batch("math and geography", pairs)
assert len(result) == 2
assert result[0] == (0.9, "Excellent.")
assert result[1] == (0.4, "Incomplete.")
@pytest.mark.asyncio
async def test_judge_batch_empty_pairs(
self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
) -> None:
result = await adapter.judge_batch("task", [])
assert result == []
@pytest.mark.asyncio
async def test_judge_batch_uses_dspy_context(
self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
) -> None:
adapter._judge = MagicMock()
adapter._judge.return_value = MagicMock(score=0.5, feedback="ok")
with patch("prometheus.infrastructure.judge_adapter.dspy.context") as mock_ctx:
await adapter.judge_batch("task", [("in", "out")])
mock_ctx.assert_called_once_with(lm=mock_lm)
@pytest.mark.asyncio
async def test_judge_batch_returns_all_results(
self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock
) -> None:
"""Judge calls run in parallel but all results are returned."""
adapter._judge = MagicMock()
adapter._judge.side_effect = [
MagicMock(score=0.5, feedback="ok"),
MagicMock(score=0.7, feedback="better"),
MagicMock(score=0.3, feedback="worse"),
]
pairs = [("first", "out1"), ("second", "out2"), ("third", "out3")]
results = await adapter.judge_batch("task", pairs)
assert len(results) == 3
scores = [r[0] for r in results]
assert 0.5 in scores
assert 0.7 in scores
assert 0.3 in scores
# --- Proposer Adapter ---
class TestDSPyProposerAdapter:
"""Tests for DSPyProposerAdapter.propose()."""
@pytest.fixture
def mock_lm(self) -> MagicMock:
return MagicMock(spec=dspy.LM)
@pytest.fixture
def adapter(self, mock_lm: MagicMock) -> DSPyProposerAdapter:
return DSPyProposerAdapter(lm=mock_lm)
@pytest.mark.asyncio
async def test_propose_returns_new_prompt(
self, adapter: DSPyProposerAdapter, mock_lm: MagicMock
) -> None:
adapter._proposer = MagicMock()
adapter._proposer.return_value = MagicMock(
new_instruction="Be concise and accurate."
)
current = Prompt(text="Answer questions.")
trajectories = [
Trajectory("in", "out", 0.3, "too verbose", "Answer questions.")
]
result = await adapter.propose(current, trajectories, "Q&A task")
assert isinstance(result, Prompt)
assert result.text == "Be concise and accurate."
@pytest.mark.asyncio
async def test_propose_uses_dspy_context(
self, adapter: DSPyProposerAdapter, mock_lm: MagicMock
) -> None:
adapter._proposer = MagicMock()
adapter._proposer.return_value = MagicMock(new_instruction="improved")
with patch("prometheus.infrastructure.proposer_adapter.dspy.context") as mock_ctx:
await adapter.propose(Prompt(text="t"), [], "task")
mock_ctx.assert_called_once_with(lm=mock_lm)
def test_format_failures_single_trajectory(self) -> None:
trajectories = [
Trajectory("What is AI?", "A type of robot.", 0.3, "Incomplete definition.", "prompt")
]
result = DSPyProposerAdapter._format_failures(trajectories)
assert "What is AI?" in result
assert "A type of robot." in result
assert "0.30" in result
assert "Incomplete definition." in result
assert "# Example 1" in result
def test_format_failures_multiple_trajectories(self) -> None:
trajectories = [
Trajectory("input1", "output1", 0.4, "bad", "prompt"),
Trajectory("input2", "output2", 0.2, "worse", "prompt"),
]
result = DSPyProposerAdapter._format_failures(trajectories)
assert "# Example 1" in result
assert "# Example 2" in result
assert "---" in result
assert "input1" in result
assert "input2" in result
def test_format_failures_empty_list(self) -> None:
result = DSPyProposerAdapter._format_failures([])
assert result == ""
# --- Synthetic Adapter ---
class TestDSPySyntheticAdapter:
"""Tests for DSPySyntheticAdapter.generate_inputs()."""
@pytest.fixture
def mock_lm(self) -> MagicMock:
return MagicMock(spec=dspy.LM)
@pytest.fixture
def adapter(self, mock_lm: MagicMock) -> DSPySyntheticAdapter:
return DSPySyntheticAdapter(lm=mock_lm)
def test_generate_inputs_returns_examples(
self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
) -> None:
adapter._generator = MagicMock()
adapter._generator.return_value = MagicMock(
examples=["What is AI?", "Explain ML.", "What is NLP?"]
)
result = adapter.generate_inputs("AI task", 3)
assert len(result) == 3
assert all(isinstance(ex, SyntheticExample) for ex in result)
assert result[0].input_text == "What is AI?"
assert result[0].id == 0
assert result[1].id == 1
def test_generate_inputs_truncates_to_n(
self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
) -> None:
adapter._generator = MagicMock()
adapter._generator.return_value = MagicMock(
examples=["q1", "q2", "q3", "q4", "q5"]
)
result = adapter.generate_inputs("task", 3)
assert len(result) == 3
def test_generate_inputs_passes_correct_args(
self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
) -> None:
adapter._generator = MagicMock()
adapter._generator.return_value = MagicMock(examples=["q1"])
adapter.generate_inputs("my task", 5)
adapter._generator.assert_called_once_with(
task_description="my task",
n_examples=5,
)
def test_generate_inputs_empty_list(
self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock
) -> None:
adapter._generator = MagicMock()
adapter._generator.return_value = MagicMock(examples=[])
result = adapter.generate_inputs("task", 0)
assert result == []

View File

@@ -0,0 +1,333 @@
"""Unit tests for checkpoint & resume functionality."""
from __future__ import annotations
import json
from pathlib import Path
from unittest.mock import AsyncMock, MagicMock
import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import (
Candidate,
EvalResult,
OptimizationState,
Prompt,
SyntheticExample,
Trajectory,
)
from prometheus.infrastructure.checkpoint import JsonCheckpointPersistence
# ---------------------------------------------------------------------------
# JsonCheckpointPersistence — save/load round-trip
# ---------------------------------------------------------------------------
class TestJsonCheckpointPersistence:
def test_roundtrip_full_state(self, tmp_path: Path) -> None:
"""Saving and loading preserves all fields."""
ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts")
state = OptimizationState(
iteration=7,
best_candidate=Candidate(
prompt=Prompt(text="best prompt", metadata={"source": "test"}),
best_score=0.92,
generation=5,
),
candidates=[
Candidate(prompt=Prompt(text="p1"), best_score=0.5, generation=0),
Candidate(prompt=Prompt(text="p2"), best_score=0.92, generation=5),
],
synthetic_pool=[
SyntheticExample(input_text="q1", category="cat_a", id=0),
SyntheticExample(input_text="q2", category="cat_b", id=1),
],
history=[{"iteration": 1, "event": "accepted", "old_score": 0.5, "new_score": 0.7}],
total_llm_calls=42,
)
ckpt.save(state)
assert ckpt.latest_exists()
loaded = ckpt.load()
assert loaded is not None
assert loaded.iteration == 7
assert loaded.total_llm_calls == 42
assert loaded.best_candidate is not None
assert loaded.best_candidate.prompt.text == "best prompt"
assert loaded.best_candidate.prompt.metadata == {"source": "test"}
assert loaded.best_candidate.best_score == 0.92
assert len(loaded.candidates) == 2
assert len(loaded.synthetic_pool) == 2
assert loaded.synthetic_pool[0].input_text == "q1"
assert loaded.synthetic_pool[1].category == "cat_b"
assert loaded.history[0]["event"] == "accepted"
def test_load_returns_none_when_no_checkpoint(self, tmp_path: Path) -> None:
"""Loading from empty dir returns None."""
ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "nope")
assert ckpt.load() is None
assert not ckpt.latest_exists()
def test_creates_directory_on_save(self, tmp_path: Path) -> None:
"""Save creates the directory tree if it doesn't exist."""
deep_dir = tmp_path / "a" / "b" / "c"
ckpt = JsonCheckpointPersistence(checkpoint_dir=deep_dir)
state = OptimizationState(iteration=1)
ckpt.save(state)
assert (deep_dir / "latest.json").exists()
def test_overwrites_previous_checkpoint(self, tmp_path: Path) -> None:
"""Second save overwrites the first."""
ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path)
ckpt.save(OptimizationState(iteration=1, total_llm_calls=10))
ckpt.save(OptimizationState(iteration=5, total_llm_calls=50))
loaded = ckpt.load()
assert loaded is not None
assert loaded.iteration == 5
assert loaded.total_llm_calls == 50
def test_json_is_human_readable(self, tmp_path: Path) -> None:
"""Checkpoint file is valid, pretty-printed JSON."""
ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path)
state = OptimizationState(
iteration=3,
best_candidate=Candidate(prompt=Prompt(text="hello"), best_score=0.8),
)
ckpt.save(state)
raw = json.loads((tmp_path / "latest.json").read_text())
assert raw["schema_version"] == 1
assert raw["iteration"] == 3
assert raw["best_candidate"]["prompt_text"] == "hello"
# ---------------------------------------------------------------------------
# EvolutionLoop — checkpoint integration
# ---------------------------------------------------------------------------
class TestEvolutionCheckpoint:
@pytest.mark.asyncio
async def test_checkpoint_saved_on_interval(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> None:
"""Checkpoint is saved every checkpoint_interval iterations."""
from unittest.mock import MagicMock
evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
# All iterations accepted so checkpoint triggers
good_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"input{i}", f"out{i}", s, "ok", "p")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
better_eval = EvalResult(
scores=[0.8, 0.9, 0.7, 0.8, 0.9],
feedbacks=["good"] * 5,
trajectories=[],
)
# initial_eval + 5 iterations (each needs old_eval + new_eval)
evaluator.evaluate = AsyncMock(
side_effect=[good_eval] # initial
+ [good_eval, better_eval] * 5 # 5 iterations
)
proposer = AsyncMock()
proposer.propose.return_value = Prompt(text="improved prompt")
checkpoint_port = MagicMock()
loop = EvolutionLoop(
evaluator=evaluator,
proposer=proposer,
bootstrap=bootstrap,
max_iterations=5,
minibatch_size=5,
checkpoint_port=checkpoint_port,
checkpoint_interval=2,
)
await loop.run(seed_prompt, synthetic_pool, task_description)
# Checkpoint at iterations 2, 4 (every 2nd)
save_calls = checkpoint_port.save.call_count
assert save_calls >= 2 # at least at iters 2 and 4
@pytest.mark.asyncio
async def test_no_checkpoint_without_port(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> None:
"""No checkpointing happens when checkpoint_port is None (default)."""
evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
perfect_eval = EvalResult(
scores=[1.0] * 5,
feedbacks=["perfect"] * 5,
trajectories=[
Trajectory(f"in{i}", f"out{i}", 1.0, "perfect", "p")
for i in range(5)
],
)
evaluator.evaluate = AsyncMock(return_value=perfect_eval)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=AsyncMock(),
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=5,
checkpoint_port=None,
)
# Should run without error — no checkpoint port, no crash
await loop.run(seed_prompt, synthetic_pool, task_description)
@pytest.mark.asyncio
async def test_resume_skips_seed_evaluation(
self,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> None:
"""When initial_state is provided, seed eval is skipped and loop starts from saved iteration."""
evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
proposer = AsyncMock()
proposer.propose.return_value = Prompt(text="new prompt")
# Only return evaluations for resumed iterations (1 iter: old_eval + new_eval)
old_eval = EvalResult(
scores=[0.5] * 5,
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"in{i}", f"out{i}", 0.5, "ok", "p") for i in range(5)
],
)
new_eval = EvalResult(
scores=[0.8] * 5,
feedbacks=["good"] * 5,
trajectories=[],
)
evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval])
# Create a state simulating checkpoint at iteration 4
initial_state = OptimizationState(
iteration=4,
best_candidate=Candidate(
prompt=Prompt(text="checkpoint prompt"), best_score=2.5, generation=4
),
candidates=[Candidate(prompt=Prompt(text="checkpoint prompt"), best_score=2.5)],
total_llm_calls=40,
)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=proposer,
bootstrap=bootstrap,
max_iterations=5, # only iteration 5 remains
minibatch_size=5,
)
state = await loop.run(
seed_prompt=Prompt(text="seed"),
synthetic_pool=synthetic_pool,
task_description=task_description,
initial_state=initial_state,
)
# Should have run only 1 iteration (iter 5)
assert state.iteration == 5
# total_llm_calls should include the 40 from checkpoint + new calls
assert state.total_llm_calls > 40
@pytest.mark.asyncio
async def test_full_save_and_resume_roundtrip(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
tmp_path: Path,
) -> None:
"""End-to-end: run a few iterations, checkpoint, resume, finish."""
evaluator = PromptEvaluator(AsyncMock(), AsyncMock())
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
old_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "ok", "p")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
new_eval = EvalResult(
scores=[0.8, 0.9, 0.7, 0.8, 0.9],
feedbacks=["good"] * 5,
trajectories=[],
)
evaluator.evaluate = AsyncMock(
side_effect=[old_eval, old_eval, new_eval, old_eval, new_eval]
)
proposer = AsyncMock()
proposer.propose.return_value = Prompt(text="improved prompt")
ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts")
loop = EvolutionLoop(
evaluator=evaluator,
proposer=proposer,
bootstrap=bootstrap,
max_iterations=2,
minibatch_size=5,
checkpoint_port=ckpt,
checkpoint_interval=1,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.iteration == 2
assert ckpt.latest_exists()
# Capture the checkpoint state *before* resume (state is mutated in-place)
loaded = ckpt.load()
assert loaded is not None
saved_llm_calls = loaded.total_llm_calls
saved_iteration = loaded.iteration
# Set up evaluator for resumed run (just 1 more iteration)
evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval])
proposer.propose.return_value = Prompt(text="even better prompt")
loop2 = EvolutionLoop(
evaluator=evaluator,
proposer=proposer,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=5,
checkpoint_port=ckpt,
checkpoint_interval=1,
)
resumed = await loop2.run(
seed_prompt, synthetic_pool, task_description,
initial_state=loaded,
)
assert resumed.iteration == 3
assert resumed.total_llm_calls > saved_llm_calls
assert resumed.iteration > saved_iteration

278
tests/unit/test_cli.py Normal file
View File

@@ -0,0 +1,278 @@
"""Tests for the CLI interface — prometheus optimize, version, etc.
Uses Typer's CliRunner for isolated command testing.
"""
from __future__ import annotations
from pathlib import Path
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
import yaml
from typer.testing import CliRunner
from prometheus.application.dto import OptimizationResult
from prometheus.cli.app import app
runner = CliRunner()
class TestCLIOptimize:
"""Tests for the `prometheus optimize` command."""
def _write_config(self, tmp_path: Path, **overrides: object) -> Path:
"""Write a minimal valid config YAML and return its path."""
data = {
"seed_prompt": "You are a helpful assistant.",
"task_description": "Answer factual questions accurately.",
}
data.update(overrides)
config_file = tmp_path / "config.yaml"
with open(config_file, "w") as f:
yaml.dump(data, f)
return config_file
def test_optimize_with_valid_config(self, tmp_path: Path) -> None:
config_file = self._write_config(tmp_path)
output_file = tmp_path / "output.yaml"
mock_result = OptimizationResult(
optimized_prompt="Improved prompt",
initial_prompt="You are a helpful assistant.",
iterations_used=5,
total_llm_calls=50,
initial_score=0.3,
final_score=0.9,
improvement=0.6,
history=[],
)
mock_uc = AsyncMock()
mock_uc.execute.return_value = mock_result
with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
mock_llm_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
mock_judge_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
mock_prop_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.dspy"):
result = runner.invoke(
app,
[
"optimize",
"-i",
str(config_file),
"-o",
str(output_file),
],
)
assert result.exit_code == 0
assert "Optimized Prompt" in result.output
def test_optimize_missing_input_file(self) -> None:
result = runner.invoke(
app,
["optimize", "-i", "/nonexistent/config.yaml"],
)
assert result.exit_code != 0
def test_optimize_with_verbose_flag(self, tmp_path: Path) -> None:
config_file = self._write_config(tmp_path)
output_file = tmp_path / "output.yaml"
mock_result = OptimizationResult(
optimized_prompt="Improved",
initial_prompt="test",
iterations_used=1,
total_llm_calls=10,
initial_score=0.3,
final_score=0.8,
improvement=0.5,
history=[],
)
mock_uc = AsyncMock()
mock_uc.execute.return_value = mock_result
with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
mock_llm_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
mock_judge_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
mock_prop_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.dspy"):
result = runner.invoke(
app,
[
"optimize",
"-i",
str(config_file),
"-o",
str(output_file),
"-v",
],
)
assert result.exit_code == 0
def test_optimize_displays_metrics(self, tmp_path: Path) -> None:
config_file = self._write_config(tmp_path)
output_file = tmp_path / "output.yaml"
mock_result = OptimizationResult(
optimized_prompt="Better prompt",
initial_prompt="test",
iterations_used=3,
total_llm_calls=30,
initial_score=0.40,
final_score=0.85,
improvement=0.45,
history=[],
)
mock_uc = AsyncMock()
mock_uc.execute.return_value = mock_result
with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
mock_llm_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
mock_judge_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
mock_prop_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.dspy"):
result = runner.invoke(
app,
[
"optimize",
"-i",
str(config_file),
"-o",
str(output_file),
],
)
assert result.exit_code == 0
assert "0.40" in result.output
assert "0.85" in result.output
assert "+0.45" in result.output
def test_optimize_with_max_concurrency_flag(self, tmp_path: Path) -> None:
config_file = self._write_config(tmp_path)
output_file = tmp_path / "output.yaml"
mock_result = OptimizationResult(
optimized_prompt="Better prompt",
initial_prompt="test",
iterations_used=1,
total_llm_calls=10,
initial_score=0.3,
final_score=0.8,
improvement=0.5,
history=[],
)
mock_uc = AsyncMock()
mock_uc.execute.return_value = mock_result
with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc):
with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"):
with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls:
mock_llm_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls:
mock_judge_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls:
mock_prop_cls.return_value = MagicMock()
with patch("prometheus.cli.commands.optimize.dspy"):
result = runner.invoke(
app,
[
"optimize",
"-i",
str(config_file),
"-o",
str(output_file),
"--max-concurrency",
"10",
],
)
assert result.exit_code == 0
class TestCLIHelp:
"""Tests for CLI help and no-args behavior."""
def test_no_args_shows_help(self) -> None:
result = runner.invoke(app, [])
# Typer uses exit code 2 when no_args_is_help=True
assert result.exit_code in (0, 2)
assert "PROMETHEUS" in result.output or "Usage" in result.output
def test_optimize_help(self) -> None:
result = runner.invoke(app, ["optimize", "--help"])
assert result.exit_code == 0
assert "input" in result.output.lower() or "INPUT" in result.output
def test_version_help(self) -> None:
result = runner.invoke(app, ["version", "--help"])
assert result.exit_code == 0
def test_init_help(self) -> None:
result = runner.invoke(app, ["init", "--help"])
assert result.exit_code == 0
def test_list_help(self) -> None:
result = runner.invoke(app, ["list", "--help"])
assert result.exit_code == 0
class TestCLIVersion:
"""Tests for the `prometheus version` command."""
def test_version_prints_version(self) -> None:
result = runner.invoke(app, ["version"])
assert result.exit_code == 0
assert "PROMETHEUS" in result.output
assert "0.1.0" in result.output
class TestCLIList:
"""Tests for the `prometheus list` command."""
def test_list_no_runs(self, tmp_path: Path) -> None:
result = runner.invoke(app, ["list", "-d", str(tmp_path)])
assert result.exit_code == 0
assert "No optimization runs found" in result.output
def test_list_with_result(self, tmp_path: Path) -> None:
result_data = {
"optimized_prompt": "Better prompt for testing",
"initial_prompt": "test",
"iterations_used": 5,
"total_llm_calls": 50,
"initial_score": 0.30,
"final_score": 0.90,
"improvement": 0.60,
"history": [],
}
result_file = tmp_path / "output.yaml"
import yaml as _yaml
with open(result_file, "w") as f:
_yaml.dump(result_data, f)
result = runner.invoke(app, ["list", "-d", str(tmp_path)])
assert result.exit_code == 0
assert "0.30" in result.output
assert "0.90" in result.output
def test_list_nonexistent_directory(self) -> None:
result = runner.invoke(app, ["list", "-d", "/nonexistent/dir"])
assert result.exit_code == 1

View File

@@ -300,3 +300,33 @@ class TestConfigValidation:
)
assert config.max_iterations == 1
assert config.perfect_score == 0.0
class TestEvalConfigValidation:
"""Tests for ground-truth evaluation config fields."""
def test_eval_defaults(self) -> None:
config = OptimizationConfig(seed_prompt="a", task_description="b")
assert config.eval_dataset_path is None
assert config.eval_metric == "bleu"
def test_eval_dataset_path_set(self) -> None:
config = OptimizationConfig(
seed_prompt="a", task_description="b",
eval_dataset_path="data.csv",
)
assert config.eval_dataset_path == "data.csv"
def test_valid_eval_metrics(self) -> None:
for metric in ("exact", "bleu", "rouge_l", "cosine", "llm_judge"):
config = OptimizationConfig(
seed_prompt="a", task_description="b", eval_metric=metric,
)
assert config.eval_metric == metric
def test_invalid_eval_metric_raises(self) -> None:
with pytest.raises(ValidationError, match="eval_metric must be one of"):
OptimizationConfig(
seed_prompt="a", task_description="b",
eval_metric="invalid_metric",
)

View File

@@ -0,0 +1,86 @@
"""Tests for the ground-truth dataset loader."""
from __future__ import annotations
import json
import os
import tempfile
import pytest
from prometheus.domain.entities import GroundTruthExample
from prometheus.infrastructure.dataset_loader import FileDatasetLoader
@pytest.fixture
def loader():
return FileDatasetLoader()
class TestCsvLoader:
def test_load_csv(self, loader, tmp_path):
csv_file = tmp_path / "test.csv"
csv_file.write_text("input,expected_output\nhello,world\nfoo,bar\n")
result = loader.load(str(csv_file))
assert len(result) == 2
assert result[0].input_text == "hello"
assert result[0].expected_output == "world"
assert result[1].input_text == "foo"
assert result[1].expected_output == "bar"
def test_load_csv_skips_empty_input(self, loader, tmp_path):
csv_file = tmp_path / "test.csv"
csv_file.write_text("input,expected_output\n,bar\nhello,world\n")
result = loader.load(str(csv_file))
assert len(result) == 1
assert result[0].input_text == "hello"
def test_load_csv_with_whitespace(self, loader, tmp_path):
csv_file = tmp_path / "test.csv"
csv_file.write_text("input,expected_output\n hello , world \n")
result = loader.load(str(csv_file))
assert result[0].input_text == "hello"
assert result[0].expected_output == "world"
def test_load_csv_empty_file(self, loader, tmp_path):
csv_file = tmp_path / "test.csv"
csv_file.write_text("input,expected_output\n")
result = loader.load(str(csv_file))
assert len(result) == 0
class TestJsonLoader:
def test_load_json(self, loader, tmp_path):
json_file = tmp_path / "test.json"
data = [
{"input": "hello", "expected_output": "world"},
{"input": "foo", "expected_output": "bar"},
]
json_file.write_text(json.dumps(data))
result = loader.load(str(json_file))
assert len(result) == 2
assert result[0].input_text == "hello"
assert result[0].expected_output == "world"
def test_load_json_skips_empty_input(self, loader, tmp_path):
json_file = tmp_path / "test.json"
data = [
{"input": "", "expected_output": "bar"},
{"input": "hello", "expected_output": "world"},
]
json_file.write_text(json.dumps(data))
result = loader.load(str(json_file))
assert len(result) == 1
def test_load_json_not_array_raises(self, loader, tmp_path):
json_file = tmp_path / "test.json"
json_file.write_text(json.dumps({"not": "an array"}))
with pytest.raises(ValueError, match="must be an array"):
loader.load(str(json_file))
class TestUnsupportedFormat:
def test_unsupported_extension_raises(self, loader, tmp_path):
txt_file = tmp_path / "test.txt"
txt_file.write_text("hello")
with pytest.raises(ValueError, match="Unsupported dataset format"):
loader.load(str(txt_file))

View File

@@ -278,6 +278,7 @@ class TestPerCallIsolation:
adapter._judge_dimensions = []
adapter._dimension_names = ""
adapter._weights = {}
adapter.call_count = 0
# Mock _judge to fail on first call, succeed on second
call_count = 0

View File

@@ -8,10 +8,30 @@ import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
from prometheus.domain.entities import (
Candidate,
EvalResult,
Prompt,
SyntheticExample,
Trajectory,
)
def _make_eval(scores: list[float], label: str = "ok") -> EvalResult:
"""Helper to build an EvalResult from a list of scores."""
return EvalResult(
scores=scores,
feedbacks=[label] * len(scores),
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, label, "prompt")
for i, s in enumerate(scores)
],
)
class TestEvolutionLoop:
"""Tests for the original single-candidate hill-climbing mode (population_size=1)."""
@pytest.mark.asyncio
async def test_accepts_improvement(
self,
@@ -27,28 +47,9 @@ class TestEvolutionLoop:
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
initial_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["bad"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
old_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["bad"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
new_eval = EvalResult(
scores=[0.8, 0.9, 0.7, 0.8, 0.9],
feedbacks=["good"] * 5,
trajectories=[],
)
evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval])
low_eval = _make_eval([0.3, 0.4, 0.3, 0.5, 0.2], "bad")
high_eval = _make_eval([0.8, 0.9, 0.7, 0.8, 0.9], "good")
evaluator.evaluate = AsyncMock(side_effect=[low_eval, low_eval, high_eval])
loop = EvolutionLoop(
evaluator=evaluator,
@@ -57,7 +58,6 @@ class TestEvolutionLoop:
max_iterations=1,
minibatch_size=5,
)
with patch.object(loop, "_log"):
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_candidate is not None
@@ -78,28 +78,9 @@ class TestEvolutionLoop:
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
initial_eval = EvalResult(
scores=[0.7, 0.8, 0.7, 0.8, 0.9],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
],
)
old_eval = EvalResult(
scores=[0.7, 0.8, 0.7, 0.8, 0.9],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
],
)
new_eval = EvalResult(
scores=[0.2, 0.1, 0.3, 0.2, 0.1],
feedbacks=["bad"] * 5,
trajectories=[],
)
evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval])
high_eval = _make_eval([0.7, 0.8, 0.7, 0.8, 0.9], "ok")
low_eval = _make_eval([0.2, 0.1, 0.3, 0.2, 0.1], "bad")
evaluator.evaluate = AsyncMock(side_effect=[high_eval, high_eval, low_eval])
loop = EvolutionLoop(
evaluator=evaluator,
@@ -108,7 +89,6 @@ class TestEvolutionLoop:
max_iterations=1,
minibatch_size=5,
)
with patch.object(loop, "_log"):
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_candidate is not None
@@ -129,14 +109,7 @@ class TestEvolutionLoop:
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
perfect_eval = EvalResult(
scores=[1.0, 1.0, 1.0, 1.0, 1.0],
feedbacks=["perfect"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt")
for i in range(5)
],
)
perfect_eval = _make_eval([1.0, 1.0, 1.0, 1.0, 1.0], "perfect")
evaluator.evaluate = AsyncMock(return_value=perfect_eval)
loop = EvolutionLoop(
@@ -146,7 +119,226 @@ class TestEvolutionLoop:
max_iterations=3,
minibatch_size=5,
)
with patch.object(loop, "_log"):
await loop.run(seed_prompt, synthetic_pool, task_description)
mock_proposer_port.propose.assert_not_called()
class TestPopulationEvolution:
"""Tests for population-based evolution (population_size > 1)."""
@pytest.mark.asyncio
async def test_population_initialization(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
mock_mutation_port: AsyncMock,
) -> None:
"""Population is initialized with the right number of candidates."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
evaluator.evaluate = AsyncMock(
return_value=_make_eval([0.5] * 5, "ok")
)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=0, # no iterations, just initialization
minibatch_size=5,
population_size=4,
mutation_port=mock_mutation_port,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
# 1 seed + 3 mutations = 4 candidates
assert len(state.candidates) == 4
assert mock_mutation_port.mutate.call_count == 3
@pytest.mark.asyncio
async def test_population_initialization_uses_proposer_fallback(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""When no mutation_port is provided, population init falls back to proposer."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
evaluator.evaluate = AsyncMock(
return_value=_make_eval([0.5] * 5, "ok")
)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=0,
minibatch_size=5,
population_size=3,
# mutation_port intentionally omitted
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert len(state.candidates) == 3
assert mock_proposer_port.propose.call_count == 2 # 3-1 = 2 init mutations
@pytest.mark.asyncio
async def test_population_iteration_replaces_worst(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
mock_crossover_port: AsyncMock,
mock_mutation_port: AsyncMock,
) -> None:
"""Crossover child replaces worst candidate when its fitness is higher."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
# Sequence:
# 1. Initial eval (seed)
# 2. Population init: 3 mutation calls use proposer.propose(), NOT evaluator.evaluate
# 3. Population iteration: crossover produces child → eval child
# Only 2 evaluator.evaluate calls total
seed_eval = _make_eval([0.5] * 5, "ok")
# Crossover child eval - high score to beat worst
child_eval = _make_eval([0.9, 0.9, 0.8, 0.9, 0.8], "great")
all_evals = [seed_eval, child_eval]
evaluator.evaluate = AsyncMock(side_effect=all_evals)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
population_size=4,
crossover_rate=1.0,
crossover_port=mock_crossover_port,
mutation_rate=0.0, # disable post-crossover mutation for determinism
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
accepted_events = [h for h in state.history if h.get("event") == "pop_accepted"]
assert len(accepted_events) >= 1
@pytest.mark.asyncio
async def test_population_iteration_rejects_inferior_child(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
mock_crossover_port: AsyncMock,
) -> None:
"""Inferior child is rejected and doesn't replace any candidate."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
seed_eval = _make_eval([0.8] * 5, "ok")
# Crossover produces very LOW-scoring child
child_eval = _make_eval([0.1] * 5, "terrible")
all_evals = [seed_eval, child_eval]
evaluator.evaluate = AsyncMock(side_effect=all_evals)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
population_size=4,
crossover_rate=1.0,
crossover_port=mock_crossover_port,
mutation_rate=0.0,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
rejected_events = [h for h in state.history if h.get("event") == "pop_rejected"]
assert len(rejected_events) >= 1
class TestDiversityScore:
"""Tests for the diversity/similarity scoring logic."""
def test_identical_prompts_have_high_similarity(self) -> None:
"""Identical prompts should have very high similarity."""
identical = Prompt(text="You are a helpful assistant. Answer the question.")
pop_a = Candidate(prompt=identical, best_score=4.0, generation=0)
pop_b = Candidate(
prompt=Prompt(text="Completely different prompt about data analysis."),
best_score=3.0,
generation=0,
)
sim_same = EvolutionLoop._compute_diversity_score(identical, [pop_a, pop_b])
# Average includes similarity to the different member, so ~0.5 not 0.9+
assert sim_same > 0.3
def test_different_prompts_have_lower_similarity(self) -> None:
"""Different prompts should have lower similarity than identical ones."""
prompt_a = Prompt(text="You are a helpful assistant. Answer the question.")
prompt_b = Prompt(text="Provide detailed analysis of complex data patterns with precision.")
pop_a = Candidate(prompt=prompt_a, best_score=4.0, generation=0)
pop_b = Candidate(prompt=prompt_b, best_score=3.0, generation=0)
sim_a = EvolutionLoop._compute_diversity_score(prompt_a, [pop_a, pop_b])
sim_b = EvolutionLoop._compute_diversity_score(prompt_b, [pop_a, pop_b])
# Both should be < 1.0 since they're different
assert sim_a < 1.0
assert sim_b < 1.0
def test_single_member_population_returns_1(self) -> None:
"""Single-member population always returns 1.0 (no penalty)."""
prompt = Prompt(text="Any prompt text here.")
pop = [Candidate(prompt=prompt, best_score=1.0, generation=0)]
sim = EvolutionLoop._compute_diversity_score(prompt, pop)
assert sim == 1.0
def test_empty_prompt_returns_zero(self) -> None:
"""Empty prompt text returns 0.0 when population has >1 member."""
prompt = Prompt(text="")
pop = [
Candidate(prompt=Prompt(text="some text"), best_score=1.0, generation=0),
Candidate(prompt=Prompt(text="other text"), best_score=2.0, generation=0),
]
sim = EvolutionLoop._compute_diversity_score(prompt, pop)
assert sim == 0.0
class TestPromptDiff:
"""Tests for the static _compute_prompt_diff helper."""
def test_identical_prompts(self) -> None:
result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello\nworld")
assert result["lines_added"] == 0
assert result["lines_removed"] == 0
assert result["chars_delta"] == 0
def test_added_lines(self) -> None:
result = EvolutionLoop._compute_prompt_diff("hello", "hello\nworld")
assert result["lines_added"] == 1
assert result["lines_removed"] == 0
assert result["chars_delta"] == 6 # "\nworld"
def test_removed_lines(self) -> None:
result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello")
assert result["lines_added"] == 0
assert result["lines_removed"] == 1

View File

@@ -0,0 +1,133 @@
"""Tests for GroundTruthEvaluator — execution + similarity comparison."""
from __future__ import annotations
from unittest.mock import AsyncMock
import pytest
from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator
from prometheus.domain.entities import EvalResult, GroundTruthExample, Prompt
from prometheus.domain.ports import LLMPort, SimilarityPort
@pytest.fixture
def mock_executor() -> AsyncMock:
port = AsyncMock(spec=LLMPort)
port.execute.return_value = "Paris is the capital of France."
return port
@pytest.fixture
def mock_similarity() -> AsyncMock:
port = AsyncMock(spec=SimilarityPort)
port.compute.return_value = 0.85
return port
@pytest.fixture
def gt_dataset() -> list[GroundTruthExample]:
return [
GroundTruthExample(input_text="What is the capital of France?", expected_output="Paris", id=0),
GroundTruthExample(input_text="What is 2+2?", expected_output="4", id=1),
GroundTruthExample(input_text="What color is the sky?", expected_output="blue", id=2),
]
@pytest.fixture
def prompt() -> Prompt:
return Prompt(text="Answer the following question accurately.")
@pytest.mark.asyncio
class TestGroundTruthEvaluator:
async def test_evaluate_happy_path(self, mock_executor, mock_similarity, gt_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor,
similarity=mock_similarity,
max_concurrency=2,
)
result = await evaluator.evaluate(prompt, gt_dataset)
assert isinstance(result, EvalResult)
assert len(result.scores) == 3
assert len(result.feedbacks) == 3
assert len(result.trajectories) == 3
assert all(s == 0.85 for s in result.scores)
assert result.mean_score == pytest.approx(0.85)
assert result.total_score == pytest.approx(2.55)
async def test_executor_called_for_each_input(self, mock_executor, mock_similarity, gt_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=mock_similarity,
)
await evaluator.evaluate(prompt, gt_dataset)
assert mock_executor.execute.call_count == 3
async def test_similarity_called_for_each_output(self, mock_executor, mock_similarity, gt_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=mock_similarity,
)
await evaluator.evaluate(prompt, gt_dataset)
assert mock_similarity.compute.call_count == 3
async def test_execution_error_produces_zero_score(self, mock_similarity, gt_dataset, prompt):
failing_executor = AsyncMock(spec=LLMPort)
failing_executor.execute.side_effect = RuntimeError("API timeout")
evaluator = GroundTruthEvaluator(
executor=failing_executor, similarity=mock_similarity,
)
result = await evaluator.evaluate(prompt, gt_dataset)
assert len(result.scores) == 3
# The similarity adapter is called with the error sentinel
assert all(isinstance(s, float) for s in result.scores)
assert all("[execution error:" in t.output_text for t in result.trajectories)
async def test_empty_dataset(self, mock_executor, mock_similarity, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=mock_similarity,
)
result = await evaluator.evaluate(prompt, [])
assert result.scores == []
assert result.mean_score == 0.0
assert result.total_score == 0.0
async def test_trajectory_contains_prompt_used(self, mock_executor, mock_similarity, gt_dataset, prompt):
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=mock_similarity,
)
result = await evaluator.evaluate(prompt, gt_dataset)
for t in result.trajectories:
assert t.prompt_used == prompt.text
async def test_scores_clamped_to_unit_range(self, mock_executor, gt_dataset, prompt):
# Similarity returns a value > 1.0 (should be clamped)
over_similarity = AsyncMock(spec=SimilarityPort)
over_similarity.compute.return_value = 1.5
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=over_similarity,
)
result = await evaluator.evaluate(prompt, gt_dataset)
assert all(0.0 <= s <= 1.0 for s in result.scores)
async def test_feedback_for_exact_match(self, mock_executor, gt_dataset, prompt):
exact_similarity = AsyncMock(spec=SimilarityPort)
exact_similarity.compute.return_value = 1.0
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=exact_similarity,
)
result = await evaluator.evaluate(prompt, gt_dataset)
assert all("Exact match" in fb for fb in result.feedbacks)
async def test_feedback_for_poor_match(self, mock_executor, gt_dataset, prompt):
poor_similarity = AsyncMock(spec=SimilarityPort)
poor_similarity.compute.return_value = 0.1
evaluator = GroundTruthEvaluator(
executor=mock_executor, similarity=poor_similarity,
)
result = await evaluator.evaluate(prompt, gt_dataset)
assert all("Poor match" in fb for fb in result.feedbacks)

View File

@@ -0,0 +1,316 @@
"""Unit tests for hold-out validation and early stopping."""
from __future__ import annotations
from unittest.mock import AsyncMock, MagicMock
import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import (
Candidate,
EvalResult,
Prompt,
SyntheticExample,
Trajectory,
)
def _make_eval(mean_score: float, n: int = 5) -> EvalResult:
"""Helper: create an EvalResult with a given mean score."""
scores = [mean_score] * n
return EvalResult(
scores=scores,
feedbacks=["feedback"] * n,
trajectories=[
Trajectory(f"input{i}", f"output{i}", mean_score, "feedback", "prompt")
for i in range(n)
],
)
class TestBootstrapSplit:
"""Tests for SyntheticBootstrap.split_pool."""
def test_split_produces_correct_sizes(self):
pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(20)]
train, val = SyntheticBootstrap.split_pool(pool, 0.3)
assert len(train) + len(val) == 20
assert len(val) == 6 # 20 * 0.3 = 6
assert len(train) == 14
def test_split_zero_fraction_returns_all_train(self):
pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(10)]
train, val = SyntheticBootstrap.split_pool(pool, 0.0)
assert len(train) == 10
assert len(val) == 0
def test_split_single_element(self):
pool = [SyntheticExample(input_text="only", id=0)]
train, val = SyntheticBootstrap.split_pool(pool, 0.3)
assert len(train) == 1
assert len(val) == 0
def test_split_deterministic_with_seed(self):
pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(50)]
train1, val1 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42)))
train2, val2 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42)))
assert [ex.id for ex in train1] == [ex.id for ex in train2]
assert [ex.id for ex in val1] == [ex.id for ex in val2]
def test_split_no_overlap(self):
pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(30)]
train, val = SyntheticBootstrap.split_pool(pool, 0.3)
train_ids = {ex.id for ex in train}
val_ids = {ex.id for ex in val}
assert train_ids.isdisjoint(val_ids)
assert train_ids | val_ids == {ex.id for ex in pool}
class TestValidationEvaluation:
"""Tests for hold-out evaluation during evolution."""
@pytest.mark.asyncio
async def test_validation_pool_evaluated_after_each_iteration(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""When a validation pool is provided, the best candidate is evaluated on it."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
# Initial eval (train) + validation eval + iteration train eval + new prompt eval + validation eval
train_eval = _make_eval(0.5)
val_eval = _make_eval(0.6)
new_eval = _make_eval(0.7)
val_eval_2 = _make_eval(0.65)
evaluator.evaluate = AsyncMock(
side_effect=[train_eval, val_eval, train_eval, new_eval, val_eval_2]
)
validation_pool = synthetic_pool[-6:]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
)
state = await loop.run(
seed_prompt, synthetic_pool, task_description,
validation_pool=validation_pool,
)
# Should have validation metrics in state
assert state.best_validation_score is not None
# History should contain validation_eval entries
val_events = [h for h in state.history if h["event"] == "validation_eval"]
assert len(val_events) >= 1
@pytest.mark.asyncio
async def test_no_validation_without_pool(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""Without a validation pool, no validation is performed."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
train_eval = _make_eval(0.5)
old_eval = _make_eval(0.5)
new_eval = _make_eval(0.7)
evaluator.evaluate = AsyncMock(side_effect=[train_eval, old_eval, new_eval])
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
)
state = await loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_validation_score is None
assert not state.early_stopped
val_events = [h for h in state.history if h["event"] == "validation_eval"]
assert len(val_events) == 0
class TestEarlyStopping:
"""Tests for early stopping when validation score degrades."""
@pytest.mark.asyncio
async def test_early_stop_triggers_on_patience_exceeded(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""Early stopping triggers when validation doesn't improve for K iterations."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
patience = 3
# Build eval sequence:
# 1. Initial train eval
# 2. Initial validation eval (0.5)
# Then for each of 3 iterations:
# - train eval (current best)
# - train eval (new prompt - accepted)
# - validation eval (degrading)
evals = [
_make_eval(0.5), # initial train
_make_eval(0.5), # initial validation
]
for i in range(patience):
evals.extend([
_make_eval(0.5 + i * 0.1), # current eval (train)
_make_eval(0.6 + i * 0.1), # new eval (train) - accepted
_make_eval(0.4), # validation eval (degrading)
])
evaluator.evaluate = AsyncMock(side_effect=evals)
validation_pool = synthetic_pool[-5:]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=10, # would go further without early stop
minibatch_size=5,
early_stop_patience=patience,
)
state = await loop.run(
seed_prompt, synthetic_pool, task_description,
validation_pool=validation_pool,
)
assert state.early_stopped is True
assert state.iteration == patience
assert state.best_validation_score is not None
# Should have an early_stop event in history
early_stop_events = [h for h in state.history if h["event"] == "early_stop"]
assert len(early_stop_events) == 1
@pytest.mark.asyncio
async def test_early_stop_does_not_trigger_when_improving(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""When validation keeps improving, early stopping does not trigger."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
evals = [
_make_eval(0.3), # initial train
_make_eval(0.3), # initial validation
]
# 3 iterations, each with improving validation
for i in range(3):
evals.extend([
_make_eval(0.3 + i * 0.1), # current train eval
_make_eval(0.4 + i * 0.1), # new train eval (accepted)
_make_eval(0.3 + (i + 1) * 0.1), # validation eval (improving)
])
evaluator.evaluate = AsyncMock(side_effect=evals)
validation_pool = synthetic_pool[-5:]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=5,
early_stop_patience=5,
)
state = await loop.run(
seed_prompt, synthetic_pool, task_description,
validation_pool=validation_pool,
)
assert state.early_stopped is False
assert state.iteration == 3
assert state.best_validation_score is not None
@pytest.mark.asyncio
async def test_validation_patience_resets_on_improvement(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: AsyncMock,
mock_judge_port: AsyncMock,
mock_proposer_port: AsyncMock,
) -> None:
"""Patience counter resets when validation improves after degrading."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
evals = [
_make_eval(0.3), # initial train
_make_eval(0.3), # initial validation
# iter 1: degrade
_make_eval(0.3), # current train
_make_eval(0.5), # new train (accepted)
_make_eval(0.2), # validation degrade (patience=1)
# iter 2: degrade
_make_eval(0.5), # current train
_make_eval(0.6), # new train (accepted)
_make_eval(0.2), # validation degrade (patience=2)
# iter 3: improve! (resets patience)
_make_eval(0.6), # current train
_make_eval(0.7), # new train (accepted)
_make_eval(0.4), # validation improve (patience=0)
# iter 4: degrade again
_make_eval(0.7), # current train
_make_eval(0.8), # new train (accepted)
_make_eval(0.2), # validation degrade (patience=1)
]
evaluator.evaluate = AsyncMock(side_effect=evals)
validation_pool = synthetic_pool[-5:]
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=4,
minibatch_size=5,
early_stop_patience=3,
)
state = await loop.run(
seed_prompt, synthetic_pool, task_description,
validation_pool=validation_pool,
)
assert state.early_stopped is False
assert state.iteration == 4

189
tests/unit/test_logging.py Normal file
View File

@@ -0,0 +1,189 @@
"""Unit tests for structured logging configuration."""
from __future__ import annotations
import json
import logging
from pathlib import Path
from prometheus.cli.logging_setup import configure_logging, get_logger
class TestConfigureLogging:
def _count_handlers(self, name: str = "prometheus") -> int:
return len(logging.getLogger(name).handlers)
def test_default_creates_console_handler(self) -> None:
configure_logging(level=logging.INFO)
prom = logging.getLogger("prometheus")
assert len(prom.handlers) == 1
assert isinstance(prom.handlers[0], logging.StreamHandler)
prom.handlers.clear()
def test_json_format_produces_valid_json(self, capsys) -> None:
configure_logging(level=logging.INFO, log_format="json")
logger = get_logger("test_json")
logger.info("hello", extra={"structured": {"key": "value"}})
captured = capsys.readouterr()
# Output goes to stderr
line = captured.err.strip().split("\n")[-1]
data = json.loads(line)
assert data["message"] == "hello"
assert data["structured"]["key"] == "value"
assert data["level"] == "INFO"
assert "timestamp" in data
logging.getLogger("prometheus").handlers.clear()
def test_text_format_includes_structured_extras(self, capsys) -> None:
configure_logging(level=logging.INFO, log_format="text")
logger = get_logger("test_text")
logger.info("msg", extra={"structured": {"foo": "bar"}})
captured = capsys.readouterr()
assert "foo=bar" in captured.err
logging.getLogger("prometheus").handlers.clear()
def test_debug_level_shows_debug_messages(self, capsys) -> None:
configure_logging(level=logging.DEBUG)
logger = get_logger("test_debug")
logger.debug("debug msg")
captured = capsys.readouterr()
assert "debug msg" in captured.err
logging.getLogger("prometheus").handlers.clear()
def test_warning_level_hides_debug_messages(self, capsys) -> None:
configure_logging(level=logging.WARNING)
logger = get_logger("test_warn")
logger.debug("should not appear")
logger.info("also hidden")
captured = capsys.readouterr()
assert "should not appear" not in captured.err
assert "also hidden" not in captured.err
logging.getLogger("prometheus").handlers.clear()
def test_file_handler_writes_to_file(self, tmp_path: Path) -> None:
log_file = tmp_path / "test.log"
configure_logging(level=logging.INFO, log_file=str(log_file))
logger = get_logger("test_file")
logger.info("file message")
prom = logging.getLogger("prometheus")
# Flush handlers
for h in prom.handlers:
h.flush()
prom.handlers.clear()
content = log_file.read_text()
assert "file message" in content
def test_json_file_output(self, tmp_path: Path) -> None:
log_file = tmp_path / "test.json.log"
configure_logging(level=logging.INFO, log_format="json", log_file=str(log_file))
logger = get_logger("test_json_file")
logger.info("json file msg", extra={"structured": {"x": 1}})
prom = logging.getLogger("prometheus")
for h in prom.handlers:
h.flush()
prom.handlers.clear()
content = log_file.read_text().strip()
data = json.loads(content)
assert data["message"] == "json file msg"
assert data["structured"]["x"] == 1
def test_reconfigure_clears_old_handlers(self) -> None:
configure_logging(level=logging.INFO)
configure_logging(level=logging.DEBUG)
prom = logging.getLogger("prometheus")
assert len(prom.handlers) == 1
prom.handlers.clear()
def test_propagate_false_prevents_duplicate_output(self, capsys) -> None:
configure_logging(level=logging.INFO)
prom = logging.getLogger("prometheus")
assert prom.propagate is False
prom.handlers.clear()
class TestGetLogger:
def test_returns_child_of_prometheus(self) -> None:
logger = get_logger("mymodule")
assert logger.name == "prometheus.mymodule"
def test_inherits_level_from_parent(self) -> None:
configure_logging(level=logging.DEBUG)
logger = get_logger("child")
assert logger.getEffectiveLevel() <= logging.DEBUG
logging.getLogger("prometheus").handlers.clear()
class TestJsonFormatter:
def test_exception_included(self, capsys) -> None:
configure_logging(level=logging.ERROR, log_format="json")
logger = get_logger("test_exc")
try:
raise ValueError("boom")
except ValueError:
logger.error("failed", exc_info=True)
captured = capsys.readouterr()
line = captured.err.strip().split("\n")[-1]
data = json.loads(line)
assert "ValueError: boom" in data["exception"]
logging.getLogger("prometheus").handlers.clear()
class TestLoggingCLIIntegration:
"""Tests for CLI flags that configure logging."""
def test_verbose_flag_enables_info(self, tmp_path: Path) -> None:
"""Simulate what -v does — configure_logging at INFO level."""
configure_logging(level=logging.INFO)
logger = get_logger("evolution")
logger.info("test message")
prom = logging.getLogger("prometheus")
assert len(prom.handlers) == 1
prom.handlers.clear()
def test_debug_flag_enables_debug(self) -> None:
"""Simulate what --debug does — configure_logging at DEBUG level."""
configure_logging(level=logging.DEBUG)
logger = get_logger("evolution")
logger.debug("debug message")
prom = logging.getLogger("prometheus")
assert prom.level == logging.DEBUG
prom.handlers.clear()
def test_log_format_invalid_rejected(self) -> None:
"""Invalid log_format should be caught by OptimizationConfig validator."""
from pydantic import ValidationError
from prometheus.application.dto import OptimizationConfig
import pytest
with pytest.raises(ValidationError, match="log_format must be one of"):
OptimizationConfig(
seed_prompt="a",
task_description="b",
log_format="xml",
)
def test_log_format_text_and_json_accepted(self) -> None:
"""Both text and json log_format values should be valid."""
from prometheus.application.dto import OptimizationConfig
for fmt in ("text", "json"):
config = OptimizationConfig(
seed_prompt="a", task_description="b", log_format=fmt,
)
assert config.log_format == fmt

View File

@@ -0,0 +1,96 @@
"""Additional unit tests for scoring edge cases."""
from __future__ import annotations
import pytest
from prometheus.domain.entities import EvalResult, Trajectory
from prometheus.domain.scoring import normalize_score, should_accept
def _make_eval(scores: list[float]) -> EvalResult:
return EvalResult(
scores=scores,
feedbacks=[""] * len(scores),
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "", "p")
for i, s in enumerate(scores)
],
)
class TestShouldAcceptEdgeCases:
"""Extended edge-case tests for should_accept."""
def test_tiny_improvement_accepted(self) -> None:
old = _make_eval([0.5])
new = _make_eval([0.5001])
assert should_accept(old, new) is True
def test_tiny_improvement_below_threshold(self) -> None:
old = _make_eval([0.5])
new = _make_eval([0.5001])
assert should_accept(old, new, min_improvement=0.01) is False
def test_zero_scores_equal(self) -> None:
old = _make_eval([0.0, 0.0])
new = _make_eval([0.0, 0.0])
assert should_accept(old, new) is False
def test_negative_to_zero_not_accepted(self) -> None:
"""Scores should be [0,1] but test should_accept with edge values."""
old = _make_eval([-0.1])
new = _make_eval([0.0])
assert should_accept(old, new) is True
def test_large_improvement(self) -> None:
old = _make_eval([0.0, 0.0, 0.0])
new = _make_eval([1.0, 1.0, 1.0])
assert should_accept(old, new) is True
def test_single_score_improvement(self) -> None:
old = _make_eval([0.4])
new = _make_eval([0.5])
assert should_accept(old, new) is True
def test_min_improvement_exactly_met(self) -> None:
"""When improvement exactly equals min_improvement, still rejected (strict >)."""
old = _make_eval([0.5])
new = _make_eval([0.7])
assert should_accept(old, new, min_improvement=0.2) is False
def test_min_improvement_just_over(self) -> None:
old = _make_eval([0.5])
new = _make_eval([0.7001])
assert should_accept(old, new, min_improvement=0.2) is True
class TestNormalizeScoreEdgeCases:
"""Extended edge-case tests for normalize_score."""
def test_exact_bounds(self) -> None:
assert normalize_score(0.0) == 0.0
assert normalize_score(1.0) == 1.0
def test_very_large_value(self) -> None:
assert normalize_score(1e10) == 1.0
def test_very_negative_value(self) -> None:
assert normalize_score(-1e10) == 0.0
def test_custom_bounds_at_edges(self) -> None:
assert normalize_score(5.0, min_val=0.0, max_val=10.0) == 5.0
assert normalize_score(0.0, min_val=0.0, max_val=10.0) == 0.0
assert normalize_score(10.0, min_val=0.0, max_val=10.0) == 10.0
def test_negative_custom_range(self) -> None:
assert normalize_score(0.0, min_val=-5.0, max_val=5.0) == 0.0
assert normalize_score(-3.0, min_val=-5.0, max_val=5.0) == -3.0
assert normalize_score(-10.0, min_val=-5.0, max_val=5.0) == -5.0
def test_zero_span_range(self) -> None:
"""When min == max, clamps to min."""
assert normalize_score(5.0, min_val=5.0, max_val=5.0) == 5.0
assert normalize_score(0.0, min_val=5.0, max_val=5.0) == 5.0
def test_fractional_score(self) -> None:
assert normalize_score(0.3333) == pytest.approx(0.3333)

View File

@@ -0,0 +1,133 @@
"""Tests for similarity adapters — exact, BLEU, ROUGE-L, cosine."""
from __future__ import annotations
import pytest
from prometheus.infrastructure.similarity import (
BleuSimilarity,
CosineSimilarity,
ExactMatchSimilarity,
RougeLSimilarity,
create_similarity_adapter,
)
class TestExactMatchSimilarity:
def test_exact_match(self):
s = ExactMatchSimilarity()
assert s.compute("Hello World", "Hello World") == 1.0
def test_case_insensitive(self):
s = ExactMatchSimilarity()
assert s.compute("hello world", "HELLO WORLD") == 1.0
def test_whitespace_trimmed(self):
s = ExactMatchSimilarity()
assert s.compute(" hello ", "hello") == 1.0
def test_no_match(self):
s = ExactMatchSimilarity()
assert s.compute("hello", "world") == 0.0
def test_partial_no_match(self):
s = ExactMatchSimilarity()
assert s.compute("hello world", "hello") == 0.0
class TestBleuSimilarity:
def test_perfect_match(self):
s = BleuSimilarity()
assert s.compute("the cat sat on the mat", "the cat sat on the mat") == 1.0
def test_no_overlap(self):
s = BleuSimilarity()
assert s.compute("aaa bbb ccc", "ddd eee fff") == 0.0
def test_partial_overlap(self):
s = BleuSimilarity()
score = s.compute("the cat sat", "the cat")
assert 0.0 < score < 1.0
def test_empty_prediction(self):
s = BleuSimilarity()
assert s.compute("", "hello world") == 0.0
def test_empty_expected(self):
s = BleuSimilarity()
assert s.compute("hello world", "") == 0.0
def test_both_empty(self):
s = BleuSimilarity()
assert s.compute("", "") == 0.0
def test_shorter_prediction_gets_brevity_penalty(self):
s = BleuSimilarity()
short = s.compute("cat", "the cat sat on the mat")
full = s.compute("the cat sat on the mat", "the cat sat on the mat")
assert short < full
class TestRougeLSimilarity:
def test_perfect_match(self):
s = RougeLSimilarity()
assert s.compute("the cat sat", "the cat sat") == 1.0
def test_no_overlap(self):
s = RougeLSimilarity()
assert s.compute("aaa bbb", "ccc ddd") == 0.0
def test_partial_overlap(self):
s = RougeLSimilarity()
score = s.compute("the cat sat on the mat", "the cat on the rug")
assert 0.0 < score < 1.0
def test_empty_prediction(self):
s = RougeLSimilarity()
assert s.compute("", "hello") == 0.0
def test_subsequence(self):
s = RougeLSimilarity()
# "cat mat" is a subsequence of "the cat sat on the mat"
score = s.compute("cat mat", "the cat sat on the mat")
assert score > 0.0
class TestCosineSimilarity:
def test_identical_texts(self):
s = CosineSimilarity()
assert s.compute("hello world", "hello world") == pytest.approx(1.0)
def test_no_overlap(self):
s = CosineSimilarity()
assert s.compute("aaa bbb", "ccc ddd") == 0.0
def test_partial_overlap(self):
s = CosineSimilarity()
score = s.compute("hello world foo", "hello world bar")
assert 0.0 < score < 1.0
def test_empty_prediction(self):
s = CosineSimilarity()
assert s.compute("", "hello") == 0.0
class TestCreateSimilarityAdapter:
def test_create_exact(self):
adapter = create_similarity_adapter("exact")
assert isinstance(adapter, ExactMatchSimilarity)
def test_create_bleu(self):
adapter = create_similarity_adapter("bleu")
assert isinstance(adapter, BleuSimilarity)
def test_create_rouge_l(self):
adapter = create_similarity_adapter("rouge_l")
assert isinstance(adapter, RougeLSimilarity)
def test_create_cosine(self):
adapter = create_similarity_adapter("cosine")
assert isinstance(adapter, CosineSimilarity)
def test_unknown_metric_raises(self):
with pytest.raises(ValueError, match="Unknown eval metric"):
create_similarity_adapter("nonexistent")

View File

@@ -0,0 +1,233 @@
"""Unit tests for OptimizePromptUseCase — direct orchestration tests."""
from __future__ import annotations
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.application.use_cases import OptimizePromptUseCase
from prometheus.domain.entities import (
Candidate,
EvalResult,
OptimizationState,
Prompt,
SyntheticExample,
Trajectory,
)
def _make_eval(scores: list[float]) -> EvalResult:
return EvalResult(
scores=scores,
feedbacks=["feedback"] * len(scores),
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
for i, s in enumerate(scores)
],
)
def _make_state(
iterations: int = 3,
initial_score: float = 0.3,
final_score: float = 0.8,
accepted: bool = True,
) -> OptimizationState:
seed = Candidate(prompt=Prompt(text="seed"), best_score=initial_score, generation=0)
best = Candidate(
prompt=Prompt(text="optimized" if accepted else "seed"),
best_score=final_score,
generation=iterations if accepted else 0,
)
history = []
for i in range(1, iterations + 1):
event = "accepted" if accepted else "rejected"
history.append({"iteration": i, "event": event, "old_score": 0.3, "new_score": 0.8})
return OptimizationState(
iteration=iterations,
best_candidate=best,
candidates=[seed, best] if accepted else [seed],
total_llm_calls=iterations * 11 + 10,
history=history,
)
class TestOptimizePromptUseCaseExecute:
"""Tests for the execute() orchestration method."""
@pytest.fixture
def mock_evaluator(self) -> MagicMock:
return MagicMock(spec=PromptEvaluator)
@pytest.fixture
def mock_proposer(self) -> MagicMock:
return MagicMock()
@pytest.fixture
def mock_bootstrap(self) -> MagicMock:
return MagicMock(spec=SyntheticBootstrap)
@pytest.fixture
def use_case(
self,
mock_evaluator: MagicMock,
mock_proposer: MagicMock,
mock_bootstrap: MagicMock,
) -> OptimizePromptUseCase:
return OptimizePromptUseCase(
evaluator=mock_evaluator,
proposer=mock_proposer,
bootstrap=mock_bootstrap,
)
@pytest.fixture
def config(self) -> OptimizationConfig:
return OptimizationConfig(
seed_prompt="Answer the question.",
task_description="Q&A task",
max_iterations=5,
n_synthetic_inputs=20,
minibatch_size=5,
seed=42,
)
@pytest.mark.asyncio
async def test_returns_optimization_result(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = [
SyntheticExample(input_text=f"q{i}", id=i) for i in range(20)
]
mock_state = _make_state(iterations=3, initial_score=0.3, final_score=0.9)
with patch.object(EvolutionLoop, "run", return_value=mock_state):
result = await use_case.execute(config)
assert isinstance(result, OptimizationResult)
assert result.initial_prompt == "Answer the question."
assert result.final_score == 0.9
assert result.improvement == pytest.approx(0.6)
@pytest.mark.asyncio
async def test_bootstrap_called_with_config_params(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = []
mock_state = _make_state()
with patch.object(EvolutionLoop, "run", return_value=mock_state):
await use_case.execute(config)
mock_bootstrap.run.assert_called_once_with(
task_description="Q&A task",
n_examples=20,
)
@pytest.mark.asyncio
async def test_evolution_loop_configured_from_config(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = []
mock_state = _make_state()
with patch.object(EvolutionLoop, "run", return_value=mock_state) as mock_run:
await use_case.execute(config)
# Verify the loop was instantiated with correct params
mock_run.assert_called_once()
call_args = mock_run.call_args
seed_prompt = call_args[0][0]
assert seed_prompt.text == "Answer the question."
synthetic_pool = call_args[0][1]
assert len(synthetic_pool) == 0 # bootstrap returned empty
assert call_args[0][2] == "Q&A task"
@pytest.mark.asyncio
async def test_total_llm_calls_includes_bootstrap_call(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = []
mock_state = _make_state(iterations=3)
# total_llm_calls from state + 1 for bootstrap
expected = mock_state.total_llm_calls + 1
with patch.object(EvolutionLoop, "run", return_value=mock_state):
result = await use_case.execute(config)
assert result.total_llm_calls == expected
@pytest.mark.asyncio
async def test_no_candidates_fallback(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = [
SyntheticExample(input_text=f"q{i}", id=i) for i in range(20)
]
mock_state = OptimizationState(
iteration=0,
best_candidate=None,
candidates=[],
total_llm_calls=0,
)
with patch.object(EvolutionLoop, "run", return_value=mock_state):
result = await use_case.execute(config)
assert result.optimized_prompt == "Answer the question."
assert result.initial_score == 0.0
assert result.final_score == 0.0
assert result.improvement == 0.0
@pytest.mark.asyncio
async def test_iterations_used_matches_state(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = []
mock_state = _make_state(iterations=7)
with patch.object(EvolutionLoop, "run", return_value=mock_state):
result = await use_case.execute(config)
assert result.iterations_used == 7
@pytest.mark.asyncio
async def test_history_passed_through(
self,
use_case: OptimizePromptUseCase,
mock_bootstrap: MagicMock,
config: OptimizationConfig,
) -> None:
mock_bootstrap.run.return_value = []
history = [
{"iteration": 1, "event": "accepted"},
{"iteration": 2, "event": "rejected"},
]
mock_state = _make_state()
mock_state.history = history
with patch.object(EvolutionLoop, "run", return_value=mock_state):
result = await use_case.execute(config)
assert result.history == history