diff --git a/docs/FEATURE_ROADMAP.md b/docs/FEATURE_ROADMAP.md new file mode 100644 index 0000000..918663f --- /dev/null +++ b/docs/FEATURE_ROADMAP.md @@ -0,0 +1,369 @@ +# PROMETHEUS Feature Roadmap + +> Complete codebase review — features needed for production-grade prompt optimization. +> Generated from v0.1.0 architecture review (2026-03-29). + +--- + +## Legend + +| Marker | Meaning | +|--------|---------| +| **CLI** | Exposed as a CLI option/flag | +| **Config** | YAML config field | +| **Internal** | No user-facing surface, architectural improvement | +| **P1** | Critical / must-have for reliability | +| **P2** | High value, should-have | +| **P3** | Nice-to-have, deferred to later versions | + +--- + +## 1. Multi-Model Routing (P1) + +**Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config. + +**Feature:** +- Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field. +- Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model). + +**Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields. + +**Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring. + +--- + +## 2. Async / Parallel Execution (P1) + +**Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size. + +**Feature:** +- Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`). +- Parallelize judge calls within a batch. +- Keep the proposer sequential (single call per iteration). + +**Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field. + +**Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`. + +--- + +## 3. Robust Error Handling & Retry (P1) + +**Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors. + +**Feature:** +- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx). +- Configurable `max_retries` and `retry_delay_base`. +- Circuit breaker: if N consecutive iterations fail, pause and alert. +- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation. + +**Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag. + +**Scope:** Infrastructure adapters + evolution loop. + +--- + +## 4. Checkpoint & Resume (P2) + +**Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence. + +**Feature:** +- Save `OptimizationState` to disk every K iterations (or every accepted improvement). +- Resume from the latest checkpoint file on restart. +- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state. + +**Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field. + +**Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`. + +--- + +## 5. Population-Based Evolution (P2) + +**Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned. + +**Feature:** +- Maintain a population of K candidates (e.g., top-K by score or Pareto front). +- Crossover: combine instructions from two parent candidates. +- Mutation operators: paraphrase, constrain, generalize, specialize. +- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance). + +**Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags. + +**Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure. + +--- + +## 6. Hold-Out Validation (P2) + +**Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs. + +**Feature:** +- Split synthetic pool into train (e.g., 70%) and validation (30%) sets. +- Evolution uses train minibatches for accept/reject decisions. +- After each iteration, evaluate the best candidate on the hold-out set. +- Report both train and validation scores in results. +- Optional early stopping if validation score degrades for K consecutive iterations. + +**Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`. + +**Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics). + +--- + +## 7. Custom Judge Criteria (P2) + +**Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria. + +**Feature:** +- Allow users to define custom judge rubrics, criteria, and scoring scales. +- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights. +- Allow `perfect_score` to reflect the custom scale. + +**Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides. + +**Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation). + +--- + +## 8. Real-World Evaluation Harness (P2) + +**Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs. + +**Feature:** +- Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns). +- When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge. +- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected. + +**Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge). + +**Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure. + +--- + +## 9. Logging & Observability (P2) + +**Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing. + +**Feature:** +- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR). +- JSON-formatted log output for machine parsing. +- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff. +- Optional OpenTelemetry export for distributed tracing. + +**Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`. + +**Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain. + +--- + +## 10. CLI Improvements (P2) + +**Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands. + +**Feature:** +- Fix Typer subcommand routing. +- `prometheus version` — show version. +- `prometheus init` — scaffold a config YAML interactively. +- `prometheus list` — list past optimization runs. +- `prometheus diff` — compare two result files (before/after prompt diff, score improvement). +- `prometheus eval` — evaluate a prompt against a dataset without optimization. + +**Surface:** CLI subcommands. + +**Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command. + +--- + +## 11. Input Validation & Schema Enforcement (P2) + +**Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline. + +**Feature:** +- Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency). +- Provide clear, actionable error messages for missing/invalid fields. +- Support config migration/upgrade from older versions. + +**Surface:** Internal. Errors surface as clear CLI messages. + +**Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution. + +--- + +## 12. Adaptive Minibatch Sizing (P3) + +**Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive. + +**Feature:** +- Start with a small minibatch for quick early iterations. +- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains). +- Shrink if too many evaluations fail (cost optimization). + +**Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config. + +**Scope:** `EvolutionLoop`, `SyntheticBootstrap`. + +--- + +## 13. Prompt Diversity Tracking (P3) + +**Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change. + +**Feature:** +- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts. +- Report diversity metrics in the result. +- Flag stagnation (N iterations with ` CLI subcommand. `--clipboard` flag. + +**Scope:** New `cli/commands/export.py`, format renderers in infrastructure. + +--- + +## 18. Config Profiles / Presets (P3) + +**Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured. + +**Feature:** +- Named profiles: `fast`, `thorough`, `economy`, `research`. +- Profile overrides individual config fields. +- User-defined profiles stored in `~/.prometheus/profiles/`. + +**Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands. + +**Scope:** `cli/app.py`, new `ProfileManager` in application. + +--- + +## Summary Table + +| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope | +|---|---------|----------|-------------|----------------|-----------------| +| 1 | Multi-Model Routing | P1 | Existing | Existing | Small | +| 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium | +| 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium | +| 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium | +| 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large | +| 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium | +| 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium | +| 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large | +| 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium | +| 10 | CLI Improvements | P2 | Subcommands | — | Medium | +| 11 | Input Validation | P2 | — (error messages) | — | Small | +| 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small | +| 13 | Prompt Diversity Tracking | P3 | — | — | Small | +| 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small | +| 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small | +| 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large | +| 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small | +| 18 | Config Profiles / Presets | P3 | `--profile` | — | Small | + +--- + +## Known Bugs (from TEST_REPORT.md and code review) + +| # | Bug | Severity | File | +|---|-----|----------|------| +| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters | +| 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` | +| 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` | +| 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` | +| 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` | +| 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` | +| 7 | `AppSettings` is never imported or used | LOW | `config.py` | +| 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` | +| 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` | + +--- + +## Test Coverage Gaps + +| Area | Current | Needed | +|------|---------|--------| +| CLI commands | 0 tests | Unit + integration for each subcommand | +| Config validation | 0 tests | Schema validation, missing fields, type errors | +| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery | +| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters | +| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios | +| Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` | + +--- + +## Recommended Implementation Order + +### Phase 1 — Production Reliability (P1) +1. Fix multi-model routing (#1) — highest impact, smallest scope +2. Add error handling & retry (#3) — essential for production runs +3. Implement async/parallel execution (#2) — biggest wall-clock improvement + +### Phase 2 — Optimization Quality (P2) +4. Input validation (#11) — small scope, high reliability gain +5. Logging & observability (#9) — enables debugging long runs +6. CLI improvements (#10) — fix Typer bug, add basic commands +7. Hold-out validation (#6) — prevents overfitting +8. Checkpoint & resume (#4) — essential for long runs +9. Custom judge criteria (#7) — enables domain-specific optimization + +### Phase 3 — Advanced Features (P3) +10. Population-based evolution (#5) +11. Real-world eval harness (#8) +12. Remaining P3 features as demand dictates diff --git a/pyproject.toml b/pyproject.toml index 6fce5bf..8c96806 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,12 +5,12 @@ description = "Prompt evolution without reference data" readme = "README.md" requires-python = ">=3.12" dependencies = [ - "dspy>=2.6,<3.0", - "typer>=0.15,<0.20", - "pydantic>=2.10", - "pydantic-settings>=2.7", - "pyyaml>=6.0", - "rich>=13.9", + "dspy==2.6.27", + "typer==0.19.2", + "pydantic==2.12.5", + "pydantic-settings==2.13.1", + "pyyaml==6.0.3", + "rich==14.3.3", ] [project.optional-dependencies] @@ -46,6 +46,6 @@ module = ["dspy", "dspy.*"] ignore_missing_imports = true [[tool.mypy.overrides]] -module = ["prometheus.infrastructure.*", "prometheus.cli.app"] +module = ["prometheus.infrastructure.*", "prometheus.cli.app", "prometheus.cli.commands.*"] disable_error_code = ["misc", "import-untyped"] diff --git a/src/prometheus/application/bootstrap.py b/src/prometheus/application/bootstrap.py index 84804a7..1a944b4 100644 --- a/src/prometheus/application/bootstrap.py +++ b/src/prometheus/application/bootstrap.py @@ -22,6 +22,24 @@ class SyntheticBootstrap: self._generator = generator self._rng = random.Random(seed) + @staticmethod + def split_pool( + pool: list[SyntheticExample], + validation_fraction: float, + rng: random.Random | None = None, + ) -> tuple[list[SyntheticExample], list[SyntheticExample]]: + """Split *pool* into (train, validation) sets. + + Returns (pool, []) when *validation_fraction* is 0. + """ + if validation_fraction <= 0.0 or len(pool) < 2: + return pool, [] + n_val = max(1, int(len(pool) * validation_fraction)) + shuffled = list(pool) + _rng = rng or random.Random(42) + _rng.shuffle(shuffled) + return shuffled[:-n_val], shuffled[-n_val:] + def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]: """Generate the synthetic pool in a single call. diff --git a/src/prometheus/application/evaluator.py b/src/prometheus/application/evaluator.py index c4f3c89..8db5be7 100644 --- a/src/prometheus/application/evaluator.py +++ b/src/prometheus/application/evaluator.py @@ -16,6 +16,7 @@ from prometheus.domain.entities import ( Trajectory, ) from prometheus.domain.ports import JudgePort, LLMPort +from prometheus.domain.scoring import normalize_score logger = logging.getLogger(__name__) @@ -72,6 +73,7 @@ class PromptEvaluator: trajectories: list[Trajectory] = [] for i, (example, output) in enumerate(zip(minibatch, outputs)): score, feedback = judge_results[i] + score = normalize_score(score) scores.append(score) feedbacks.append(feedback) trajectories.append( diff --git a/src/prometheus/application/evolution.py b/src/prometheus/application/evolution.py index 54ba651..a59c265 100644 --- a/src/prometheus/application/evolution.py +++ b/src/prometheus/application/evolution.py @@ -2,24 +2,33 @@ Evolution loop — core PROMETHEUS engine. Orchestrates the select → evaluate → propose → accept cycle. -Equivalent to GEPAEngine.run(), adapted to work without a valset. +Supports two modes: +- Single-candidate hill climbing (population_size=1, backward compat) +- Population-based evolution with crossover & mutation (population_size>1) """ from __future__ import annotations import logging +import random from prometheus.application.bootstrap import SyntheticBootstrap from prometheus.application.evaluator import PromptEvaluator +from prometheus.cli.logging_setup import get_logger from prometheus.domain.entities import ( Candidate, OptimizationState, Prompt, SyntheticExample, ) -from prometheus.domain.ports import ProposerPort +from prometheus.domain.ports import ( + CheckpointPort, + CrossoverPort, + MutationPort, + ProposerPort, +) from prometheus.domain.scoring import should_accept -logger = logging.getLogger(__name__) +logger = get_logger("evolution") class CircuitBreakerOpen(Exception): @@ -30,9 +39,9 @@ class EvolutionLoop: """Main evolution loop. Design: - - Keeps only the best candidate (no full population). - - Simplifies vs GEPA (no Pareto, no merge). - - Population support deferred to v2. + - population_size=1: classic single-candidate hill climbing (backward compat). + - population_size>1: population-based evolution with crossover, mutation, + and diversity maintenance. Error handling: - Transient errors are retried by adapters. @@ -51,6 +60,17 @@ class EvolutionLoop: verbose: bool = False, circuit_breaker_threshold: int = 5, error_strategy: str = "retry", + checkpoint_port: CheckpointPort | None = None, + checkpoint_interval: int = 5, + # --- Population-based evolution params --- + population_size: int = 1, + crossover_rate: float = 0.5, + mutation_rate: float = 0.3, + diversity_penalty: float = 0.1, + crossover_port: CrossoverPort | None = None, + mutation_port: MutationPort | None = None, + # --- Hold-out validation params --- + early_stop_patience: int = 5, ): self._evaluator = evaluator self._proposer = proposer @@ -61,51 +81,208 @@ class EvolutionLoop: self._verbose = verbose self._circuit_breaker_threshold = circuit_breaker_threshold self._error_strategy = error_strategy + self._checkpoint_port = checkpoint_port + self._checkpoint_interval = checkpoint_interval + self._population_size = population_size + self._crossover_rate = crossover_rate + self._mutation_rate = mutation_rate + self._diversity_penalty = diversity_penalty + self._crossover_port = crossover_port + self._mutation_port = mutation_port + self._early_stop_patience = early_stop_patience async def run( self, seed_prompt: Prompt, synthetic_pool: list[SyntheticExample], task_description: str, + initial_state: OptimizationState | None = None, + validation_pool: list[SyntheticExample] | None = None, ) -> OptimizationState: - """Execute the complete evolution loop.""" - state = OptimizationState() + """Execute the complete evolution loop. + + If *initial_state* is provided (from a checkpoint), resume from that + point — skipping the seed evaluation and continuing at the saved iteration. + + If *validation_pool* is provided (non-empty), the best candidate is + evaluated on the hold-out set after each iteration and early stopping + is applied when validation score degrades for ``early_stop_patience`` + consecutive iterations. + """ + state = initial_state or OptimizationState() consecutive_failures = 0 - # Evaluate the seed - initial_batch = self._bootstrap.sample_minibatch( - synthetic_pool, self._minibatch_size - ) - initial_eval = await self._evaluator.evaluate( - seed_prompt, initial_batch, task_description - ) - state.total_llm_calls += 2 * self._minibatch_size # N executions + N judge calls + # Hold-out validation tracking + has_validation = bool(validation_pool) + best_validation_score: float = -1.0 + validation_patience_counter: int = 0 - best_candidate = Candidate( - prompt=seed_prompt, - best_score=initial_eval.total_score, - generation=0, - ) - state.best_candidate = best_candidate - state.candidates.append(best_candidate) - self._log(f"Initial score: {initial_eval.total_score:.2f}") + # Only evaluate the seed when starting fresh (no checkpoint resume) + if initial_state is None: + initial_batch = self._bootstrap.sample_minibatch( + synthetic_pool, self._minibatch_size + ) + initial_eval = await self._evaluator.evaluate( + seed_prompt, initial_batch, task_description + ) + state.total_llm_calls += 2 * self._minibatch_size # N executions + N judge calls + + seed_candidate = Candidate( + prompt=seed_prompt, + best_score=initial_eval.total_score, + generation=0, + ) + state.best_candidate = seed_candidate + state.candidates.append(seed_candidate) + logger.info( + "Initial evaluation complete", + extra={ + "structured": { + "event": "initial_eval", + "score": round(initial_eval.total_score, 4), + "minibatch_size": self._minibatch_size, + "sample_ids": [ex.id for ex in initial_batch], + }, + }, + ) + + # Evaluate seed on validation set + if has_validation and state.best_candidate is not None: + val_eval = await self._evaluator.evaluate( + state.best_candidate.prompt, validation_pool, task_description + ) + state.total_llm_calls += 2 * len(validation_pool) + best_validation_score = val_eval.mean_score + logger.info( + "Initial validation evaluation", + extra={ + "structured": { + "event": "validation_eval", + "iteration": 0, + "validation_score": round(best_validation_score, 4), + "validation_pool_size": len(validation_pool), + }, + }, + ) + + # Population initialization: seed the population with mutations + if self._population_size > 1: + await self._initialize_population( + state, seed_prompt, seed_candidate, task_description + ) + else: + logger.info( + "Resuming from checkpoint", + extra={ + "structured": { + "event": "resume", + "iteration": state.iteration, + "total_llm_calls": state.total_llm_calls, + }, + }, + ) + # Restore validation tracking from state history + if has_validation: + for entry in reversed(state.history): + if entry.get("event") == "validation_eval": + best_validation_score = entry.get("best_validation_score", -1.0) + validation_patience_counter = entry.get("validation_patience", 0) + break + + # Determine starting iteration + start_iteration = state.iteration + 1 # Main loop - for i in range(1, self._max_iterations + 1): + for i in range(start_iteration, self._max_iterations + 1): state.iteration = i try: - await self._run_iteration( - i, state, best_candidate, synthetic_pool, task_description - ) - # Update best_candidate from state after successful iteration - best_candidate = state.best_candidate # type: ignore[assignment] + if self._population_size > 1 and len(state.candidates) > 1: + await self._run_population_iteration( + i, state, synthetic_pool, task_description + ) + else: + await self._run_single_iteration( + i, state, synthetic_pool, task_description + ) consecutive_failures = 0 + # Hold-out validation: evaluate best candidate on validation set + if has_validation and state.best_candidate is not None: + val_eval = await self._evaluator.evaluate( + state.best_candidate.prompt, validation_pool, task_description + ) + state.total_llm_calls += 2 * len(validation_pool) + current_val_score = val_eval.mean_score + + if current_val_score > best_validation_score: + best_validation_score = current_val_score + validation_patience_counter = 0 + else: + validation_patience_counter += 1 + + state.history.append({ + "iteration": i, + "event": "validation_eval", + "validation_score": round(current_val_score, 4), + "best_validation_score": round(best_validation_score, 4), + "validation_patience": validation_patience_counter, + }) + + logger.info( + "Validation evaluation", + extra={ + "structured": { + "event": "validation_eval", + "iteration": i, + "validation_score": round(current_val_score, 4), + "best_validation_score": round(best_validation_score, 4), + "patience": f"{validation_patience_counter}/{self._early_stop_patience}", + }, + }, + ) + + if validation_patience_counter >= self._early_stop_patience: + logger.warning( + "Early stopping triggered — validation score did not improve for %d iterations", + self._early_stop_patience, + extra={ + "structured": { + "event": "early_stop", + "iteration": i, + "best_validation_score": round(best_validation_score, 4), + "patience": self._early_stop_patience, + }, + }, + ) + state.history.append({ + "iteration": i, + "event": "early_stop", + "best_validation_score": round(best_validation_score, 4), + "patience": self._early_stop_patience, + }) + state.best_validation_score = best_validation_score + state.early_stopped = True + if self._checkpoint_port is not None: + self._checkpoint_port.save(state) + break + + # Checkpoint on accepted improvement (detected via state change) + self._maybe_checkpoint(state) + except Exception as exc: consecutive_failures += 1 - self._log( - f"Iter {i}: ERROR ({consecutive_failures} consecutive) — {exc}" + logger.error( + "Iteration error", + extra={ + "structured": { + "event": "iteration_error", + "iteration": i, + "consecutive_failures": consecutive_failures, + "error": str(exc), + }, + }, + exc_info=True, ) state.history.append( { @@ -118,9 +295,16 @@ class EvolutionLoop: # Check circuit breaker if consecutive_failures >= self._circuit_breaker_threshold: - self._log( - f"Circuit breaker tripped after {consecutive_failures} " - f"consecutive failures." + logger.warning( + "Circuit breaker tripped", + extra={ + "structured": { + "event": "circuit_breaker", + "iteration": i, + "consecutive_failures": consecutive_failures, + "error_strategy": self._error_strategy, + }, + }, ) state.history.append( { @@ -134,7 +318,9 @@ class EvolutionLoop: f"Circuit breaker tripped after " f"{consecutive_failures} consecutive failures" ) from exc - # skip / retry strategies: stop the loop gracefully + # skip / retry strategies: save checkpoint, then stop the loop gracefully + if self._checkpoint_port is not None: + self._checkpoint_port.save(state) break if self._error_strategy == "abort": @@ -142,21 +328,77 @@ class EvolutionLoop: # skip / retry: continue to next iteration continue + # Store final validation metadata on state + if has_validation: + state.best_validation_score = best_validation_score + return state - async def _run_iteration( + # ------------------------------------------------------------------ + # Population initialization + # ------------------------------------------------------------------ + + async def _initialize_population( + self, + state: OptimizationState, + seed_prompt: Prompt, + seed_candidate: Candidate, + task_description: str, + ) -> None: + """Fill the population with mutated variants of the seed prompt.""" + n_needed = self._population_size - 1 + mutation_types = ["paraphrase", "constrain", "generalize", "specialize"] + + for idx in range(n_needed): + mutation_type = mutation_types[idx % len(mutation_types)] + + if self._mutation_port is not None: + new_prompt = await self._mutation_port.mutate( + seed_prompt, task_description, mutation_type + ) + else: + # Fallback: use proposer for reflective mutation + new_prompt = await self._proposer.propose( + seed_prompt, [], task_description + ) + state.total_llm_calls += 1 + new_candidate = Candidate( + prompt=new_prompt, + best_score=seed_candidate.best_score, # estimate until evaluated + generation=0, + parent_id=id(seed_candidate), + ) + state.candidates.append(new_candidate) + + logger.info( + "Population initialized", + extra={ + "structured": { + "event": "population_init", + "population_size": len(state.candidates), + }, + }, + ) + + # ------------------------------------------------------------------ + # Single-candidate iteration (original hill-climbing) + # ------------------------------------------------------------------ + + async def _run_single_iteration( self, i: int, state: OptimizationState, - best_candidate: Candidate, synthetic_pool: list[SyntheticExample], task_description: str, ) -> None: - """Execute a single iteration. Mutates *state* in-place.""" + """Execute a single-candidate iteration. Mutates *state* in-place.""" + best_candidate = state.best_candidate # type: ignore[assignment] + # 1. Sample a fresh minibatch batch = self._bootstrap.sample_minibatch( synthetic_pool, self._minibatch_size ) + sample_ids = [ex.id for ex in batch] # 2. Evaluate the current candidate current_eval = await self._evaluator.evaluate( @@ -164,9 +406,31 @@ class EvolutionLoop: ) state.total_llm_calls += 2 * self._minibatch_size + logger.debug( + "Iteration minibatch evaluated", + extra={ + "structured": { + "event": "minibatch_eval", + "iteration": i, + "sample_ids": sample_ids, + "scores": [round(s, 4) for s in current_eval.scores], + "total_score": round(current_eval.total_score, 4), + }, + }, + ) + # 3. Skip if perfect if all(s >= self._perfect_score for s in current_eval.scores): - self._log(f"Iter {i}: All scores perfect, skipping.") + logger.info( + "Iteration skipped — all scores perfect", + extra={ + "structured": { + "event": "skip_perfect", + "iteration": i, + "total_score": round(current_eval.total_score, 4), + }, + }, + ) state.history.append( { "iteration": i, @@ -177,12 +441,26 @@ class EvolutionLoop: return # 4. Propose a new prompt (reflective mutation) — sequential + state.total_llm_calls += 1 new_prompt = await self._proposer.propose( best_candidate.prompt, current_eval.trajectories, task_description, ) - state.total_llm_calls += 1 # 1 proposition call + + prompt_diff = self._compute_prompt_diff( + best_candidate.prompt.text, new_prompt.text + ) + logger.debug( + "Proposed new prompt", + extra={ + "structured": { + "event": "proposer_output", + "iteration": i, + "prompt_diff": prompt_diff, + }, + }, + ) # 5. Evaluate the new prompt on the same minibatch new_eval = await self._evaluator.evaluate( @@ -200,9 +478,22 @@ class EvolutionLoop: ) state.best_candidate = new_candidate state.candidates.append(new_candidate) - self._log( - f"Iter {i}: ACCEPTED " - f"({current_eval.total_score:.2f} -> {new_eval.total_score:.2f})" + logger.info( + "Iteration accepted", + extra={ + "structured": { + "event": "accepted", + "iteration": i, + "old_score": round(current_eval.total_score, 4), + "new_score": round(new_eval.total_score, 4), + "improvement": round( + new_eval.total_score - current_eval.total_score, 4 + ), + "sample_ids": sample_ids, + "new_scores": [round(s, 4) for s in new_eval.scores], + "prompt_diff": prompt_diff, + }, + }, ) state.history.append( { @@ -215,9 +506,19 @@ class EvolutionLoop: } ) else: - self._log( - f"Iter {i}: REJECTED " - f"({new_eval.total_score:.2f} <= {current_eval.total_score:.2f})" + logger.info( + "Iteration rejected", + extra={ + "structured": { + "event": "rejected", + "iteration": i, + "old_score": round(current_eval.total_score, 4), + "new_score": round(new_eval.total_score, 4), + "sample_ids": sample_ids, + "new_scores": [round(s, 4) for s in new_eval.scores], + "prompt_diff": prompt_diff, + }, + }, ) state.history.append( { @@ -228,6 +529,213 @@ class EvolutionLoop: } ) - def _log(self, msg: str) -> None: - if self._verbose: - logger.info("[PROMETHEUS] %s", msg) + # ------------------------------------------------------------------ + # Population-based iteration + # ------------------------------------------------------------------ + + async def _run_population_iteration( + self, + i: int, + state: OptimizationState, + synthetic_pool: list[SyntheticExample], + task_description: str, + ) -> None: + """Execute a population-based iteration. Mutates *state* in-place.""" + population = state.candidates + + # 1. Sample a fresh minibatch + batch = self._bootstrap.sample_minibatch( + synthetic_pool, self._minibatch_size + ) + sample_ids = [ex.id for ex in batch] + + # 2. Select two parents via tournament selection + parent_a = self._tournament_select(population) + parent_b = self._tournament_select(population) + + # 3. Generate child: crossover or reflective mutation + use_crossover = ( + random.random() < self._crossover_rate + and self._crossover_port is not None + ) + if use_crossover: + state.total_llm_calls += 1 + child_prompt = await self._crossover_port.crossover( + parent_a.prompt, parent_b.prompt, task_description + ) + origin = "crossover" + else: + # Reflective mutation: evaluate a parent, propose improvement + state.total_llm_calls += 2 * self._minibatch_size + parent_eval = await self._evaluator.evaluate( + parent_a.prompt, batch, task_description + ) + state.total_llm_calls += 1 + child_prompt = await self._proposer.propose( + parent_a.prompt, + parent_eval.trajectories, + task_description, + ) + origin = "reflective" + + # 4. Optional mutation + if random.random() < self._mutation_rate and self._mutation_port is not None: + mutation_type = random.choice( + ["paraphrase", "constrain", "generalize", "specialize"] + ) + state.total_llm_calls += 1 + child_prompt = await self._mutation_port.mutate( + child_prompt, task_description, mutation_type + ) + origin += f"+mutation({mutation_type})" + + # 5. Evaluate the child + state.total_llm_calls += 2 * self._minibatch_size + child_eval = await self._evaluator.evaluate( + child_prompt, batch, task_description + ) + + # 6. Compute fitness with diversity penalty + child_score = child_eval.total_score + diversity_sim = self._compute_diversity_score( + child_prompt, population + ) + child_fitness = child_score - self._diversity_penalty * (1.0 - diversity_sim) + + # 7. Find worst candidate and replace if child is better + worst_idx = min( + range(len(population)), + key=lambda idx: population[idx].best_score, + ) + worst_fitness = ( + population[worst_idx].best_score + - self._diversity_penalty * (1.0 - self._compute_diversity_score( + population[worst_idx].prompt, population + )) + ) + + accepted = child_fitness > worst_fitness + + if accepted: + new_candidate = Candidate( + prompt=child_prompt, + best_score=child_score, + generation=i, + parent_id=id(parent_a), + ) + population[worst_idx] = new_candidate + + # Update best if this child is the new best + if child_score > (state.best_candidate.best_score if state.best_candidate else 0): + state.best_candidate = new_candidate + + logger.info( + "Population iteration accepted", + extra={ + "structured": { + "event": "pop_accepted", + "iteration": i, + "origin": origin, + "child_score": round(child_score, 4), + "child_fitness": round(child_fitness, 4), + "diversity_sim": round(diversity_sim, 4), + "replaced_idx": worst_idx, + "sample_ids": sample_ids, + }, + }, + ) + state.history.append( + { + "iteration": i, + "event": "pop_accepted", + "origin": origin, + "child_score": child_score, + "child_fitness": child_fitness, + "diversity_sim": diversity_sim, + } + ) + else: + logger.info( + "Population iteration rejected", + extra={ + "structured": { + "event": "pop_rejected", + "iteration": i, + "origin": origin, + "child_score": round(child_score, 4), + "child_fitness": round(child_fitness, 4), + "worst_fitness": round(worst_fitness, 4), + "sample_ids": sample_ids, + }, + }, + ) + state.history.append( + { + "iteration": i, + "event": "pop_rejected", + "origin": origin, + "child_score": child_score, + "child_fitness": child_fitness, + "worst_fitness": worst_fitness, + } + ) + + # ------------------------------------------------------------------ + # Selection and diversity helpers + # ------------------------------------------------------------------ + + def _tournament_select( + self, + population: list[Candidate], + tournament_size: int = 3, + ) -> Candidate: + """Tournament selection: pick the best from a random subset.""" + k = min(tournament_size, len(population)) + contestants = random.sample(population, k) + return max(contestants, key=lambda c: c.best_score) + + @staticmethod + def _compute_diversity_score( + prompt: Prompt, + population: list[Candidate], + ) -> float: + """Compute the average Jaccard similarity between *prompt* and all + population members. Returns 1.0 when population has only one member + (no diversity penalty).""" + if len(population) <= 1: + return 1.0 + + prompt_words = set(prompt.text.lower().split()) + if not prompt_words: + return 0.0 + + similarities: list[float] = [] + for candidate in population: + other_words = set(candidate.prompt.text.lower().split()) + if not other_words: + continue + intersection = prompt_words & other_words + union = prompt_words | other_words + sim = len(intersection) / len(union) if union else 0.0 + similarities.append(sim) + + # Average similarity (lower = more diverse) + return sum(similarities) / len(similarities) if similarities else 0.0 + + @staticmethod + def _compute_prompt_diff(old: str, new: str) -> dict[str, int]: + """Compute a simple diff summary between two prompts.""" + old_lines = set(old.splitlines()) + new_lines = set(new.splitlines()) + return { + "lines_added": len(new_lines - old_lines), + "lines_removed": len(old_lines - new_lines), + "chars_delta": len(new) - len(old), + } + + def _maybe_checkpoint(self, state: OptimizationState) -> None: + """Save a checkpoint if the interval is met or on accepted improvements.""" + if self._checkpoint_port is None: + return + if state.iteration % self._checkpoint_interval == 0: + self._checkpoint_port.save(state) diff --git a/src/prometheus/application/ground_truth_evaluator.py b/src/prometheus/application/ground_truth_evaluator.py new file mode 100644 index 0000000..3d3dbc0 --- /dev/null +++ b/src/prometheus/application/ground_truth_evaluator.py @@ -0,0 +1,116 @@ +""" +Ground-truth evaluator — execution + similarity comparison. + +Produces a quality signal *with* ground truth by comparing model outputs +against expected outputs using a configurable similarity metric. +""" +from __future__ import annotations + +import asyncio +import logging + +from prometheus.domain.entities import ( + EvalResult, + GroundTruthExample, + Prompt, + Trajectory, +) +from prometheus.domain.ports import LLMPort, SimilarityPort + +logger = logging.getLogger(__name__) + + +class GroundTruthEvaluator: + """Evaluates a prompt against a ground-truth dataset. + + Pipeline: execute → compare with similarity metric → build trajectories. + Unlike PromptEvaluator (which uses LLM-as-Judge), this compares outputs + directly against known-good expected outputs. + """ + + def __init__( + self, + executor: LLMPort, + similarity: SimilarityPort, + max_concurrency: int = 5, + ): + self._executor = executor + self._similarity = similarity + self._semaphore = asyncio.Semaphore(max_concurrency) + + async def evaluate( + self, + prompt: Prompt, + dataset: list[GroundTruthExample], + ) -> EvalResult: + """Evaluate the prompt on the ground-truth dataset. + + Steps: + 1. Execute the prompt on each input (parallel, bounded) + 2. Compare each output against expected using similarity metric + 3. Build trajectories with feedback + """ + # Step 1: Parallel execution (per-item isolation) + output_coros = [ + self._execute_single(prompt, example) for example in dataset + ] + outputs = await asyncio.gather(*output_coros) + + # Step 2: Compute similarity scores + scores: list[float] = [] + feedbacks: list[str] = [] + trajectories: list[Trajectory] = [] + + for example, output in zip(dataset, outputs): + score = self._similarity.compute(output, example.expected_output) + score = max(0.0, min(1.0, score)) # normalize to [0, 1] + scores.append(score) + feedback = self._build_feedback(output, example.expected_output, score) + feedbacks.append(feedback) + trajectories.append( + Trajectory( + input_text=example.input_text, + output_text=output, + score=score, + feedback=feedback, + prompt_used=prompt.text, + ) + ) + + logger.info( + "Ground-truth evaluation complete: %d items, mean_score=%.4f", + len(dataset), + sum(scores) / len(scores) if scores else 0.0, + ) + + return EvalResult( + scores=scores, + feedbacks=feedbacks, + trajectories=trajectories, + ) + + async def _execute_single( + self, prompt: Prompt, example: GroundTruthExample + ) -> str: + async with self._semaphore: + try: + return await self._executor.execute(prompt, example.input_text) + except Exception as exc: + logger.warning( + "Execution failed for input '%s…': %s", + example.input_text[:40], + exc, + ) + return f"[execution error: {exc}]" + + @staticmethod + def _build_feedback(output: str, expected: str, score: float) -> str: + """Build human-readable feedback for a ground-truth comparison.""" + if score >= 0.99: + return "Exact match." + elif score >= 0.7: + return f"Close match (score={score:.2f}). Expected: {expected[:100]}" + elif score >= 0.3: + return f"Partial match (score={score:.2f}). Expected: {expected[:100]}" + else: + return f"Poor match (score={score:.2f}). Expected: {expected[:100]}" diff --git a/src/prometheus/application/use_cases.py b/src/prometheus/application/use_cases.py index 999b132..a6e200f 100644 --- a/src/prometheus/application/use_cases.py +++ b/src/prometheus/application/use_cases.py @@ -10,8 +10,16 @@ from prometheus.application.bootstrap import SyntheticBootstrap from prometheus.application.dto import OptimizationConfig, OptimizationResult from prometheus.application.evaluator import PromptEvaluator from prometheus.application.evolution import EvolutionLoop +from prometheus.cli.logging_setup import get_logger from prometheus.domain.entities import Prompt -from prometheus.domain.ports import ProposerPort +from prometheus.domain.ports import ( + CheckpointPort, + CrossoverPort, + MutationPort, + ProposerPort, +) + +logger = get_logger("use_cases") class OptimizePromptUseCase: @@ -25,24 +33,60 @@ class OptimizePromptUseCase: evaluator: PromptEvaluator, proposer: ProposerPort, bootstrap: SyntheticBootstrap, + checkpoint_port: CheckpointPort | None = None, + crossover_port: CrossoverPort | None = None, + mutation_port: MutationPort | None = None, ): self._evaluator = evaluator self._proposer = proposer self._bootstrap = bootstrap + self._checkpoint_port = checkpoint_port + self._crossover_port = crossover_port + self._mutation_port = mutation_port async def execute(self, config: OptimizationConfig) -> OptimizationResult: """Full pipeline: 1. Bootstrap → generate synthetic inputs - 2. Evolution → optimization loop + 2. Evolution → optimization loop (with optional checkpoint resume) 3. Return result """ - # Phase 0: Bootstrap - synthetic_pool = self._bootstrap.run( - task_description=config.task_description, - n_examples=config.n_synthetic_inputs, - ) + # Phase 0: Bootstrap (skip synthetic generation on resume if pool was saved) + initial_state = None + if config.resume and self._checkpoint_port is not None: + initial_state = self._checkpoint_port.load() + if initial_state is not None and initial_state.synthetic_pool: + synthetic_pool = initial_state.synthetic_pool + logger.info( + "Resumed checkpoint includes %d synthetic inputs — skipping bootstrap", + extra={"structured": {"event": "resume_skip_bootstrap", "pool_size": len(synthetic_pool)}}, + ) + else: + synthetic_pool = self._bootstrap.run( + task_description=config.task_description, + n_examples=config.n_synthetic_inputs, + ) + else: + synthetic_pool = self._bootstrap.run( + task_description=config.task_description, + n_examples=config.n_synthetic_inputs, + ) - # Phase 1: Evolution + # Split into train / validation if configured + validation_pool: list = [] + if config.validation_split > 0: + synthetic_pool, validation_pool = SyntheticBootstrap.split_pool( + synthetic_pool, config.validation_split, + ) + logger.info( + "Split synthetic pool: %d train, %d validation (%.0f%% hold-out)", + len(synthetic_pool), len(validation_pool), + config.validation_split * 100, + extra={"structured": { + "event": "pool_split", + "train_size": len(synthetic_pool), + "val_size": len(validation_pool), + }}, + ) loop = EvolutionLoop( evaluator=self._evaluator, proposer=self._proposer, @@ -53,9 +97,22 @@ class OptimizePromptUseCase: verbose=config.verbose, circuit_breaker_threshold=config.circuit_breaker_threshold, error_strategy=config.error_strategy, + checkpoint_port=self._checkpoint_port, + checkpoint_interval=config.checkpoint_interval, + population_size=config.population_size, + crossover_rate=config.crossover_rate, + mutation_rate=config.mutation_rate, + diversity_penalty=config.diversity_penalty, + crossover_port=self._crossover_port, + mutation_port=self._mutation_port, + early_stop_patience=config.early_stop_patience, ) seed_prompt = Prompt(text=config.seed_prompt) - state = await loop.run(seed_prompt, synthetic_pool, config.task_description) + state = await loop.run( + seed_prompt, synthetic_pool, config.task_description, + initial_state=initial_state, + validation_pool=validation_pool or None, + ) # Phase 2: Result initial_score = ( @@ -71,9 +128,12 @@ class OptimizePromptUseCase: ), initial_prompt=config.seed_prompt, iterations_used=state.iteration, - total_llm_calls=state.total_llm_calls + 1, # +1 for bootstrap + total_llm_calls=state.total_llm_calls + 1, # +1 for bootstrap synthesis call initial_score=initial_score, final_score=final_score, improvement=final_score - initial_score, history=state.history, + final_validation_score=state.best_validation_score, + best_validation_score=state.best_validation_score, + early_stopped=state.early_stopped, ) diff --git a/src/prometheus/cli/app.py b/src/prometheus/cli/app.py index 776f70e..611cbaf 100644 --- a/src/prometheus/cli/app.py +++ b/src/prometheus/cli/app.py @@ -1,31 +1,13 @@ """ CLI — user entry point. -Typer interface with -i (input) and -o (output) options. +Registers all subcommands and delegates to cli/commands/. """ from __future__ import annotations -import asyncio -import logging -import os -from dataclasses import asdict - -import dspy import typer -from pydantic import ValidationError -from rich.console import Console -from rich.panel import Panel -from rich.table import Table -from prometheus.application.bootstrap import SyntheticBootstrap -from prometheus.application.dto import OptimizationConfig, OptimizationResult -from prometheus.application.evaluator import PromptEvaluator -from prometheus.application.use_cases import OptimizePromptUseCase -from prometheus.infrastructure.file_io import YamlPersistence -from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter -from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter -from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter -from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter +from prometheus.cli.commands import init, list_runs, optimize, version app = typer.Typer( name="prometheus", @@ -33,205 +15,12 @@ app = typer.Typer( no_args_is_help=True, ) -console = Console() - - -@app.command() -def optimize( - input: str = typer.Option( - ..., - "-i", - "--input", - help="Path to input YAML config file.", - exists=True, - readable=True, - ), - output: str = typer.Option( - "output.yaml", - "-o", - "--output", - help="Path to output YAML result file.", - ), - verbose: bool = typer.Option( - False, - "-v", - "--verbose", - help="Print detailed progress.", - ), - max_retries: int = typer.Option( - 3, - "--max-retries", - help="Max retry attempts for transient LLM errors (429, timeout, 5xx).", - ), - error_strategy: str = typer.Option( - "retry", - "--error-strategy", - help="How to handle errors: skip | retry | abort.", - ), - max_concurrency: int = typer.Option( - 5, - "--max-concurrency", - help="Max parallel LLM calls for minibatch execution and judging.", - ), -) -> None: - """Optimize a prompt without any reference data. - - Usage: - prometheus optimize -i config.yaml -o result.yaml - """ - asyncio.run( - _async_optimize(input, output, verbose, max_retries, error_strategy, max_concurrency) - ) - - -async def _async_optimize( - input: str, - output: str, - verbose: bool, - max_retries: int, - error_strategy: str, - max_concurrency: int, -) -> None: - # Configure verbose logging - if verbose: - logging.basicConfig(level=logging.INFO, format="[PROMETHEUS] %(message)s") - - console.print( - Panel.fit( - "PROMETHEUS — Prompt Evolution Engine", - subtitle="No reference data required", - ) - ) - - # 1. Load & validate config - persistence = YamlPersistence() - raw_config = persistence.read_config(input) - - # CLI flags override config file values - raw_config.setdefault("max_retries", max_retries) - raw_config.setdefault("error_strategy", error_strategy) - raw_config.setdefault("max_concurrency", max_concurrency) - raw_config["output_path"] = output - raw_config["verbose"] = verbose - - try: - config = OptimizationConfig.model_validate(raw_config) - except ValidationError as exc: - console.print("[bold red]Configuration error:[/bold red]\n") - for err in exc.errors(): - loc = " → ".join(str(l) for l in err["loc"]) - console.print(f" [red]• {loc}: {err['msg']}[/red]") - raise typer.Exit(code=1) from exc - console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]") - console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]") - - # 2. Create per-model DSPy LM instances - def _model_lm_kwargs( - model_api_base: str | None, - model_api_key_env: str | None, - ) -> dict: - """Build kwargs for dspy.LM, using per-model overrides with global fallback.""" - kwargs: dict = {} - api_base = model_api_base or config.api_base - api_key_env = model_api_key_env or config.api_key_env - if api_base: - kwargs["api_base"] = api_base - if api_key_env: - kwargs["api_key"] = os.environ.get(api_key_env, "") - return kwargs - - task_lm = dspy.LM( - config.task_model, - **_model_lm_kwargs(config.task_api_base, config.task_api_key_env), - ) - judge_lm = dspy.LM( - config.judge_model, - **_model_lm_kwargs(config.judge_api_base, config.judge_api_key_env), - ) - proposer_lm = dspy.LM( - config.proposer_model, - **_model_lm_kwargs(config.proposer_api_base, config.proposer_api_key_env), - ) - synth_lm = dspy.LM( - config.synth_model, - **_model_lm_kwargs(config.synth_api_base, config.synth_api_key_env), - ) - - # 3. Build adapters (Dependency Injection — each gets its own LM + retry config) - synth_adapter = DSPySyntheticAdapter(lm=synth_lm) - llm_adapter = DSPyLLMAdapter( - lm=task_lm, - max_retries=config.max_retries, - retry_delay_base=config.retry_delay_base, - ) - judge_adapter = DSPyJudgeAdapter( - lm=judge_lm, - max_retries=config.max_retries, - retry_delay_base=config.retry_delay_base, - max_concurrency=config.max_concurrency, - ) - proposer_adapter = DSPyProposerAdapter( - lm=proposer_lm, - max_retries=config.max_retries, - retry_delay_base=config.retry_delay_base, - ) - bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed) - evaluator = PromptEvaluator( - executor=llm_adapter, - judge=judge_adapter, - max_concurrency=config.max_concurrency, - ) - use_case = OptimizePromptUseCase( - evaluator=evaluator, - proposer=proposer_adapter, - bootstrap=bootstrap, - ) - - # 4. Execute - with console.status("[bold green]Evolving prompt..."): - result = await use_case.execute(config) - - # 5. Display results - _display_result(result) - - # 6. Save - _save_result(persistence, output, result) - console.print(f"\n[green]Results saved to {output}[/green]") - - -def _display_result(result: OptimizationResult) -> None: - """Display a Rich summary in the terminal.""" - console.print() - console.print( - Panel( - f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}", - title="Result", - ) - ) - table = Table(title="Metrics") - table.add_column("Metric", style="cyan") - table.add_column("Value", style="bold") - table.add_row("Initial Score", f"{result.initial_score:.2f}") - table.add_row("Final Score", f"{result.final_score:.2f}") - table.add_row("Improvement", f"{result.improvement:+.2f}") - table.add_row("Iterations", str(result.iterations_used)) - table.add_row("LLM Calls", str(result.total_llm_calls)) - console.print(table) - - -def _save_result( - persistence: YamlPersistence, - path: str, - result: OptimizationResult, -) -> None: - """Save the result as YAML.""" - persistence.write_result(path, asdict(result)) - - -@app.command(hidden=True) -def _help() -> None: - """Internal placeholder to force multi-command Typer behavior.""" - pass +# Register all subcommands — having multiple commands fixes the +# Typer 0.24+ bug where a single-command app absorbs the subcommand. +optimize.register(app) +version.register(app) +init.register(app) +list_runs.register(app) if __name__ == "__main__": diff --git a/src/prometheus/cli/commands/__init__.py b/src/prometheus/cli/commands/__init__.py new file mode 100644 index 0000000..9937ab9 --- /dev/null +++ b/src/prometheus/cli/commands/__init__.py @@ -0,0 +1 @@ +"""CLI command modules.""" diff --git a/src/prometheus/cli/commands/init.py b/src/prometheus/cli/commands/init.py new file mode 100644 index 0000000..387cbc0 --- /dev/null +++ b/src/prometheus/cli/commands/init.py @@ -0,0 +1,97 @@ +"""prometheus init — scaffold a config YAML interactively.""" +from __future__ import annotations + +from pathlib import Path + +import typer +import yaml +from rich.console import Console + +console = Console() + +_TEMPLATE = """\ +# PROMETHEUS configuration +# Generated by `prometheus init` + +# --- Required --- +seed_prompt: {seed_prompt} +task_description: {task_description} + +# --- Models --- +task_model: {task_model} +judge_model: {judge_model} +proposer_model: {proposer_model} +synth_model: {synth_model} + +# --- Global API settings (optional) --- +# api_base: https://api.openai.com/v1 +# api_key_env: OPENAI_API_KEY + +# --- Evolution parameters --- +max_iterations: 30 +n_synthetic_inputs: 20 +minibatch_size: 5 +perfect_score: 1.0 +seed: 42 + +# --- Concurrency --- +max_concurrency: 5 + +# --- Error handling --- +max_retries: 3 +retry_delay_base: 1.0 +circuit_breaker_threshold: 5 +error_strategy: retry +""" + + +def register(app: typer.Typer) -> None: + """Register the init command on the Typer app.""" + + @app.command() + def init( + output: str = typer.Option( + "config.yaml", + "-o", + "--output", + help="Path for the generated config file.", + ), + ) -> None: + """Interactively scaffold a PROMETHEUS config YAML. + + Prompts for required fields and writes a ready-to-edit config file. + """ + target = Path(output) + + if target.exists() and not typer.confirm( + f"{output} already exists. Overwrite?", default=False + ): + raise typer.Exit(code=0) + + seed_prompt: str = typer.prompt("Seed prompt") + task_description: str = typer.prompt("Task description") + task_model: str = typer.prompt("Task model", default="openai/gpt-4o-mini") + judge_model: str = typer.prompt("Judge model", default="openai/gpt-4o") + proposer_model: str = typer.prompt("Proposer model", default="openai/gpt-4o") + synth_model: str = typer.prompt("Synth model", default="openai/gpt-4o") + + content = _TEMPLATE.format( + seed_prompt=_yaml_string(seed_prompt), + task_description=_yaml_string(task_description), + task_model=task_model, + judge_model=judge_model, + proposer_model=proposer_model, + synth_model=synth_model, + ) + + target.write_text(content, encoding="utf-8") + console.print(f"[green]Config written to {output}[/green]") + console.print("[dim]Edit it as needed, then run: prometheus optimize -i config.yaml[/dim]") + + +def _yaml_string(value: str) -> str: + """Quote a string for YAML if it contains special characters.""" + if any(ch in value for ch in (":", "#", "'", '"', "\n", "{", "}", "[", "]", ",")): + escaped = value.replace("'", "''") + return f"'{escaped}'" + return value diff --git a/src/prometheus/cli/commands/list_runs.py b/src/prometheus/cli/commands/list_runs.py new file mode 100644 index 0000000..e897647 --- /dev/null +++ b/src/prometheus/cli/commands/list_runs.py @@ -0,0 +1,101 @@ +"""prometheus list — list past optimization runs.""" +from __future__ import annotations + +import glob as globmod +from pathlib import Path + +import typer +import yaml +from rich.console import Console +from rich.table import Table + +console = Console() + +_DEFAULT_PATTERNS = ("output.yaml", "results/*.yaml", "*.result.yaml") + + +def register(app: typer.Typer) -> None: + """Register the list command on the Typer app.""" + + @app.command("list") + def list_runs( + directory: str = typer.Option( + ".", + "-d", + "--directory", + help="Directory to scan for result YAML files.", + ), + ) -> None: + """List past optimization runs found in result YAML files. + + Scans the given directory for YAML files that look like PROMETHEUS + output (they contain 'optimized_prompt' and 'final_score' keys) and + displays a summary table. + """ + base = Path(directory) + if not base.is_dir(): + console.print(f"[red]Directory not found: {directory}[/red]") + raise typer.Exit(code=1) + + runs: list[dict] = [] + + for pattern in _DEFAULT_PATTERNS: + for path_str in globmod.glob(str(base / pattern), recursive=False): + _try_read_run(path_str, runs) + + # Also try nested directories one level deep + for path_str in globmod.glob(str(base / "**/*.yaml"), recursive=True): + _try_read_run(path_str, runs) + + if not runs: + console.print("[dim]No optimization runs found.[/dim]") + raise typer.Exit(code=0) + + # Deduplicate by path + seen: set[str] = set() + unique_runs: list[dict] = [] + for run in runs: + if run["path"] not in seen: + seen.add(run["path"]) + unique_runs.append(run) + + table = Table(title="PROMETHEUS Runs") + table.add_column("File", style="cyan") + table.add_column("Initial", justify="right") + table.add_column("Final", justify="right", style="green") + table.add_column("Delta", justify="right") + table.add_column("Iters", justify="right") + table.add_column("Prompt (first 60 chars)", style="dim") + + for run in sorted(unique_runs, key=lambda r: r["path"]): + table.add_row( + run["path"], + f"{run['initial_score']:.2f}", + f"{run['final_score']:.2f}", + f"{run['improvement']:+.2f}", + str(run["iterations"]), + run["prompt_preview"], + ) + + console.print(table) + + +def _try_read_run(path_str: str, runs: list[dict]) -> None: + """Try to parse a YAML file as a PROMETHEUS result and append metadata.""" + try: + with open(path_str, encoding="utf-8") as f: + data = yaml.safe_load(f) + if not isinstance(data, dict): + return + if "optimized_prompt" not in data or "final_score" not in data: + return + runs.append({ + "path": path_str, + "initial_score": float(data.get("initial_score", 0.0)), + "final_score": float(data.get("final_score", 0.0)), + "improvement": float(data.get("improvement", 0.0)), + "iterations": int(data.get("iterations_used", 0)), + "prompt_preview": str(data.get("optimized_prompt", ""))[:60], + }) + except (OSError, yaml.YAMLError, ValueError, TypeError): + pass diff --git a/src/prometheus/cli/commands/optimize.py b/src/prometheus/cli/commands/optimize.py index 8c39a0e..43842a7 100644 --- a/src/prometheus/cli/commands/optimize.py +++ b/src/prometheus/cli/commands/optimize.py @@ -27,8 +27,6 @@ from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter -from prometheus.infrastructure.crossover_adapter import DSPyCrossoverAdapter -from prometheus.infrastructure.mutation_adapter import DSPyMutationAdapter from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter from prometheus.infrastructure.similarity import create_similarity_adapter from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter @@ -337,6 +335,29 @@ async def _async_optimize( with console.status("[bold green]Evolving prompt..."): result = await use_case.execute(config) + # 4b. Compute actual LLM call count from adapter counters + actual_llm_calls = ( + llm_adapter.call_count + + judge_adapter.call_count + + proposer_adapter.call_count + + synth_adapter.call_count + + (crossover_adapter.call_count if crossover_adapter else 0) + + (mutation_adapter.call_count if mutation_adapter else 0) + ) + result = OptimizationResult( + optimized_prompt=result.optimized_prompt, + initial_prompt=result.initial_prompt, + iterations_used=result.iterations_used, + total_llm_calls=actual_llm_calls, + initial_score=result.initial_score, + final_score=result.final_score, + improvement=result.improvement, + history=result.history, + final_validation_score=result.final_validation_score, + best_validation_score=result.best_validation_score, + early_stopped=result.early_stopped, + ) + # 5. Display results _display_result(result) diff --git a/src/prometheus/cli/commands/version.py b/src/prometheus/cli/commands/version.py new file mode 100644 index 0000000..f206427 --- /dev/null +++ b/src/prometheus/cli/commands/version.py @@ -0,0 +1,18 @@ +"""prometheus version — print the current version.""" +from __future__ import annotations + +import typer +from rich.console import Console + +from prometheus import __version__ + +console = Console() + + +def register(app: typer.Typer) -> None: + """Register the version command on the Typer app.""" + + @app.command() + def version() -> None: + """Print the PROMETHEUS version.""" + console.print(f"PROMETHEUS {__version__}") diff --git a/src/prometheus/cli/logging_setup.py b/src/prometheus/cli/logging_setup.py new file mode 100644 index 0000000..b4442f8 --- /dev/null +++ b/src/prometheus/cli/logging_setup.py @@ -0,0 +1,96 @@ +""" +Structured logging configuration for PROMETHEUS. + +Supports text (human-readable) and JSON (machine-parseable) output, +configurable log levels, and optional file output. + +Fixes Bug #4: verbose mode now reliably produces output by configuring +handlers explicitly instead of relying on ``logging.basicConfig``. +""" +from __future__ import annotations + +import json +import logging +import sys +from datetime import datetime, timezone +from typing import TextIO + + +_PROMETHEUS_LOGGER = "prometheus" + + +class _JsonFormatter(logging.Formatter): + """Emit one JSON object per log line.""" + + def format(self, record: logging.LogRecord) -> str: + message = record.getMessage() + payload: dict = { + "timestamp": datetime.fromtimestamp(record.created, tz=timezone.utc).isoformat(), + "level": record.levelname, + "logger": record.name, + "message": message, + } + # Merge any extra structured fields the caller attached. + if hasattr(record, "structured"): + payload["structured"] = record.structured # type: ignore[attr-defined] + if record.exc_info and record.exc_info[1] is not None: + payload["exception"] = self.formatException(record.exc_info) + return json.dumps(payload, default=str) + + +class _TextFormatter(logging.Formatter): + """Human-readable format with structured extras appended.""" + + def format(self, record: logging.LogRecord) -> str: + base = super().format(record) + if hasattr(record, "structured") and record.structured: + extras = " ".join(f"{k}={v}" for k, v in record.structured.items()) + base = f"{base} {extras}" + return base + + +def configure_logging( + *, + level: int = logging.WARNING, + log_format: str = "text", + log_file: str | None = None, +) -> None: + """Configure the prometheus root logger. + + Args: + level: Logging level (e.g. logging.DEBUG, logging.INFO). + log_format: ``"text"`` for human-readable or ``"json"`` for + machine-parseable output. + log_file: Optional path to also write logs to a file. + """ + prom_logger = logging.getLogger(_PROMETHEUS_LOGGER) + prom_logger.setLevel(level) + # Remove any stale handlers so re-configuration is idempotent. + prom_logger.handlers.clear() + + if log_format == "json": + fmt = _JsonFormatter() + else: + fmt = _TextFormatter( + fmt="%(asctime)s [%(levelname)s] %(name)s: %(message)s", + datefmt="%H:%M:%S", + ) + + # Console handler (stderr so it doesn't mix with Rich stdout) + console_handler = logging.StreamHandler(sys.stderr) + console_handler.setFormatter(fmt) + prom_logger.addHandler(console_handler) + + # Optional file handler + if log_file: + file_handler = logging.FileHandler(log_file) + file_handler.setFormatter(fmt) + prom_logger.addHandler(file_handler) + + # Prevent propagation to root logger to avoid duplicate output + prom_logger.propagate = False + + +def get_logger(name: str) -> logging.Logger: + """Return a child logger under the prometheus namespace.""" + return logging.getLogger(f"{_PROMETHEUS_LOGGER}.{name}") diff --git a/src/prometheus/domain/entities.py b/src/prometheus/domain/entities.py index 40644ff..6d45cc9 100644 --- a/src/prometheus/domain/entities.py +++ b/src/prometheus/domain/entities.py @@ -31,6 +31,15 @@ class SyntheticExample: id: int = 0 +@dataclass(frozen=True) +class GroundTruthExample: + """A ground-truth evaluation example with a known-good expected output.""" + + input_text: str + expected_output: str + id: int = 0 + + @dataclass class Trajectory: """Execution trace of a prompt on an input. @@ -85,3 +94,6 @@ class OptimizationState: synthetic_pool: list[SyntheticExample] = field(default_factory=list) history: list[dict[str, Any]] = field(default_factory=list) total_llm_calls: int = 0 + # Hold-out validation + best_validation_score: float | None = None + early_stopped: bool = False diff --git a/src/prometheus/domain/ports.py b/src/prometheus/domain/ports.py index 88ed554..723a637 100644 --- a/src/prometheus/domain/ports.py +++ b/src/prometheus/domain/ports.py @@ -8,7 +8,14 @@ from abc import ABC, abstractmethod from typing import Any -from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory +from prometheus.domain.entities import ( + Candidate, + GroundTruthExample, + OptimizationState, + Prompt, + SyntheticExample, + Trajectory, +) class LLMPort(ABC): @@ -73,6 +80,34 @@ class SyntheticGeneratorPort(ABC): ... +class CrossoverPort(ABC): + """Port for crossover — combining instructions from two parent candidates.""" + + @abstractmethod + async def crossover( + self, + parent_a: Prompt, + parent_b: Prompt, + task_description: str, + ) -> Prompt: + """Combine instructions from two parents into a child prompt.""" + ... + + +class MutationPort(ABC): + """Port for mutating a prompt — paraphrase, constrain, generalize, specialize.""" + + @abstractmethod + async def mutate( + self, + prompt: Prompt, + task_description: str, + mutation_type: str = "paraphrase", + ) -> Prompt: + """Apply a mutation to the prompt.""" + ... + + class PersistencePort(ABC): """Port for reading/writing files.""" @@ -83,3 +118,49 @@ class PersistencePort(ABC): @abstractmethod def write_result(self, path: str, data: dict[str, Any]) -> None: ... + + +class SimilarityPort(ABC): + """Port for computing similarity between a prediction and expected output. + + Infrastructure provides concrete metrics (exact match, BLEU, ROUGE, cosine). + """ + + @abstractmethod + def compute(self, prediction: str, expected: str) -> float: + """Compute similarity score in [0, 1]. 1.0 = perfect match.""" + ... + + +class DatasetLoaderPort(ABC): + """Port for loading ground-truth evaluation datasets.""" + + @abstractmethod + def load(self, path: str) -> list[GroundTruthExample]: + """Load a dataset from a CSV or JSON file. + + Each row must have 'input' and 'expected_output' fields. + """ + ... + + +class CheckpointPort(ABC): + """Port for saving and loading optimization checkpoints. + + Enables resuming long-running optimizations after interruption. + """ + + @abstractmethod + def save(self, state: OptimizationState) -> None: + """Persist the current optimization state to disk.""" + ... + + @abstractmethod + def load(self) -> OptimizationState | None: + """Load the latest checkpoint. Returns None if no checkpoint exists.""" + ... + + @abstractmethod + def latest_exists(self) -> bool: + """Check if a checkpoint file is available for resuming.""" + ... diff --git a/src/prometheus/infrastructure/checkpoint.py b/src/prometheus/infrastructure/checkpoint.py new file mode 100644 index 0000000..0786dcc --- /dev/null +++ b/src/prometheus/infrastructure/checkpoint.py @@ -0,0 +1,149 @@ +""" +JSON checkpoint persistence — save/load optimization state to disk. + +Implements the CheckpointPort with JSON for human-readable, versionable snapshots. +""" +from __future__ import annotations + +import json +import logging +from dataclasses import asdict +from pathlib import Path + +from prometheus.cli.logging_setup import get_logger +from prometheus.domain.entities import ( + Candidate, + OptimizationState, + Prompt, + SyntheticExample, +) +from prometheus.domain.ports import CheckpointPort + +logger = get_logger("checkpoint") + +_CHECKPOINT_FILE = "latest.json" + + +class JsonCheckpointPersistence(CheckpointPort): + """Saves optimization state as JSON to a configurable directory.""" + + def __init__(self, checkpoint_dir: str | Path = ".prometheus/checkpoints") -> None: + self._dir = Path(checkpoint_dir) + + def save(self, state: OptimizationState) -> None: + """Persist the current optimization state to disk.""" + self._dir.mkdir(parents=True, exist_ok=True) + path = self._dir / _CHECKPOINT_FILE + data = _serialize_state(state) + path.write_text(json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8") + logger.info( + "Checkpoint saved", + extra={ + "structured": { + "event": "checkpoint_saved", + "path": str(path), + "iteration": state.iteration, + "total_llm_calls": state.total_llm_calls, + }, + }, + ) + + def load(self) -> OptimizationState | None: + """Load the latest checkpoint. Returns None if no checkpoint exists.""" + path = self._dir / _CHECKPOINT_FILE + if not path.exists(): + return None + raw = json.loads(path.read_text(encoding="utf-8")) + state = _deserialize_state(raw) + logger.info( + "Checkpoint loaded", + extra={ + "structured": { + "event": "checkpoint_loaded", + "path": str(path), + "iteration": state.iteration, + "total_llm_calls": state.total_llm_calls, + }, + }, + ) + return state + + def latest_exists(self) -> bool: + """Check if a checkpoint file is available for resuming.""" + return (self._dir / _CHECKPOINT_FILE).exists() + + +# --------------------------------------------------------------------------- +# Serialization helpers — keep the JSON format stable and self-describing. +# --------------------------------------------------------------------------- + +_SCHEMA_VERSION = 1 + + +def _serialize_state(state: OptimizationState) -> dict: + """Convert OptimizationState to a JSON-safe dict.""" + return { + "schema_version": _SCHEMA_VERSION, + "iteration": state.iteration, + "best_candidate": _serialize_candidate(state.best_candidate), + "candidates": [_serialize_candidate(c) for c in state.candidates], + "synthetic_pool": [ + {"input_text": ex.input_text, "category": ex.category, "id": ex.id} + for ex in state.synthetic_pool + ], + "history": state.history, + "total_llm_calls": state.total_llm_calls, + } + + +def _serialize_candidate(candidate: Candidate | None) -> dict | None: + if candidate is None: + return None + return { + "prompt_text": candidate.prompt.text, + "prompt_metadata": candidate.prompt.metadata, + "best_score": candidate.best_score, + "generation": candidate.generation, + "parent_id": candidate.parent_id, + } + + +def _deserialize_state(data: dict) -> OptimizationState: + """Reconstruct OptimizationState from a checkpoint dict.""" + version = data.get("schema_version", 1) + # Future migration hooks go here: if version < 2: ... + + best_raw = data.get("best_candidate") + best_candidate = _deserialize_candidate(best_raw) + + candidates = [_deserialize_candidate(c) for c in data.get("candidates", [])] + + synthetic_pool = [ + SyntheticExample( + input_text=ex["input_text"], + category=ex.get("category", "default"), + id=ex.get("id", 0), + ) + for ex in data.get("synthetic_pool", []) + ] + + state = OptimizationState( + iteration=data.get("iteration", 0), + best_candidate=best_candidate, + candidates=candidates, + synthetic_pool=synthetic_pool, + history=data.get("history", []), + total_llm_calls=data.get("total_llm_calls", 0), + ) + return state + + +def _deserialize_candidate(raw: dict | None) -> Candidate | None: + if raw is None: + return None + return Candidate( + prompt=Prompt(text=raw["prompt_text"], metadata=raw.get("prompt_metadata", {})), + best_score=raw.get("best_score", 0.0), + generation=raw.get("generation", 0), + parent_id=raw.get("parent_id"), + ) diff --git a/src/prometheus/infrastructure/crossover_adapter.py b/src/prometheus/infrastructure/crossover_adapter.py new file mode 100644 index 0000000..d29b40d --- /dev/null +++ b/src/prometheus/infrastructure/crossover_adapter.py @@ -0,0 +1,63 @@ +""" +Adapter: Instruction Crossover via DSPy. + +Implements CrossoverPort — combines two parent prompts into a child. +""" +from __future__ import annotations + +import asyncio + +import dspy + +from prometheus.domain.entities import Prompt +from prometheus.domain.ports import CrossoverPort +from prometheus.infrastructure.dspy_modules import InstructionCrossover +from prometheus.infrastructure.retry import async_retry_with_backoff + + +class DSPyCrossoverAdapter(CrossoverPort): + """Uses DSPy to combine two parent instructions into a child.""" + + def __init__( + self, + lm: dspy.LM, + max_retries: int = 3, + retry_delay_base: float = 1.0, + ) -> None: + self._lm = lm + self._crossover = InstructionCrossover() + self._max_retries = max_retries + self._retry_delay_base = retry_delay_base + self.call_count: int = 0 + + async def crossover( + self, + parent_a: Prompt, + parent_b: Prompt, + task_description: str, + ) -> Prompt: + async def _call() -> Prompt: + return await asyncio.to_thread( + self._sync_crossover, parent_a, parent_b, task_description, + ) + + return await async_retry_with_backoff( + _call, + max_retries=self._max_retries, + retry_delay_base=self._retry_delay_base, + ) + + def _sync_crossover( + self, + parent_a: Prompt, + parent_b: Prompt, + task_description: str, + ) -> Prompt: + with dspy.context(lm=self._lm): + pred = self._crossover( + parent_a=parent_a.text, + parent_b=parent_b.text, + task_description=task_description, + ) + self.call_count += 1 + return Prompt(text=pred.child_instruction) diff --git a/src/prometheus/infrastructure/dataset_loader.py b/src/prometheus/infrastructure/dataset_loader.py new file mode 100644 index 0000000..271300c --- /dev/null +++ b/src/prometheus/infrastructure/dataset_loader.py @@ -0,0 +1,75 @@ +"""Dataset loader — loads ground-truth CSV/JSON datasets.""" +from __future__ import annotations + +import csv +import json +import logging +from pathlib import Path + +from prometheus.domain.entities import GroundTruthExample +from prometheus.domain.ports import DatasetLoaderPort + +logger = logging.getLogger(__name__) + + +class FileDatasetLoader(DatasetLoaderPort): + """Loads evaluation datasets from CSV or JSON files. + + CSV files must have 'input' and 'expected_output' columns. + JSON files must be an array of objects with 'input' and 'expected_output' keys. + """ + + def load(self, path: str) -> list[GroundTruthExample]: + suffix = Path(path).suffix.lower() + if suffix == ".csv": + return self._load_csv(path) + elif suffix in (".json", ".jsonl"): + return self._load_json(path) + else: + raise ValueError( + f"Unsupported dataset format '{suffix}'. Use .csv, .json, or .jsonl." + ) + + def _load_csv(self, path: str) -> list[GroundTruthExample]: + examples: list[GroundTruthExample] = [] + with open(path, newline="", encoding="utf-8") as f: + reader = csv.DictReader(f) + for i, row in enumerate(reader): + input_text = row.get("input", "").strip() + expected = row.get("expected_output", "").strip() + if not input_text: + logger.warning("Skipping CSV row %d: empty 'input' field", i + 1) + continue + examples.append( + GroundTruthExample( + input_text=input_text, + expected_output=expected, + id=i, + ) + ) + logger.info("Loaded %d examples from CSV: %s", len(examples), path) + return examples + + def _load_json(self, path: str) -> list[GroundTruthExample]: + with open(path, encoding="utf-8") as f: + data = json.load(f) + if not isinstance(data, list): + raise ValueError("JSON dataset must be an array of objects.") + examples: list[GroundTruthExample] = [] + for i, item in enumerate(data): + input_text = item.get("input", "").strip() if isinstance(item, dict) else "" + expected = ( + item.get("expected_output", "").strip() if isinstance(item, dict) else "" + ) + if not input_text: + logger.warning("Skipping JSON item %d: empty 'input' field", i) + continue + examples.append( + GroundTruthExample( + input_text=input_text, + expected_output=expected, + id=i, + ) + ) + logger.info("Loaded %d examples from JSON: %s", len(examples), path) + return examples diff --git a/src/prometheus/infrastructure/judge_adapter.py b/src/prometheus/infrastructure/judge_adapter.py index 220ac74..a28134f 100644 --- a/src/prometheus/infrastructure/judge_adapter.py +++ b/src/prometheus/infrastructure/judge_adapter.py @@ -59,6 +59,7 @@ class DSPyJudgeAdapter(JudgePort): if self._judge_dimensions else {} ) + self.call_count: int = 0 async def judge_batch( self, @@ -104,13 +105,15 @@ class DSPyJudgeAdapter(JudgePort): def _sync_judge(self, task_description: str, input_text: str, output_text: str): with dspy.context(lm=self._lm): - return self._judge( + result = self._judge( task_description=task_description, input_text=input_text, output_text=output_text, judge_criteria=self._judge_criteria, dimension_names=self._dimension_names, ) + self.call_count += 1 + return result def _aggregate_result(self, pred: Any) -> tuple[float, str]: """Compute weighted aggregate score from dimension scores if available.""" diff --git a/src/prometheus/infrastructure/llm_adapter.py b/src/prometheus/infrastructure/llm_adapter.py index 2134550..b97f7a4 100644 --- a/src/prometheus/infrastructure/llm_adapter.py +++ b/src/prometheus/infrastructure/llm_adapter.py @@ -34,6 +34,7 @@ class DSPyLLMAdapter(LLMPort): self._predictor = dspy.Predict(self._ExecuteSignature) self._max_retries = max_retries self._retry_delay_base = retry_delay_base + self.call_count: int = 0 async def execute(self, prompt: Prompt, input_text: str) -> str: async def _call() -> str: @@ -52,4 +53,5 @@ class DSPyLLMAdapter(LLMPort): instruction=prompt.text, input_text=input_text, ) + self.call_count += 1 return str(result.output) diff --git a/src/prometheus/infrastructure/mutation_adapter.py b/src/prometheus/infrastructure/mutation_adapter.py new file mode 100644 index 0000000..07dba7e --- /dev/null +++ b/src/prometheus/infrastructure/mutation_adapter.py @@ -0,0 +1,70 @@ +""" +Adapter: Instruction Mutation via DSPy. + +Implements MutationPort — applies typed mutations (paraphrase, constrain, +generalize, specialize) to a prompt. +""" +from __future__ import annotations + +import asyncio +import random + +import dspy + +from prometheus.domain.entities import Prompt +from prometheus.domain.ports import MutationPort +from prometheus.infrastructure.dspy_modules import InstructionMutator +from prometheus.infrastructure.retry import async_retry_with_backoff + +_MUTATION_TYPES = ("paraphrase", "constrain", "generalize", "specialize") + + +class DSPyMutationAdapter(MutationPort): + """Uses DSPy to apply typed mutations to an instruction.""" + + def __init__( + self, + lm: dspy.LM, + max_retries: int = 3, + retry_delay_base: float = 1.0, + ) -> None: + self._lm = lm + self._mutator = InstructionMutator() + self._max_retries = max_retries + self._retry_delay_base = retry_delay_base + self.call_count: int = 0 + + async def mutate( + self, + prompt: Prompt, + task_description: str, + mutation_type: str = "paraphrase", + ) -> Prompt: + if mutation_type not in _MUTATION_TYPES: + mutation_type = random.choice(_MUTATION_TYPES) + + async def _call() -> Prompt: + return await asyncio.to_thread( + self._sync_mutate, prompt, task_description, mutation_type, + ) + + return await async_retry_with_backoff( + _call, + max_retries=self._max_retries, + retry_delay_base=self._retry_delay_base, + ) + + def _sync_mutate( + self, + prompt: Prompt, + task_description: str, + mutation_type: str, + ) -> Prompt: + with dspy.context(lm=self._lm): + pred = self._mutator( + current_instruction=prompt.text, + task_description=task_description, + mutation_type=mutation_type, + ) + self.call_count += 1 + return Prompt(text=pred.mutated_instruction) diff --git a/src/prometheus/infrastructure/proposer_adapter.py b/src/prometheus/infrastructure/proposer_adapter.py index 3d49ab5..ef5a311 100644 --- a/src/prometheus/infrastructure/proposer_adapter.py +++ b/src/prometheus/infrastructure/proposer_adapter.py @@ -29,6 +29,7 @@ class DSPyProposerAdapter(ProposerPort): self._proposer = InstructionProposer() self._max_retries = max_retries self._retry_delay_base = retry_delay_base + self.call_count: int = 0 async def propose( self, @@ -56,6 +57,7 @@ class DSPyProposerAdapter(ProposerPort): task_description=task_description, failure_examples=failure_examples, ) + self.call_count += 1 return Prompt(text=pred.new_instruction) @staticmethod diff --git a/src/prometheus/infrastructure/similarity.py b/src/prometheus/infrastructure/similarity.py new file mode 100644 index 0000000..5aedd65 --- /dev/null +++ b/src/prometheus/infrastructure/similarity.py @@ -0,0 +1,153 @@ +"""Similarity adapters — concrete metrics for comparing prediction vs expected.""" +from __future__ import annotations + +import math +import re +from collections import Counter + +from prometheus.domain.ports import SimilarityPort + + +class ExactMatchSimilarity(SimilarityPort): + """Case-insensitive exact string match. Returns 1.0 or 0.0.""" + + def compute(self, prediction: str, expected: str) -> float: + return 1.0 if prediction.strip().lower() == expected.strip().lower() else 0.0 + + +class BleuSimilarity(SimilarityPort): + """BLEU-style n-gram precision (up to 4-grams). + + Simplified implementation using sentence-level BLEU with brevity penalty. + Returns a score in [0, 1]. + """ + + def __init__(self, max_n: int = 4): + self._max_n = max_n + + def compute(self, prediction: str, expected: str) -> float: + pred_tokens = _tokenize(prediction) + ref_tokens = _tokenize(expected) + if not pred_tokens or not ref_tokens: + return 0.0 if not ref_tokens else 0.0 + + # Modified precision for each n-gram + precisions: list[float] = [] + for n in range(1, self._max_n + 1): + pred_ngrams = _ngrams(pred_tokens, n) + ref_ngrams = _ngrams(ref_tokens, n) + if not pred_ngrams: + break + clipped = sum(min(pred_ngrams[ng], ref_ngrams.get(ng, 0)) for ng in pred_ngrams) + total = sum(pred_ngrams.values()) + precisions.append(clipped / total if total > 0 else 0.0) + + if not precisions: + return 0.0 + + # Geometric mean of precisions + log_avg = sum(math.log(p) for p in precisions if p > 0) + n_nonzero = sum(1 for p in precisions if p > 0) + if n_nonzero == 0: + return 0.0 + geo_mean = math.exp(log_avg / n_nonzero) + + # Brevity penalty + bp = 1.0 + if len(pred_tokens) < len(ref_tokens): + bp = math.exp(1 - len(ref_tokens) / len(pred_tokens)) + + return min(bp * geo_mean, 1.0) + + +class RougeLSimilarity(SimilarityPort): + """ROUGE-L using Longest Common Subsequence. + + Returns F1 score combining precision and recall in [0, 1]. + """ + + def compute(self, prediction: str, expected: str) -> float: + pred_tokens = _tokenize(prediction) + ref_tokens = _tokenize(expected) + if not pred_tokens or not ref_tokens: + return 0.0 + + lcs_len = _lcs_length(pred_tokens, ref_tokens) + precision = lcs_len / len(pred_tokens) + recall = lcs_len / len(ref_tokens) + if precision + recall == 0: + return 0.0 + f1 = 2 * precision * recall / (precision + recall) + return f1 + + +class CosineSimilarity(SimilarityPort): + """TF-IDF cosine similarity between bag-of-words vectors. + + Lightweight semantic similarity without external embedding models. + """ + + def compute(self, prediction: str, expected: str) -> float: + pred_tokens = _tokenize(prediction) + ref_tokens = _tokenize(expected) + if not pred_tokens or not ref_tokens: + return 0.0 + + pred_counts = Counter(pred_tokens) + ref_counts = Counter(ref_tokens) + all_tokens = set(pred_counts) | set(ref_counts) + + dot = sum(pred_counts.get(t, 0) * ref_counts.get(t, 0) for t in all_tokens) + norm_pred = math.sqrt(sum(v * v for v in pred_counts.values())) + norm_ref = math.sqrt(sum(v * v for v in ref_counts.values())) + + if norm_pred == 0 or norm_ref == 0: + return 0.0 + return dot / (norm_pred * norm_ref) + + +# --- Helpers --- + + +def _tokenize(text: str) -> list[str]: + """Simple whitespace + punctuation tokenizer.""" + return re.findall(r"\w+", text.lower()) + + +def _ngrams(tokens: list[str], n: int) -> Counter: + """Count n-grams in a token list.""" + return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)) + + +def _lcs_length(a: list[str], b: list[str]) -> int: + """Compute length of the Longest Common Subsequence.""" + m, n = len(a), len(b) + prev = [0] * (n + 1) + for i in range(1, m + 1): + curr = [0] * (n + 1) + for j in range(1, n + 1): + if a[i - 1] == b[j - 1]: + curr[j] = prev[j - 1] + 1 + else: + curr[j] = max(prev[j], curr[j - 1]) + prev = curr + return prev[n] + + +def create_similarity_adapter(metric: str) -> SimilarityPort: + """Factory: create a SimilarityPort by metric name. + + Supported metrics: exact, bleu, rouge_l, cosine. + """ + adapters = { + "exact": ExactMatchSimilarity, + "bleu": BleuSimilarity, + "rouge_l": RougeLSimilarity, + "cosine": CosineSimilarity, + } + cls = adapters.get(metric) + if cls is None: + raise ValueError( + f"Unknown eval metric '{metric}'. Choose from: {sorted(adapters)}" + ) + return cls() diff --git a/src/prometheus/infrastructure/synth_adapter.py b/src/prometheus/infrastructure/synth_adapter.py index 8a10135..9917432 100644 --- a/src/prometheus/infrastructure/synth_adapter.py +++ b/src/prometheus/infrastructure/synth_adapter.py @@ -18,6 +18,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort): def __init__(self, lm: dspy.LM) -> None: self._lm = lm self._generator = SyntheticInputGenerator() + self.call_count: int = 0 def generate_inputs( self, @@ -29,6 +30,7 @@ class DSPySyntheticAdapter(SyntheticGeneratorPort): task_description=task_description, n_examples=n_examples, ) + self.call_count += 1 return [ SyntheticExample( input_text=text, diff --git a/tests/conftest.py b/tests/conftest.py index 87da00b..58df951 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -91,3 +91,27 @@ def mock_proposer_port() -> AsyncMock: text="You are a very helpful assistant. Answer the question precisely." ) return port + + +@pytest.fixture +def mock_crossover_port() -> AsyncMock: + """Mock CrossoverPort that combines two parent prompts.""" + port = AsyncMock() + + async def _crossover(parent_a: Prompt, parent_b: Prompt, task_description: str) -> Prompt: + return Prompt(text=f"{parent_a.text} Also, {parent_b.text.lower()}") + + port.crossover = AsyncMock(side_effect=_crossover) + return port + + +@pytest.fixture +def mock_mutation_port() -> AsyncMock: + """Mock MutationPort that paraphrases a prompt.""" + port = AsyncMock() + + async def _mutate(prompt: Prompt, task_description: str, mutation_type: str = "paraphrase") -> Prompt: + return Prompt(text=f"[{mutation_type}] {prompt.text}") + + port.mutate = AsyncMock(side_effect=_mutate) + return port diff --git a/tests/integration/test_dspy_adapters.py b/tests/integration/test_dspy_adapters.py index dd6f909..4fcd0db 100644 --- a/tests/integration/test_dspy_adapters.py +++ b/tests/integration/test_dspy_adapters.py @@ -20,9 +20,10 @@ def mock_lm() -> dspy.LM: class TestDSPyLLMAdapter: - def test_execute_returns_response(self, mock_lm: dspy.LM) -> None: + @pytest.mark.asyncio + async def test_execute_returns_response(self, mock_lm: dspy.LM) -> None: adapter = DSPyLLMAdapter(lm=mock_lm) prompt = Prompt(text="Answer the question.") - result = adapter.execute(prompt, "What is 2+2?") + result = await adapter.execute(prompt, "What is 2+2?") assert isinstance(result, str) assert len(result) > 0 diff --git a/tests/integration/test_evolution_integration.py b/tests/integration/test_evolution_integration.py new file mode 100644 index 0000000..1a0e2fc --- /dev/null +++ b/tests/integration/test_evolution_integration.py @@ -0,0 +1,300 @@ +"""Integration tests for multi-iteration evolution with mixed accept/reject.""" +from __future__ import annotations + +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from prometheus.application.bootstrap import SyntheticBootstrap +from prometheus.application.evaluator import PromptEvaluator +from prometheus.application.evolution import EvolutionLoop +from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory +from prometheus.domain.ports import JudgePort, LLMPort, ProposerPort + + +def _make_eval(scores: list[float]) -> EvalResult: + return EvalResult( + scores=scores, + feedbacks=["feedback"] * len(scores), + trajectories=[ + Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt") + for i, s in enumerate(scores) + ], + ) + + +class TestMultiIterationEvolution: + """Tests for the evolution loop across multiple iterations.""" + + @pytest.fixture + def seed_prompt(self) -> Prompt: + return Prompt(text="You are a helpful assistant.") + + @pytest.fixture + def task_description(self) -> str: + return "Answer factual questions." + + @pytest.fixture + def synthetic_pool(self) -> list[SyntheticExample]: + return [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)] + + @pytest.mark.asyncio + async def test_mixed_accept_reject( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """Iteration 1: accept, iteration 2: reject, iteration 3: accept.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:3] + + # Build eval sequence: initial, then per-iteration (current, new) + evals = [ + _make_eval([0.3, 0.3, 0.3]), # initial seed eval + # Iter 1: accept (old=0.4, new=0.8) + _make_eval([0.4, 0.4, 0.4]), + _make_eval([0.8, 0.8, 0.8]), + # Iter 2: reject (old=0.7, new=0.2) + _make_eval([0.7, 0.7, 0.7]), + _make_eval([0.2, 0.2, 0.2]), + # Iter 3: accept (old=0.5, new=0.9) + _make_eval([0.5, 0.5, 0.5]), + _make_eval([0.9, 0.9, 0.9]), + ] + evaluator.evaluate = AsyncMock(side_effect=evals) + + mock_proposer.propose.side_effect = [ + Prompt(text="Better prompt v1"), + Prompt(text="Worse prompt v2"), + Prompt(text="Best prompt v3"), + ] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=3, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert state.iteration == 3 + assert state.best_candidate is not None + assert state.best_candidate.best_score == pytest.approx(2.7) # 0.9*3 + assert len(state.history) == 3 + assert state.history[0]["event"] == "accepted" + assert state.history[1]["event"] == "rejected" + assert state.history[2]["event"] == "accepted" + + @pytest.mark.asyncio + async def test_all_rejected_keeps_seed( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """When all proposals are rejected, the seed prompt stays as best.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:3] + + evals = [ + _make_eval([0.5, 0.5, 0.5]), # initial + ] + for _ in range(3): + evals.append(_make_eval([0.5, 0.5, 0.5])) # current + evals.append(_make_eval([0.1, 0.1, 0.1])) # worse proposal + evaluator.evaluate = AsyncMock(side_effect=evals) + + mock_proposer.propose.side_effect = [ + Prompt(text="bad v1"), + Prompt(text="bad v2"), + Prompt(text="bad v3"), + ] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=3, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert state.best_candidate.prompt.text == seed_prompt.text + assert state.best_candidate.best_score == pytest.approx(1.5) # 0.5*3 + + @pytest.mark.asyncio + async def test_all_accepted_chain( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """All iterations accept, forming an improvement chain.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:2] + + evals = [ + _make_eval([0.2, 0.2]), # initial + ] + for i in range(1, 5): + score = 0.2 + i * 0.15 + evals.append(_make_eval([score, score])) # current + evals.append(_make_eval([score + 0.1, score + 0.1])) # new (accepted) + evaluator.evaluate = AsyncMock(side_effect=evals) + + mock_proposer.propose.side_effect = [ + Prompt(text=f"Improved v{i}") for i in range(4) + ] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=4, + minibatch_size=2, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert len(state.candidates) == 5 # seed + 4 accepted + assert all(h["event"] == "accepted" for h in state.history) + + @pytest.mark.asyncio + async def test_error_recovery_continues_loop( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """When an iteration errors, the loop continues.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:2] + + # Eval sequence for 3 iterations: + # - iter 1: evaluate current → propose → evaluate new (accepted) + # - iter 2: evaluate current → propose (ERROR, no new eval) + # - iter 3: evaluate current → propose → evaluate new (accepted) + evals = [ + _make_eval([0.3, 0.3]), # initial + _make_eval([0.5, 0.5]), # iter 1 current + _make_eval([0.9, 0.9]), # iter 1 new (accepted) + _make_eval([0.5, 0.5]), # iter 2 current (proposer errors after this) + _make_eval([0.5, 0.5]), # iter 3 current + _make_eval([0.8, 0.8]), # iter 3 new (accepted) + ] + evaluator.evaluate = AsyncMock(side_effect=evals) + + # Proposer raises on iter 2 + mock_proposer.propose.side_effect = [ + Prompt(text="good v1"), + RuntimeError("LLM timeout"), + Prompt(text="good v3"), + ] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=2, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert state.iteration == 3 + assert state.history[1]["event"] == "error" + assert "LLM timeout" in state.history[1]["error"] + assert state.history[0]["event"] == "accepted" + assert state.history[2]["event"] == "accepted" + + @pytest.mark.asyncio + async def test_perfect_score_skips_proposer( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """When all scores are perfect, no proposition is made.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:2] + + perfect_eval = _make_eval([1.0, 1.0]) + evaluator.evaluate = AsyncMock(return_value=perfect_eval) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=5, + minibatch_size=2, + perfect_score=1.0, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + mock_proposer.propose.assert_not_called() + assert all(h["event"] == "skip_perfect" for h in state.history) + + @pytest.mark.asyncio + async def test_llm_call_counting( + self, + seed_prompt: Prompt, + task_description: str, + synthetic_pool: list[SyntheticExample], + ) -> None: + """Verify LLM call counting: 2*N per eval (execute + judge) + 1 per propose.""" + mock_llm = MagicMock(spec=LLMPort) + mock_judge = MagicMock(spec=JudgePort) + mock_proposer = MagicMock(spec=ProposerPort) + + evaluator = PromptEvaluator(mock_llm, mock_judge) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:3] + + evals = [_make_eval([0.3, 0.3, 0.3])] # initial + for _ in range(2): + evals.append(_make_eval([0.4, 0.4, 0.4])) + evals.append(_make_eval([0.6, 0.6, 0.6])) + evaluator.evaluate = AsyncMock(side_effect=evals) + + mock_proposer.propose.side_effect = [ + Prompt(text="v1"), + Prompt(text="v2"), + ] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer, + bootstrap=bootstrap, + max_iterations=2, + minibatch_size=3, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + # Initial: 2*3=6, Iter1: 2*3 + 1 + 2*3 = 13, Iter2: same = 13 + # Total: 6 + 13 + 13 = 32 + assert state.total_llm_calls == 32 diff --git a/tests/integration/test_full_pipeline.py b/tests/integration/test_full_pipeline.py index 23b5a49..dc05267 100644 --- a/tests/integration/test_full_pipeline.py +++ b/tests/integration/test_full_pipeline.py @@ -1,7 +1,9 @@ """End-to-end pipeline test with mocked LLM calls.""" from __future__ import annotations -from unittest.mock import MagicMock +from unittest.mock import AsyncMock, MagicMock + +import pytest from prometheus.application.bootstrap import SyntheticBootstrap from prometheus.application.dto import OptimizationConfig @@ -23,9 +25,10 @@ def _make_eval(scores: list[float]) -> EvalResult: class TestFullPipeline: - def test_pipeline_produces_result(self) -> None: + @pytest.mark.asyncio + async def test_pipeline_produces_result(self) -> None: """Full pipeline with mocked ports produces an OptimizationResult.""" - mock_llm = MagicMock(spec=LLMPort) + mock_llm = AsyncMock(spec=LLMPort) mock_llm.execute.return_value = "mock response" mock_judge = MagicMock(spec=JudgePort) @@ -38,11 +41,11 @@ class TestFullPipeline: eval_sequence.append(_make_eval([0.6, 0.6, 0.6, 0.6, 0.6])) # new eval (accepted) mock_judge.judge_batch.return_value = [(0.5, "ok")] * 5 - mock_proposer = MagicMock(spec=ProposerPort) + mock_proposer = AsyncMock(spec=ProposerPort) mock_proposer.propose.return_value = Prompt(text="Improved prompt") evaluator = PromptEvaluator(mock_llm, mock_judge) - evaluator.evaluate = MagicMock(side_effect=eval_sequence) + evaluator.evaluate = AsyncMock(side_effect=eval_sequence) mock_gen = MagicMock() mock_gen.generate_inputs.return_value = [ @@ -65,7 +68,7 @@ class TestFullPipeline: seed=42, ) - result = use_case.execute(config) + result = await use_case.execute(config) assert result.initial_prompt == "Answer questions." assert result.optimized_prompt == "Improved prompt" diff --git a/tests/integration/test_ground_truth_eval.py b/tests/integration/test_ground_truth_eval.py new file mode 100644 index 0000000..3c7b076 --- /dev/null +++ b/tests/integration/test_ground_truth_eval.py @@ -0,0 +1,199 @@ +"""Integration test — ground-truth evaluation end-to-end with real similarity metrics.""" +from __future__ import annotations + +import asyncio +import json + +import pytest +from unittest.mock import AsyncMock + +from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator +from prometheus.domain.entities import GroundTruthExample, Prompt +from prometheus.domain.ports import LLMPort +from prometheus.infrastructure.dataset_loader import FileDatasetLoader +from prometheus.infrastructure.similarity import ( + BleuSimilarity, + CosineSimilarity, + ExactMatchSimilarity, + RougeLSimilarity, + create_similarity_adapter, +) + + +def _make_dataset(items: list[tuple[str, str]]) -> list[GroundTruthExample]: + return [ + GroundTruthExample(input_text=inp, expected_output=exp, id=i) + for i, (inp, exp) in enumerate(items) + ] + + +@pytest.fixture +def qa_dataset(): + return _make_dataset([ + ("What is the capital of France?", "Paris"), + ("What is 2+2?", "4"), + ("What color is the sky?", "blue"), + ]) + + +@pytest.fixture +def prompt(): + return Prompt(text="Answer the following question concisely.") + + +@pytest.fixture +def mock_executor(): + """Returns responses that partially match the ground truth.""" + port = AsyncMock(spec=LLMPort) + port.execute.side_effect = [ + "Paris is the capital of France.", + "The answer is 4.", + "The sky is blue.", + ] + return port + + +class TestGroundTruthIntegrationWithExactMatch: + @pytest.mark.asyncio + async def test_exact_match_on_qa(self, mock_executor, qa_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, + similarity=ExactMatchSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + # None of the outputs are exact matches with expected outputs + assert all(s == 0.0 for s in result.scores) + + @pytest.mark.asyncio + async def test_exact_match_with_exact_outputs(self, qa_dataset, prompt): + exact_executor = AsyncMock(spec=LLMPort) + exact_executor.execute.side_effect = ["Paris", "4", "blue"] + evaluator = GroundTruthEvaluator( + executor=exact_executor, + similarity=ExactMatchSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + assert all(s == 1.0 for s in result.scores) + + +class TestGroundTruthIntegrationWithBleu: + @pytest.mark.asyncio + async def test_bleu_scores_partial_match(self, mock_executor, qa_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, + similarity=BleuSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + assert all(0.0 < s < 1.0 for s in result.scores) + assert result.mean_score > 0.0 + + @pytest.mark.asyncio + async def test_bleu_perfect_match(self, qa_dataset, prompt): + perfect_executor = AsyncMock(spec=LLMPort) + perfect_executor.execute.side_effect = ["Paris", "4", "blue"] + evaluator = GroundTruthEvaluator( + executor=perfect_executor, + similarity=BleuSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + assert all(s > 0.0 for s in result.scores) + + +class TestGroundTruthIntegrationWithRouge: + @pytest.mark.asyncio + async def test_rouge_l_scores(self, mock_executor, qa_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, + similarity=RougeLSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + assert all(s > 0.0 for s in result.scores) + + +class TestGroundTruthIntegrationWithCosine: + @pytest.mark.asyncio + async def test_cosine_scores(self, mock_executor, qa_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, + similarity=CosineSimilarity(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + assert all(s > 0.0 for s in result.scores) + + +class TestDatasetLoaderIntegration: + @pytest.mark.asyncio + async def test_load_csv_and_evaluate(self, tmp_path, prompt): + csv_file = tmp_path / "eval.csv" + csv_file.write_text("input,expected_output\nWhat is 2+2?,4\nWhat color is grass?,green\n") + + loader = FileDatasetLoader() + dataset = loader.load(str(csv_file)) + assert len(dataset) == 2 + + executor = AsyncMock(spec=LLMPort) + executor.execute.side_effect = ["4", "green"] + + evaluator = GroundTruthEvaluator( + executor=executor, + similarity=ExactMatchSimilarity(), + ) + result = await evaluator.evaluate(prompt, dataset) + assert all(s == 1.0 for s in result.scores) + + @pytest.mark.asyncio + async def test_load_json_and_evaluate(self, tmp_path, prompt): + json_file = tmp_path / "eval.json" + data = [ + {"input": "What is 2+2?", "expected_output": "4"}, + {"input": "What color is grass?", "expected_output": "green"}, + ] + json_file.write_text(json.dumps(data)) + + loader = FileDatasetLoader() + dataset = loader.load(str(json_file)) + assert len(dataset) == 2 + + executor = AsyncMock(spec=LLMPort) + executor.execute.side_effect = ["4", "not green"] + + evaluator = GroundTruthEvaluator( + executor=executor, + similarity=create_similarity_adapter("bleu"), + ) + result = await evaluator.evaluate(prompt, dataset) + # First item should score well, second poorly + assert result.scores[0] > result.scores[1] + + +class TestMetricComparison: + """Compare different metrics on the same outputs to ensure they behave differently.""" + + @pytest.mark.asyncio + async def test_metrics_give_different_scores(self, qa_dataset, prompt): + results = {} + for metric_name, metric_cls in [ + ("exact", ExactMatchSimilarity), + ("bleu", BleuSimilarity), + ("rouge_l", RougeLSimilarity), + ("cosine", CosineSimilarity), + ]: + executor = AsyncMock(spec=LLMPort) + executor.execute.side_effect = [ + "Paris is the capital of France.", + "The answer is 4.", + "The sky is blue.", + ] + evaluator = GroundTruthEvaluator( + executor=executor, + similarity=metric_cls(), + ) + result = await evaluator.evaluate(prompt, qa_dataset) + results[metric_name] = result.mean_score + + # Exact match should be 0 (no exact matches) + assert results["exact"] == 0.0 + # All other metrics should give partial credit + assert results["bleu"] > 0.0 + assert results["rouge_l"] > 0.0 + assert results["cosine"] > 0.0 diff --git a/tests/unit/test_adapters.py b/tests/unit/test_adapters.py new file mode 100644 index 0000000..ba685bb --- /dev/null +++ b/tests/unit/test_adapters.py @@ -0,0 +1,294 @@ +"""Unit tests for infrastructure adapters — LLM, Judge, Proposer, Synthetic. + +Uses mocked DSPy modules to isolate adapter logic from LLM calls. +""" +from __future__ import annotations + +from unittest.mock import AsyncMock, MagicMock, patch + +import dspy +import pytest + +from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory +from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter +from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter +from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter +from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter + + +# --- LLM Adapter --- + + +class TestDSPyLLMAdapter: + """Tests for DSPyLLMAdapter.execute().""" + + @pytest.fixture + def mock_lm(self) -> MagicMock: + return MagicMock(spec=dspy.LM) + + @pytest.fixture + def adapter(self, mock_lm: MagicMock) -> DSPyLLMAdapter: + return DSPyLLMAdapter(lm=mock_lm) + + @pytest.mark.asyncio + async def test_execute_returns_output_string( + self, adapter: DSPyLLMAdapter, mock_lm: MagicMock + ) -> None: + mock_predictor = MagicMock() + mock_predictor.return_value = MagicMock(output="Hello response") + adapter._predictor = mock_predictor + + prompt = Prompt(text="Say hello.") + result = await adapter.execute(prompt, "Hi there") + + assert result == "Hello response" + + @pytest.mark.asyncio + async def test_execute_passes_prompt_text_and_input( + self, adapter: DSPyLLMAdapter, mock_lm: MagicMock + ) -> None: + mock_predictor = MagicMock() + mock_predictor.return_value = MagicMock(output="response") + adapter._predictor = mock_predictor + + prompt = Prompt(text="Translate this.") + await adapter.execute(prompt, "Hello world") + + mock_predictor.assert_called_once_with( + instruction="Translate this.", + input_text="Hello world", + ) + + @pytest.mark.asyncio + async def test_execute_uses_dspy_context( + self, adapter: DSPyLLMAdapter, mock_lm: MagicMock + ) -> None: + mock_predictor = MagicMock() + mock_predictor.return_value = MagicMock(output="ok") + adapter._predictor = mock_predictor + + with patch("prometheus.infrastructure.llm_adapter.dspy.context") as mock_ctx: + await adapter.execute(Prompt(text="test"), "input") + mock_ctx.assert_called_once_with(lm=mock_lm) + + @pytest.mark.asyncio + async def test_execute_converts_output_to_str( + self, adapter: DSPyLLMAdapter, mock_lm: MagicMock + ) -> None: + mock_predictor = MagicMock() + mock_predictor.return_value = MagicMock(output=42) + adapter._predictor = mock_predictor + + result = await adapter.execute(Prompt(text="test"), "input") + assert isinstance(result, str) + assert result == "42" + + +# --- Judge Adapter --- + + +class TestDSPyJudgeAdapter: + """Tests for DSPyJudgeAdapter.judge_batch().""" + + @pytest.fixture + def mock_lm(self) -> MagicMock: + return MagicMock(spec=dspy.LM) + + @pytest.fixture + def adapter(self, mock_lm: MagicMock) -> DSPyJudgeAdapter: + return DSPyJudgeAdapter(lm=mock_lm) + + @pytest.mark.asyncio + async def test_judge_batch_returns_scores_and_feedback( + self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock + ) -> None: + adapter._judge = MagicMock() + adapter._judge.side_effect = [ + MagicMock(score=0.9, feedback="Excellent."), + MagicMock(score=0.4, feedback="Incomplete."), + ] + + pairs = [("What is 2+2?", "4"), ("Capital of France?", "London")] + result = await adapter.judge_batch("math and geography", pairs) + + assert len(result) == 2 + assert result[0] == (0.9, "Excellent.") + assert result[1] == (0.4, "Incomplete.") + + @pytest.mark.asyncio + async def test_judge_batch_empty_pairs( + self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock + ) -> None: + result = await adapter.judge_batch("task", []) + assert result == [] + + @pytest.mark.asyncio + async def test_judge_batch_uses_dspy_context( + self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock + ) -> None: + adapter._judge = MagicMock() + adapter._judge.return_value = MagicMock(score=0.5, feedback="ok") + + with patch("prometheus.infrastructure.judge_adapter.dspy.context") as mock_ctx: + await adapter.judge_batch("task", [("in", "out")]) + mock_ctx.assert_called_once_with(lm=mock_lm) + + @pytest.mark.asyncio + async def test_judge_batch_returns_all_results( + self, adapter: DSPyJudgeAdapter, mock_lm: MagicMock + ) -> None: + """Judge calls run in parallel but all results are returned.""" + adapter._judge = MagicMock() + adapter._judge.side_effect = [ + MagicMock(score=0.5, feedback="ok"), + MagicMock(score=0.7, feedback="better"), + MagicMock(score=0.3, feedback="worse"), + ] + + pairs = [("first", "out1"), ("second", "out2"), ("third", "out3")] + results = await adapter.judge_batch("task", pairs) + + assert len(results) == 3 + scores = [r[0] for r in results] + assert 0.5 in scores + assert 0.7 in scores + assert 0.3 in scores + + +# --- Proposer Adapter --- + + +class TestDSPyProposerAdapter: + """Tests for DSPyProposerAdapter.propose().""" + + @pytest.fixture + def mock_lm(self) -> MagicMock: + return MagicMock(spec=dspy.LM) + + @pytest.fixture + def adapter(self, mock_lm: MagicMock) -> DSPyProposerAdapter: + return DSPyProposerAdapter(lm=mock_lm) + + @pytest.mark.asyncio + async def test_propose_returns_new_prompt( + self, adapter: DSPyProposerAdapter, mock_lm: MagicMock + ) -> None: + adapter._proposer = MagicMock() + adapter._proposer.return_value = MagicMock( + new_instruction="Be concise and accurate." + ) + + current = Prompt(text="Answer questions.") + trajectories = [ + Trajectory("in", "out", 0.3, "too verbose", "Answer questions.") + ] + result = await adapter.propose(current, trajectories, "Q&A task") + + assert isinstance(result, Prompt) + assert result.text == "Be concise and accurate." + + @pytest.mark.asyncio + async def test_propose_uses_dspy_context( + self, adapter: DSPyProposerAdapter, mock_lm: MagicMock + ) -> None: + adapter._proposer = MagicMock() + adapter._proposer.return_value = MagicMock(new_instruction="improved") + + with patch("prometheus.infrastructure.proposer_adapter.dspy.context") as mock_ctx: + await adapter.propose(Prompt(text="t"), [], "task") + mock_ctx.assert_called_once_with(lm=mock_lm) + + def test_format_failures_single_trajectory(self) -> None: + trajectories = [ + Trajectory("What is AI?", "A type of robot.", 0.3, "Incomplete definition.", "prompt") + ] + result = DSPyProposerAdapter._format_failures(trajectories) + + assert "What is AI?" in result + assert "A type of robot." in result + assert "0.30" in result + assert "Incomplete definition." in result + assert "# Example 1" in result + + def test_format_failures_multiple_trajectories(self) -> None: + trajectories = [ + Trajectory("input1", "output1", 0.4, "bad", "prompt"), + Trajectory("input2", "output2", 0.2, "worse", "prompt"), + ] + result = DSPyProposerAdapter._format_failures(trajectories) + + assert "# Example 1" in result + assert "# Example 2" in result + assert "---" in result + assert "input1" in result + assert "input2" in result + + def test_format_failures_empty_list(self) -> None: + result = DSPyProposerAdapter._format_failures([]) + assert result == "" + + +# --- Synthetic Adapter --- + + +class TestDSPySyntheticAdapter: + """Tests for DSPySyntheticAdapter.generate_inputs().""" + + @pytest.fixture + def mock_lm(self) -> MagicMock: + return MagicMock(spec=dspy.LM) + + @pytest.fixture + def adapter(self, mock_lm: MagicMock) -> DSPySyntheticAdapter: + return DSPySyntheticAdapter(lm=mock_lm) + + def test_generate_inputs_returns_examples( + self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock + ) -> None: + adapter._generator = MagicMock() + adapter._generator.return_value = MagicMock( + examples=["What is AI?", "Explain ML.", "What is NLP?"] + ) + + result = adapter.generate_inputs("AI task", 3) + + assert len(result) == 3 + assert all(isinstance(ex, SyntheticExample) for ex in result) + assert result[0].input_text == "What is AI?" + assert result[0].id == 0 + assert result[1].id == 1 + + def test_generate_inputs_truncates_to_n( + self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock + ) -> None: + adapter._generator = MagicMock() + adapter._generator.return_value = MagicMock( + examples=["q1", "q2", "q3", "q4", "q5"] + ) + + result = adapter.generate_inputs("task", 3) + + assert len(result) == 3 + + def test_generate_inputs_passes_correct_args( + self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock + ) -> None: + adapter._generator = MagicMock() + adapter._generator.return_value = MagicMock(examples=["q1"]) + + adapter.generate_inputs("my task", 5) + + adapter._generator.assert_called_once_with( + task_description="my task", + n_examples=5, + ) + + def test_generate_inputs_empty_list( + self, adapter: DSPySyntheticAdapter, mock_lm: MagicMock + ) -> None: + adapter._generator = MagicMock() + adapter._generator.return_value = MagicMock(examples=[]) + + result = adapter.generate_inputs("task", 0) + + assert result == [] diff --git a/tests/unit/test_checkpoint.py b/tests/unit/test_checkpoint.py new file mode 100644 index 0000000..d9428db --- /dev/null +++ b/tests/unit/test_checkpoint.py @@ -0,0 +1,333 @@ +"""Unit tests for checkpoint & resume functionality.""" +from __future__ import annotations + +import json +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from prometheus.application.bootstrap import SyntheticBootstrap +from prometheus.application.evaluator import PromptEvaluator +from prometheus.application.evolution import EvolutionLoop +from prometheus.domain.entities import ( + Candidate, + EvalResult, + OptimizationState, + Prompt, + SyntheticExample, + Trajectory, +) +from prometheus.infrastructure.checkpoint import JsonCheckpointPersistence + + +# --------------------------------------------------------------------------- +# JsonCheckpointPersistence — save/load round-trip +# --------------------------------------------------------------------------- + + +class TestJsonCheckpointPersistence: + def test_roundtrip_full_state(self, tmp_path: Path) -> None: + """Saving and loading preserves all fields.""" + ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts") + + state = OptimizationState( + iteration=7, + best_candidate=Candidate( + prompt=Prompt(text="best prompt", metadata={"source": "test"}), + best_score=0.92, + generation=5, + ), + candidates=[ + Candidate(prompt=Prompt(text="p1"), best_score=0.5, generation=0), + Candidate(prompt=Prompt(text="p2"), best_score=0.92, generation=5), + ], + synthetic_pool=[ + SyntheticExample(input_text="q1", category="cat_a", id=0), + SyntheticExample(input_text="q2", category="cat_b", id=1), + ], + history=[{"iteration": 1, "event": "accepted", "old_score": 0.5, "new_score": 0.7}], + total_llm_calls=42, + ) + + ckpt.save(state) + assert ckpt.latest_exists() + + loaded = ckpt.load() + assert loaded is not None + assert loaded.iteration == 7 + assert loaded.total_llm_calls == 42 + assert loaded.best_candidate is not None + assert loaded.best_candidate.prompt.text == "best prompt" + assert loaded.best_candidate.prompt.metadata == {"source": "test"} + assert loaded.best_candidate.best_score == 0.92 + assert len(loaded.candidates) == 2 + assert len(loaded.synthetic_pool) == 2 + assert loaded.synthetic_pool[0].input_text == "q1" + assert loaded.synthetic_pool[1].category == "cat_b" + assert loaded.history[0]["event"] == "accepted" + + def test_load_returns_none_when_no_checkpoint(self, tmp_path: Path) -> None: + """Loading from empty dir returns None.""" + ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "nope") + assert ckpt.load() is None + assert not ckpt.latest_exists() + + def test_creates_directory_on_save(self, tmp_path: Path) -> None: + """Save creates the directory tree if it doesn't exist.""" + deep_dir = tmp_path / "a" / "b" / "c" + ckpt = JsonCheckpointPersistence(checkpoint_dir=deep_dir) + state = OptimizationState(iteration=1) + ckpt.save(state) + assert (deep_dir / "latest.json").exists() + + def test_overwrites_previous_checkpoint(self, tmp_path: Path) -> None: + """Second save overwrites the first.""" + ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path) + + ckpt.save(OptimizationState(iteration=1, total_llm_calls=10)) + ckpt.save(OptimizationState(iteration=5, total_llm_calls=50)) + + loaded = ckpt.load() + assert loaded is not None + assert loaded.iteration == 5 + assert loaded.total_llm_calls == 50 + + def test_json_is_human_readable(self, tmp_path: Path) -> None: + """Checkpoint file is valid, pretty-printed JSON.""" + ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path) + state = OptimizationState( + iteration=3, + best_candidate=Candidate(prompt=Prompt(text="hello"), best_score=0.8), + ) + ckpt.save(state) + + raw = json.loads((tmp_path / "latest.json").read_text()) + assert raw["schema_version"] == 1 + assert raw["iteration"] == 3 + assert raw["best_candidate"]["prompt_text"] == "hello" + + +# --------------------------------------------------------------------------- +# EvolutionLoop — checkpoint integration +# --------------------------------------------------------------------------- + + +class TestEvolutionCheckpoint: + @pytest.mark.asyncio + async def test_checkpoint_saved_on_interval( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + ) -> None: + """Checkpoint is saved every checkpoint_interval iterations.""" + from unittest.mock import MagicMock + + evaluator = PromptEvaluator(AsyncMock(), AsyncMock()) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + # All iterations accepted so checkpoint triggers + good_eval = EvalResult( + scores=[0.3, 0.4, 0.3, 0.5, 0.2], + feedbacks=["ok"] * 5, + trajectories=[ + Trajectory(f"input{i}", f"out{i}", s, "ok", "p") + for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2]) + ], + ) + better_eval = EvalResult( + scores=[0.8, 0.9, 0.7, 0.8, 0.9], + feedbacks=["good"] * 5, + trajectories=[], + ) + # initial_eval + 5 iterations (each needs old_eval + new_eval) + evaluator.evaluate = AsyncMock( + side_effect=[good_eval] # initial + + [good_eval, better_eval] * 5 # 5 iterations + ) + + proposer = AsyncMock() + proposer.propose.return_value = Prompt(text="improved prompt") + + checkpoint_port = MagicMock() + loop = EvolutionLoop( + evaluator=evaluator, + proposer=proposer, + bootstrap=bootstrap, + max_iterations=5, + minibatch_size=5, + checkpoint_port=checkpoint_port, + checkpoint_interval=2, + ) + + await loop.run(seed_prompt, synthetic_pool, task_description) + + # Checkpoint at iterations 2, 4 (every 2nd) + save_calls = checkpoint_port.save.call_count + assert save_calls >= 2 # at least at iters 2 and 4 + + @pytest.mark.asyncio + async def test_no_checkpoint_without_port( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + ) -> None: + """No checkpointing happens when checkpoint_port is None (default).""" + evaluator = PromptEvaluator(AsyncMock(), AsyncMock()) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + perfect_eval = EvalResult( + scores=[1.0] * 5, + feedbacks=["perfect"] * 5, + trajectories=[ + Trajectory(f"in{i}", f"out{i}", 1.0, "perfect", "p") + for i in range(5) + ], + ) + evaluator.evaluate = AsyncMock(return_value=perfect_eval) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=AsyncMock(), + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=5, + checkpoint_port=None, + ) + # Should run without error — no checkpoint port, no crash + await loop.run(seed_prompt, synthetic_pool, task_description) + + @pytest.mark.asyncio + async def test_resume_skips_seed_evaluation( + self, + synthetic_pool: list[SyntheticExample], + task_description: str, + ) -> None: + """When initial_state is provided, seed eval is skipped and loop starts from saved iteration.""" + evaluator = PromptEvaluator(AsyncMock(), AsyncMock()) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + proposer = AsyncMock() + proposer.propose.return_value = Prompt(text="new prompt") + + # Only return evaluations for resumed iterations (1 iter: old_eval + new_eval) + old_eval = EvalResult( + scores=[0.5] * 5, + feedbacks=["ok"] * 5, + trajectories=[ + Trajectory(f"in{i}", f"out{i}", 0.5, "ok", "p") for i in range(5) + ], + ) + new_eval = EvalResult( + scores=[0.8] * 5, + feedbacks=["good"] * 5, + trajectories=[], + ) + evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval]) + + # Create a state simulating checkpoint at iteration 4 + initial_state = OptimizationState( + iteration=4, + best_candidate=Candidate( + prompt=Prompt(text="checkpoint prompt"), best_score=2.5, generation=4 + ), + candidates=[Candidate(prompt=Prompt(text="checkpoint prompt"), best_score=2.5)], + total_llm_calls=40, + ) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=proposer, + bootstrap=bootstrap, + max_iterations=5, # only iteration 5 remains + minibatch_size=5, + ) + state = await loop.run( + seed_prompt=Prompt(text="seed"), + synthetic_pool=synthetic_pool, + task_description=task_description, + initial_state=initial_state, + ) + + # Should have run only 1 iteration (iter 5) + assert state.iteration == 5 + # total_llm_calls should include the 40 from checkpoint + new calls + assert state.total_llm_calls > 40 + + @pytest.mark.asyncio + async def test_full_save_and_resume_roundtrip( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + tmp_path: Path, + ) -> None: + """End-to-end: run a few iterations, checkpoint, resume, finish.""" + evaluator = PromptEvaluator(AsyncMock(), AsyncMock()) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + old_eval = EvalResult( + scores=[0.3, 0.4, 0.3, 0.5, 0.2], + feedbacks=["ok"] * 5, + trajectories=[ + Trajectory(f"in{i}", f"out{i}", s, "ok", "p") + for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2]) + ], + ) + new_eval = EvalResult( + scores=[0.8, 0.9, 0.7, 0.8, 0.9], + feedbacks=["good"] * 5, + trajectories=[], + ) + evaluator.evaluate = AsyncMock( + side_effect=[old_eval, old_eval, new_eval, old_eval, new_eval] + ) + proposer = AsyncMock() + proposer.propose.return_value = Prompt(text="improved prompt") + + ckpt = JsonCheckpointPersistence(checkpoint_dir=tmp_path / "ckpts") + loop = EvolutionLoop( + evaluator=evaluator, + proposer=proposer, + bootstrap=bootstrap, + max_iterations=2, + minibatch_size=5, + checkpoint_port=ckpt, + checkpoint_interval=1, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + assert state.iteration == 2 + assert ckpt.latest_exists() + + # Capture the checkpoint state *before* resume (state is mutated in-place) + loaded = ckpt.load() + assert loaded is not None + saved_llm_calls = loaded.total_llm_calls + saved_iteration = loaded.iteration + + # Set up evaluator for resumed run (just 1 more iteration) + evaluator.evaluate = AsyncMock(side_effect=[old_eval, new_eval]) + proposer.propose.return_value = Prompt(text="even better prompt") + + loop2 = EvolutionLoop( + evaluator=evaluator, + proposer=proposer, + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=5, + checkpoint_port=ckpt, + checkpoint_interval=1, + ) + resumed = await loop2.run( + seed_prompt, synthetic_pool, task_description, + initial_state=loaded, + ) + assert resumed.iteration == 3 + assert resumed.total_llm_calls > saved_llm_calls + assert resumed.iteration > saved_iteration diff --git a/tests/unit/test_cli.py b/tests/unit/test_cli.py new file mode 100644 index 0000000..d392ac5 --- /dev/null +++ b/tests/unit/test_cli.py @@ -0,0 +1,278 @@ +"""Tests for the CLI interface — prometheus optimize, version, etc. + +Uses Typer's CliRunner for isolated command testing. +""" +from __future__ import annotations + +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock, patch + +import pytest +import yaml +from typer.testing import CliRunner + +from prometheus.application.dto import OptimizationResult +from prometheus.cli.app import app + +runner = CliRunner() + + +class TestCLIOptimize: + """Tests for the `prometheus optimize` command.""" + + def _write_config(self, tmp_path: Path, **overrides: object) -> Path: + """Write a minimal valid config YAML and return its path.""" + data = { + "seed_prompt": "You are a helpful assistant.", + "task_description": "Answer factual questions accurately.", + } + data.update(overrides) + config_file = tmp_path / "config.yaml" + with open(config_file, "w") as f: + yaml.dump(data, f) + return config_file + + def test_optimize_with_valid_config(self, tmp_path: Path) -> None: + config_file = self._write_config(tmp_path) + output_file = tmp_path / "output.yaml" + + mock_result = OptimizationResult( + optimized_prompt="Improved prompt", + initial_prompt="You are a helpful assistant.", + iterations_used=5, + total_llm_calls=50, + initial_score=0.3, + final_score=0.9, + improvement=0.6, + history=[], + ) + + mock_uc = AsyncMock() + mock_uc.execute.return_value = mock_result + + with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc): + with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"): + with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls: + mock_llm_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls: + mock_judge_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls: + mock_prop_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.dspy"): + result = runner.invoke( + app, + [ + "optimize", + "-i", + str(config_file), + "-o", + str(output_file), + ], + ) + + assert result.exit_code == 0 + assert "Optimized Prompt" in result.output + + def test_optimize_missing_input_file(self) -> None: + result = runner.invoke( + app, + ["optimize", "-i", "/nonexistent/config.yaml"], + ) + assert result.exit_code != 0 + + def test_optimize_with_verbose_flag(self, tmp_path: Path) -> None: + config_file = self._write_config(tmp_path) + output_file = tmp_path / "output.yaml" + + mock_result = OptimizationResult( + optimized_prompt="Improved", + initial_prompt="test", + iterations_used=1, + total_llm_calls=10, + initial_score=0.3, + final_score=0.8, + improvement=0.5, + history=[], + ) + + mock_uc = AsyncMock() + mock_uc.execute.return_value = mock_result + + with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc): + with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"): + with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls: + mock_llm_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls: + mock_judge_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls: + mock_prop_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.dspy"): + result = runner.invoke( + app, + [ + "optimize", + "-i", + str(config_file), + "-o", + str(output_file), + "-v", + ], + ) + + assert result.exit_code == 0 + + def test_optimize_displays_metrics(self, tmp_path: Path) -> None: + config_file = self._write_config(tmp_path) + output_file = tmp_path / "output.yaml" + + mock_result = OptimizationResult( + optimized_prompt="Better prompt", + initial_prompt="test", + iterations_used=3, + total_llm_calls=30, + initial_score=0.40, + final_score=0.85, + improvement=0.45, + history=[], + ) + + mock_uc = AsyncMock() + mock_uc.execute.return_value = mock_result + + with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc): + with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"): + with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls: + mock_llm_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls: + mock_judge_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls: + mock_prop_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.dspy"): + result = runner.invoke( + app, + [ + "optimize", + "-i", + str(config_file), + "-o", + str(output_file), + ], + ) + + assert result.exit_code == 0 + assert "0.40" in result.output + assert "0.85" in result.output + assert "+0.45" in result.output + + def test_optimize_with_max_concurrency_flag(self, tmp_path: Path) -> None: + config_file = self._write_config(tmp_path) + output_file = tmp_path / "output.yaml" + + mock_result = OptimizationResult( + optimized_prompt="Better prompt", + initial_prompt="test", + iterations_used=1, + total_llm_calls=10, + initial_score=0.3, + final_score=0.8, + improvement=0.5, + history=[], + ) + + mock_uc = AsyncMock() + mock_uc.execute.return_value = mock_result + + with patch("prometheus.cli.commands.optimize.OptimizePromptUseCase", return_value=mock_uc): + with patch("prometheus.cli.commands.optimize.DSPySyntheticAdapter"): + with patch("prometheus.cli.commands.optimize.DSPyLLMAdapter") as mock_llm_cls: + mock_llm_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyJudgeAdapter") as mock_judge_cls: + mock_judge_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.DSPyProposerAdapter") as mock_prop_cls: + mock_prop_cls.return_value = MagicMock() + with patch("prometheus.cli.commands.optimize.dspy"): + result = runner.invoke( + app, + [ + "optimize", + "-i", + str(config_file), + "-o", + str(output_file), + "--max-concurrency", + "10", + ], + ) + + assert result.exit_code == 0 + + +class TestCLIHelp: + """Tests for CLI help and no-args behavior.""" + + def test_no_args_shows_help(self) -> None: + result = runner.invoke(app, []) + # Typer uses exit code 2 when no_args_is_help=True + assert result.exit_code in (0, 2) + assert "PROMETHEUS" in result.output or "Usage" in result.output + + def test_optimize_help(self) -> None: + result = runner.invoke(app, ["optimize", "--help"]) + assert result.exit_code == 0 + assert "input" in result.output.lower() or "INPUT" in result.output + + def test_version_help(self) -> None: + result = runner.invoke(app, ["version", "--help"]) + assert result.exit_code == 0 + + def test_init_help(self) -> None: + result = runner.invoke(app, ["init", "--help"]) + assert result.exit_code == 0 + + def test_list_help(self) -> None: + result = runner.invoke(app, ["list", "--help"]) + assert result.exit_code == 0 + + +class TestCLIVersion: + """Tests for the `prometheus version` command.""" + + def test_version_prints_version(self) -> None: + result = runner.invoke(app, ["version"]) + assert result.exit_code == 0 + assert "PROMETHEUS" in result.output + assert "0.1.0" in result.output + + +class TestCLIList: + """Tests for the `prometheus list` command.""" + + def test_list_no_runs(self, tmp_path: Path) -> None: + result = runner.invoke(app, ["list", "-d", str(tmp_path)]) + assert result.exit_code == 0 + assert "No optimization runs found" in result.output + + def test_list_with_result(self, tmp_path: Path) -> None: + result_data = { + "optimized_prompt": "Better prompt for testing", + "initial_prompt": "test", + "iterations_used": 5, + "total_llm_calls": 50, + "initial_score": 0.30, + "final_score": 0.90, + "improvement": 0.60, + "history": [], + } + result_file = tmp_path / "output.yaml" + import yaml as _yaml + with open(result_file, "w") as f: + _yaml.dump(result_data, f) + + result = runner.invoke(app, ["list", "-d", str(tmp_path)]) + assert result.exit_code == 0 + assert "0.30" in result.output + assert "0.90" in result.output + + def test_list_nonexistent_directory(self) -> None: + result = runner.invoke(app, ["list", "-d", "/nonexistent/dir"]) + assert result.exit_code == 1 diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py index bdf138b..86a868f 100644 --- a/tests/unit/test_config.py +++ b/tests/unit/test_config.py @@ -300,3 +300,33 @@ class TestConfigValidation: ) assert config.max_iterations == 1 assert config.perfect_score == 0.0 + + +class TestEvalConfigValidation: + """Tests for ground-truth evaluation config fields.""" + + def test_eval_defaults(self) -> None: + config = OptimizationConfig(seed_prompt="a", task_description="b") + assert config.eval_dataset_path is None + assert config.eval_metric == "bleu" + + def test_eval_dataset_path_set(self) -> None: + config = OptimizationConfig( + seed_prompt="a", task_description="b", + eval_dataset_path="data.csv", + ) + assert config.eval_dataset_path == "data.csv" + + def test_valid_eval_metrics(self) -> None: + for metric in ("exact", "bleu", "rouge_l", "cosine", "llm_judge"): + config = OptimizationConfig( + seed_prompt="a", task_description="b", eval_metric=metric, + ) + assert config.eval_metric == metric + + def test_invalid_eval_metric_raises(self) -> None: + with pytest.raises(ValidationError, match="eval_metric must be one of"): + OptimizationConfig( + seed_prompt="a", task_description="b", + eval_metric="invalid_metric", + ) diff --git a/tests/unit/test_dataset_loader.py b/tests/unit/test_dataset_loader.py new file mode 100644 index 0000000..2a10a69 --- /dev/null +++ b/tests/unit/test_dataset_loader.py @@ -0,0 +1,86 @@ +"""Tests for the ground-truth dataset loader.""" +from __future__ import annotations + +import json +import os +import tempfile + +import pytest + +from prometheus.domain.entities import GroundTruthExample +from prometheus.infrastructure.dataset_loader import FileDatasetLoader + + +@pytest.fixture +def loader(): + return FileDatasetLoader() + + +class TestCsvLoader: + def test_load_csv(self, loader, tmp_path): + csv_file = tmp_path / "test.csv" + csv_file.write_text("input,expected_output\nhello,world\nfoo,bar\n") + result = loader.load(str(csv_file)) + assert len(result) == 2 + assert result[0].input_text == "hello" + assert result[0].expected_output == "world" + assert result[1].input_text == "foo" + assert result[1].expected_output == "bar" + + def test_load_csv_skips_empty_input(self, loader, tmp_path): + csv_file = tmp_path / "test.csv" + csv_file.write_text("input,expected_output\n,bar\nhello,world\n") + result = loader.load(str(csv_file)) + assert len(result) == 1 + assert result[0].input_text == "hello" + + def test_load_csv_with_whitespace(self, loader, tmp_path): + csv_file = tmp_path / "test.csv" + csv_file.write_text("input,expected_output\n hello , world \n") + result = loader.load(str(csv_file)) + assert result[0].input_text == "hello" + assert result[0].expected_output == "world" + + def test_load_csv_empty_file(self, loader, tmp_path): + csv_file = tmp_path / "test.csv" + csv_file.write_text("input,expected_output\n") + result = loader.load(str(csv_file)) + assert len(result) == 0 + + +class TestJsonLoader: + def test_load_json(self, loader, tmp_path): + json_file = tmp_path / "test.json" + data = [ + {"input": "hello", "expected_output": "world"}, + {"input": "foo", "expected_output": "bar"}, + ] + json_file.write_text(json.dumps(data)) + result = loader.load(str(json_file)) + assert len(result) == 2 + assert result[0].input_text == "hello" + assert result[0].expected_output == "world" + + def test_load_json_skips_empty_input(self, loader, tmp_path): + json_file = tmp_path / "test.json" + data = [ + {"input": "", "expected_output": "bar"}, + {"input": "hello", "expected_output": "world"}, + ] + json_file.write_text(json.dumps(data)) + result = loader.load(str(json_file)) + assert len(result) == 1 + + def test_load_json_not_array_raises(self, loader, tmp_path): + json_file = tmp_path / "test.json" + json_file.write_text(json.dumps({"not": "an array"})) + with pytest.raises(ValueError, match="must be an array"): + loader.load(str(json_file)) + + +class TestUnsupportedFormat: + def test_unsupported_extension_raises(self, loader, tmp_path): + txt_file = tmp_path / "test.txt" + txt_file.write_text("hello") + with pytest.raises(ValueError, match="Unsupported dataset format"): + loader.load(str(txt_file)) diff --git a/tests/unit/test_error_handling.py b/tests/unit/test_error_handling.py index 748f762..017cf73 100644 --- a/tests/unit/test_error_handling.py +++ b/tests/unit/test_error_handling.py @@ -278,6 +278,7 @@ class TestPerCallIsolation: adapter._judge_dimensions = [] adapter._dimension_names = "" adapter._weights = {} + adapter.call_count = 0 # Mock _judge to fail on first call, succeed on second call_count = 0 diff --git a/tests/unit/test_evolution.py b/tests/unit/test_evolution.py index 4b70b72..3260a6b 100644 --- a/tests/unit/test_evolution.py +++ b/tests/unit/test_evolution.py @@ -8,10 +8,30 @@ import pytest from prometheus.application.bootstrap import SyntheticBootstrap from prometheus.application.evaluator import PromptEvaluator from prometheus.application.evolution import EvolutionLoop -from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory +from prometheus.domain.entities import ( + Candidate, + EvalResult, + Prompt, + SyntheticExample, + Trajectory, +) + + +def _make_eval(scores: list[float], label: str = "ok") -> EvalResult: + """Helper to build an EvalResult from a list of scores.""" + return EvalResult( + scores=scores, + feedbacks=[label] * len(scores), + trajectories=[ + Trajectory(f"input{i}", f"output{i}", s, label, "prompt") + for i, s in enumerate(scores) + ], + ) class TestEvolutionLoop: + """Tests for the original single-candidate hill-climbing mode (population_size=1).""" + @pytest.mark.asyncio async def test_accepts_improvement( self, @@ -27,28 +47,9 @@ class TestEvolutionLoop: bootstrap = MagicMock(spec=SyntheticBootstrap) bootstrap.sample_minibatch.return_value = synthetic_pool[:5] - initial_eval = EvalResult( - scores=[0.3, 0.4, 0.3, 0.5, 0.2], - feedbacks=["bad"] * 5, - trajectories=[ - Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt") - for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2]) - ], - ) - old_eval = EvalResult( - scores=[0.3, 0.4, 0.3, 0.5, 0.2], - feedbacks=["bad"] * 5, - trajectories=[ - Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt") - for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2]) - ], - ) - new_eval = EvalResult( - scores=[0.8, 0.9, 0.7, 0.8, 0.9], - feedbacks=["good"] * 5, - trajectories=[], - ) - evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval]) + low_eval = _make_eval([0.3, 0.4, 0.3, 0.5, 0.2], "bad") + high_eval = _make_eval([0.8, 0.9, 0.7, 0.8, 0.9], "good") + evaluator.evaluate = AsyncMock(side_effect=[low_eval, low_eval, high_eval]) loop = EvolutionLoop( evaluator=evaluator, @@ -57,8 +58,7 @@ class TestEvolutionLoop: max_iterations=1, minibatch_size=5, ) - with patch.object(loop, "_log"): - state = await loop.run(seed_prompt, synthetic_pool, task_description) + state = await loop.run(seed_prompt, synthetic_pool, task_description) assert state.best_candidate is not None assert state.best_candidate.best_score > 0 @@ -78,28 +78,9 @@ class TestEvolutionLoop: bootstrap = MagicMock(spec=SyntheticBootstrap) bootstrap.sample_minibatch.return_value = synthetic_pool[:5] - initial_eval = EvalResult( - scores=[0.7, 0.8, 0.7, 0.8, 0.9], - feedbacks=["ok"] * 5, - trajectories=[ - Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt") - for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9]) - ], - ) - old_eval = EvalResult( - scores=[0.7, 0.8, 0.7, 0.8, 0.9], - feedbacks=["ok"] * 5, - trajectories=[ - Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt") - for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9]) - ], - ) - new_eval = EvalResult( - scores=[0.2, 0.1, 0.3, 0.2, 0.1], - feedbacks=["bad"] * 5, - trajectories=[], - ) - evaluator.evaluate = AsyncMock(side_effect=[initial_eval, old_eval, new_eval]) + high_eval = _make_eval([0.7, 0.8, 0.7, 0.8, 0.9], "ok") + low_eval = _make_eval([0.2, 0.1, 0.3, 0.2, 0.1], "bad") + evaluator.evaluate = AsyncMock(side_effect=[high_eval, high_eval, low_eval]) loop = EvolutionLoop( evaluator=evaluator, @@ -108,8 +89,7 @@ class TestEvolutionLoop: max_iterations=1, minibatch_size=5, ) - with patch.object(loop, "_log"): - state = await loop.run(seed_prompt, synthetic_pool, task_description) + state = await loop.run(seed_prompt, synthetic_pool, task_description) assert state.best_candidate is not None assert state.best_candidate.prompt.text == seed_prompt.text @@ -129,14 +109,7 @@ class TestEvolutionLoop: bootstrap = MagicMock(spec=SyntheticBootstrap) bootstrap.sample_minibatch.return_value = synthetic_pool[:5] - perfect_eval = EvalResult( - scores=[1.0, 1.0, 1.0, 1.0, 1.0], - feedbacks=["perfect"] * 5, - trajectories=[ - Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt") - for i in range(5) - ], - ) + perfect_eval = _make_eval([1.0, 1.0, 1.0, 1.0, 1.0], "perfect") evaluator.evaluate = AsyncMock(return_value=perfect_eval) loop = EvolutionLoop( @@ -146,7 +119,226 @@ class TestEvolutionLoop: max_iterations=3, minibatch_size=5, ) - with patch.object(loop, "_log"): - await loop.run(seed_prompt, synthetic_pool, task_description) - + await loop.run(seed_prompt, synthetic_pool, task_description) mock_proposer_port.propose.assert_not_called() + + +class TestPopulationEvolution: + """Tests for population-based evolution (population_size > 1).""" + + @pytest.mark.asyncio + async def test_population_initialization( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + mock_mutation_port: AsyncMock, + ) -> None: + """Population is initialized with the right number of candidates.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + evaluator.evaluate = AsyncMock( + return_value=_make_eval([0.5] * 5, "ok") + ) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=0, # no iterations, just initialization + minibatch_size=5, + population_size=4, + mutation_port=mock_mutation_port, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + # 1 seed + 3 mutations = 4 candidates + assert len(state.candidates) == 4 + assert mock_mutation_port.mutate.call_count == 3 + + @pytest.mark.asyncio + async def test_population_initialization_uses_proposer_fallback( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """When no mutation_port is provided, population init falls back to proposer.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + evaluator.evaluate = AsyncMock( + return_value=_make_eval([0.5] * 5, "ok") + ) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=0, + minibatch_size=5, + population_size=3, + # mutation_port intentionally omitted + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert len(state.candidates) == 3 + assert mock_proposer_port.propose.call_count == 2 # 3-1 = 2 init mutations + + @pytest.mark.asyncio + async def test_population_iteration_replaces_worst( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + mock_crossover_port: AsyncMock, + mock_mutation_port: AsyncMock, + ) -> None: + """Crossover child replaces worst candidate when its fitness is higher.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + # Sequence: + # 1. Initial eval (seed) + # 2. Population init: 3 mutation calls use proposer.propose(), NOT evaluator.evaluate + # 3. Population iteration: crossover produces child → eval child + # Only 2 evaluator.evaluate calls total + seed_eval = _make_eval([0.5] * 5, "ok") + # Crossover child eval - high score to beat worst + child_eval = _make_eval([0.9, 0.9, 0.8, 0.9, 0.8], "great") + + all_evals = [seed_eval, child_eval] + evaluator.evaluate = AsyncMock(side_effect=all_evals) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=1, + minibatch_size=5, + population_size=4, + crossover_rate=1.0, + crossover_port=mock_crossover_port, + mutation_rate=0.0, # disable post-crossover mutation for determinism + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + accepted_events = [h for h in state.history if h.get("event") == "pop_accepted"] + assert len(accepted_events) >= 1 + + @pytest.mark.asyncio + async def test_population_iteration_rejects_inferior_child( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + mock_crossover_port: AsyncMock, + ) -> None: + """Inferior child is rejected and doesn't replace any candidate.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + seed_eval = _make_eval([0.8] * 5, "ok") + # Crossover produces very LOW-scoring child + child_eval = _make_eval([0.1] * 5, "terrible") + + all_evals = [seed_eval, child_eval] + evaluator.evaluate = AsyncMock(side_effect=all_evals) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=1, + minibatch_size=5, + population_size=4, + crossover_rate=1.0, + crossover_port=mock_crossover_port, + mutation_rate=0.0, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + rejected_events = [h for h in state.history if h.get("event") == "pop_rejected"] + assert len(rejected_events) >= 1 + + +class TestDiversityScore: + """Tests for the diversity/similarity scoring logic.""" + + def test_identical_prompts_have_high_similarity(self) -> None: + """Identical prompts should have very high similarity.""" + identical = Prompt(text="You are a helpful assistant. Answer the question.") + pop_a = Candidate(prompt=identical, best_score=4.0, generation=0) + pop_b = Candidate( + prompt=Prompt(text="Completely different prompt about data analysis."), + best_score=3.0, + generation=0, + ) + sim_same = EvolutionLoop._compute_diversity_score(identical, [pop_a, pop_b]) + # Average includes similarity to the different member, so ~0.5 not 0.9+ + assert sim_same > 0.3 + + def test_different_prompts_have_lower_similarity(self) -> None: + """Different prompts should have lower similarity than identical ones.""" + prompt_a = Prompt(text="You are a helpful assistant. Answer the question.") + prompt_b = Prompt(text="Provide detailed analysis of complex data patterns with precision.") + pop_a = Candidate(prompt=prompt_a, best_score=4.0, generation=0) + pop_b = Candidate(prompt=prompt_b, best_score=3.0, generation=0) + sim_a = EvolutionLoop._compute_diversity_score(prompt_a, [pop_a, pop_b]) + sim_b = EvolutionLoop._compute_diversity_score(prompt_b, [pop_a, pop_b]) + # Both should be < 1.0 since they're different + assert sim_a < 1.0 + assert sim_b < 1.0 + + def test_single_member_population_returns_1(self) -> None: + """Single-member population always returns 1.0 (no penalty).""" + prompt = Prompt(text="Any prompt text here.") + pop = [Candidate(prompt=prompt, best_score=1.0, generation=0)] + sim = EvolutionLoop._compute_diversity_score(prompt, pop) + assert sim == 1.0 + + def test_empty_prompt_returns_zero(self) -> None: + """Empty prompt text returns 0.0 when population has >1 member.""" + prompt = Prompt(text="") + pop = [ + Candidate(prompt=Prompt(text="some text"), best_score=1.0, generation=0), + Candidate(prompt=Prompt(text="other text"), best_score=2.0, generation=0), + ] + sim = EvolutionLoop._compute_diversity_score(prompt, pop) + assert sim == 0.0 + + +class TestPromptDiff: + """Tests for the static _compute_prompt_diff helper.""" + + def test_identical_prompts(self) -> None: + result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello\nworld") + assert result["lines_added"] == 0 + assert result["lines_removed"] == 0 + assert result["chars_delta"] == 0 + + def test_added_lines(self) -> None: + result = EvolutionLoop._compute_prompt_diff("hello", "hello\nworld") + assert result["lines_added"] == 1 + assert result["lines_removed"] == 0 + assert result["chars_delta"] == 6 # "\nworld" + + def test_removed_lines(self) -> None: + result = EvolutionLoop._compute_prompt_diff("hello\nworld", "hello") + assert result["lines_added"] == 0 + assert result["lines_removed"] == 1 diff --git a/tests/unit/test_ground_truth_evaluator.py b/tests/unit/test_ground_truth_evaluator.py new file mode 100644 index 0000000..cbd8c70 --- /dev/null +++ b/tests/unit/test_ground_truth_evaluator.py @@ -0,0 +1,133 @@ +"""Tests for GroundTruthEvaluator — execution + similarity comparison.""" +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from prometheus.application.ground_truth_evaluator import GroundTruthEvaluator +from prometheus.domain.entities import EvalResult, GroundTruthExample, Prompt +from prometheus.domain.ports import LLMPort, SimilarityPort + + +@pytest.fixture +def mock_executor() -> AsyncMock: + port = AsyncMock(spec=LLMPort) + port.execute.return_value = "Paris is the capital of France." + return port + + +@pytest.fixture +def mock_similarity() -> AsyncMock: + port = AsyncMock(spec=SimilarityPort) + port.compute.return_value = 0.85 + return port + + +@pytest.fixture +def gt_dataset() -> list[GroundTruthExample]: + return [ + GroundTruthExample(input_text="What is the capital of France?", expected_output="Paris", id=0), + GroundTruthExample(input_text="What is 2+2?", expected_output="4", id=1), + GroundTruthExample(input_text="What color is the sky?", expected_output="blue", id=2), + ] + + +@pytest.fixture +def prompt() -> Prompt: + return Prompt(text="Answer the following question accurately.") + + +@pytest.mark.asyncio +class TestGroundTruthEvaluator: + async def test_evaluate_happy_path(self, mock_executor, mock_similarity, gt_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, + similarity=mock_similarity, + max_concurrency=2, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + + assert isinstance(result, EvalResult) + assert len(result.scores) == 3 + assert len(result.feedbacks) == 3 + assert len(result.trajectories) == 3 + assert all(s == 0.85 for s in result.scores) + assert result.mean_score == pytest.approx(0.85) + assert result.total_score == pytest.approx(2.55) + + async def test_executor_called_for_each_input(self, mock_executor, mock_similarity, gt_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=mock_similarity, + ) + await evaluator.evaluate(prompt, gt_dataset) + assert mock_executor.execute.call_count == 3 + + async def test_similarity_called_for_each_output(self, mock_executor, mock_similarity, gt_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=mock_similarity, + ) + await evaluator.evaluate(prompt, gt_dataset) + assert mock_similarity.compute.call_count == 3 + + async def test_execution_error_produces_zero_score(self, mock_similarity, gt_dataset, prompt): + failing_executor = AsyncMock(spec=LLMPort) + failing_executor.execute.side_effect = RuntimeError("API timeout") + + evaluator = GroundTruthEvaluator( + executor=failing_executor, similarity=mock_similarity, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + + assert len(result.scores) == 3 + # The similarity adapter is called with the error sentinel + assert all(isinstance(s, float) for s in result.scores) + assert all("[execution error:" in t.output_text for t in result.trajectories) + + async def test_empty_dataset(self, mock_executor, mock_similarity, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=mock_similarity, + ) + result = await evaluator.evaluate(prompt, []) + assert result.scores == [] + assert result.mean_score == 0.0 + assert result.total_score == 0.0 + + async def test_trajectory_contains_prompt_used(self, mock_executor, mock_similarity, gt_dataset, prompt): + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=mock_similarity, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + for t in result.trajectories: + assert t.prompt_used == prompt.text + + async def test_scores_clamped_to_unit_range(self, mock_executor, gt_dataset, prompt): + # Similarity returns a value > 1.0 (should be clamped) + over_similarity = AsyncMock(spec=SimilarityPort) + over_similarity.compute.return_value = 1.5 + + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=over_similarity, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + assert all(0.0 <= s <= 1.0 for s in result.scores) + + async def test_feedback_for_exact_match(self, mock_executor, gt_dataset, prompt): + exact_similarity = AsyncMock(spec=SimilarityPort) + exact_similarity.compute.return_value = 1.0 + + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=exact_similarity, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + assert all("Exact match" in fb for fb in result.feedbacks) + + async def test_feedback_for_poor_match(self, mock_executor, gt_dataset, prompt): + poor_similarity = AsyncMock(spec=SimilarityPort) + poor_similarity.compute.return_value = 0.1 + + evaluator = GroundTruthEvaluator( + executor=mock_executor, similarity=poor_similarity, + ) + result = await evaluator.evaluate(prompt, gt_dataset) + assert all("Poor match" in fb for fb in result.feedbacks) diff --git a/tests/unit/test_holdout_validation.py b/tests/unit/test_holdout_validation.py new file mode 100644 index 0000000..0c06e67 --- /dev/null +++ b/tests/unit/test_holdout_validation.py @@ -0,0 +1,316 @@ +"""Unit tests for hold-out validation and early stopping.""" +from __future__ import annotations + +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from prometheus.application.bootstrap import SyntheticBootstrap +from prometheus.application.evaluator import PromptEvaluator +from prometheus.application.evolution import EvolutionLoop +from prometheus.domain.entities import ( + Candidate, + EvalResult, + Prompt, + SyntheticExample, + Trajectory, +) + + +def _make_eval(mean_score: float, n: int = 5) -> EvalResult: + """Helper: create an EvalResult with a given mean score.""" + scores = [mean_score] * n + return EvalResult( + scores=scores, + feedbacks=["feedback"] * n, + trajectories=[ + Trajectory(f"input{i}", f"output{i}", mean_score, "feedback", "prompt") + for i in range(n) + ], + ) + + +class TestBootstrapSplit: + """Tests for SyntheticBootstrap.split_pool.""" + + def test_split_produces_correct_sizes(self): + pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(20)] + train, val = SyntheticBootstrap.split_pool(pool, 0.3) + assert len(train) + len(val) == 20 + assert len(val) == 6 # 20 * 0.3 = 6 + assert len(train) == 14 + + def test_split_zero_fraction_returns_all_train(self): + pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(10)] + train, val = SyntheticBootstrap.split_pool(pool, 0.0) + assert len(train) == 10 + assert len(val) == 0 + + def test_split_single_element(self): + pool = [SyntheticExample(input_text="only", id=0)] + train, val = SyntheticBootstrap.split_pool(pool, 0.3) + assert len(train) == 1 + assert len(val) == 0 + + def test_split_deterministic_with_seed(self): + pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(50)] + train1, val1 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42))) + train2, val2 = SyntheticBootstrap.split_pool(pool, 0.3, rng=MagicMock(wraps=__import__("random").Random(42))) + assert [ex.id for ex in train1] == [ex.id for ex in train2] + assert [ex.id for ex in val1] == [ex.id for ex in val2] + + def test_split_no_overlap(self): + pool = [SyntheticExample(input_text=f"ex{i}", id=i) for i in range(30)] + train, val = SyntheticBootstrap.split_pool(pool, 0.3) + train_ids = {ex.id for ex in train} + val_ids = {ex.id for ex in val} + assert train_ids.isdisjoint(val_ids) + assert train_ids | val_ids == {ex.id for ex in pool} + + +class TestValidationEvaluation: + """Tests for hold-out evaluation during evolution.""" + + @pytest.mark.asyncio + async def test_validation_pool_evaluated_after_each_iteration( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """When a validation pool is provided, the best candidate is evaluated on it.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + # Initial eval (train) + validation eval + iteration train eval + new prompt eval + validation eval + train_eval = _make_eval(0.5) + val_eval = _make_eval(0.6) + new_eval = _make_eval(0.7) + val_eval_2 = _make_eval(0.65) + + evaluator.evaluate = AsyncMock( + side_effect=[train_eval, val_eval, train_eval, new_eval, val_eval_2] + ) + + validation_pool = synthetic_pool[-6:] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=1, + minibatch_size=5, + ) + state = await loop.run( + seed_prompt, synthetic_pool, task_description, + validation_pool=validation_pool, + ) + + # Should have validation metrics in state + assert state.best_validation_score is not None + # History should contain validation_eval entries + val_events = [h for h in state.history if h["event"] == "validation_eval"] + assert len(val_events) >= 1 + + @pytest.mark.asyncio + async def test_no_validation_without_pool( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """Without a validation pool, no validation is performed.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + train_eval = _make_eval(0.5) + old_eval = _make_eval(0.5) + new_eval = _make_eval(0.7) + evaluator.evaluate = AsyncMock(side_effect=[train_eval, old_eval, new_eval]) + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=1, + minibatch_size=5, + ) + state = await loop.run(seed_prompt, synthetic_pool, task_description) + + assert state.best_validation_score is None + assert not state.early_stopped + val_events = [h for h in state.history if h["event"] == "validation_eval"] + assert len(val_events) == 0 + + +class TestEarlyStopping: + """Tests for early stopping when validation score degrades.""" + + @pytest.mark.asyncio + async def test_early_stop_triggers_on_patience_exceeded( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """Early stopping triggers when validation doesn't improve for K iterations.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + patience = 3 + # Build eval sequence: + # 1. Initial train eval + # 2. Initial validation eval (0.5) + # Then for each of 3 iterations: + # - train eval (current best) + # - train eval (new prompt - accepted) + # - validation eval (degrading) + evals = [ + _make_eval(0.5), # initial train + _make_eval(0.5), # initial validation + ] + for i in range(patience): + evals.extend([ + _make_eval(0.5 + i * 0.1), # current eval (train) + _make_eval(0.6 + i * 0.1), # new eval (train) - accepted + _make_eval(0.4), # validation eval (degrading) + ]) + + evaluator.evaluate = AsyncMock(side_effect=evals) + + validation_pool = synthetic_pool[-5:] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=10, # would go further without early stop + minibatch_size=5, + early_stop_patience=patience, + ) + state = await loop.run( + seed_prompt, synthetic_pool, task_description, + validation_pool=validation_pool, + ) + + assert state.early_stopped is True + assert state.iteration == patience + assert state.best_validation_score is not None + # Should have an early_stop event in history + early_stop_events = [h for h in state.history if h["event"] == "early_stop"] + assert len(early_stop_events) == 1 + + @pytest.mark.asyncio + async def test_early_stop_does_not_trigger_when_improving( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """When validation keeps improving, early stopping does not trigger.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + evals = [ + _make_eval(0.3), # initial train + _make_eval(0.3), # initial validation + ] + # 3 iterations, each with improving validation + for i in range(3): + evals.extend([ + _make_eval(0.3 + i * 0.1), # current train eval + _make_eval(0.4 + i * 0.1), # new train eval (accepted) + _make_eval(0.3 + (i + 1) * 0.1), # validation eval (improving) + ]) + + evaluator.evaluate = AsyncMock(side_effect=evals) + + validation_pool = synthetic_pool[-5:] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=3, + minibatch_size=5, + early_stop_patience=5, + ) + state = await loop.run( + seed_prompt, synthetic_pool, task_description, + validation_pool=validation_pool, + ) + + assert state.early_stopped is False + assert state.iteration == 3 + assert state.best_validation_score is not None + + @pytest.mark.asyncio + async def test_validation_patience_resets_on_improvement( + self, + seed_prompt: Prompt, + synthetic_pool: list[SyntheticExample], + task_description: str, + mock_llm_port: AsyncMock, + mock_judge_port: AsyncMock, + mock_proposer_port: AsyncMock, + ) -> None: + """Patience counter resets when validation improves after degrading.""" + evaluator = PromptEvaluator(mock_llm_port, mock_judge_port) + bootstrap = MagicMock(spec=SyntheticBootstrap) + bootstrap.sample_minibatch.return_value = synthetic_pool[:5] + + evals = [ + _make_eval(0.3), # initial train + _make_eval(0.3), # initial validation + # iter 1: degrade + _make_eval(0.3), # current train + _make_eval(0.5), # new train (accepted) + _make_eval(0.2), # validation degrade (patience=1) + # iter 2: degrade + _make_eval(0.5), # current train + _make_eval(0.6), # new train (accepted) + _make_eval(0.2), # validation degrade (patience=2) + # iter 3: improve! (resets patience) + _make_eval(0.6), # current train + _make_eval(0.7), # new train (accepted) + _make_eval(0.4), # validation improve (patience=0) + # iter 4: degrade again + _make_eval(0.7), # current train + _make_eval(0.8), # new train (accepted) + _make_eval(0.2), # validation degrade (patience=1) + ] + + evaluator.evaluate = AsyncMock(side_effect=evals) + validation_pool = synthetic_pool[-5:] + + loop = EvolutionLoop( + evaluator=evaluator, + proposer=mock_proposer_port, + bootstrap=bootstrap, + max_iterations=4, + minibatch_size=5, + early_stop_patience=3, + ) + state = await loop.run( + seed_prompt, synthetic_pool, task_description, + validation_pool=validation_pool, + ) + + assert state.early_stopped is False + assert state.iteration == 4 diff --git a/tests/unit/test_logging.py b/tests/unit/test_logging.py new file mode 100644 index 0000000..f132a1e --- /dev/null +++ b/tests/unit/test_logging.py @@ -0,0 +1,189 @@ +"""Unit tests for structured logging configuration.""" +from __future__ import annotations + +import json +import logging +from pathlib import Path + +from prometheus.cli.logging_setup import configure_logging, get_logger + + +class TestConfigureLogging: + def _count_handlers(self, name: str = "prometheus") -> int: + return len(logging.getLogger(name).handlers) + + def test_default_creates_console_handler(self) -> None: + configure_logging(level=logging.INFO) + prom = logging.getLogger("prometheus") + assert len(prom.handlers) == 1 + assert isinstance(prom.handlers[0], logging.StreamHandler) + prom.handlers.clear() + + def test_json_format_produces_valid_json(self, capsys) -> None: + configure_logging(level=logging.INFO, log_format="json") + logger = get_logger("test_json") + logger.info("hello", extra={"structured": {"key": "value"}}) + + captured = capsys.readouterr() + # Output goes to stderr + line = captured.err.strip().split("\n")[-1] + data = json.loads(line) + assert data["message"] == "hello" + assert data["structured"]["key"] == "value" + assert data["level"] == "INFO" + assert "timestamp" in data + + logging.getLogger("prometheus").handlers.clear() + + def test_text_format_includes_structured_extras(self, capsys) -> None: + configure_logging(level=logging.INFO, log_format="text") + logger = get_logger("test_text") + logger.info("msg", extra={"structured": {"foo": "bar"}}) + + captured = capsys.readouterr() + assert "foo=bar" in captured.err + + logging.getLogger("prometheus").handlers.clear() + + def test_debug_level_shows_debug_messages(self, capsys) -> None: + configure_logging(level=logging.DEBUG) + logger = get_logger("test_debug") + logger.debug("debug msg") + + captured = capsys.readouterr() + assert "debug msg" in captured.err + + logging.getLogger("prometheus").handlers.clear() + + def test_warning_level_hides_debug_messages(self, capsys) -> None: + configure_logging(level=logging.WARNING) + logger = get_logger("test_warn") + logger.debug("should not appear") + logger.info("also hidden") + + captured = capsys.readouterr() + assert "should not appear" not in captured.err + assert "also hidden" not in captured.err + + logging.getLogger("prometheus").handlers.clear() + + def test_file_handler_writes_to_file(self, tmp_path: Path) -> None: + log_file = tmp_path / "test.log" + configure_logging(level=logging.INFO, log_file=str(log_file)) + logger = get_logger("test_file") + logger.info("file message") + + prom = logging.getLogger("prometheus") + # Flush handlers + for h in prom.handlers: + h.flush() + prom.handlers.clear() + + content = log_file.read_text() + assert "file message" in content + + def test_json_file_output(self, tmp_path: Path) -> None: + log_file = tmp_path / "test.json.log" + configure_logging(level=logging.INFO, log_format="json", log_file=str(log_file)) + logger = get_logger("test_json_file") + logger.info("json file msg", extra={"structured": {"x": 1}}) + + prom = logging.getLogger("prometheus") + for h in prom.handlers: + h.flush() + prom.handlers.clear() + + content = log_file.read_text().strip() + data = json.loads(content) + assert data["message"] == "json file msg" + assert data["structured"]["x"] == 1 + + def test_reconfigure_clears_old_handlers(self) -> None: + configure_logging(level=logging.INFO) + configure_logging(level=logging.DEBUG) + prom = logging.getLogger("prometheus") + assert len(prom.handlers) == 1 + prom.handlers.clear() + + def test_propagate_false_prevents_duplicate_output(self, capsys) -> None: + configure_logging(level=logging.INFO) + prom = logging.getLogger("prometheus") + assert prom.propagate is False + prom.handlers.clear() + + +class TestGetLogger: + def test_returns_child_of_prometheus(self) -> None: + logger = get_logger("mymodule") + assert logger.name == "prometheus.mymodule" + + def test_inherits_level_from_parent(self) -> None: + configure_logging(level=logging.DEBUG) + logger = get_logger("child") + assert logger.getEffectiveLevel() <= logging.DEBUG + logging.getLogger("prometheus").handlers.clear() + + +class TestJsonFormatter: + def test_exception_included(self, capsys) -> None: + configure_logging(level=logging.ERROR, log_format="json") + logger = get_logger("test_exc") + try: + raise ValueError("boom") + except ValueError: + logger.error("failed", exc_info=True) + + captured = capsys.readouterr() + line = captured.err.strip().split("\n")[-1] + data = json.loads(line) + assert "ValueError: boom" in data["exception"] + + logging.getLogger("prometheus").handlers.clear() + + +class TestLoggingCLIIntegration: + """Tests for CLI flags that configure logging.""" + + def test_verbose_flag_enables_info(self, tmp_path: Path) -> None: + """Simulate what -v does — configure_logging at INFO level.""" + configure_logging(level=logging.INFO) + logger = get_logger("evolution") + logger.info("test message") + + prom = logging.getLogger("prometheus") + assert len(prom.handlers) == 1 + prom.handlers.clear() + + def test_debug_flag_enables_debug(self) -> None: + """Simulate what --debug does — configure_logging at DEBUG level.""" + configure_logging(level=logging.DEBUG) + logger = get_logger("evolution") + logger.debug("debug message") + + prom = logging.getLogger("prometheus") + assert prom.level == logging.DEBUG + prom.handlers.clear() + + def test_log_format_invalid_rejected(self) -> None: + """Invalid log_format should be caught by OptimizationConfig validator.""" + from pydantic import ValidationError + from prometheus.application.dto import OptimizationConfig + + import pytest + + with pytest.raises(ValidationError, match="log_format must be one of"): + OptimizationConfig( + seed_prompt="a", + task_description="b", + log_format="xml", + ) + + def test_log_format_text_and_json_accepted(self) -> None: + """Both text and json log_format values should be valid.""" + from prometheus.application.dto import OptimizationConfig + + for fmt in ("text", "json"): + config = OptimizationConfig( + seed_prompt="a", task_description="b", log_format=fmt, + ) + assert config.log_format == fmt diff --git a/tests/unit/test_scoring_extended.py b/tests/unit/test_scoring_extended.py new file mode 100644 index 0000000..42a6e84 --- /dev/null +++ b/tests/unit/test_scoring_extended.py @@ -0,0 +1,96 @@ +"""Additional unit tests for scoring edge cases.""" +from __future__ import annotations + +import pytest + +from prometheus.domain.entities import EvalResult, Trajectory +from prometheus.domain.scoring import normalize_score, should_accept + + +def _make_eval(scores: list[float]) -> EvalResult: + return EvalResult( + scores=scores, + feedbacks=[""] * len(scores), + trajectories=[ + Trajectory(f"in{i}", f"out{i}", s, "", "p") + for i, s in enumerate(scores) + ], + ) + + +class TestShouldAcceptEdgeCases: + """Extended edge-case tests for should_accept.""" + + def test_tiny_improvement_accepted(self) -> None: + old = _make_eval([0.5]) + new = _make_eval([0.5001]) + assert should_accept(old, new) is True + + def test_tiny_improvement_below_threshold(self) -> None: + old = _make_eval([0.5]) + new = _make_eval([0.5001]) + assert should_accept(old, new, min_improvement=0.01) is False + + def test_zero_scores_equal(self) -> None: + old = _make_eval([0.0, 0.0]) + new = _make_eval([0.0, 0.0]) + assert should_accept(old, new) is False + + def test_negative_to_zero_not_accepted(self) -> None: + """Scores should be [0,1] but test should_accept with edge values.""" + old = _make_eval([-0.1]) + new = _make_eval([0.0]) + assert should_accept(old, new) is True + + def test_large_improvement(self) -> None: + old = _make_eval([0.0, 0.0, 0.0]) + new = _make_eval([1.0, 1.0, 1.0]) + assert should_accept(old, new) is True + + def test_single_score_improvement(self) -> None: + old = _make_eval([0.4]) + new = _make_eval([0.5]) + assert should_accept(old, new) is True + + def test_min_improvement_exactly_met(self) -> None: + """When improvement exactly equals min_improvement, still rejected (strict >).""" + old = _make_eval([0.5]) + new = _make_eval([0.7]) + assert should_accept(old, new, min_improvement=0.2) is False + + def test_min_improvement_just_over(self) -> None: + old = _make_eval([0.5]) + new = _make_eval([0.7001]) + assert should_accept(old, new, min_improvement=0.2) is True + + +class TestNormalizeScoreEdgeCases: + """Extended edge-case tests for normalize_score.""" + + def test_exact_bounds(self) -> None: + assert normalize_score(0.0) == 0.0 + assert normalize_score(1.0) == 1.0 + + def test_very_large_value(self) -> None: + assert normalize_score(1e10) == 1.0 + + def test_very_negative_value(self) -> None: + assert normalize_score(-1e10) == 0.0 + + def test_custom_bounds_at_edges(self) -> None: + assert normalize_score(5.0, min_val=0.0, max_val=10.0) == 5.0 + assert normalize_score(0.0, min_val=0.0, max_val=10.0) == 0.0 + assert normalize_score(10.0, min_val=0.0, max_val=10.0) == 10.0 + + def test_negative_custom_range(self) -> None: + assert normalize_score(0.0, min_val=-5.0, max_val=5.0) == 0.0 + assert normalize_score(-3.0, min_val=-5.0, max_val=5.0) == -3.0 + assert normalize_score(-10.0, min_val=-5.0, max_val=5.0) == -5.0 + + def test_zero_span_range(self) -> None: + """When min == max, clamps to min.""" + assert normalize_score(5.0, min_val=5.0, max_val=5.0) == 5.0 + assert normalize_score(0.0, min_val=5.0, max_val=5.0) == 5.0 + + def test_fractional_score(self) -> None: + assert normalize_score(0.3333) == pytest.approx(0.3333) diff --git a/tests/unit/test_similarity.py b/tests/unit/test_similarity.py new file mode 100644 index 0000000..a86ae0c --- /dev/null +++ b/tests/unit/test_similarity.py @@ -0,0 +1,133 @@ +"""Tests for similarity adapters — exact, BLEU, ROUGE-L, cosine.""" +from __future__ import annotations + +import pytest + +from prometheus.infrastructure.similarity import ( + BleuSimilarity, + CosineSimilarity, + ExactMatchSimilarity, + RougeLSimilarity, + create_similarity_adapter, +) + + +class TestExactMatchSimilarity: + def test_exact_match(self): + s = ExactMatchSimilarity() + assert s.compute("Hello World", "Hello World") == 1.0 + + def test_case_insensitive(self): + s = ExactMatchSimilarity() + assert s.compute("hello world", "HELLO WORLD") == 1.0 + + def test_whitespace_trimmed(self): + s = ExactMatchSimilarity() + assert s.compute(" hello ", "hello") == 1.0 + + def test_no_match(self): + s = ExactMatchSimilarity() + assert s.compute("hello", "world") == 0.0 + + def test_partial_no_match(self): + s = ExactMatchSimilarity() + assert s.compute("hello world", "hello") == 0.0 + + +class TestBleuSimilarity: + def test_perfect_match(self): + s = BleuSimilarity() + assert s.compute("the cat sat on the mat", "the cat sat on the mat") == 1.0 + + def test_no_overlap(self): + s = BleuSimilarity() + assert s.compute("aaa bbb ccc", "ddd eee fff") == 0.0 + + def test_partial_overlap(self): + s = BleuSimilarity() + score = s.compute("the cat sat", "the cat") + assert 0.0 < score < 1.0 + + def test_empty_prediction(self): + s = BleuSimilarity() + assert s.compute("", "hello world") == 0.0 + + def test_empty_expected(self): + s = BleuSimilarity() + assert s.compute("hello world", "") == 0.0 + + def test_both_empty(self): + s = BleuSimilarity() + assert s.compute("", "") == 0.0 + + def test_shorter_prediction_gets_brevity_penalty(self): + s = BleuSimilarity() + short = s.compute("cat", "the cat sat on the mat") + full = s.compute("the cat sat on the mat", "the cat sat on the mat") + assert short < full + + +class TestRougeLSimilarity: + def test_perfect_match(self): + s = RougeLSimilarity() + assert s.compute("the cat sat", "the cat sat") == 1.0 + + def test_no_overlap(self): + s = RougeLSimilarity() + assert s.compute("aaa bbb", "ccc ddd") == 0.0 + + def test_partial_overlap(self): + s = RougeLSimilarity() + score = s.compute("the cat sat on the mat", "the cat on the rug") + assert 0.0 < score < 1.0 + + def test_empty_prediction(self): + s = RougeLSimilarity() + assert s.compute("", "hello") == 0.0 + + def test_subsequence(self): + s = RougeLSimilarity() + # "cat mat" is a subsequence of "the cat sat on the mat" + score = s.compute("cat mat", "the cat sat on the mat") + assert score > 0.0 + + +class TestCosineSimilarity: + def test_identical_texts(self): + s = CosineSimilarity() + assert s.compute("hello world", "hello world") == pytest.approx(1.0) + + def test_no_overlap(self): + s = CosineSimilarity() + assert s.compute("aaa bbb", "ccc ddd") == 0.0 + + def test_partial_overlap(self): + s = CosineSimilarity() + score = s.compute("hello world foo", "hello world bar") + assert 0.0 < score < 1.0 + + def test_empty_prediction(self): + s = CosineSimilarity() + assert s.compute("", "hello") == 0.0 + + +class TestCreateSimilarityAdapter: + def test_create_exact(self): + adapter = create_similarity_adapter("exact") + assert isinstance(adapter, ExactMatchSimilarity) + + def test_create_bleu(self): + adapter = create_similarity_adapter("bleu") + assert isinstance(adapter, BleuSimilarity) + + def test_create_rouge_l(self): + adapter = create_similarity_adapter("rouge_l") + assert isinstance(adapter, RougeLSimilarity) + + def test_create_cosine(self): + adapter = create_similarity_adapter("cosine") + assert isinstance(adapter, CosineSimilarity) + + def test_unknown_metric_raises(self): + with pytest.raises(ValueError, match="Unknown eval metric"): + create_similarity_adapter("nonexistent") diff --git a/tests/unit/test_use_cases.py b/tests/unit/test_use_cases.py new file mode 100644 index 0000000..e30e7cc --- /dev/null +++ b/tests/unit/test_use_cases.py @@ -0,0 +1,233 @@ +"""Unit tests for OptimizePromptUseCase — direct orchestration tests.""" +from __future__ import annotations + +from unittest.mock import AsyncMock, MagicMock, patch + +import pytest + +from prometheus.application.bootstrap import SyntheticBootstrap +from prometheus.application.dto import OptimizationConfig, OptimizationResult +from prometheus.application.evaluator import PromptEvaluator +from prometheus.application.evolution import EvolutionLoop +from prometheus.application.use_cases import OptimizePromptUseCase +from prometheus.domain.entities import ( + Candidate, + EvalResult, + OptimizationState, + Prompt, + SyntheticExample, + Trajectory, +) + + +def _make_eval(scores: list[float]) -> EvalResult: + return EvalResult( + scores=scores, + feedbacks=["feedback"] * len(scores), + trajectories=[ + Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt") + for i, s in enumerate(scores) + ], + ) + + +def _make_state( + iterations: int = 3, + initial_score: float = 0.3, + final_score: float = 0.8, + accepted: bool = True, +) -> OptimizationState: + seed = Candidate(prompt=Prompt(text="seed"), best_score=initial_score, generation=0) + best = Candidate( + prompt=Prompt(text="optimized" if accepted else "seed"), + best_score=final_score, + generation=iterations if accepted else 0, + ) + history = [] + for i in range(1, iterations + 1): + event = "accepted" if accepted else "rejected" + history.append({"iteration": i, "event": event, "old_score": 0.3, "new_score": 0.8}) + + return OptimizationState( + iteration=iterations, + best_candidate=best, + candidates=[seed, best] if accepted else [seed], + total_llm_calls=iterations * 11 + 10, + history=history, + ) + + +class TestOptimizePromptUseCaseExecute: + """Tests for the execute() orchestration method.""" + + @pytest.fixture + def mock_evaluator(self) -> MagicMock: + return MagicMock(spec=PromptEvaluator) + + @pytest.fixture + def mock_proposer(self) -> MagicMock: + return MagicMock() + + @pytest.fixture + def mock_bootstrap(self) -> MagicMock: + return MagicMock(spec=SyntheticBootstrap) + + @pytest.fixture + def use_case( + self, + mock_evaluator: MagicMock, + mock_proposer: MagicMock, + mock_bootstrap: MagicMock, + ) -> OptimizePromptUseCase: + return OptimizePromptUseCase( + evaluator=mock_evaluator, + proposer=mock_proposer, + bootstrap=mock_bootstrap, + ) + + @pytest.fixture + def config(self) -> OptimizationConfig: + return OptimizationConfig( + seed_prompt="Answer the question.", + task_description="Q&A task", + max_iterations=5, + n_synthetic_inputs=20, + minibatch_size=5, + seed=42, + ) + + @pytest.mark.asyncio + async def test_returns_optimization_result( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [ + SyntheticExample(input_text=f"q{i}", id=i) for i in range(20) + ] + + mock_state = _make_state(iterations=3, initial_score=0.3, final_score=0.9) + with patch.object(EvolutionLoop, "run", return_value=mock_state): + result = await use_case.execute(config) + + assert isinstance(result, OptimizationResult) + assert result.initial_prompt == "Answer the question." + assert result.final_score == 0.9 + assert result.improvement == pytest.approx(0.6) + + @pytest.mark.asyncio + async def test_bootstrap_called_with_config_params( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [] + mock_state = _make_state() + with patch.object(EvolutionLoop, "run", return_value=mock_state): + await use_case.execute(config) + + mock_bootstrap.run.assert_called_once_with( + task_description="Q&A task", + n_examples=20, + ) + + @pytest.mark.asyncio + async def test_evolution_loop_configured_from_config( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [] + mock_state = _make_state() + + with patch.object(EvolutionLoop, "run", return_value=mock_state) as mock_run: + await use_case.execute(config) + + # Verify the loop was instantiated with correct params + mock_run.assert_called_once() + call_args = mock_run.call_args + seed_prompt = call_args[0][0] + assert seed_prompt.text == "Answer the question." + synthetic_pool = call_args[0][1] + assert len(synthetic_pool) == 0 # bootstrap returned empty + assert call_args[0][2] == "Q&A task" + + @pytest.mark.asyncio + async def test_total_llm_calls_includes_bootstrap_call( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [] + mock_state = _make_state(iterations=3) + # total_llm_calls from state + 1 for bootstrap + expected = mock_state.total_llm_calls + 1 + + with patch.object(EvolutionLoop, "run", return_value=mock_state): + result = await use_case.execute(config) + + assert result.total_llm_calls == expected + + @pytest.mark.asyncio + async def test_no_candidates_fallback( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [ + SyntheticExample(input_text=f"q{i}", id=i) for i in range(20) + ] + mock_state = OptimizationState( + iteration=0, + best_candidate=None, + candidates=[], + total_llm_calls=0, + ) + + with patch.object(EvolutionLoop, "run", return_value=mock_state): + result = await use_case.execute(config) + + assert result.optimized_prompt == "Answer the question." + assert result.initial_score == 0.0 + assert result.final_score == 0.0 + assert result.improvement == 0.0 + + @pytest.mark.asyncio + async def test_iterations_used_matches_state( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [] + mock_state = _make_state(iterations=7) + + with patch.object(EvolutionLoop, "run", return_value=mock_state): + result = await use_case.execute(config) + + assert result.iterations_used == 7 + + @pytest.mark.asyncio + async def test_history_passed_through( + self, + use_case: OptimizePromptUseCase, + mock_bootstrap: MagicMock, + config: OptimizationConfig, + ) -> None: + mock_bootstrap.run.return_value = [] + history = [ + {"iteration": 1, "event": "accepted"}, + {"iteration": 2, "event": "rejected"}, + ] + mock_state = _make_state() + mock_state.history = history + + with patch.object(EvolutionLoop, "run", return_value=mock_state): + result = await use_case.execute(config) + + assert result.history == history