Files

FullStackDev a5bf2ad59c feat: v0.2.0 sprint — ground truth eval, crossover/mutation, checkpointing, similarity guards, dataset loader, CLI commands, extended test coverage

Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes
2 integration tests that broke when the codebase went async (DSPyLLMAdapter
and full pipeline tests now properly await coroutines).

277 tests pass (260 unit + 17 integration).

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-03-29 19:13:50 +00:00

17 KiB

Raw Blame History

PROMETHEUS Feature Roadmap

Complete codebase review — features needed for production-grade prompt optimization. Generated from v0.1.0 architecture review (2026-03-29).

Legend

Marker	Meaning
CLI	Exposed as a CLI option/flag
Config	YAML config field
Internal	No user-facing surface, architectural improvement
P1	Critical / must-have for reliability
P2	High value, should-have
P3	Nice-to-have, deferred to later versions

1. Multi-Model Routing (P1)

Current state: OptimizationConfig defines four model slots (task_model, judge_model, proposer_model, synth_model), but cli/app.py only configures a single global DSPy LM from task_model. All adapters silently use the same model regardless of config.

Feature:

Each adapter (DSPyLLMAdapter, DSPyJudgeAdapter, DSPyProposerAdapter, DSPySyntheticAdapter) must instantiate its own dspy.LM from the corresponding config field.
Support per-model api_base and api_key_env overrides (e.g., judge on GPT-4o, propose on a cheaper model).

Surface: Config (already partially defined) — judge_model, proposer_model, synth_model become functional. No new CLI flags needed; the YAML already has the fields.

Scope: Infrastructure layer (llm_adapter.py, judge_adapter.py, proposer_adapter.py, synth_adapter.py) + cli/app.py DI wiring.

2. Async / Parallel Execution (P1)

Current state: All LLM calls (execute, judge, propose) are sequential. A single iteration with minibatch_size=5 makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.

Feature:

Parallelize execution of the prompt across a minibatch (asyncio.gather or dspy.Parallel).
Parallelize judge calls within a batch.
Keep the proposer sequential (single call per iteration).

Surface: Internal. Optionally exposed via --max-concurrency CLI flag and max_concurrency YAML field.

Scope: evaluator.py, judge_adapter.py, llm_adapter.py.

3. Robust Error Handling & Retry (P1)

Current state: The evolution loop catches broad Exception per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.

Feature:

Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
Configurable max_retries and retry_delay_base.
Circuit breaker: if N consecutive iterations fail, pause and alert.
Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.

Surface: --max-retries CLI flag, max_retries Config field. --error-strategy (skip | retry | abort) CLI flag.

Scope: Infrastructure adapters + evolution loop.

4. Checkpoint & Resume (P2)

Current state: If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.

Feature:

Save OptimizationState to disk every K iterations (or every accepted improvement).
Resume from the latest checkpoint file on restart.
Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.

Surface: --checkpoint-dir CLI flag (default: .prometheus/checkpoints/). --resume CLI flag to resume from latest checkpoint. checkpoint_interval Config field.

Scope: New CheckpointPort in domain, JsonCheckpointPersistence in infrastructure, modifications to EvolutionLoop.run().

5. Population-Based Evolution (P2)

Current state: The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The Candidate entity has generation and parent_id fields that suggest population support was planned.

Feature:

Maintain a population of K candidates (e.g., top-K by score or Pareto front).
Crossover: combine instructions from two parent candidates.
Mutation operators: paraphrase, constrain, generalize, specialize.
Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).

Surface: --population-size CLI flag, population_size Config field. --crossover-rate, --mutation-rate CLI flags.

Scope: EvolutionLoop refactor, new CrossoverPort and MutationPort in domain, new DSPy signatures for crossover/mutation in infrastructure.

6. Hold-Out Validation (P2)

Current state: The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.

Feature:

Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
Evolution uses train minibatches for accept/reject decisions.
After each iteration, evaluate the best candidate on the hold-out set.
Report both train and validation scores in results.
Optional early stopping if validation score degrades for K consecutive iterations.

Surface: --validation-split CLI flag (default: 0.3). --early-stop-patience CLI flag (default: 5). Config fields: validation_split, early_stop_patience.

Scope: SyntheticBootstrap, EvolutionLoop, OptimizationResult (add validation metrics).

7. Custom Judge Criteria (P2)

Current state: The judge uses a hardcoded rubric in JudgeOutput DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.

Feature:

Allow users to define custom judge rubrics, criteria, and scoring scales.
Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
Allow perfect_score to reflect the custom scale.

Surface: judge_criteria YAML field (free text). judge_dimensions YAML field (list of {name, weight, description}). CLI: --judge-criteria for quick overrides.

Scope: JudgeOutput signature (dynamic instructions), JudgePort, DSPyJudgeAdapter, scoring.py (weighted aggregation).

8. Real-World Evaluation Harness (P2)

Current state: The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.

Feature:

Accept an optional evaluation dataset (CSV/JSON with input and expected_output columns).
When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.

Surface: --eval-dataset CLI flag. eval_dataset_path Config field. --eval-metric CLI flag (exact | semantic | llm_judge).

Scope: New GroundTruthEvaluator in application, new SimilarityPort in domain, dataset loader in infrastructure.

9. Logging & Observability (P2)

Current state: Verbose mode (-v) configures Python's logging module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.

Feature:

Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
JSON-formatted log output for machine parsing.
Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
Optional OpenTelemetry export for distributed tracing.

Surface: -v / --verbose enables INFO level. --debug enables DEBUG level. --log-format (text | json). --log-file for file output. Config fields: log_level, log_format, log_file.

Scope: cli/app.py (logging setup), evolution.py (structured traces), new TracingPort in domain.

10. CLI Improvements (P2)

Current state: Single optimize command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No version, init, or list-results commands.

Feature:

Fix Typer subcommand routing.
prometheus version — show version.
prometheus init — scaffold a config YAML interactively.
prometheus list — list past optimization runs.
prometheus diff — compare two result files (before/after prompt diff, score improvement).
prometheus eval — evaluate a prompt against a dataset without optimization.

Surface: CLI subcommands.

Scope: cli/app.py restructured into cli/commands/ with one module per command.

11. Input Validation & Schema Enforcement (P2)

Current state: Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.

Feature:

Validate input YAML against a Pydantic schema (leveraging the existing pydantic dependency).
Provide clear, actionable error messages for missing/invalid fields.
Support config migration/upgrade from older versions.

Surface: Internal. Errors surface as clear CLI messages.

Scope: OptimizationConfig converted to Pydantic model with validators, cli/app.py validation step before pipeline execution.

12. Adaptive Minibatch Sizing (P3)

Current state: Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.

Feature:

Start with a small minibatch for quick early iterations.
Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
Shrink if too many evaluations fail (cost optimization).

Surface: --adaptive-minibatch CLI flag (boolean toggle). minibatch_size becomes minibatch_size_min and minibatch_size_max in config.

Scope: EvolutionLoop, SyntheticBootstrap.

13. Prompt Diversity Tracking (P3)

Current state: No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.

Feature:

Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
Report diversity metrics in the result.
Flag stagnation (N iterations with <epsilon change).

Surface: Internal. Reported in OptimizationResult.history entries.

Scope: EvolutionLoop, OptimizationResult (add diversity field per history entry).

14. Temperature & Sampling Control (P3)

Current state: No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.

Feature:

Per-model-slot temperature and sampling parameters.
Higher temperature for proposer (creativity), lower for judge (consistency).

Surface: task_temperature, judge_temperature, proposer_temperature, synth_temperature Config fields. --temperature CLI flag for global override.

Scope: cli/app.py (DSPy LM configuration), infrastructure adapters.

15. Cost Estimation & Budget Caps (P3)

Current state: total_llm_calls is tracked (inaccurately). No cost estimation, no budget caps.

Feature:

Estimate cost per run based on model pricing and approximate token counts.
Allow users to set a budget cap (--max-cost-usd).
Report estimated cost in the result.

Surface: --max-cost-usd CLI flag. max_cost_usd Config field. Cost breakdown in result output.

Scope: cli/app.py, OptimizationResult (add cost fields), token counting in adapters.

16. Multi-Objective Optimization (P3)

Current state: Single scalar score from the judge. The Prompt entity comment mentions "Pareto tracking" but it's not implemented.

Feature:

Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
Maintain a Pareto front of non-dominated candidates.
Allow users to set objective weights or constraints.

Surface: objectives Config field (list of {name, weight, judge_criteria}). CLI: --objective repeatable flag.

Scope: EvolutionLoop (Pareto front), scoring.py (multi-objective acceptance), OptimizationResult (Pareto set).

17. Export Optimized Prompt (P3)

Current state: The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.

Feature:

prometheus export command to extract the optimized prompt as plain text.
Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
Copy to clipboard option.

Surface: prometheus export --format <txt|md|json|langchain|dspy> CLI subcommand. --clipboard flag.

Scope: New cli/commands/export.py, format renderers in infrastructure.

18. Config Profiles / Presets (P3)

Current state: Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.

Feature:

Named profiles: fast, thorough, economy, research.
Profile overrides individual config fields.
User-defined profiles stored in ~/.prometheus/profiles/.

Surface: --profile CLI flag. prometheus profile list / prometheus profile create subcommands.

Scope: cli/app.py, new ProfileManager in application.

Summary Table

#	Feature	Priority	CLI Surface	Config Surface	Estimated Scope
1	Multi-Model Routing	P1	Existing	Existing	Small
2	Async / Parallel Execution	P1	`--max-concurrency`	`max_concurrency`	Medium
3	Error Handling & Retry	P1	`--max-retries`, `--error-strategy`	`max_retries`, `error_strategy`	Medium
4	Checkpoint & Resume	P2	`--checkpoint-dir`, `--resume`	`checkpoint_interval`	Medium
5	Population-Based Evolution	P2	`--population-size`, `--crossover-rate`	`population_size`, `crossover_rate`	Large
6	Hold-Out Validation	P2	`--validation-split`, `--early-stop-patience`	`validation_split`, `early_stop_patience`	Medium
7	Custom Judge Criteria	P2	`--judge-criteria`	`judge_criteria`, `judge_dimensions`	Medium
8	Real-World Eval Harness	P2	`--eval-dataset`, `--eval-metric`	`eval_dataset_path`	Large
9	Logging & Observability	P2	`--debug`, `--log-format`, `--log-file`	`log_level`, `log_format`	Medium
10	CLI Improvements	P2	Subcommands	—	Medium
11	Input Validation	P2	— (error messages)	—	Small
12	Adaptive Minibatch	P3	`--adaptive-minibatch`	`minibatch_size_min/max`	Small
13	Prompt Diversity Tracking	P3	—	—	Small
14	Temperature & Sampling	P3	`--temperature`	`*_temperature`	Small
15	Cost Estimation	P3	`--max-cost-usd`	`max_cost_usd`	Small
16	Multi-Objective Optimization	P3	`--objective`	`objectives`	Large
17	Export Optimized Prompt	P3	`prometheus export`	—	Small
18	Config Profiles / Presets	P3	`--profile`	—	Small

Known Bugs (from TEST_REPORT.md and code review)

#	Bug	Severity	File
1	Multi-model config not wired — all adapters use single global LM	HIGH	`cli/app.py`, all adapters
2	`DSPyLLMAdapter` accepts `model` param but never uses it	HIGH	`infrastructure/llm_adapter.py`
3	CLI subcommand `optimize` absorbed by Typer 0.24	HIGH	`cli/app.py`
4	Verbose logging produces no output — no handler configured	MEDIUM	`cli/app.py`
5	`total_llm_calls` counter is inaccurate	LOW	`application/use_cases.py`, `evolution.py`
6	`normalize_score()` is dead code — never called	LOW	`domain/scoring.py`
7	`AppSettings` is never imported or used	LOW	`config.py`
8	No LLM error handling in evolution loop	MEDIUM	`evolution.py`
9	Unpinned dependencies (dspy, typer)	LOW	`pyproject.toml`

Test Coverage Gaps

Area	Current	Needed
CLI commands	0 tests	Unit + integration for each subcommand
Config validation	0 tests	Schema validation, missing fields, type errors
Evolution loop	3 tests (single iteration each)	Multi-iteration, mixed accept/reject, failure recovery
Integration pipeline	1 test (happy path only)	Error paths, mixed results, real adapters
Adapter coverage	1 adapter tested	All 4 adapters + error scenarios
Use case orchestration	1 indirect test	Direct unit tests for `OptimizePromptUseCase`

Recommended Implementation Order

Phase 1 — Production Reliability (P1)

Fix multi-model routing (#1) — highest impact, smallest scope
Add error handling & retry (#3) — essential for production runs
Implement async/parallel execution (#2) — biggest wall-clock improvement

Phase 2 — Optimization Quality (P2)

Input validation (#11) — small scope, high reliability gain
Logging & observability (#9) — enables debugging long runs
CLI improvements (#10) — fix Typer bug, add basic commands
Hold-out validation (#6) — prevents overfitting
Checkpoint & resume (#4) — essential for long runs
Custom judge criteria (#7) — enables domain-specific optimization

Phase 3 — Advanced Features (P3)

Population-based evolution (#5)
Real-world eval harness (#8)
Remaining P3 features as demand dictates

17 KiB Raw Blame History

PROMETHEUS Feature Roadmap

Legend

1. Multi-Model Routing (P1)

2. Async / Parallel Execution (P1)

3. Robust Error Handling & Retry (P1)

4. Checkpoint & Resume (P2)

5. Population-Based Evolution (P2)

6. Hold-Out Validation (P2)

7. Custom Judge Criteria (P2)

8. Real-World Evaluation Harness (P2)

9. Logging & Observability (P2)

10. CLI Improvements (P2)

11. Input Validation & Schema Enforcement (P2)

12. Adaptive Minibatch Sizing (P3)

13. Prompt Diversity Tracking (P3)

14. Temperature & Sampling Control (P3)

15. Cost Estimation & Budget Caps (P3)

16. Multi-Objective Optimization (P3)

17. Export Optimized Prompt (P3)

18. Config Profiles / Presets (P3)

Summary Table

Known Bugs (from TEST_REPORT.md and code review)

Test Coverage Gaps

Recommended Implementation Order

Phase 1 — Production Reliability (P1)

Phase 2 — Optimization Quality (P2)

Phase 3 — Advanced Features (P3)

17 KiB

Raw Blame History