Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes 2 integration tests that broke when the codebase went async (DSPyLLMAdapter and full pipeline tests now properly await coroutines). 277 tests pass (260 unit + 17 integration). Co-Authored-By: Paperclip <noreply@paperclip.ing>
17 KiB
PROMETHEUS Feature Roadmap
Complete codebase review — features needed for production-grade prompt optimization. Generated from v0.1.0 architecture review (2026-03-29).
Legend
| Marker | Meaning |
|---|---|
| CLI | Exposed as a CLI option/flag |
| Config | YAML config field |
| Internal | No user-facing surface, architectural improvement |
| P1 | Critical / must-have for reliability |
| P2 | High value, should-have |
| P3 | Nice-to-have, deferred to later versions |
1. Multi-Model Routing (P1)
Current state: OptimizationConfig defines four model slots (task_model, judge_model, proposer_model, synth_model), but cli/app.py only configures a single global DSPy LM from task_model. All adapters silently use the same model regardless of config.
Feature:
- Each adapter (
DSPyLLMAdapter,DSPyJudgeAdapter,DSPyProposerAdapter,DSPySyntheticAdapter) must instantiate its owndspy.LMfrom the corresponding config field. - Support per-model
api_baseandapi_key_envoverrides (e.g., judge on GPT-4o, propose on a cheaper model).
Surface: Config (already partially defined) — judge_model, proposer_model, synth_model become functional. No new CLI flags needed; the YAML already has the fields.
Scope: Infrastructure layer (llm_adapter.py, judge_adapter.py, proposer_adapter.py, synth_adapter.py) + cli/app.py DI wiring.
2. Async / Parallel Execution (P1)
Current state: All LLM calls (execute, judge, propose) are sequential. A single iteration with minibatch_size=5 makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.
Feature:
- Parallelize execution of the prompt across a minibatch (
asyncio.gatherordspy.Parallel). - Parallelize judge calls within a batch.
- Keep the proposer sequential (single call per iteration).
Surface: Internal. Optionally exposed via --max-concurrency CLI flag and max_concurrency YAML field.
Scope: evaluator.py, judge_adapter.py, llm_adapter.py.
3. Robust Error Handling & Retry (P1)
Current state: The evolution loop catches broad Exception per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.
Feature:
- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
- Configurable
max_retriesandretry_delay_base. - Circuit breaker: if N consecutive iterations fail, pause and alert.
- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.
Surface: --max-retries CLI flag, max_retries Config field. --error-strategy (skip | retry | abort) CLI flag.
Scope: Infrastructure adapters + evolution loop.
4. Checkpoint & Resume (P2)
Current state: If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.
Feature:
- Save
OptimizationStateto disk every K iterations (or every accepted improvement). - Resume from the latest checkpoint file on restart.
- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.
Surface: --checkpoint-dir CLI flag (default: .prometheus/checkpoints/). --resume CLI flag to resume from latest checkpoint. checkpoint_interval Config field.
Scope: New CheckpointPort in domain, JsonCheckpointPersistence in infrastructure, modifications to EvolutionLoop.run().
5. Population-Based Evolution (P2)
Current state: The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The Candidate entity has generation and parent_id fields that suggest population support was planned.
Feature:
- Maintain a population of K candidates (e.g., top-K by score or Pareto front).
- Crossover: combine instructions from two parent candidates.
- Mutation operators: paraphrase, constrain, generalize, specialize.
- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).
Surface: --population-size CLI flag, population_size Config field. --crossover-rate, --mutation-rate CLI flags.
Scope: EvolutionLoop refactor, new CrossoverPort and MutationPort in domain, new DSPy signatures for crossover/mutation in infrastructure.
6. Hold-Out Validation (P2)
Current state: The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.
Feature:
- Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
- Evolution uses train minibatches for accept/reject decisions.
- After each iteration, evaluate the best candidate on the hold-out set.
- Report both train and validation scores in results.
- Optional early stopping if validation score degrades for K consecutive iterations.
Surface: --validation-split CLI flag (default: 0.3). --early-stop-patience CLI flag (default: 5). Config fields: validation_split, early_stop_patience.
Scope: SyntheticBootstrap, EvolutionLoop, OptimizationResult (add validation metrics).
7. Custom Judge Criteria (P2)
Current state: The judge uses a hardcoded rubric in JudgeOutput DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.
Feature:
- Allow users to define custom judge rubrics, criteria, and scoring scales.
- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
- Allow
perfect_scoreto reflect the custom scale.
Surface: judge_criteria YAML field (free text). judge_dimensions YAML field (list of {name, weight, description}). CLI: --judge-criteria for quick overrides.
Scope: JudgeOutput signature (dynamic instructions), JudgePort, DSPyJudgeAdapter, scoring.py (weighted aggregation).
8. Real-World Evaluation Harness (P2)
Current state: The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.
Feature:
- Accept an optional evaluation dataset (CSV/JSON with
inputandexpected_outputcolumns). - When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.
Surface: --eval-dataset CLI flag. eval_dataset_path Config field. --eval-metric CLI flag (exact | semantic | llm_judge).
Scope: New GroundTruthEvaluator in application, new SimilarityPort in domain, dataset loader in infrastructure.
9. Logging & Observability (P2)
Current state: Verbose mode (-v) configures Python's logging module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.
Feature:
- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
- JSON-formatted log output for machine parsing.
- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
- Optional OpenTelemetry export for distributed tracing.
Surface: -v / --verbose enables INFO level. --debug enables DEBUG level. --log-format (text | json). --log-file for file output. Config fields: log_level, log_format, log_file.
Scope: cli/app.py (logging setup), evolution.py (structured traces), new TracingPort in domain.
10. CLI Improvements (P2)
Current state: Single optimize command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No version, init, or list-results commands.
Feature:
- Fix Typer subcommand routing.
prometheus version— show version.prometheus init— scaffold a config YAML interactively.prometheus list— list past optimization runs.prometheus diff— compare two result files (before/after prompt diff, score improvement).prometheus eval— evaluate a prompt against a dataset without optimization.
Surface: CLI subcommands.
Scope: cli/app.py restructured into cli/commands/ with one module per command.
11. Input Validation & Schema Enforcement (P2)
Current state: Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.
Feature:
- Validate input YAML against a Pydantic schema (leveraging the existing
pydanticdependency). - Provide clear, actionable error messages for missing/invalid fields.
- Support config migration/upgrade from older versions.
Surface: Internal. Errors surface as clear CLI messages.
Scope: OptimizationConfig converted to Pydantic model with validators, cli/app.py validation step before pipeline execution.
12. Adaptive Minibatch Sizing (P3)
Current state: Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.
Feature:
- Start with a small minibatch for quick early iterations.
- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
- Shrink if too many evaluations fail (cost optimization).
Surface: --adaptive-minibatch CLI flag (boolean toggle). minibatch_size becomes minibatch_size_min and minibatch_size_max in config.
Scope: EvolutionLoop, SyntheticBootstrap.
13. Prompt Diversity Tracking (P3)
Current state: No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.
Feature:
- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
- Report diversity metrics in the result.
- Flag stagnation (N iterations with <epsilon change).
Surface: Internal. Reported in OptimizationResult.history entries.
Scope: EvolutionLoop, OptimizationResult (add diversity field per history entry).
14. Temperature & Sampling Control (P3)
Current state: No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.
Feature:
- Per-model-slot temperature and sampling parameters.
- Higher temperature for proposer (creativity), lower for judge (consistency).
Surface: task_temperature, judge_temperature, proposer_temperature, synth_temperature Config fields. --temperature CLI flag for global override.
Scope: cli/app.py (DSPy LM configuration), infrastructure adapters.
15. Cost Estimation & Budget Caps (P3)
Current state: total_llm_calls is tracked (inaccurately). No cost estimation, no budget caps.
Feature:
- Estimate cost per run based on model pricing and approximate token counts.
- Allow users to set a budget cap (
--max-cost-usd). - Report estimated cost in the result.
Surface: --max-cost-usd CLI flag. max_cost_usd Config field. Cost breakdown in result output.
Scope: cli/app.py, OptimizationResult (add cost fields), token counting in adapters.
16. Multi-Objective Optimization (P3)
Current state: Single scalar score from the judge. The Prompt entity comment mentions "Pareto tracking" but it's not implemented.
Feature:
- Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
- Maintain a Pareto front of non-dominated candidates.
- Allow users to set objective weights or constraints.
Surface: objectives Config field (list of {name, weight, judge_criteria}). CLI: --objective repeatable flag.
Scope: EvolutionLoop (Pareto front), scoring.py (multi-objective acceptance), OptimizationResult (Pareto set).
17. Export Optimized Prompt (P3)
Current state: The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.
Feature:
prometheus exportcommand to extract the optimized prompt as plain text.- Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
- Copy to clipboard option.
Surface: prometheus export --format <txt|md|json|langchain|dspy> CLI subcommand. --clipboard flag.
Scope: New cli/commands/export.py, format renderers in infrastructure.
18. Config Profiles / Presets (P3)
Current state: Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.
Feature:
- Named profiles:
fast,thorough,economy,research. - Profile overrides individual config fields.
- User-defined profiles stored in
~/.prometheus/profiles/.
Surface: --profile CLI flag. prometheus profile list / prometheus profile create subcommands.
Scope: cli/app.py, new ProfileManager in application.
Summary Table
| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope |
|---|---|---|---|---|---|
| 1 | Multi-Model Routing | P1 | Existing | Existing | Small |
| 2 | Async / Parallel Execution | P1 | --max-concurrency |
max_concurrency |
Medium |
| 3 | Error Handling & Retry | P1 | --max-retries, --error-strategy |
max_retries, error_strategy |
Medium |
| 4 | Checkpoint & Resume | P2 | --checkpoint-dir, --resume |
checkpoint_interval |
Medium |
| 5 | Population-Based Evolution | P2 | --population-size, --crossover-rate |
population_size, crossover_rate |
Large |
| 6 | Hold-Out Validation | P2 | --validation-split, --early-stop-patience |
validation_split, early_stop_patience |
Medium |
| 7 | Custom Judge Criteria | P2 | --judge-criteria |
judge_criteria, judge_dimensions |
Medium |
| 8 | Real-World Eval Harness | P2 | --eval-dataset, --eval-metric |
eval_dataset_path |
Large |
| 9 | Logging & Observability | P2 | --debug, --log-format, --log-file |
log_level, log_format |
Medium |
| 10 | CLI Improvements | P2 | Subcommands | — | Medium |
| 11 | Input Validation | P2 | — (error messages) | — | Small |
| 12 | Adaptive Minibatch | P3 | --adaptive-minibatch |
minibatch_size_min/max |
Small |
| 13 | Prompt Diversity Tracking | P3 | — | — | Small |
| 14 | Temperature & Sampling | P3 | --temperature |
*_temperature |
Small |
| 15 | Cost Estimation | P3 | --max-cost-usd |
max_cost_usd |
Small |
| 16 | Multi-Objective Optimization | P3 | --objective |
objectives |
Large |
| 17 | Export Optimized Prompt | P3 | prometheus export |
— | Small |
| 18 | Config Profiles / Presets | P3 | --profile |
— | Small |
Known Bugs (from TEST_REPORT.md and code review)
| # | Bug | Severity | File |
|---|---|---|---|
| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | cli/app.py, all adapters |
| 2 | DSPyLLMAdapter accepts model param but never uses it |
HIGH | infrastructure/llm_adapter.py |
| 3 | CLI subcommand optimize absorbed by Typer 0.24 |
HIGH | cli/app.py |
| 4 | Verbose logging produces no output — no handler configured | MEDIUM | cli/app.py |
| 5 | total_llm_calls counter is inaccurate |
LOW | application/use_cases.py, evolution.py |
| 6 | normalize_score() is dead code — never called |
LOW | domain/scoring.py |
| 7 | AppSettings is never imported or used |
LOW | config.py |
| 8 | No LLM error handling in evolution loop | MEDIUM | evolution.py |
| 9 | Unpinned dependencies (dspy, typer) | LOW | pyproject.toml |
Test Coverage Gaps
| Area | Current | Needed |
|---|---|---|
| CLI commands | 0 tests | Unit + integration for each subcommand |
| Config validation | 0 tests | Schema validation, missing fields, type errors |
| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery |
| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters |
| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios |
| Use case orchestration | 1 indirect test | Direct unit tests for OptimizePromptUseCase |
Recommended Implementation Order
Phase 1 — Production Reliability (P1)
- Fix multi-model routing (#1) — highest impact, smallest scope
- Add error handling & retry (#3) — essential for production runs
- Implement async/parallel execution (#2) — biggest wall-clock improvement
Phase 2 — Optimization Quality (P2)
- Input validation (#11) — small scope, high reliability gain
- Logging & observability (#9) — enables debugging long runs
- CLI improvements (#10) — fix Typer bug, add basic commands
- Hold-out validation (#6) — prevents overfitting
- Checkpoint & resume (#4) — essential for long runs
- Custom judge criteria (#7) — enables domain-specific optimization
Phase 3 — Advanced Features (P3)
- Population-based evolution (#5)
- Real-world eval harness (#8)
- Remaining P3 features as demand dictates