# PROMETHEUS Feature Roadmap > Complete codebase review — features needed for production-grade prompt optimization. > Generated from v0.1.0 architecture review (2026-03-29). --- ## Legend | Marker | Meaning | |--------|---------| | **CLI** | Exposed as a CLI option/flag | | **Config** | YAML config field | | **Internal** | No user-facing surface, architectural improvement | | **P1** | Critical / must-have for reliability | | **P2** | High value, should-have | | **P3** | Nice-to-have, deferred to later versions | --- ## 1. Multi-Model Routing (P1) **Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config. **Feature:** - Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field. - Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model). **Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields. **Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring. --- ## 2. Async / Parallel Execution (P1) **Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size. **Feature:** - Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`). - Parallelize judge calls within a batch. - Keep the proposer sequential (single call per iteration). **Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field. **Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`. --- ## 3. Robust Error Handling & Retry (P1) **Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors. **Feature:** - Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx). - Configurable `max_retries` and `retry_delay_base`. - Circuit breaker: if N consecutive iterations fail, pause and alert. - Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation. **Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag. **Scope:** Infrastructure adapters + evolution loop. --- ## 4. Checkpoint & Resume (P2) **Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence. **Feature:** - Save `OptimizationState` to disk every K iterations (or every accepted improvement). - Resume from the latest checkpoint file on restart. - Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state. **Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field. **Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`. --- ## 5. Population-Based Evolution (P2) **Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned. **Feature:** - Maintain a population of K candidates (e.g., top-K by score or Pareto front). - Crossover: combine instructions from two parent candidates. - Mutation operators: paraphrase, constrain, generalize, specialize. - Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance). **Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags. **Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure. --- ## 6. Hold-Out Validation (P2) **Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs. **Feature:** - Split synthetic pool into train (e.g., 70%) and validation (30%) sets. - Evolution uses train minibatches for accept/reject decisions. - After each iteration, evaluate the best candidate on the hold-out set. - Report both train and validation scores in results. - Optional early stopping if validation score degrades for K consecutive iterations. **Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`. **Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics). --- ## 7. Custom Judge Criteria (P2) **Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria. **Feature:** - Allow users to define custom judge rubrics, criteria, and scoring scales. - Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights. - Allow `perfect_score` to reflect the custom scale. **Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides. **Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation). --- ## 8. Real-World Evaluation Harness (P2) **Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs. **Feature:** - Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns). - When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge. - Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected. **Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge). **Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure. --- ## 9. Logging & Observability (P2) **Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing. **Feature:** - Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR). - JSON-formatted log output for machine parsing. - Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff. - Optional OpenTelemetry export for distributed tracing. **Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`. **Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain. --- ## 10. CLI Improvements (P2) **Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands. **Feature:** - Fix Typer subcommand routing. - `prometheus version` — show version. - `prometheus init` — scaffold a config YAML interactively. - `prometheus list` — list past optimization runs. - `prometheus diff` — compare two result files (before/after prompt diff, score improvement). - `prometheus eval` — evaluate a prompt against a dataset without optimization. **Surface:** CLI subcommands. **Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command. --- ## 11. Input Validation & Schema Enforcement (P2) **Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline. **Feature:** - Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency). - Provide clear, actionable error messages for missing/invalid fields. - Support config migration/upgrade from older versions. **Surface:** Internal. Errors surface as clear CLI messages. **Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution. --- ## 12. Adaptive Minibatch Sizing (P3) **Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive. **Feature:** - Start with a small minibatch for quick early iterations. - Increase minibatch size as the prompt improves (higher confidence needed for marginal gains). - Shrink if too many evaluations fail (cost optimization). **Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config. **Scope:** `EvolutionLoop`, `SyntheticBootstrap`. --- ## 13. Prompt Diversity Tracking (P3) **Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change. **Feature:** - Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts. - Report diversity metrics in the result. - Flag stagnation (N iterations with ` CLI subcommand. `--clipboard` flag. **Scope:** New `cli/commands/export.py`, format renderers in infrastructure. --- ## 18. Config Profiles / Presets (P3) **Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured. **Feature:** - Named profiles: `fast`, `thorough`, `economy`, `research`. - Profile overrides individual config fields. - User-defined profiles stored in `~/.prometheus/profiles/`. **Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands. **Scope:** `cli/app.py`, new `ProfileManager` in application. --- ## Summary Table | # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope | |---|---------|----------|-------------|----------------|-----------------| | 1 | Multi-Model Routing | P1 | Existing | Existing | Small | | 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium | | 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium | | 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium | | 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large | | 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium | | 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium | | 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large | | 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium | | 10 | CLI Improvements | P2 | Subcommands | — | Medium | | 11 | Input Validation | P2 | — (error messages) | — | Small | | 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small | | 13 | Prompt Diversity Tracking | P3 | — | — | Small | | 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small | | 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small | | 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large | | 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small | | 18 | Config Profiles / Presets | P3 | `--profile` | — | Small | --- ## Known Bugs (from TEST_REPORT.md and code review) | # | Bug | Severity | File | |---|-----|----------|------| | 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters | | 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` | | 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` | | 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` | | 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` | | 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` | | 7 | `AppSettings` is never imported or used | LOW | `config.py` | | 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` | | 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` | --- ## Test Coverage Gaps | Area | Current | Needed | |------|---------|--------| | CLI commands | 0 tests | Unit + integration for each subcommand | | Config validation | 0 tests | Schema validation, missing fields, type errors | | Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery | | Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters | | Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios | | Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` | --- ## Recommended Implementation Order ### Phase 1 — Production Reliability (P1) 1. Fix multi-model routing (#1) — highest impact, smallest scope 2. Add error handling & retry (#3) — essential for production runs 3. Implement async/parallel execution (#2) — biggest wall-clock improvement ### Phase 2 — Optimization Quality (P2) 4. Input validation (#11) — small scope, high reliability gain 5. Logging & observability (#9) — enables debugging long runs 6. CLI improvements (#10) — fix Typer bug, add basic commands 7. Hold-out validation (#6) — prevents overfitting 8. Checkpoint & resume (#4) — essential for long runs 9. Custom judge criteria (#7) — enables domain-specific optimization ### Phase 3 — Advanced Features (P3) 10. Population-based evolution (#5) 11. Real-world eval harness (#8) 12. Remaining P3 features as demand dictates