Prompt-optimizer/docs/FEATURE_ROADMAP.md

# PROMETHEUS Feature Roadmap

> Complete codebase review — features needed for production-grade prompt optimization.
> Generated from v0.1.0 architecture review (2026-03-29).

---

## Legend

| Marker | Meaning |
|--------|---------|
| **CLI** | Exposed as a CLI option/flag |
| **Config** | YAML config field |
| **Internal** | No user-facing surface, architectural improvement |
| **P1** | Critical / must-have for reliability |
| **P2** | High value, should-have |
| **P3** | Nice-to-have, deferred to later versions |

---

## 1. Multi-Model Routing (P1)

**Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config.

**Feature:**
- Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field.
- Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model).

**Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields.

**Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring.

---

## 2. Async / Parallel Execution (P1)

**Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.

**Feature:**
- Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`).
- Parallelize judge calls within a batch.
- Keep the proposer sequential (single call per iteration).

**Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field.

**Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`.

---

## 3. Robust Error Handling & Retry (P1)

**Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.

**Feature:**
- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
- Configurable `max_retries` and `retry_delay_base`.
- Circuit breaker: if N consecutive iterations fail, pause and alert.
- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.

**Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag.

**Scope:** Infrastructure adapters + evolution loop.

---

## 4. Checkpoint & Resume (P2)

**Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.

**Feature:**
- Save `OptimizationState` to disk every K iterations (or every accepted improvement).
- Resume from the latest checkpoint file on restart.
- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.

**Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field.

**Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`.

---

## 5. Population-Based Evolution (P2)

**Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned.

**Feature:**
- Maintain a population of K candidates (e.g., top-K by score or Pareto front).
- Crossover: combine instructions from two parent candidates.
- Mutation operators: paraphrase, constrain, generalize, specialize.
- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).

**Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags.

**Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure.

---

## 6. Hold-Out Validation (P2)

**Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.

**Feature:**
- Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
- Evolution uses train minibatches for accept/reject decisions.
- After each iteration, evaluate the best candidate on the hold-out set.
- Report both train and validation scores in results.
- Optional early stopping if validation score degrades for K consecutive iterations.

**Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`.

**Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics).

---

## 7. Custom Judge Criteria (P2)

**Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.

**Feature:**
- Allow users to define custom judge rubrics, criteria, and scoring scales.
- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
- Allow `perfect_score` to reflect the custom scale.

**Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides.

**Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation).

---

## 8. Real-World Evaluation Harness (P2)

**Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.

**Feature:**
- Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns).
- When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.

**Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge).

**Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure.

---

## 9. Logging & Observability (P2)

**Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.

**Feature:**
- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
- JSON-formatted log output for machine parsing.
- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
- Optional OpenTelemetry export for distributed tracing.

**Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`.

**Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain.

---

## 10. CLI Improvements (P2)

**Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands.

**Feature:**
- Fix Typer subcommand routing.
- `prometheus version` — show version.
- `prometheus init` — scaffold a config YAML interactively.
- `prometheus list` — list past optimization runs.
- `prometheus diff` — compare two result files (before/after prompt diff, score improvement).
- `prometheus eval` — evaluate a prompt against a dataset without optimization.

**Surface:** CLI subcommands.

**Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command.

---

## 11. Input Validation & Schema Enforcement (P2)

**Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.

**Feature:**
- Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency).
- Provide clear, actionable error messages for missing/invalid fields.
- Support config migration/upgrade from older versions.

**Surface:** Internal. Errors surface as clear CLI messages.

**Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution.

---

## 12. Adaptive Minibatch Sizing (P3)

**Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.

**Feature:**
- Start with a small minibatch for quick early iterations.
- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
- Shrink if too many evaluations fail (cost optimization).

**Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config.

**Scope:** `EvolutionLoop`, `SyntheticBootstrap`.

---

## 13. Prompt Diversity Tracking (P3)

**Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.

**Feature:**
- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
- Report diversity metrics in the result.
- Flag stagnation (N iterations with <epsilon change).

**Surface:** Internal. Reported in `OptimizationResult.history` entries.

**Scope:** `EvolutionLoop`, `OptimizationResult` (add diversity field per history entry).

---

## 14. Temperature & Sampling Control (P3)

**Current state:** No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.

**Feature:**
- Per-model-slot temperature and sampling parameters.
- Higher temperature for proposer (creativity), lower for judge (consistency).

**Surface:** `task_temperature`, `judge_temperature`, `proposer_temperature`, `synth_temperature` Config fields. `--temperature` CLI flag for global override.

**Scope:** `cli/app.py` (DSPy LM configuration), infrastructure adapters.

---

## 15. Cost Estimation & Budget Caps (P3)

**Current state:** `total_llm_calls` is tracked (inaccurately). No cost estimation, no budget caps.

**Feature:**
- Estimate cost per run based on model pricing and approximate token counts.
- Allow users to set a budget cap (`--max-cost-usd`).
- Report estimated cost in the result.

**Surface:** `--max-cost-usd` CLI flag. `max_cost_usd` Config field. Cost breakdown in result output.

**Scope:** `cli/app.py`, `OptimizationResult` (add cost fields), token counting in adapters.

---

## 16. Multi-Objective Optimization (P3)

**Current state:** Single scalar score from the judge. The `Prompt` entity comment mentions "Pareto tracking" but it's not implemented.

**Feature:**
- Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
- Maintain a Pareto front of non-dominated candidates.
- Allow users to set objective weights or constraints.

**Surface:** `objectives` Config field (list of `{name, weight, judge_criteria}`). CLI: `--objective` repeatable flag.

**Scope:** `EvolutionLoop` (Pareto front), `scoring.py` (multi-objective acceptance), `OptimizationResult` (Pareto set).

---

## 17. Export Optimized Prompt (P3)

**Current state:** The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.

**Feature:**
- `prometheus export` command to extract the optimized prompt as plain text.
- Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
- Copy to clipboard option.

**Surface:** `prometheus export --format <txt|md|json|langchain|dspy>` CLI subcommand. `--clipboard` flag.

**Scope:** New `cli/commands/export.py`, format renderers in infrastructure.

---

## 18. Config Profiles / Presets (P3)

**Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.

**Feature:**
- Named profiles: `fast`, `thorough`, `economy`, `research`.
- Profile overrides individual config fields.
- User-defined profiles stored in `~/.prometheus/profiles/`.

**Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands.

**Scope:** `cli/app.py`, new `ProfileManager` in application.

---

## Summary Table

| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope |
|---|---------|----------|-------------|----------------|-----------------|
| 1 | Multi-Model Routing | P1 | Existing | Existing | Small |
| 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium |
| 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium |
| 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium |
| 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large |
| 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium |
| 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium |
| 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large |
| 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium |
| 10 | CLI Improvements | P2 | Subcommands | — | Medium |
| 11 | Input Validation | P2 | — (error messages) | — | Small |
| 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small |
| 13 | Prompt Diversity Tracking | P3 | — | — | Small |
| 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small |
| 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small |
| 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large |
| 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small |
| 18 | Config Profiles / Presets | P3 | `--profile` | — | Small |

---

## Known Bugs (from TEST_REPORT.md and code review)

| # | Bug | Severity | File |
|---|-----|----------|------|
| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters |
| 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` |
| 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` |
| 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` |
| 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` |
| 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` |
| 7 | `AppSettings` is never imported or used | LOW | `config.py` |
| 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` |
| 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` |

---

## Test Coverage Gaps

| Area | Current | Needed |
|------|---------|--------|
| CLI commands | 0 tests | Unit + integration for each subcommand |
| Config validation | 0 tests | Schema validation, missing fields, type errors |
| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery |
| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters |
| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios |
| Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` |

---

## Recommended Implementation Order

### Phase 1 — Production Reliability (P1)
1. Fix multi-model routing (#1) — highest impact, smallest scope
2. Add error handling & retry (#3) — essential for production runs
3. Implement async/parallel execution (#2) — biggest wall-clock improvement

### Phase 2 — Optimization Quality (P2)
4. Input validation (#11) — small scope, high reliability gain
5. Logging & observability (#9) — enables debugging long runs
6. CLI improvements (#10) — fix Typer bug, add basic commands
7. Hold-out validation (#6) — prevents overfitting
8. Checkpoint & resume (#4) — essential for long runs
9. Custom judge criteria (#7) — enables domain-specific optimization

### Phase 3 — Advanced Features (P3)
10. Population-based evolution (#5)
11. Real-world eval harness (#8)
12. Remaining P3 features as demand dictates