Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes 2 integration tests that broke when the codebase went async (DSPyLLMAdapter and full pipeline tests now properly await coroutines). 277 tests pass (260 unit + 17 integration). Co-Authored-By: Paperclip <noreply@paperclip.ing>
370 lines
17 KiB
Markdown
370 lines
17 KiB
Markdown
# PROMETHEUS Feature Roadmap
|
|
|
|
> Complete codebase review — features needed for production-grade prompt optimization.
|
|
> Generated from v0.1.0 architecture review (2026-03-29).
|
|
|
|
---
|
|
|
|
## Legend
|
|
|
|
| Marker | Meaning |
|
|
|--------|---------|
|
|
| **CLI** | Exposed as a CLI option/flag |
|
|
| **Config** | YAML config field |
|
|
| **Internal** | No user-facing surface, architectural improvement |
|
|
| **P1** | Critical / must-have for reliability |
|
|
| **P2** | High value, should-have |
|
|
| **P3** | Nice-to-have, deferred to later versions |
|
|
|
|
---
|
|
|
|
## 1. Multi-Model Routing (P1)
|
|
|
|
**Current state:** `OptimizationConfig` defines four model slots (`task_model`, `judge_model`, `proposer_model`, `synth_model`), but `cli/app.py` only configures a single global DSPy LM from `task_model`. All adapters silently use the same model regardless of config.
|
|
|
|
**Feature:**
|
|
- Each adapter (`DSPyLLMAdapter`, `DSPyJudgeAdapter`, `DSPyProposerAdapter`, `DSPySyntheticAdapter`) must instantiate its own `dspy.LM` from the corresponding config field.
|
|
- Support per-model `api_base` and `api_key_env` overrides (e.g., judge on GPT-4o, propose on a cheaper model).
|
|
|
|
**Surface:** Config (already partially defined) — `judge_model`, `proposer_model`, `synth_model` become functional. No new CLI flags needed; the YAML already has the fields.
|
|
|
|
**Scope:** Infrastructure layer (`llm_adapter.py`, `judge_adapter.py`, `proposer_adapter.py`, `synth_adapter.py`) + `cli/app.py` DI wiring.
|
|
|
|
---
|
|
|
|
## 2. Async / Parallel Execution (P1)
|
|
|
|
**Current state:** All LLM calls (execute, judge, propose) are sequential. A single iteration with `minibatch_size=5` makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.
|
|
|
|
**Feature:**
|
|
- Parallelize execution of the prompt across a minibatch (`asyncio.gather` or `dspy.Parallel`).
|
|
- Parallelize judge calls within a batch.
|
|
- Keep the proposer sequential (single call per iteration).
|
|
|
|
**Surface:** Internal. Optionally exposed via `--max-concurrency` CLI flag and `max_concurrency` YAML field.
|
|
|
|
**Scope:** `evaluator.py`, `judge_adapter.py`, `llm_adapter.py`.
|
|
|
|
---
|
|
|
|
## 3. Robust Error Handling & Retry (P1)
|
|
|
|
**Current state:** The evolution loop catches broad `Exception` per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.
|
|
|
|
**Feature:**
|
|
- Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
|
|
- Configurable `max_retries` and `retry_delay_base`.
|
|
- Circuit breaker: if N consecutive iterations fail, pause and alert.
|
|
- Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.
|
|
|
|
**Surface:** `--max-retries` CLI flag, `max_retries` Config field. `--error-strategy` (skip | retry | abort) CLI flag.
|
|
|
|
**Scope:** Infrastructure adapters + evolution loop.
|
|
|
|
---
|
|
|
|
## 4. Checkpoint & Resume (P2)
|
|
|
|
**Current state:** If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.
|
|
|
|
**Feature:**
|
|
- Save `OptimizationState` to disk every K iterations (or every accepted improvement).
|
|
- Resume from the latest checkpoint file on restart.
|
|
- Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.
|
|
|
|
**Surface:** `--checkpoint-dir` CLI flag (default: `.prometheus/checkpoints/`). `--resume` CLI flag to resume from latest checkpoint. `checkpoint_interval` Config field.
|
|
|
|
**Scope:** New `CheckpointPort` in domain, `JsonCheckpointPersistence` in infrastructure, modifications to `EvolutionLoop.run()`.
|
|
|
|
---
|
|
|
|
## 5. Population-Based Evolution (P2)
|
|
|
|
**Current state:** The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The `Candidate` entity has `generation` and `parent_id` fields that suggest population support was planned.
|
|
|
|
**Feature:**
|
|
- Maintain a population of K candidates (e.g., top-K by score or Pareto front).
|
|
- Crossover: combine instructions from two parent candidates.
|
|
- Mutation operators: paraphrase, constrain, generalize, specialize.
|
|
- Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).
|
|
|
|
**Surface:** `--population-size` CLI flag, `population_size` Config field. `--crossover-rate`, `--mutation-rate` CLI flags.
|
|
|
|
**Scope:** `EvolutionLoop` refactor, new `CrossoverPort` and `MutationPort` in domain, new DSPy signatures for crossover/mutation in infrastructure.
|
|
|
|
---
|
|
|
|
## 6. Hold-Out Validation (P2)
|
|
|
|
**Current state:** The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.
|
|
|
|
**Feature:**
|
|
- Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
|
|
- Evolution uses train minibatches for accept/reject decisions.
|
|
- After each iteration, evaluate the best candidate on the hold-out set.
|
|
- Report both train and validation scores in results.
|
|
- Optional early stopping if validation score degrades for K consecutive iterations.
|
|
|
|
**Surface:** `--validation-split` CLI flag (default: 0.3). `--early-stop-patience` CLI flag (default: 5). Config fields: `validation_split`, `early_stop_patience`.
|
|
|
|
**Scope:** `SyntheticBootstrap`, `EvolutionLoop`, `OptimizationResult` (add validation metrics).
|
|
|
|
---
|
|
|
|
## 7. Custom Judge Criteria (P2)
|
|
|
|
**Current state:** The judge uses a hardcoded rubric in `JudgeOutput` DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.
|
|
|
|
**Feature:**
|
|
- Allow users to define custom judge rubrics, criteria, and scoring scales.
|
|
- Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
|
|
- Allow `perfect_score` to reflect the custom scale.
|
|
|
|
**Surface:** `judge_criteria` YAML field (free text). `judge_dimensions` YAML field (list of `{name, weight, description}`). CLI: `--judge-criteria` for quick overrides.
|
|
|
|
**Scope:** `JudgeOutput` signature (dynamic instructions), `JudgePort`, `DSPyJudgeAdapter`, `scoring.py` (weighted aggregation).
|
|
|
|
---
|
|
|
|
## 8. Real-World Evaluation Harness (P2)
|
|
|
|
**Current state:** The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.
|
|
|
|
**Feature:**
|
|
- Accept an optional evaluation dataset (CSV/JSON with `input` and `expected_output` columns).
|
|
- When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
|
|
- Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.
|
|
|
|
**Surface:** `--eval-dataset` CLI flag. `eval_dataset_path` Config field. `--eval-metric` CLI flag (exact | semantic | llm_judge).
|
|
|
|
**Scope:** New `GroundTruthEvaluator` in application, new `SimilarityPort` in domain, dataset loader in infrastructure.
|
|
|
|
---
|
|
|
|
## 9. Logging & Observability (P2)
|
|
|
|
**Current state:** Verbose mode (`-v`) configures Python's `logging` module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.
|
|
|
|
**Feature:**
|
|
- Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
|
|
- JSON-formatted log output for machine parsing.
|
|
- Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
|
|
- Optional OpenTelemetry export for distributed tracing.
|
|
|
|
**Surface:** `-v` / `--verbose` enables INFO level. `--debug` enables DEBUG level. `--log-format` (text | json). `--log-file` for file output. Config fields: `log_level`, `log_format`, `log_file`.
|
|
|
|
**Scope:** `cli/app.py` (logging setup), `evolution.py` (structured traces), new `TracingPort` in domain.
|
|
|
|
---
|
|
|
|
## 10. CLI Improvements (P2)
|
|
|
|
**Current state:** Single `optimize` command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No `version`, `init`, or `list-results` commands.
|
|
|
|
**Feature:**
|
|
- Fix Typer subcommand routing.
|
|
- `prometheus version` — show version.
|
|
- `prometheus init` — scaffold a config YAML interactively.
|
|
- `prometheus list` — list past optimization runs.
|
|
- `prometheus diff` — compare two result files (before/after prompt diff, score improvement).
|
|
- `prometheus eval` — evaluate a prompt against a dataset without optimization.
|
|
|
|
**Surface:** CLI subcommands.
|
|
|
|
**Scope:** `cli/app.py` restructured into `cli/commands/` with one module per command.
|
|
|
|
---
|
|
|
|
## 11. Input Validation & Schema Enforcement (P2)
|
|
|
|
**Current state:** Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.
|
|
|
|
**Feature:**
|
|
- Validate input YAML against a Pydantic schema (leveraging the existing `pydantic` dependency).
|
|
- Provide clear, actionable error messages for missing/invalid fields.
|
|
- Support config migration/upgrade from older versions.
|
|
|
|
**Surface:** Internal. Errors surface as clear CLI messages.
|
|
|
|
**Scope:** `OptimizationConfig` converted to Pydantic model with validators, `cli/app.py` validation step before pipeline execution.
|
|
|
|
---
|
|
|
|
## 12. Adaptive Minibatch Sizing (P3)
|
|
|
|
**Current state:** Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.
|
|
|
|
**Feature:**
|
|
- Start with a small minibatch for quick early iterations.
|
|
- Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
|
|
- Shrink if too many evaluations fail (cost optimization).
|
|
|
|
**Surface:** `--adaptive-minibatch` CLI flag (boolean toggle). `minibatch_size` becomes `minibatch_size_min` and `minibatch_size_max` in config.
|
|
|
|
**Scope:** `EvolutionLoop`, `SyntheticBootstrap`.
|
|
|
|
---
|
|
|
|
## 13. Prompt Diversity Tracking (P3)
|
|
|
|
**Current state:** No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.
|
|
|
|
**Feature:**
|
|
- Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
|
|
- Report diversity metrics in the result.
|
|
- Flag stagnation (N iterations with <epsilon change).
|
|
|
|
**Surface:** Internal. Reported in `OptimizationResult.history` entries.
|
|
|
|
**Scope:** `EvolutionLoop`, `OptimizationResult` (add diversity field per history entry).
|
|
|
|
---
|
|
|
|
## 14. Temperature & Sampling Control (P3)
|
|
|
|
**Current state:** No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.
|
|
|
|
**Feature:**
|
|
- Per-model-slot temperature and sampling parameters.
|
|
- Higher temperature for proposer (creativity), lower for judge (consistency).
|
|
|
|
**Surface:** `task_temperature`, `judge_temperature`, `proposer_temperature`, `synth_temperature` Config fields. `--temperature` CLI flag for global override.
|
|
|
|
**Scope:** `cli/app.py` (DSPy LM configuration), infrastructure adapters.
|
|
|
|
---
|
|
|
|
## 15. Cost Estimation & Budget Caps (P3)
|
|
|
|
**Current state:** `total_llm_calls` is tracked (inaccurately). No cost estimation, no budget caps.
|
|
|
|
**Feature:**
|
|
- Estimate cost per run based on model pricing and approximate token counts.
|
|
- Allow users to set a budget cap (`--max-cost-usd`).
|
|
- Report estimated cost in the result.
|
|
|
|
**Surface:** `--max-cost-usd` CLI flag. `max_cost_usd` Config field. Cost breakdown in result output.
|
|
|
|
**Scope:** `cli/app.py`, `OptimizationResult` (add cost fields), token counting in adapters.
|
|
|
|
---
|
|
|
|
## 16. Multi-Objective Optimization (P3)
|
|
|
|
**Current state:** Single scalar score from the judge. The `Prompt` entity comment mentions "Pareto tracking" but it's not implemented.
|
|
|
|
**Feature:**
|
|
- Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
|
|
- Maintain a Pareto front of non-dominated candidates.
|
|
- Allow users to set objective weights or constraints.
|
|
|
|
**Surface:** `objectives` Config field (list of `{name, weight, judge_criteria}`). CLI: `--objective` repeatable flag.
|
|
|
|
**Scope:** `EvolutionLoop` (Pareto front), `scoring.py` (multi-objective acceptance), `OptimizationResult` (Pareto set).
|
|
|
|
---
|
|
|
|
## 17. Export Optimized Prompt (P3)
|
|
|
|
**Current state:** The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.
|
|
|
|
**Feature:**
|
|
- `prometheus export` command to extract the optimized prompt as plain text.
|
|
- Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
|
|
- Copy to clipboard option.
|
|
|
|
**Surface:** `prometheus export --format <txt|md|json|langchain|dspy>` CLI subcommand. `--clipboard` flag.
|
|
|
|
**Scope:** New `cli/commands/export.py`, format renderers in infrastructure.
|
|
|
|
---
|
|
|
|
## 18. Config Profiles / Presets (P3)
|
|
|
|
**Current state:** Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.
|
|
|
|
**Feature:**
|
|
- Named profiles: `fast`, `thorough`, `economy`, `research`.
|
|
- Profile overrides individual config fields.
|
|
- User-defined profiles stored in `~/.prometheus/profiles/`.
|
|
|
|
**Surface:** `--profile` CLI flag. `prometheus profile list` / `prometheus profile create` subcommands.
|
|
|
|
**Scope:** `cli/app.py`, new `ProfileManager` in application.
|
|
|
|
---
|
|
|
|
## Summary Table
|
|
|
|
| # | Feature | Priority | CLI Surface | Config Surface | Estimated Scope |
|
|
|---|---------|----------|-------------|----------------|-----------------|
|
|
| 1 | Multi-Model Routing | P1 | Existing | Existing | Small |
|
|
| 2 | Async / Parallel Execution | P1 | `--max-concurrency` | `max_concurrency` | Medium |
|
|
| 3 | Error Handling & Retry | P1 | `--max-retries`, `--error-strategy` | `max_retries`, `error_strategy` | Medium |
|
|
| 4 | Checkpoint & Resume | P2 | `--checkpoint-dir`, `--resume` | `checkpoint_interval` | Medium |
|
|
| 5 | Population-Based Evolution | P2 | `--population-size`, `--crossover-rate` | `population_size`, `crossover_rate` | Large |
|
|
| 6 | Hold-Out Validation | P2 | `--validation-split`, `--early-stop-patience` | `validation_split`, `early_stop_patience` | Medium |
|
|
| 7 | Custom Judge Criteria | P2 | `--judge-criteria` | `judge_criteria`, `judge_dimensions` | Medium |
|
|
| 8 | Real-World Eval Harness | P2 | `--eval-dataset`, `--eval-metric` | `eval_dataset_path` | Large |
|
|
| 9 | Logging & Observability | P2 | `--debug`, `--log-format`, `--log-file` | `log_level`, `log_format` | Medium |
|
|
| 10 | CLI Improvements | P2 | Subcommands | — | Medium |
|
|
| 11 | Input Validation | P2 | — (error messages) | — | Small |
|
|
| 12 | Adaptive Minibatch | P3 | `--adaptive-minibatch` | `minibatch_size_min/max` | Small |
|
|
| 13 | Prompt Diversity Tracking | P3 | — | — | Small |
|
|
| 14 | Temperature & Sampling | P3 | `--temperature` | `*_temperature` | Small |
|
|
| 15 | Cost Estimation | P3 | `--max-cost-usd` | `max_cost_usd` | Small |
|
|
| 16 | Multi-Objective Optimization | P3 | `--objective` | `objectives` | Large |
|
|
| 17 | Export Optimized Prompt | P3 | `prometheus export` | — | Small |
|
|
| 18 | Config Profiles / Presets | P3 | `--profile` | — | Small |
|
|
|
|
---
|
|
|
|
## Known Bugs (from TEST_REPORT.md and code review)
|
|
|
|
| # | Bug | Severity | File |
|
|
|---|-----|----------|------|
|
|
| 1 | Multi-model config not wired — all adapters use single global LM | HIGH | `cli/app.py`, all adapters |
|
|
| 2 | `DSPyLLMAdapter` accepts `model` param but never uses it | HIGH | `infrastructure/llm_adapter.py` |
|
|
| 3 | CLI subcommand `optimize` absorbed by Typer 0.24 | HIGH | `cli/app.py` |
|
|
| 4 | Verbose logging produces no output — no handler configured | MEDIUM | `cli/app.py` |
|
|
| 5 | `total_llm_calls` counter is inaccurate | LOW | `application/use_cases.py`, `evolution.py` |
|
|
| 6 | `normalize_score()` is dead code — never called | LOW | `domain/scoring.py` |
|
|
| 7 | `AppSettings` is never imported or used | LOW | `config.py` |
|
|
| 8 | No LLM error handling in evolution loop | MEDIUM | `evolution.py` |
|
|
| 9 | Unpinned dependencies (dspy, typer) | LOW | `pyproject.toml` |
|
|
|
|
---
|
|
|
|
## Test Coverage Gaps
|
|
|
|
| Area | Current | Needed |
|
|
|------|---------|--------|
|
|
| CLI commands | 0 tests | Unit + integration for each subcommand |
|
|
| Config validation | 0 tests | Schema validation, missing fields, type errors |
|
|
| Evolution loop | 3 tests (single iteration each) | Multi-iteration, mixed accept/reject, failure recovery |
|
|
| Integration pipeline | 1 test (happy path only) | Error paths, mixed results, real adapters |
|
|
| Adapter coverage | 1 adapter tested | All 4 adapters + error scenarios |
|
|
| Use case orchestration | 1 indirect test | Direct unit tests for `OptimizePromptUseCase` |
|
|
|
|
---
|
|
|
|
## Recommended Implementation Order
|
|
|
|
### Phase 1 — Production Reliability (P1)
|
|
1. Fix multi-model routing (#1) — highest impact, smallest scope
|
|
2. Add error handling & retry (#3) — essential for production runs
|
|
3. Implement async/parallel execution (#2) — biggest wall-clock improvement
|
|
|
|
### Phase 2 — Optimization Quality (P2)
|
|
4. Input validation (#11) — small scope, high reliability gain
|
|
5. Logging & observability (#9) — enables debugging long runs
|
|
6. CLI improvements (#10) — fix Typer bug, add basic commands
|
|
7. Hold-out validation (#6) — prevents overfitting
|
|
8. Checkpoint & resume (#4) — essential for long runs
|
|
9. Custom judge criteria (#7) — enables domain-specific optimization
|
|
|
|
### Phase 3 — Advanced Features (P3)
|
|
10. Population-based evolution (#5)
|
|
11. Real-world eval harness (#8)
|
|
12. Remaining P3 features as demand dictates
|