Files
Prompt-optimizer/docs/FEATURE_ROADMAP.md
FullStackDev a5bf2ad59c feat: v0.2.0 sprint — ground truth eval, crossover/mutation, checkpointing, similarity guards, dataset loader, CLI commands, extended test coverage
Aggregates all v0.2.0 sprint work (GARAA-30 through GARAA-40) and fixes
2 integration tests that broke when the codebase went async (DSPyLLMAdapter
and full pipeline tests now properly await coroutines).

277 tests pass (260 unit + 17 integration).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-03-29 19:13:50 +00:00

17 KiB

PROMETHEUS Feature Roadmap

Complete codebase review — features needed for production-grade prompt optimization. Generated from v0.1.0 architecture review (2026-03-29).


Legend

Marker Meaning
CLI Exposed as a CLI option/flag
Config YAML config field
Internal No user-facing surface, architectural improvement
P1 Critical / must-have for reliability
P2 High value, should-have
P3 Nice-to-have, deferred to later versions

1. Multi-Model Routing (P1)

Current state: OptimizationConfig defines four model slots (task_model, judge_model, proposer_model, synth_model), but cli/app.py only configures a single global DSPy LM from task_model. All adapters silently use the same model regardless of config.

Feature:

  • Each adapter (DSPyLLMAdapter, DSPyJudgeAdapter, DSPyProposerAdapter, DSPySyntheticAdapter) must instantiate its own dspy.LM from the corresponding config field.
  • Support per-model api_base and api_key_env overrides (e.g., judge on GPT-4o, propose on a cheaper model).

Surface: Config (already partially defined) — judge_model, proposer_model, synth_model become functional. No new CLI flags needed; the YAML already has the fields.

Scope: Infrastructure layer (llm_adapter.py, judge_adapter.py, proposer_adapter.py, synth_adapter.py) + cli/app.py DI wiring.


2. Async / Parallel Execution (P1)

Current state: All LLM calls (execute, judge, propose) are sequential. A single iteration with minibatch_size=5 makes ~11 sequential LLM calls. Wall-clock time scales linearly with minibatch size.

Feature:

  • Parallelize execution of the prompt across a minibatch (asyncio.gather or dspy.Parallel).
  • Parallelize judge calls within a batch.
  • Keep the proposer sequential (single call per iteration).

Surface: Internal. Optionally exposed via --max-concurrency CLI flag and max_concurrency YAML field.

Scope: evaluator.py, judge_adapter.py, llm_adapter.py.


3. Robust Error Handling & Retry (P1)

Current state: The evolution loop catches broad Exception per iteration and logs it, then continues. Individual LLM call failures (timeouts, rate limits, malformed responses) are not retried. DSPy module fallbacks only cover parsing, not network errors.

Feature:

  • Retry with exponential backoff for transient errors (rate limits, timeouts, 5xx).
  • Configurable max_retries and retry_delay_base.
  • Circuit breaker: if N consecutive iterations fail, pause and alert.
  • Per-call error isolation: one bad minibatch item shouldn't fail the whole evaluation.

Surface: --max-retries CLI flag, max_retries Config field. --error-strategy (skip | retry | abort) CLI flag.

Scope: Infrastructure adapters + evolution loop.


4. Checkpoint & Resume (P2)

Current state: If a long optimization run crashes or is interrupted, all progress is lost. There is no intermediate state persistence.

Feature:

  • Save OptimizationState to disk every K iterations (or every accepted improvement).
  • Resume from the latest checkpoint file on restart.
  • Checkpoint includes: current best candidate, all candidates, iteration number, LLM call count, RNG seed state.

Surface: --checkpoint-dir CLI flag (default: .prometheus/checkpoints/). --resume CLI flag to resume from latest checkpoint. checkpoint_interval Config field.

Scope: New CheckpointPort in domain, JsonCheckpointPersistence in infrastructure, modifications to EvolutionLoop.run().


5. Population-Based Evolution (P2)

Current state: The evolution loop keeps only a single best candidate (hill climbing). No diversity, no crossover, no population dynamics. The Candidate entity has generation and parent_id fields that suggest population support was planned.

Feature:

  • Maintain a population of K candidates (e.g., top-K by score or Pareto front).
  • Crossover: combine instructions from two parent candidates.
  • Mutation operators: paraphrase, constrain, generalize, specialize.
  • Diversity maintenance: penalize candidates too similar to existing ones (cosine similarity or edit distance).

Surface: --population-size CLI flag, population_size Config field. --crossover-rate, --mutation-rate CLI flags.

Scope: EvolutionLoop refactor, new CrossoverPort and MutationPort in domain, new DSPy signatures for crossover/mutation in infrastructure.


6. Hold-Out Validation (P2)

Current state: The same synthetic inputs are used for both optimization and evaluation. No train/test split. Risk of overfitting to synthetic inputs.

Feature:

  • Split synthetic pool into train (e.g., 70%) and validation (30%) sets.
  • Evolution uses train minibatches for accept/reject decisions.
  • After each iteration, evaluate the best candidate on the hold-out set.
  • Report both train and validation scores in results.
  • Optional early stopping if validation score degrades for K consecutive iterations.

Surface: --validation-split CLI flag (default: 0.3). --early-stop-patience CLI flag (default: 5). Config fields: validation_split, early_stop_patience.

Scope: SyntheticBootstrap, EvolutionLoop, OptimizationResult (add validation metrics).


7. Custom Judge Criteria (P2)

Current state: The judge uses a hardcoded rubric in JudgeOutput DSPy signature ("score 0.0-1.0" with generic quality assessment). Users cannot customize evaluation criteria.

Feature:

  • Allow users to define custom judge rubrics, criteria, and scoring scales.
  • Support multi-dimensional scoring (e.g., accuracy: 0-10, clarity: 0-10, safety: 0-10) with configurable weights.
  • Allow perfect_score to reflect the custom scale.

Surface: judge_criteria YAML field (free text). judge_dimensions YAML field (list of {name, weight, description}). CLI: --judge-criteria for quick overrides.

Scope: JudgeOutput signature (dynamic instructions), JudgePort, DSPyJudgeAdapter, scoring.py (weighted aggregation).


8. Real-World Evaluation Harness (P2)

Current state: The system only evaluates against synthetic inputs. There is no way to test optimized prompts against real inputs with known-good outputs.

Feature:

  • Accept an optional evaluation dataset (CSV/JSON with input and expected_output columns).
  • When provided, use exact/semantic similarity matching against expected outputs instead of (or in addition to) LLM-as-Judge.
  • Report metrics: accuracy, BLEU, ROUGE, or embedding cosine similarity vs expected.

Surface: --eval-dataset CLI flag. eval_dataset_path Config field. --eval-metric CLI flag (exact | semantic | llm_judge).

Scope: New GroundTruthEvaluator in application, new SimilarityPort in domain, dataset loader in infrastructure.


9. Logging & Observability (P2)

Current state: Verbose mode (-v) configures Python's logging module but no handler is attached (Bug #4 in TEST_REPORT.md). No structured logging, no tracing.

Feature:

  • Proper structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR).
  • JSON-formatted log output for machine parsing.
  • Per-iteration trace: minibatch sample IDs, execution outputs, judge scores, proposer prompt diff.
  • Optional OpenTelemetry export for distributed tracing.

Surface: -v / --verbose enables INFO level. --debug enables DEBUG level. --log-format (text | json). --log-file for file output. Config fields: log_level, log_format, log_file.

Scope: cli/app.py (logging setup), evolution.py (structured traces), new TracingPort in domain.


10. CLI Improvements (P2)

Current state: Single optimize command. Known Typer 0.24 bug absorbing subcommands (Bug #1). No version, init, or list-results commands.

Feature:

  • Fix Typer subcommand routing.
  • prometheus version — show version.
  • prometheus init — scaffold a config YAML interactively.
  • prometheus list — list past optimization runs.
  • prometheus diff — compare two result files (before/after prompt diff, score improvement).
  • prometheus eval — evaluate a prompt against a dataset without optimization.

Surface: CLI subcommands.

Scope: cli/app.py restructured into cli/commands/ with one module per command.


11. Input Validation & Schema Enforcement (P2)

Current state: Config YAML is parsed as a raw dict with no schema validation. Missing or wrong-type fields cause cryptic errors deep in the pipeline.

Feature:

  • Validate input YAML against a Pydantic schema (leveraging the existing pydantic dependency).
  • Provide clear, actionable error messages for missing/invalid fields.
  • Support config migration/upgrade from older versions.

Surface: Internal. Errors surface as clear CLI messages.

Scope: OptimizationConfig converted to Pydantic model with validators, cli/app.py validation step before pipeline execution.


12. Adaptive Minibatch Sizing (P3)

Current state: Minibatch size is static throughout the run. Small batches are noisy; large batches are expensive.

Feature:

  • Start with a small minibatch for quick early iterations.
  • Increase minibatch size as the prompt improves (higher confidence needed for marginal gains).
  • Shrink if too many evaluations fail (cost optimization).

Surface: --adaptive-minibatch CLI flag (boolean toggle). minibatch_size becomes minibatch_size_min and minibatch_size_max in config.

Scope: EvolutionLoop, SyntheticBootstrap.


13. Prompt Diversity Tracking (P3)

Current state: No visibility into how much the prompt is actually changing between iterations. A "successful" optimization might just rephrase without structural change.

Feature:

  • Compute edit distance (Levenshtein) or embedding cosine similarity between consecutive prompts.
  • Report diversity metrics in the result.
  • Flag stagnation (N iterations with <epsilon change).

Surface: Internal. Reported in OptimizationResult.history entries.

Scope: EvolutionLoop, OptimizationResult (add diversity field per history entry).


14. Temperature & Sampling Control (P3)

Current state: No way to control LLM temperature, top_p, or other sampling parameters for any of the four model slots. DSPy defaults apply.

Feature:

  • Per-model-slot temperature and sampling parameters.
  • Higher temperature for proposer (creativity), lower for judge (consistency).

Surface: task_temperature, judge_temperature, proposer_temperature, synth_temperature Config fields. --temperature CLI flag for global override.

Scope: cli/app.py (DSPy LM configuration), infrastructure adapters.


15. Cost Estimation & Budget Caps (P3)

Current state: total_llm_calls is tracked (inaccurately). No cost estimation, no budget caps.

Feature:

  • Estimate cost per run based on model pricing and approximate token counts.
  • Allow users to set a budget cap (--max-cost-usd).
  • Report estimated cost in the result.

Surface: --max-cost-usd CLI flag. max_cost_usd Config field. Cost breakdown in result output.

Scope: cli/app.py, OptimizationResult (add cost fields), token counting in adapters.


16. Multi-Objective Optimization (P3)

Current state: Single scalar score from the judge. The Prompt entity comment mentions "Pareto tracking" but it's not implemented.

Feature:

  • Optimize for multiple objectives simultaneously (quality, latency, token efficiency, safety).
  • Maintain a Pareto front of non-dominated candidates.
  • Allow users to set objective weights or constraints.

Surface: objectives Config field (list of {name, weight, judge_criteria}). CLI: --objective repeatable flag.

Scope: EvolutionLoop (Pareto front), scoring.py (multi-objective acceptance), OptimizationResult (Pareto set).


17. Export Optimized Prompt (P3)

Current state: The optimized prompt is embedded in the YAML result file. No easy way to extract it for use.

Feature:

  • prometheus export command to extract the optimized prompt as plain text.
  • Support multiple export formats: plain text, Markdown, JSON, LangChain template, DSPy module.
  • Copy to clipboard option.

Surface: prometheus export --format <txt|md|json|langchain|dspy> CLI subcommand. --clipboard flag.

Scope: New cli/commands/export.py, format renderers in infrastructure.


18. Config Profiles / Presets (P3)

Current state: Every run requires a full config YAML. Common patterns (fast iterate, thorough optimize, cheap run) are not captured.

Feature:

  • Named profiles: fast, thorough, economy, research.
  • Profile overrides individual config fields.
  • User-defined profiles stored in ~/.prometheus/profiles/.

Surface: --profile CLI flag. prometheus profile list / prometheus profile create subcommands.

Scope: cli/app.py, new ProfileManager in application.


Summary Table

# Feature Priority CLI Surface Config Surface Estimated Scope
1 Multi-Model Routing P1 Existing Existing Small
2 Async / Parallel Execution P1 --max-concurrency max_concurrency Medium
3 Error Handling & Retry P1 --max-retries, --error-strategy max_retries, error_strategy Medium
4 Checkpoint & Resume P2 --checkpoint-dir, --resume checkpoint_interval Medium
5 Population-Based Evolution P2 --population-size, --crossover-rate population_size, crossover_rate Large
6 Hold-Out Validation P2 --validation-split, --early-stop-patience validation_split, early_stop_patience Medium
7 Custom Judge Criteria P2 --judge-criteria judge_criteria, judge_dimensions Medium
8 Real-World Eval Harness P2 --eval-dataset, --eval-metric eval_dataset_path Large
9 Logging & Observability P2 --debug, --log-format, --log-file log_level, log_format Medium
10 CLI Improvements P2 Subcommands Medium
11 Input Validation P2 — (error messages) Small
12 Adaptive Minibatch P3 --adaptive-minibatch minibatch_size_min/max Small
13 Prompt Diversity Tracking P3 Small
14 Temperature & Sampling P3 --temperature *_temperature Small
15 Cost Estimation P3 --max-cost-usd max_cost_usd Small
16 Multi-Objective Optimization P3 --objective objectives Large
17 Export Optimized Prompt P3 prometheus export Small
18 Config Profiles / Presets P3 --profile Small

Known Bugs (from TEST_REPORT.md and code review)

# Bug Severity File
1 Multi-model config not wired — all adapters use single global LM HIGH cli/app.py, all adapters
2 DSPyLLMAdapter accepts model param but never uses it HIGH infrastructure/llm_adapter.py
3 CLI subcommand optimize absorbed by Typer 0.24 HIGH cli/app.py
4 Verbose logging produces no output — no handler configured MEDIUM cli/app.py
5 total_llm_calls counter is inaccurate LOW application/use_cases.py, evolution.py
6 normalize_score() is dead code — never called LOW domain/scoring.py
7 AppSettings is never imported or used LOW config.py
8 No LLM error handling in evolution loop MEDIUM evolution.py
9 Unpinned dependencies (dspy, typer) LOW pyproject.toml

Test Coverage Gaps

Area Current Needed
CLI commands 0 tests Unit + integration for each subcommand
Config validation 0 tests Schema validation, missing fields, type errors
Evolution loop 3 tests (single iteration each) Multi-iteration, mixed accept/reject, failure recovery
Integration pipeline 1 test (happy path only) Error paths, mixed results, real adapters
Adapter coverage 1 adapter tested All 4 adapters + error scenarios
Use case orchestration 1 indirect test Direct unit tests for OptimizePromptUseCase

Phase 1 — Production Reliability (P1)

  1. Fix multi-model routing (#1) — highest impact, smallest scope
  2. Add error handling & retry (#3) — essential for production runs
  3. Implement async/parallel execution (#2) — biggest wall-clock improvement

Phase 2 — Optimization Quality (P2)

  1. Input validation (#11) — small scope, high reliability gain
  2. Logging & observability (#9) — enables debugging long runs
  3. CLI improvements (#10) — fix Typer bug, add basic commands
  4. Hold-out validation (#6) — prevents overfitting
  5. Checkpoint & resume (#4) — essential for long runs
  6. Custom judge criteria (#7) — enables domain-specific optimization

Phase 3 — Advanced Features (P3)

  1. Population-based evolution (#5)
  2. Real-world eval harness (#8)
  3. Remaining P3 features as demand dictates