Initial commit: PROMETHEUS v0.1.0 - Prompt optimizer

- Clean architecture (domain/application/infrastructure)
- DSPy-based evolution engine with scoring
- CLI via pyproject.toml entry point
- Unit + integration tests (~300 tests)
- Configs for glm-5.1 and glm-4.5-air models
- Z.AI endpoint integration
This commit is contained in:
2026-03-29 11:44:03 +00:00
commit 837a44970f
49 changed files with 6599 additions and 0 deletions

42
.gitignore vendored Normal file
View File

@@ -0,0 +1,42 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.egg-info/
dist/
build/
*.egg
# Virtual environments
.venv/
venv/
env/
# Testing / coverage
.pytest_cache/
.ruff_cache/
.mypy_cache/
.coverage
coverage.json
htmlcov/
# IDE
.idea/
.vscode/
*.swp
*.swo
# Environment / secrets
.env
.env_runtime
# OS
.DS_Store
Thumbs.db
# Output artifacts (transient)
result_*.yaml
test_result.yaml
zai_result.yaml
prompt_optimized_*.md
TEST_REPORT.*

27
README.md Normal file
View File

@@ -0,0 +1,27 @@
# Prometheus
Prompt evolution without reference data.
## Quick Start
```bash
uv sync
uv run prometheus optimize -i examples/sample_config.yaml -o result.yaml -v
```
## Architecture
Clean hexagonal architecture with four layers:
- **Domain** — entities, ports, scoring (zero external dependencies)
- **Application** — use cases, bootstrap, evaluator, evolution loop
- **Infrastructure** — DSPy signatures, modules, adapters, file I/O
- **CLI** — Typer app with `optimize` command
## Testing
```bash
uv run pytest
uv run ruff check .
uv run mypy src/
```

32
config_glm45air.yaml Normal file
View File

@@ -0,0 +1,32 @@
# PROMETHEUS — DAG Planner Optimization (glm-4.5-air)
# =======================================================
seed_prompt: |
Un nouveau développeur monte en compétence sur le projet. Voici sa requête :
J'aimerai modifier le mécanisme de calcul des rente-pont pour gérer des ages de retraite différents selon la localité du bénéficiaire.
Tu fais parti d'un système agentique dont l'objectif global est de pouvoir lui fournir :
- Des diagrammes d'architecture complet de l'écosystème des composants concernés.
- Des diagrammes de flux très spécifique sur les composants concernés.
- Des diagrammes de flux mettant en relation l'ensemble des actions utilisateurs ayant un impact sur cette demande.
- Des diagrammes prévisionnel de chacun de ces types mettant en évidence les modifications qui devraient être effectuée par le développeur.
Ton rôle est capital et se place en début de chaine. Tu dois créer un DAG-like (Directed Acyclic Graph) en markdown qui répertorie, dans ce contexte, l'ensemble des opérations à mener par le reste du système. Ton output sera transmis à un orchestrateur qui l'utilisera pour planifier et déléguer les demandes d'analyse spécifiques des différents composants nécessaire à l'atteinte de l'objectif global. Le DAG devra donc, au besoin, contenir des noeuds pour les analyse des composants, des flux spécifiques et les tâches de consolidation pour la création des diagrammes. Chacun des noeuds de ton DAG sera traitée par un agent compétent et devra contenir les informations suivantes :
- Descriptif complet de la tâche à accomplir
- Format de sortie (toujours du markdown, mais défini clairement le contenu attendu)
- Dépendances aux autres noeuds, afin de permettre à l'orchestrateur de faire de la parallèlisation et d'identifier les problèmes
Afin de t'assurer de créer un plan cohérent, utilise les outils que tu as pour effectuer l'analyse préliminaires qui déterminera les besoins.
task_description: |
Task decomposition planner for a multi-agent software analysis system. The assistant operates as the first stage in an agentic pipeline. Given a developer's feature request involving a domain-specific codebase (e.g., insurance, actuarial calculations, pension bridge calculations), it must: (1) Analyze the request to identify all impacted system components and data flows, (2) Produce a structured Directed Acyclic Graph (DAG) in markdown format, (3) Each DAG node represents a subtask for a specialized downstream agent, containing: (a) complete task description with enough context for an autonomous agent to execute, (b) expected output format specification (markdown with defined structure), (c) dependencies on other nodes for orchestration scheduling, (4) The DAG must enable an orchestrator to: parallelize independent analyses, identify blocking dependencies, and distribute work to specialized diagram-generation agents. The pipeline's ultimate deliverables are: ecosystem architecture diagrams, component-specific flow diagrams, user-action impact maps, and forward-looking modification plans. Quality criteria: DAG completeness (all necessary analyses covered), acyclicity correctness, task description specificity and actionability, dependency accuracy reflecting real component relationships, and parallelization opportunities correctly identified.
task_model: "openai/glm-4.5-air"
judge_model: "openai/glm-4.5-air"
proposer_model: "openai/glm-4.5-air"
synth_model: "openai/glm-4.5-air"
api_base: "https://api.z.ai/api/coding/paas/v4"
api_key_env: "GLM_API_KEY"
max_iterations: 8
n_synthetic_inputs: 5
minibatch_size: 3
seed: 42

32
config_glm51.yaml Normal file
View File

@@ -0,0 +1,32 @@
# PROMETHEUS — DAG Planner Optimization (glm-5.1)
# ===================================================
seed_prompt: |
Un nouveau développeur monte en compétence sur le projet. Voici sa requête :
J'aimerai modifier le mécanisme de calcul des rente-pont pour gérer des ages de retraite différents selon la localité du bénéficiaire.
Tu fais parti d'un système agentique dont l'objectif global est de pouvoir lui fournir :
- Des diagrammes d'architecture complet de l'écosystème des composants concernés.
- Des diagrammes de flux très spécifique sur les composants concernés.
- Des diagrammes de flux mettant en relation l'ensemble des actions utilisateurs ayant un impact sur cette demande.
- Des diagrammes prévisionnel de chacun de ces types mettant en évidence les modifications qui devraient être effectuée par le développeur.
Ton rôle est capital et se place en début de chaine. Tu dois créer un DAG-like (Directed Acyclic Graph) en markdown qui répertorie, dans ce contexte, l'ensemble des opérations à mener par le reste du système. Ton output sera transmis à un orchestrateur qui l'utilisera pour planifier et déléguer les demandes d'analyse spécifiques des différents composants nécessaire à l'atteinte de l'objectif global. Le DAG devra donc, au besoin, contenir des noeuds pour les analyse des composants, des flux spécifiques et les tâches de consolidation pour la création des diagrammes. Chacun des noeuds de ton DAG sera traitée par un agent compétent et devra contenir les informations suivantes :
- Descriptif complet de la tâche à accomplir
- Format de sortie (toujours du markdown, mais défini clairement le contenu attendu)
- Dépendances aux autres noeuds, afin de permettre à l'orchestrateur de faire de la parallèlisation et d'identifier les problèmes
Afin de t'assurer de créer un plan cohérent, utilise les outils que tu as pour effectuer l'analyse préliminaires qui déterminera les besoins.
task_description: |
Task decomposition planner for a multi-agent software analysis system. The assistant operates as the first stage in an agentic pipeline. Given a developer's feature request involving a domain-specific codebase (e.g., insurance, actuarial calculations, pension bridge calculations), it must: (1) Analyze the request to identify all impacted system components and data flows, (2) Produce a structured Directed Acyclic Graph (DAG) in markdown format, (3) Each DAG node represents a subtask for a specialized downstream agent, containing: (a) complete task description with enough context for an autonomous agent to execute, (b) expected output format specification (markdown with defined structure), (c) dependencies on other nodes for orchestration scheduling, (4) The DAG must enable an orchestrator to: parallelize independent analyses, identify blocking dependencies, and distribute work to specialized diagram-generation agents. The pipeline's ultimate deliverables are: ecosystem architecture diagrams, component-specific flow diagrams, user-action impact maps, and forward-looking modification plans. Quality criteria: DAG completeness (all necessary analyses covered), acyclicity correctness, task description specificity and actionability, dependency accuracy reflecting real component relationships, and parallelization opportunities correctly identified.
task_model: "openai/glm-5.1"
judge_model: "openai/glm-5.1"
proposer_model: "openai/glm-5.1"
synth_model: "openai/glm-5.1"
api_base: "https://api.z.ai/api/coding/paas/v4"
api_key_env: "GLM_API_KEY"
max_iterations: 3
n_synthetic_inputs: 3
minibatch_size: 2
seed: 42

1642
docs/technical-spec.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,27 @@
# PROMETHEUS Configuration File
# ==================================
# The initial prompt to optimize
seed_prompt: |
You are an expert assistant in contract analysis.
Analyze the provided text and identify potentially abusive clauses.
Be precise and cite the relevant passages.
# Task description (used to generate synthetic inputs)
task_description: |
Legal analysis of contracts to identify abusive clauses.
The assistant must examine a contract text and flag
any clause that could be considered abusive under
French consumer protection law.
# LLM models (DSPy/litellm format)
task_model: "openai/gpt-4o-mini"
judge_model: "openai/gpt-4o"
proposer_model: "openai/gpt-4o"
synth_model: "openai/gpt-4o"
# Evolution parameters
max_iterations: 30
n_synthetic_inputs: 20
minibatch_size: 5
seed: 42

View File

@@ -0,0 +1,20 @@
# PROMETHEUS Test Config — Code Review Task
# ===========================================
seed_prompt: |
You are a helpful coding assistant. Help the user write code.
task_description: |
Code review and bug detection assistant. Reviews code snippets and
identifies bugs, security issues, and style problems. The assistant
receives a code snippet and must produce a structured review.
task_model: "openai/glm-4.5-air"
judge_model: "openai/glm-4.5-air"
proposer_model: "openai/glm-4.5-air"
synth_model: "openai/glm-4.5-air"
max_iterations: 5
n_synthetic_inputs: 8
minibatch_size: 3
seed: 123

View File

@@ -0,0 +1,26 @@
# PROMETHEUS Test Config — Real run with Z.AI
# =============================================
seed_prompt: |
You are a helpful coding assistant. Help the user write clean, bug-free code.
When reviewing code, identify bugs and suggest improvements.
task_description: |
Code review and bug detection assistant. Reviews code snippets and
identifies bugs, security issues, and style problems. The assistant
receives a code snippet and must produce a structured review.
task_model: "openai/glm-4.5-air"
judge_model: "openai/glm-4.5-air"
proposer_model: "openai/glm-4.5-air"
synth_model: "openai/glm-4.5-air"
# API configuration for z.ai
api_base: "https://api.z.ai/api/paas/v4"
api_key_env: "GLM_API_KEY"
# Evolution parameters (reduced for quick test)
max_iterations: 3
n_synthetic_inputs: 3
minibatch_size: 2
seed: 42

34
examples/zai_config.yaml Normal file
View File

@@ -0,0 +1,34 @@
# PROMETHEUS Configuration File — z.ai Backend
# ==================================
# REQUIRES env vars:
# export OPENAI_API_KEY=<your_glm_key>
# (api_base is configured below)
# The initial prompt to optimize
seed_prompt: |
You are an expert assistant in contract analysis.
Analyze the provided text and identify potentially abusive clauses.
Be precise and cite the relevant passages.
# Task description (used to generate synthetic inputs)
task_description: |
Legal analysis of contracts to identify abusive clauses.
The assistant must examine a contract text and flag
any clause that could be considered abusive under
French consumer protection law.
# LLM models (DSPy/litellm format with openai/ prefix for z.ai)
task_model: "openai/glm-4.5-air"
judge_model: "openai/glm-4.5-air"
proposer_model: "openai/glm-4.5-air"
synth_model: "openai/glm-4.5-air"
# API configuration for z.ai
api_base: "https://api.z.ai/api/paas/v4"
api_key_env: "OPENAI_API_KEY"
# Evolution parameters (reduced for functional testing)
max_iterations: 3
n_synthetic_inputs: 5
minibatch_size: 3
seed: 42

47
pyproject.toml Normal file
View File

@@ -0,0 +1,47 @@
[project]
name = "prometheus"
version = "0.1.0"
description = "Prompt evolution without reference data"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"dspy>=2.6,<3.0",
"typer>=0.15,<0.20",
"pydantic>=2.10",
"pydantic-settings>=2.7",
"pyyaml>=6.0",
"rich>=13.9",
]
[project.optional-dependencies]
dev = [
"pytest>=8.3",
"pytest-cov>=6.0",
"ruff>=0.9",
"mypy>=1.14",
"types-pyyaml>=6.0.12.20250915",
]
[project.scripts]
prometheus = "prometheus.cli.app:app"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.ruff]
line-length = 100
target-version = "py312"
[tool.mypy]
python_version = "3.12"
strict = true
[[tool.mypy.overrides]]
module = ["dspy", "dspy.*"]
ignore_missing_imports = true
[[tool.mypy.overrides]]
module = ["prometheus.infrastructure.*", "prometheus.cli.app"]
disable_error_code = ["misc", "import-untyped"]

6
run_glm45air.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
set -a
source /home/debian/workspace/prometheus/.env_runtime
set +a
cd /home/debian/workspace/prometheus
uv run prometheus optimize -i config_glm45air.yaml -o result_glm45air.yaml -v

6
run_glm51.sh Normal file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
set -a
source /home/debian/workspace/prometheus/.env_runtime
set +a
cd /home/debian/workspace/prometheus
uv run prometheus optimize -i config_glm51.yaml -o result_glm51.yaml -v

View File

@@ -0,0 +1,3 @@
"""PROMETHEUS — Prompt evolution without reference data."""
__version__ = "0.1.0"

View File

View File

@@ -0,0 +1,42 @@
"""
Bootstrap — synthetic input generation.
Creates a pool of test inputs from the task description.
This replaces the need for a labelled dataset.
"""
from __future__ import annotations
import random
from prometheus.domain.entities import SyntheticExample
from prometheus.domain.ports import SyntheticGeneratorPort
class SyntheticBootstrap:
"""Orchestrates synthetic input generation.
Depends only on the abstract port, not on DSPy directly.
"""
def __init__(self, generator: SyntheticGeneratorPort, seed: int = 42):
self._generator = generator
self._rng = random.Random(seed)
def run(self, task_description: str, n_examples: int) -> list[SyntheticExample]:
"""Generate the synthetic pool in a single call.
Single call minimizes LLM cost (1 call instead of N),
and the LLM can ensure diversity in a single generation.
"""
examples = self._generator.generate_inputs(task_description, n_examples)
self._rng.shuffle(examples)
return examples
def sample_minibatch(
self,
pool: list[SyntheticExample],
size: int,
) -> list[SyntheticExample]:
"""Sample a minibatch from the synthetic pool."""
size = min(size, len(pool))
return self._rng.sample(pool, size)

View File

@@ -0,0 +1,47 @@
"""Data Transfer Objects — configuration and results."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
@dataclass
class OptimizationConfig:
"""Complete configuration for a PROMETHEUS run."""
# --- Prompt ---
seed_prompt: str
task_description: str
# --- Models ---
task_model: str = "openai/gpt-4o-mini"
judge_model: str = "openai/gpt-4o"
proposer_model: str = "openai/gpt-4o"
synth_model: str = "openai/gpt-4o"
# --- Evolution parameters ---
max_iterations: int = 30
n_synthetic_inputs: int = 20
minibatch_size: int = 5
perfect_score: float = 1.0
# --- Reproducibility ---
seed: int = 42
# --- Output ---
output_path: str = "output.yaml"
verbose: bool = False
@dataclass
class OptimizationResult:
"""Result of a complete optimization."""
optimized_prompt: str
initial_prompt: str
iterations_used: int
total_llm_calls: int
initial_score: float
final_score: float
improvement: float
history: list[dict[str, Any]] = field(default_factory=list)

View File

@@ -0,0 +1,75 @@
"""
Evaluator — execution + judgement.
Produces a quality signal without ground truth.
Combines candidate prompt execution + LLM-as-Judge evaluation.
"""
from __future__ import annotations
from prometheus.domain.entities import (
EvalResult,
Prompt,
SyntheticExample,
Trajectory,
)
from prometheus.domain.ports import JudgePort, LLMPort
class PromptEvaluator:
"""Evaluates a prompt on a minibatch of synthetic inputs.
Pipeline: execute → judge → build trajectories.
Replaces GEPA's EvaluatorFn. Instead of comparing to ground truth,
uses an LLM-as-Judge.
"""
def __init__(self, executor: LLMPort, judge: JudgePort):
self._executor = executor
self._judge = judge
def evaluate(
self,
prompt: Prompt,
minibatch: list[SyntheticExample],
task_description: str,
) -> EvalResult:
"""Evaluate the prompt on the minibatch.
Steps:
1. Execute the prompt on each input in the minibatch
2. Judge each (input, output) pair
3. Build trajectories with feedback
"""
# Step 1: Execution
outputs: list[str] = []
for example in minibatch:
raw_output = self._executor.execute(prompt, example.input_text)
outputs.append(raw_output)
# Step 2: Judgement
pairs = [(ex.input_text, out) for ex, out in zip(minibatch, outputs)]
judge_results = self._judge.judge_batch(task_description, pairs)
# Step 3: Build trajectories
scores: list[float] = []
feedbacks: list[str] = []
trajectories: list[Trajectory] = []
for i, (example, output) in enumerate(zip(minibatch, outputs)):
score, feedback = judge_results[i]
scores.append(score)
feedbacks.append(feedback)
trajectories.append(
Trajectory(
input_text=example.input_text,
output_text=output,
score=score,
feedback=feedback,
prompt_used=prompt.text,
)
)
return EvalResult(
scores=scores,
feedbacks=feedbacks,
trajectories=trajectories,
)

View File

@@ -0,0 +1,174 @@
"""
Evolution loop — core PROMETHEUS engine.
Orchestrates the select → evaluate → propose → accept cycle.
Equivalent to GEPAEngine.run(), adapted to work without a valset.
"""
from __future__ import annotations
import logging
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.domain.entities import (
Candidate,
OptimizationState,
Prompt,
SyntheticExample,
)
from prometheus.domain.ports import ProposerPort
from prometheus.domain.scoring import should_accept
logger = logging.getLogger(__name__)
class EvolutionLoop:
"""Main evolution loop.
Design:
- Keeps only the best candidate (no full population).
- Simplifies vs GEPA (no Pareto, no merge).
- Population support deferred to v2.
"""
def __init__(
self,
evaluator: PromptEvaluator,
proposer: ProposerPort,
bootstrap: SyntheticBootstrap,
max_iterations: int = 30,
minibatch_size: int = 5,
perfect_score: float = 1.0,
verbose: bool = False,
):
self._evaluator = evaluator
self._proposer = proposer
self._bootstrap = bootstrap
self._max_iterations = max_iterations
self._minibatch_size = minibatch_size
self._perfect_score = perfect_score
self._verbose = verbose
def run(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
) -> OptimizationState:
"""Execute the complete evolution loop."""
state = OptimizationState()
# Evaluate the seed
initial_batch = self._bootstrap.sample_minibatch(
synthetic_pool, self._minibatch_size
)
initial_eval = self._evaluator.evaluate(
seed_prompt, initial_batch, task_description
)
state.total_llm_calls += 2 * self._minibatch_size # N executions + N judge calls
best_candidate = Candidate(
prompt=seed_prompt,
best_score=initial_eval.total_score,
generation=0,
)
state.best_candidate = best_candidate
state.candidates.append(best_candidate)
self._log(f"Initial score: {initial_eval.total_score:.2f}")
# Main loop
for i in range(1, self._max_iterations + 1):
state.iteration = i
try:
# 1. Sample a fresh minibatch
batch = self._bootstrap.sample_minibatch(
synthetic_pool, self._minibatch_size
)
# 2. Evaluate the current candidate
current_eval = self._evaluator.evaluate(
best_candidate.prompt, batch, task_description
)
state.total_llm_calls += 2 * self._minibatch_size
# 3. Skip if perfect
if all(s >= self._perfect_score for s in current_eval.scores):
self._log(f"Iter {i}: All scores perfect, skipping.")
state.history.append(
{
"iteration": i,
"event": "skip_perfect",
"current_score": current_eval.total_score,
}
)
continue
# 4. Propose a new prompt (reflective mutation)
new_prompt = self._proposer.propose(
best_candidate.prompt,
current_eval.trajectories,
task_description,
)
state.total_llm_calls += 1 # 1 proposition call
# 5. Evaluate the new prompt on the same minibatch
new_eval = self._evaluator.evaluate(
new_prompt, batch, task_description
)
state.total_llm_calls += 2 * self._minibatch_size
# 6. Accept or reject
if should_accept(current_eval, new_eval):
best_candidate = Candidate(
prompt=new_prompt,
best_score=new_eval.total_score,
generation=i,
parent_id=id(best_candidate),
)
state.best_candidate = best_candidate
state.candidates.append(best_candidate)
self._log(
f"Iter {i}: ACCEPTED "
f"({current_eval.total_score:.2f} -> {new_eval.total_score:.2f})"
)
state.history.append(
{
"iteration": i,
"event": "accepted",
"old_score": current_eval.total_score,
"new_score": new_eval.total_score,
"improvement": new_eval.total_score
- current_eval.total_score,
}
)
else:
self._log(
f"Iter {i}: REJECTED "
f"({new_eval.total_score:.2f} <= {current_eval.total_score:.2f})"
)
state.history.append(
{
"iteration": i,
"event": "rejected",
"old_score": current_eval.total_score,
"new_score": new_eval.total_score,
}
)
except Exception as exc:
self._log(f"Iter {i}: ERROR — {exc}. Skipping iteration.")
state.history.append(
{
"iteration": i,
"event": "error",
"error": str(exc),
}
)
continue
return state
def _log(self, msg: str) -> None:
if self._verbose:
logger.info("[PROMETHEUS] %s", msg)

View File

@@ -0,0 +1,77 @@
"""
Main use case — high-level orchestration.
Entry point for business logic. Coordinates bootstrap → evolution → result.
Contains no technical logic, only orchestration.
"""
from __future__ import annotations
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import Prompt
from prometheus.domain.ports import ProposerPort
class OptimizePromptUseCase:
"""Single MVP use case.
Injects dependencies via constructor (dependency injection).
"""
def __init__(
self,
evaluator: PromptEvaluator,
proposer: ProposerPort,
bootstrap: SyntheticBootstrap,
):
self._evaluator = evaluator
self._proposer = proposer
self._bootstrap = bootstrap
def execute(self, config: OptimizationConfig) -> OptimizationResult:
"""Full pipeline:
1. Bootstrap → generate synthetic inputs
2. Evolution → optimization loop
3. Return result
"""
# Phase 0: Bootstrap
synthetic_pool = self._bootstrap.run(
task_description=config.task_description,
n_examples=config.n_synthetic_inputs,
)
# Phase 1: Evolution
loop = EvolutionLoop(
evaluator=self._evaluator,
proposer=self._proposer,
bootstrap=self._bootstrap,
max_iterations=config.max_iterations,
minibatch_size=config.minibatch_size,
perfect_score=config.perfect_score,
verbose=config.verbose,
)
seed_prompt = Prompt(text=config.seed_prompt)
state = loop.run(seed_prompt, synthetic_pool, config.task_description)
# Phase 2: Result
initial_score = (
state.candidates[0].best_score if state.candidates else 0.0
)
final_score = state.best_candidate.best_score if state.best_candidate else 0.0
return OptimizationResult(
optimized_prompt=(
state.best_candidate.prompt.text
if state.best_candidate
else config.seed_prompt
),
initial_prompt=config.seed_prompt,
iterations_used=state.iteration,
total_llm_calls=state.total_llm_calls + 1, # +1 for bootstrap
initial_score=initial_score,
final_score=final_score,
improvement=final_score - initial_score,
history=state.history,
)

View File

168
src/prometheus/cli/app.py Normal file
View File

@@ -0,0 +1,168 @@
"""
CLI — user entry point.
Typer interface with -i (input) and -o (output) options.
"""
from __future__ import annotations
import logging
import os
from dataclasses import asdict
import dspy
import typer
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig, OptimizationResult
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.use_cases import OptimizePromptUseCase
from prometheus.infrastructure.file_io import YamlPersistence
from prometheus.infrastructure.judge_adapter import DSPyJudgeAdapter
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
from prometheus.infrastructure.proposer_adapter import DSPyProposerAdapter
from prometheus.infrastructure.synth_adapter import DSPySyntheticAdapter
app = typer.Typer(
name="prometheus",
help="PROMETHEUS — Prompt evolution without reference data.",
no_args_is_help=True,
)
console = Console()
@app.command()
def optimize(
input: str = typer.Option(
...,
"-i",
"--input",
help="Path to input YAML config file.",
exists=True,
readable=True,
),
output: str = typer.Option(
"output.yaml",
"-o",
"--output",
help="Path to output YAML result file.",
),
verbose: bool = typer.Option(
False,
"-v",
"--verbose",
help="Print detailed progress.",
),
) -> None:
"""Optimize a prompt without any reference data.
Usage:
prometheus optimize -i config.yaml -o result.yaml
"""
# Configure verbose logging
if verbose:
logging.basicConfig(level=logging.INFO, format="[PROMETHEUS] %(message)s")
console.print(
Panel.fit(
"PROMETHEUS — Prompt Evolution Engine",
subtitle="No reference data required",
)
)
# 1. Load config
persistence = YamlPersistence()
raw_config = persistence.read_config(input)
config = OptimizationConfig(
seed_prompt=raw_config["seed_prompt"],
task_description=raw_config["task_description"],
task_model=raw_config.get("task_model", "openai/gpt-4o-mini"),
judge_model=raw_config.get("judge_model", "openai/gpt-4o"),
proposer_model=raw_config.get("proposer_model", "openai/gpt-4o"),
synth_model=raw_config.get("synth_model", "openai/gpt-4o"),
max_iterations=raw_config.get("max_iterations", 30),
n_synthetic_inputs=raw_config.get("n_synthetic_inputs", 20),
minibatch_size=raw_config.get("minibatch_size", 5),
seed=raw_config.get("seed", 42),
output_path=output,
verbose=verbose,
)
console.print(f"[dim]Task: {config.task_description[:80]}...[/dim]")
console.print(f"[dim]Seed prompt: {config.seed_prompt[:80]}...[/dim]")
# 2. Configure DSPy with optional api_base/api_key from config
lm_kwargs: dict = {}
api_base = raw_config.get("api_base")
api_key_env = raw_config.get("api_key_env")
if api_base:
lm_kwargs["api_base"] = api_base
if api_key_env:
lm_kwargs["api_key"] = os.environ.get(api_key_env, "")
task_lm = dspy.LM(config.task_model, **lm_kwargs)
dspy.configure(lm=task_lm)
# 3. Build adapters (Dependency Injection)
synth_adapter = DSPySyntheticAdapter()
llm_adapter = DSPyLLMAdapter(model=config.task_model)
judge_adapter = DSPyJudgeAdapter()
proposer_adapter = DSPyProposerAdapter()
bootstrap = SyntheticBootstrap(generator=synth_adapter, seed=config.seed)
evaluator = PromptEvaluator(executor=llm_adapter, judge=judge_adapter)
use_case = OptimizePromptUseCase(
evaluator=evaluator,
proposer=proposer_adapter,
bootstrap=bootstrap,
)
# 4. Execute
with console.status("[bold green]Evolving prompt..."):
result = use_case.execute(config)
# 5. Display results
_display_result(result)
# 6. Save
_save_result(persistence, output, result)
console.print(f"\n[green]Results saved to {output}[/green]")
def _display_result(result: OptimizationResult) -> None:
"""Display a Rich summary in the terminal."""
console.print()
console.print(
Panel(
f"[bold green]Optimized Prompt[/bold green]\n\n{result.optimized_prompt}",
title="Result",
)
)
table = Table(title="Metrics")
table.add_column("Metric", style="cyan")
table.add_column("Value", style="bold")
table.add_row("Initial Score", f"{result.initial_score:.2f}")
table.add_row("Final Score", f"{result.final_score:.2f}")
table.add_row("Improvement", f"{result.improvement:+.2f}")
table.add_row("Iterations", str(result.iterations_used))
table.add_row("LLM Calls", str(result.total_llm_calls))
console.print(table)
def _save_result(
persistence: YamlPersistence,
path: str,
result: OptimizationResult,
) -> None:
"""Save the result as YAML."""
persistence.write_result(path, asdict(result))
@app.command(hidden=True)
def _help() -> None:
"""Internal placeholder to force multi-command Typer behavior."""
pass
if __name__ == "__main__":
app()

12
src/prometheus/config.py Normal file
View File

@@ -0,0 +1,12 @@
"""Application settings."""
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class AppSettings:
"""Non-sensitive settings, hardcoded for the MVP."""
app_name: str = "prometheus"
version: str = "0.1.0"

View File

View File

@@ -0,0 +1,87 @@
"""Domain entities — pure data, zero dependencies."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
@dataclass(frozen=True)
class Prompt:
"""Represents a candidate prompt.
frozen=True → immutable, safe for Pareto tracking.
"""
text: str
metadata: dict[str, Any] = field(default_factory=dict)
def __len__(self) -> int:
return len(self.text)
@dataclass(frozen=True)
class SyntheticExample:
"""A synthetic example: an input generated from the task description.
No expected output — the judge will evaluate the output directly.
"""
input_text: str
category: str = "default" # for future stratified sampling
id: int = 0
@dataclass
class Trajectory:
"""Execution trace of a prompt on an input.
Used by reflective mutation to understand failures.
"""
input_text: str
output_text: str
score: float
feedback: str # textual feedback from the judge
prompt_used: str
@dataclass
class EvalResult:
"""Result of an evaluation on a minibatch."""
scores: list[float]
feedbacks: list[str]
trajectories: list[Trajectory]
@property
def total_score(self) -> float:
return sum(self.scores)
@property
def mean_score(self) -> float:
return sum(self.scores) / len(self.scores) if self.scores else 0.0
@dataclass
class Candidate:
"""A candidate in the evolution pool.
Contains the prompt + its cumulative scores.
"""
prompt: Prompt
best_score: float = 0.0
generation: int = 0 # at which iteration it was created
parent_id: int | None = None
@dataclass
class OptimizationState:
"""Complete optimization state — serializable snapshot."""
iteration: int = 0
best_candidate: Candidate | None = None
candidates: list[Candidate] = field(default_factory=list)
synthetic_pool: list[SyntheticExample] = field(default_factory=list)
history: list[dict[str, Any]] = field(default_factory=list)
total_llm_calls: int = 0

View File

@@ -0,0 +1,85 @@
"""
Domain ports — abstract interfaces that infrastructure implements.
Uses ABC (abstract base classes) for the loose coupling.
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from typing import Any
from prometheus.domain.entities import Prompt, SyntheticExample, Trajectory
class LLMPort(ABC):
"""Port for executing a prompt on an input.
Infrastructure will provide an implementation via DSPy.
"""
@abstractmethod
def execute(self, prompt: Prompt, input_text: str) -> str:
"""Execute the prompt on the input, return the raw response."""
...
class JudgePort(ABC):
"""Port for LLM-as-Judge evaluation.
Takes (input, output) pairs + the task description.
Returns a score + textual feedback per pair.
"""
@abstractmethod
def judge_batch(
self,
task_description: str,
pairs: list[tuple[str, str]],
) -> list[tuple[float, str]]:
"""Evaluate a batch of (input, output) pairs.
Returns a list of (score, feedback).
"""
...
class ProposerPort(ABC):
"""Port for proposing a new prompt.
Uses evaluation trajectories to propose an improvement.
"""
@abstractmethod
def propose(
self,
current_prompt: Prompt,
trajectories: list[Trajectory],
task_description: str,
) -> Prompt:
"""Propose a new prompt based on failure trajectories."""
...
class SyntheticGeneratorPort(ABC):
"""Port for generating synthetic inputs."""
@abstractmethod
def generate_inputs(
self,
task_description: str,
n_examples: int,
) -> list[SyntheticExample]:
"""Generate N diverse synthetic inputs."""
...
class PersistencePort(ABC):
"""Port for reading/writing files."""
@abstractmethod
def read_config(self, path: str) -> dict[str, Any]:
...
@abstractmethod
def write_result(self, path: str, data: dict[str, Any]) -> None:
...

View File

@@ -0,0 +1,21 @@
"""Scoring logic and acceptance criteria — pure domain."""
from __future__ import annotations
from prometheus.domain.entities import EvalResult
def should_accept(
old_result: EvalResult,
new_result: EvalResult,
min_improvement: float = 0.0,
) -> bool:
"""Strict acceptance criterion.
The new candidate must strictly improve the total score.
"""
return new_result.total_score > old_result.total_score + min_improvement
def normalize_score(raw: float, min_val: float = 0.0, max_val: float = 1.0) -> float:
"""Clamp a score within [min_val, max_val]."""
return max(min_val, min(max_val, raw))

View File

@@ -0,0 +1,92 @@
"""
DSPy Modules — signature composition.
Declarative LLM call orchestration via DSPy.
"""
from __future__ import annotations
import json
import re
import dspy
from prometheus.infrastructure.dspy_signatures import (
GenerateSyntheticInputs,
JudgeOutput,
ProposeInstruction,
)
class SyntheticInputGenerator(dspy.Module):
"""Generates synthetic inputs in a single batch call.
Uses ChainOfThought for better diversity.
"""
def __init__(self) -> None:
super().__init__()
self.generate = dspy.ChainOfThought(GenerateSyntheticInputs)
def forward(self, task_description: str, n_examples: int) -> dspy.Prediction:
result = self.generate(
task_description=task_description,
n_examples=n_examples,
)
try:
examples = json.loads(result.examples)
except json.JSONDecodeError:
examples = self._parse_fallback(result.examples)
return dspy.Prediction(examples=examples)
@staticmethod
def _parse_fallback(text: str) -> list[str]:
"""Extract strings from non-JSON output."""
matches = re.findall(r'"([^"]+)"', text)
return matches if matches else [text]
class OutputJudge(dspy.Module):
"""Judges a single output. Called in batch by JudgeAdapter."""
def __init__(self) -> None:
super().__init__()
self.judge = dspy.ChainOfThought(JudgeOutput)
def forward(
self, task_description: str, input_text: str, output_text: str
) -> dspy.Prediction:
result = self.judge(
task_description=task_description,
input_text=input_text,
output_text=output_text,
)
try:
score = float(result.score)
except (ValueError, TypeError):
score = 0.5 # neutral fallback
score = max(0.0, min(1.0, score))
return dspy.Prediction(score=score, feedback=result.feedback)
class InstructionProposer(dspy.Module):
"""Proposes a new prompt from failure trajectories.
Equivalent to GEPA's InstructionProposalSignature.
"""
def __init__(self) -> None:
super().__init__()
self.propose = dspy.ChainOfThought(ProposeInstruction)
def forward(
self,
current_instruction: str,
task_description: str,
failure_examples: str,
) -> dspy.Prediction:
result = self.propose(
current_instruction=current_instruction,
task_description=task_description,
failure_examples=failure_examples,
)
return dspy.Prediction(new_instruction=result.new_instruction)

View File

@@ -0,0 +1,79 @@
"""
DSPy Signatures — declarative LLM contracts.
Defines WHAT each LLM call does, not HOW.
DSPy Signature = input_fields → output_fields + instruction.
DSPy handles prompting, parsing, and structuring.
"""
from __future__ import annotations
import dspy
class GenerateSyntheticInputs(dspy.Signature):
"""Generate diverse, realistic input examples for a given task."""
task_description: str = dspy.InputField(
desc="Description of the task the prompt should accomplish."
)
n_examples: int = dspy.InputField(
desc="Number of examples to generate."
)
examples: str = dspy.OutputField(
desc=(
"A JSON array of strings, each being a realistic input "
"for the task. Cover: normal cases, edge cases, long inputs, "
"short inputs, ambiguous cases, and tricky scenarios."
),
)
class JudgeOutput(dspy.Signature):
"""Evaluate the quality of an LLM output for a given task and input.
Score: 0.0 (completely wrong) to 1.0 (perfect).
Feedback: specific, actionable criticism.
"""
task_description: str = dspy.InputField(
desc="What the assistant is supposed to do."
)
input_text: str = dspy.InputField(
desc="The input provided to the assistant."
)
output_text: str = dspy.InputField(
desc="The assistant's response to evaluate."
)
score: float = dspy.OutputField(
desc="Quality score from 0.0 (wrong) to 1.0 (perfect)."
)
feedback: str = dspy.OutputField(
desc=(
"Specific, actionable feedback explaining what's wrong "
"with the output and how to improve it. Be critical."
),
)
class ProposeInstruction(dspy.Signature):
"""Given a current prompt and examples of where it fails with feedback,
propose an improved version of the prompt.
The new prompt should address all the issues identified in the feedback.
"""
current_instruction: str = dspy.InputField(
desc="The current prompt/instruction to improve."
)
task_description: str = dspy.InputField(
desc="Description of the task."
)
failure_examples: str = dspy.InputField(
desc=(
"Examples of inputs, outputs, scores, and feedback "
"showing where the current instruction fails."
),
)
new_instruction: str = dspy.OutputField(
desc="An improved version of the instruction."
)

View File

@@ -0,0 +1,25 @@
"""
File I/O — read/write config and result files.
Implements the PersistencePort with YAML.
"""
from __future__ import annotations
from typing import Any
import yaml
from prometheus.domain.ports import PersistencePort
class YamlPersistence(PersistencePort):
"""Reads and writes YAML files."""
def read_config(self, path: str) -> dict[str, Any]:
with open(path, encoding="utf-8") as f:
data: dict[str, Any] = yaml.safe_load(f)
return data
def write_result(self, path: str, data: dict[str, Any]) -> None:
with open(path, "w", encoding="utf-8") as f:
yaml.dump(data, f, default_flow_style=False, allow_unicode=True)

View File

@@ -0,0 +1,34 @@
"""
Adapter: LLM-as-Judge.
Implements the JudgePort via the DSPy OutputJudge module.
"""
from __future__ import annotations
from prometheus.domain.ports import JudgePort
from prometheus.infrastructure.dspy_modules import OutputJudge
class DSPyJudgeAdapter(JudgePort):
"""Evaluates a batch of (input, output) pairs by calling the Judge for each.
Sequential for MVP. Future: parallelize via dspy.Parallel.
"""
def __init__(self) -> None:
self._judge = OutputJudge()
def judge_batch(
self,
task_description: str,
pairs: list[tuple[str, str]],
) -> list[tuple[float, str]]:
results: list[tuple[float, str]] = []
for input_text, output_text in pairs:
pred = self._judge(
task_description=task_description,
input_text=input_text,
output_text=output_text,
)
results.append((pred.score, pred.feedback))
return results

View File

@@ -0,0 +1,32 @@
"""
Adapter: Execute a prompt on an input.
Implements the LLMPort via DSPy.
"""
from __future__ import annotations
import dspy
from prometheus.domain.entities import Prompt
from prometheus.domain.ports import LLMPort
class DSPyLLMAdapter(LLMPort):
"""Executes a prompt using dspy.Predict with a simple signature."""
class _ExecuteSignature(dspy.Signature):
"""Execute the instruction on the given input."""
instruction: str = dspy.InputField(desc="The instruction/prompt to follow.")
input_text: str = dspy.InputField(desc="The input to process.")
output: str = dspy.OutputField(desc="The response following the instruction.")
def __init__(self, model: str) -> None:
self._predictor = dspy.Predict(self._ExecuteSignature)
def execute(self, prompt: Prompt, input_text: str) -> str:
result = self._predictor(
instruction=prompt.text,
input_text=input_text,
)
return str(result.output)

View File

@@ -0,0 +1,47 @@
"""
Adapter: Reflective Mutation Proposer.
Implements the ProposerPort via the DSPy InstructionProposer.
Converts trajectories into readable format for the LLM proposer.
"""
from __future__ import annotations
from prometheus.domain.entities import Prompt, Trajectory
from prometheus.domain.ports import ProposerPort
from prometheus.infrastructure.dspy_modules import InstructionProposer
class DSPyProposerAdapter(ProposerPort):
"""Uses evaluation trajectories to build a failure report and propose a new prompt."""
def __init__(self) -> None:
self._proposer = InstructionProposer()
def propose(
self,
current_prompt: Prompt,
trajectories: list[Trajectory],
task_description: str,
) -> Prompt:
failure_examples = self._format_failures(trajectories)
pred = self._proposer(
current_instruction=current_prompt.text,
task_description=task_description,
failure_examples=failure_examples,
)
return Prompt(text=pred.new_instruction)
@staticmethod
def _format_failures(trajectories: list[Trajectory]) -> str:
"""Convert trajectories into a structured textual report."""
sections: list[str] = []
for i, t in enumerate(trajectories, 1):
section = (
f"# Example {i}\n"
f"## Input\n{t.input_text}\n\n"
f"## Generated Output\n{t.output_text}\n\n"
f"## Score\n{t.score:.2f}\n\n"
f"## Feedback\n{t.feedback}\n"
)
sections.append(section)
return "\n---\n".join(sections)

View File

@@ -0,0 +1,34 @@
"""
Adapter: Synthetic input generation.
Implements the SyntheticGeneratorPort via DSPy.
"""
from __future__ import annotations
from prometheus.domain.entities import SyntheticExample
from prometheus.domain.ports import SyntheticGeneratorPort
from prometheus.infrastructure.dspy_modules import SyntheticInputGenerator
class DSPySyntheticAdapter(SyntheticGeneratorPort):
"""Generates synthetic inputs in a single batch call via DSPy."""
def __init__(self) -> None:
self._generator = SyntheticInputGenerator()
def generate_inputs(
self,
task_description: str,
n_examples: int,
) -> list[SyntheticExample]:
pred = self._generator(
task_description=task_description,
n_examples=n_examples,
)
return [
SyntheticExample(
input_text=text,
id=i,
)
for i, text in enumerate(pred.examples[:n_examples])
]

26
test_config.yaml Normal file
View File

@@ -0,0 +1,26 @@
# PROMETHEUS Test Config — Full Pipeline Test
# =============================================
seed_prompt: |
You are a helpful coding assistant. Help the user write clean, bug-free code.
When reviewing code, identify bugs and suggest improvements.
task_description: |
Code review and bug detection assistant. Reviews code snippets and
identifies bugs, security issues, and style problems. The assistant
receives a code snippet and must produce a structured review.
task_model: "openai/glm-4.5-air"
judge_model: "openai/glm-4.5-air"
proposer_model: "openai/glm-4.5-air"
synth_model: "openai/glm-4.5-air"
# API configuration for z.ai (coding endpoint)
api_base: "https://api.z.ai/api/coding/paas/v4"
api_key_env: "GLM_API_KEY"
# Evolution parameters (small test)
max_iterations: 5
n_synthetic_inputs: 5
minibatch_size: 3
seed: 42

0
tests/__init__.py Normal file
View File

93
tests/conftest.py Normal file
View File

@@ -0,0 +1,93 @@
"""Shared test fixtures."""
from __future__ import annotations
from unittest.mock import MagicMock
import pytest
from prometheus.domain.entities import (
EvalResult,
Prompt,
SyntheticExample,
Trajectory,
)
@pytest.fixture
def seed_prompt() -> Prompt:
return Prompt(text="You are a helpful assistant. Answer the question.")
@pytest.fixture
def task_description() -> str:
return "Answer factual questions accurately and concisely."
@pytest.fixture
def synthetic_pool() -> list[SyntheticExample]:
return [
SyntheticExample(input_text=f"Test input {i}", id=i) for i in range(20)
]
@pytest.fixture
def mock_eval_result() -> EvalResult:
return EvalResult(
scores=[0.3, 0.5, 0.4, 0.6, 0.2],
feedbacks=[
"Incomplete answer",
"Missing key detail",
"Wrong format",
"Partially correct",
"Completely off topic",
],
trajectories=[
Trajectory(
input_text=f"Input {i}",
output_text=f"Output {i}",
score=s,
feedback=f,
prompt_used="test prompt",
)
for i, (s, f) in enumerate(
zip(
[0.3, 0.5, 0.4, 0.6, 0.2],
[
"Incomplete answer",
"Missing key detail",
"Wrong format",
"Partially correct",
"Completely off topic",
],
)
)
],
)
@pytest.fixture
def mock_llm_port() -> MagicMock:
"""Mock LLMPort that returns canned responses."""
port = MagicMock()
port.execute.return_value = "This is a mock response."
return port
@pytest.fixture
def mock_judge_port() -> MagicMock:
"""Mock JudgePort that returns moderate scores."""
port = MagicMock()
port.judge_batch.return_value = [
(0.5, "Moderate quality, needs improvement."),
] * 5
return port
@pytest.fixture
def mock_proposer_port() -> MagicMock:
"""Mock ProposerPort that returns a slightly modified prompt."""
port = MagicMock()
port.propose.return_value = Prompt(
text="You are a very helpful assistant. Answer the question precisely."
)
return port

View File

View File

@@ -0,0 +1,29 @@
"""Integration tests for DSPy adapters using DSPy mock LM."""
from __future__ import annotations
import dspy
import pytest
from prometheus.domain.entities import Prompt
from prometheus.infrastructure.llm_adapter import DSPyLLMAdapter
@pytest.fixture
def mock_lm() -> dspy.LM:
"""Create a DSPy mock LM that returns predictable responses."""
lm = dspy.utils.DummyLM(
[
{"output": "Mock output response"},
]
)
dspy.configure(lm=lm)
return lm
class TestDSPyLLMAdapter:
def test_execute_returns_response(self, mock_lm: dspy.LM) -> None:
adapter = DSPyLLMAdapter(model="openai/gpt-4o-mini")
prompt = Prompt(text="Answer the question.")
result = adapter.execute(prompt, "What is 2+2?")
assert isinstance(result, str)
assert len(result) > 0

View File

@@ -0,0 +1,74 @@
"""End-to-end pipeline test with mocked LLM calls."""
from __future__ import annotations
from unittest.mock import MagicMock
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.dto import OptimizationConfig
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.use_cases import OptimizePromptUseCase
from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
from prometheus.domain.ports import JudgePort, LLMPort, ProposerPort
def _make_eval(scores: list[float]) -> EvalResult:
return EvalResult(
scores=scores,
feedbacks=["feedback"] * len(scores),
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "feedback", "prompt")
for i, s in enumerate(scores)
],
)
class TestFullPipeline:
def test_pipeline_produces_result(self) -> None:
"""Full pipeline with mocked ports produces an OptimizationResult."""
mock_llm = MagicMock(spec=LLMPort)
mock_llm.execute.return_value = "mock response"
mock_judge = MagicMock(spec=JudgePort)
# Initial eval (low), then alternating current/new evals per iteration
eval_sequence = [
_make_eval([0.3, 0.3, 0.3, 0.3, 0.3]), # initial seed eval
]
for _ in range(5): # 5 iterations
eval_sequence.append(_make_eval([0.4, 0.4, 0.4, 0.4, 0.4])) # current eval
eval_sequence.append(_make_eval([0.6, 0.6, 0.6, 0.6, 0.6])) # new eval (accepted)
mock_judge.judge_batch.return_value = [(0.5, "ok")] * 5
mock_proposer = MagicMock(spec=ProposerPort)
mock_proposer.propose.return_value = Prompt(text="Improved prompt")
evaluator = PromptEvaluator(mock_llm, mock_judge)
evaluator.evaluate = MagicMock(side_effect=eval_sequence)
mock_gen = MagicMock()
mock_gen.generate_inputs.return_value = [
SyntheticExample(input_text=f"synth input {i}", id=i) for i in range(20)
]
bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
use_case = OptimizePromptUseCase(
evaluator=evaluator,
proposer=mock_proposer,
bootstrap=bootstrap,
)
config = OptimizationConfig(
seed_prompt="Answer questions.",
task_description="Answer questions accurately.",
max_iterations=5,
n_synthetic_inputs=20,
minibatch_size=5,
seed=42,
)
result = use_case.execute(config)
assert result.initial_prompt == "Answer questions."
assert result.optimized_prompt == "Improved prompt"
assert result.iterations_used == 5
assert result.total_llm_calls > 0
assert result.final_score > result.initial_score

0
tests/unit/__init__.py Normal file
View File

View File

@@ -0,0 +1,50 @@
"""Unit tests for the bootstrap module."""
from __future__ import annotations
from unittest.mock import MagicMock
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.domain.entities import SyntheticExample
from prometheus.domain.ports import SyntheticGeneratorPort
class TestSyntheticBootstrap:
def test_run_returns_shuffled_examples(self) -> None:
mock_gen = MagicMock(spec=SyntheticGeneratorPort)
examples = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(10)]
mock_gen.generate_inputs.return_value = examples
bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
result = bootstrap.run("task desc", 10)
assert len(result) == 10
mock_gen.generate_inputs.assert_called_once_with("task desc", 10)
def test_sample_minibatch_returns_correct_size(self) -> None:
mock_gen = MagicMock(spec=SyntheticGeneratorPort)
pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
batch = bootstrap.sample_minibatch(pool, 5)
assert len(batch) == 5
# All items should be from the pool
assert all(item in pool for item in batch)
def test_sample_minibatch_capped_at_pool_size(self) -> None:
mock_gen = MagicMock(spec=SyntheticGeneratorPort)
pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(3)]
bootstrap = SyntheticBootstrap(generator=mock_gen, seed=42)
batch = bootstrap.sample_minibatch(pool, 10)
assert len(batch) == 3
def test_deterministic_with_same_seed(self) -> None:
mock_gen = MagicMock(spec=SyntheticGeneratorPort)
pool = [SyntheticExample(input_text=f"input {i}", id=i) for i in range(20)]
b1 = SyntheticBootstrap(generator=mock_gen, seed=42)
b2 = SyntheticBootstrap(generator=mock_gen, seed=42)
assert b1.sample_minibatch(pool, 5) == b2.sample_minibatch(pool, 5)

View File

@@ -0,0 +1,198 @@
"""Unit tests for DSPy module parsing logic."""
from __future__ import annotations
import json
from unittest.mock import MagicMock, patch
import dspy
import pytest
from prometheus.infrastructure.dspy_modules import (
InstructionProposer,
OutputJudge,
SyntheticInputGenerator,
)
class TestSyntheticInputGeneratorParseFallback:
"""Tests for _parse_fallback — regex-based JSON recovery."""
def test_extracts_quoted_strings(self) -> None:
text = 'Here are some: "first example" and "second example" done.'
result = SyntheticInputGenerator._parse_fallback(text)
assert result == ["first example", "second example"]
def test_single_quoted_string(self) -> None:
text = 'Just one: "hello world"'
result = SyntheticInputGenerator._parse_fallback(text)
assert result == ["hello world"]
def test_no_quotes_returns_raw_text(self) -> None:
text = "no quotes at all here"
result = SyntheticInputGenerator._parse_fallback(text)
assert result == ["no quotes at all here"]
def test_empty_string_returns_itself(self) -> None:
result = SyntheticInputGenerator._parse_fallback("")
assert result == [""]
def test_mixed_json_with_extra_text(self) -> None:
text = 'Results: "alpha", "beta", "gamma" — take your pick.'
result = SyntheticInputGenerator._parse_fallback(text)
assert result == ["alpha", "beta", "gamma"]
class TestOutputJudgeForward:
"""Tests for OutputJudge score parsing and clamping.
Mocks the internal ChainOfThought module to isolate parsing logic.
"""
@pytest.fixture
def judge(self) -> OutputJudge:
return OutputJudge()
def test_valid_numeric_score(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="0.8", feedback="Good output.")
)
result = judge.forward("task", "input", "output")
assert result.score == 0.8
assert result.feedback == "Good output."
def test_non_numeric_score_falls_back_to_half(
self, judge: OutputJudge
) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(
score="not-a-number", feedback="N/A"
)
)
result = judge.forward("task", "input", "output")
assert result.score == 0.5
def test_score_clamped_to_upper_bound(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="1.5", feedback="Great!")
)
result = judge.forward("task", "input", "output")
assert result.score == 1.0
def test_score_clamped_to_lower_bound(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="-0.3", feedback="Terrible.")
)
result = judge.forward("task", "input", "output")
assert result.score == 0.0
def test_empty_score_string_falls_back(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="", feedback="No score.")
)
result = judge.forward("task", "input", "output")
assert result.score == 0.5
def test_boundary_score_one(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="1.0", feedback="Perfect.")
)
result = judge.forward("task", "input", "output")
assert result.score == 1.0
def test_boundary_score_zero(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score="0.0", feedback="Wrong.")
)
result = judge.forward("task", "input", "output")
assert result.score == 0.0
def test_none_score_falls_back(self, judge: OutputJudge) -> None:
judge.judge = MagicMock(
return_value=dspy.Prediction(score=None, feedback="Missing.")
)
result = judge.forward("task", "input", "output")
assert result.score == 0.5
class TestSyntheticInputGeneratorForward:
"""Tests for SyntheticInputGenerator.forward JSON/fallback parsing.
Mocks the internal ChainOfThought module to isolate parsing logic.
"""
@pytest.fixture
def generator(self) -> SyntheticInputGenerator:
return SyntheticInputGenerator()
def test_valid_json_parsed_correctly(
self, generator: SyntheticInputGenerator
) -> None:
examples_json = json.dumps(["q1", "q2", "q3"])
generator.generate = MagicMock(
return_value=dspy.Prediction(examples=examples_json)
)
result = generator.forward("task desc", 3)
assert result.examples == ["q1", "q2", "q3"]
def test_malformed_json_triggers_fallback(
self, generator: SyntheticInputGenerator
) -> None:
generator.generate = MagicMock(
return_value=dspy.Prediction(
examples='Here: "fallback item" and "another one"'
)
)
result = generator.forward("task desc", 2)
assert result.examples == ["fallback item", "another one"]
def test_empty_json_array(self, generator: SyntheticInputGenerator) -> None:
generator.generate = MagicMock(
return_value=dspy.Prediction(examples="[]")
)
result = generator.forward("task desc", 0)
assert result.examples == []
class TestInstructionProposerForward:
"""Tests for InstructionProposer.forward."""
@pytest.fixture
def proposer(self) -> InstructionProposer:
return InstructionProposer()
def test_returns_new_instruction(self, proposer: InstructionProposer) -> None:
proposer.propose = MagicMock(
return_value=dspy.Prediction(
new_instruction="Be concise and accurate."
)
)
result = proposer.forward(
"Be helpful.", "Answer questions.", "Failed: too verbose"
)
assert result.new_instruction == "Be concise and accurate."
def test_passes_correct_arguments(
self, proposer: InstructionProposer
) -> None:
proposer.propose = MagicMock(
return_value=dspy.Prediction(new_instruction="improved")
)
proposer.forward("current", "task desc", "failures")
proposer.propose.assert_called_once_with(
current_instruction="current",
task_description="task desc",
failure_examples="failures",
)

View File

@@ -0,0 +1,99 @@
"""Unit tests for domain entities."""
from __future__ import annotations
from prometheus.domain.entities import (
Candidate,
EvalResult,
OptimizationState,
Prompt,
SyntheticExample,
Trajectory,
)
class TestPrompt:
def test_prompt_text(self) -> None:
p = Prompt(text="Hello")
assert p.text == "Hello"
def test_prompt_len(self) -> None:
p = Prompt(text="Hello")
assert len(p) == 5
def test_prompt_frozen(self) -> None:
p = Prompt(text="Hello")
try:
p.text = "World" # type: ignore[misc]
raise AssertionError("Should have raised FrozenInstanceError")
except AttributeError:
pass
def test_prompt_default_metadata(self) -> None:
p = Prompt(text="Hello")
assert p.metadata == {}
def test_prompt_custom_metadata(self) -> None:
p = Prompt(text="Hello", metadata={"key": "value"})
assert p.metadata["key"] == "value"
class TestSyntheticExample:
def test_default_category(self) -> None:
ex = SyntheticExample(input_text="test")
assert ex.category == "default"
def test_default_id(self) -> None:
ex = SyntheticExample(input_text="test")
assert ex.id == 0
class TestEvalResult:
def test_total_score(self) -> None:
result = EvalResult(
scores=[0.3, 0.5, 0.4],
feedbacks=["a", "b", "c"],
trajectories=[],
)
assert result.total_score == 1.2
def test_mean_score(self) -> None:
result = EvalResult(
scores=[0.3, 0.5, 0.4],
feedbacks=["a", "b", "c"],
trajectories=[],
)
assert abs(result.mean_score - 0.4) < 1e-9
def test_mean_score_empty(self) -> None:
result = EvalResult(scores=[], feedbacks=[], trajectories=[])
assert result.mean_score == 0.0
class TestTrajectory:
def test_trajectory_fields(self) -> None:
t = Trajectory(
input_text="in",
output_text="out",
score=0.8,
feedback="good",
prompt_used="test",
)
assert t.input_text == "in"
assert t.score == 0.8
class TestCandidate:
def test_candidate_defaults(self) -> None:
c = Candidate(prompt=Prompt(text="test"))
assert c.best_score == 0.0
assert c.generation == 0
assert c.parent_id is None
class TestOptimizationState:
def test_default_state(self) -> None:
state = OptimizationState()
assert state.iteration == 0
assert state.best_candidate is None
assert state.candidates == []
assert state.total_llm_calls == 0

View File

@@ -0,0 +1,121 @@
"""Unit tests for PromptEvaluator.evaluate()."""
from __future__ import annotations
from unittest.mock import MagicMock
import pytest
from prometheus.application.evaluator import PromptEvaluator
from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
from prometheus.domain.ports import JudgePort, LLMPort
class TestPromptEvaluatorEvaluate:
"""Tests for the evaluate() pipeline: execute → judge → trajectories."""
@pytest.fixture
def executor(self) -> MagicMock:
return MagicMock(spec=LLMPort)
@pytest.fixture
def judge(self) -> MagicMock:
return MagicMock(spec=JudgePort)
@pytest.fixture
def evaluator(self, executor: MagicMock, judge: MagicMock) -> PromptEvaluator:
return PromptEvaluator(executor=executor, judge=judge)
def test_happy_path_builds_correct_trajectories(
self,
evaluator: PromptEvaluator,
executor: MagicMock,
judge: MagicMock,
) -> None:
prompt = Prompt(text="Answer the question.")
examples = [
SyntheticExample(input_text="What is 2+2?", id=0),
SyntheticExample(input_text="Capital of France?", id=1),
]
executor.execute.side_effect = ["4", "Paris"]
judge.judge_batch.return_value = [
(0.9, "Correct."),
(0.8, "Mostly correct."),
]
result = evaluator.evaluate(prompt, examples, "math and geography")
assert isinstance(result, EvalResult)
assert result.scores == [0.9, 0.8]
assert result.feedbacks == ["Correct.", "Mostly correct."]
assert len(result.trajectories) == 2
assert result.trajectories[0].input_text == "What is 2+2?"
assert result.trajectories[0].output_text == "4"
assert result.trajectories[0].score == 0.9
assert result.trajectories[0].feedback == "Correct."
assert result.trajectories[0].prompt_used == "Answer the question."
assert result.trajectories[1].prompt_used == "Answer the question."
def test_empty_minibatch_returns_empty_result(
self,
evaluator: PromptEvaluator,
executor: MagicMock,
judge: MagicMock,
) -> None:
prompt = Prompt(text="test")
result = evaluator.evaluate(prompt, [], "task")
assert result.scores == []
assert result.feedbacks == []
assert result.trajectories == []
executor.execute.assert_not_called()
# judge_batch is called with empty pairs list
judge.judge_batch.assert_called_once_with("task", [])
def test_executor_called_with_correct_prompt(
self,
evaluator: PromptEvaluator,
executor: MagicMock,
judge: MagicMock,
) -> None:
prompt = Prompt(text="Summarize this.")
examples = [SyntheticExample(input_text="Long text here", id=0)]
executor.execute.return_value = "Summary."
judge.judge_batch.return_value = [(0.7, "Good summary.")]
evaluator.evaluate(prompt, examples, "summarization")
executor.execute.assert_called_once_with(prompt, "Long text here")
def test_trajectories_prompt_used_matches_input_prompt(
self,
evaluator: PromptEvaluator,
executor: MagicMock,
judge: MagicMock,
) -> None:
prompt = Prompt(text="Translate to French.")
examples = [SyntheticExample(input_text="Hello", id=0)]
executor.execute.return_value = "Bonjour"
judge.judge_batch.return_value = [(1.0, "Perfect.")]
result = evaluator.evaluate(prompt, examples, "translation")
assert result.trajectories[0].prompt_used == "Translate to French."
def test_scores_feedbacks_trajectories_lists_sized_correctly(
self,
evaluator: PromptEvaluator,
executor: MagicMock,
judge: MagicMock,
) -> None:
prompt = Prompt(text="test prompt")
examples = [SyntheticExample(input_text=f"q{i}", id=i) for i in range(4)]
executor.execute.side_effect = [f"a{i}" for i in range(4)]
judge.judge_batch.return_value = [
(0.1 * i, f"fb{i}") for i in range(4)
]
result = evaluator.evaluate(prompt, examples, "task")
assert len(result.scores) == 4
assert len(result.feedbacks) == 4
assert len(result.trajectories) == 4

View File

@@ -0,0 +1,147 @@
"""Unit tests for the evolution loop — with full mocking."""
from __future__ import annotations
from unittest.mock import MagicMock, patch
from prometheus.application.bootstrap import SyntheticBootstrap
from prometheus.application.evaluator import PromptEvaluator
from prometheus.application.evolution import EvolutionLoop
from prometheus.domain.entities import EvalResult, Prompt, SyntheticExample, Trajectory
class TestEvolutionLoop:
def test_accepts_improvement(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: MagicMock,
mock_judge_port: MagicMock,
mock_proposer_port: MagicMock,
) -> None:
"""When the new prompt improves the score, the best candidate is updated."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
initial_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["bad"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
old_eval = EvalResult(
scores=[0.3, 0.4, 0.3, 0.5, 0.2],
feedbacks=["bad"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "bad", "prompt")
for i, s in enumerate([0.3, 0.4, 0.3, 0.5, 0.2])
],
)
new_eval = EvalResult(
scores=[0.8, 0.9, 0.7, 0.8, 0.9],
feedbacks=["good"] * 5,
trajectories=[],
)
evaluator.evaluate = MagicMock(side_effect=[initial_eval, old_eval, new_eval])
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
)
with patch.object(loop, "_log"):
state = loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_candidate is not None
assert state.best_candidate.best_score > 0
def test_rejects_regression(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: MagicMock,
mock_judge_port: MagicMock,
mock_proposer_port: MagicMock,
) -> None:
"""When the new prompt degrades the score, the best candidate stays unchanged."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
initial_eval = EvalResult(
scores=[0.7, 0.8, 0.7, 0.8, 0.9],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
],
)
old_eval = EvalResult(
scores=[0.7, 0.8, 0.7, 0.8, 0.9],
feedbacks=["ok"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", s, "ok", "prompt")
for i, s in enumerate([0.7, 0.8, 0.7, 0.8, 0.9])
],
)
new_eval = EvalResult(
scores=[0.2, 0.1, 0.3, 0.2, 0.1],
feedbacks=["bad"] * 5,
trajectories=[],
)
evaluator.evaluate = MagicMock(side_effect=[initial_eval, old_eval, new_eval])
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=1,
minibatch_size=5,
)
with patch.object(loop, "_log"):
state = loop.run(seed_prompt, synthetic_pool, task_description)
assert state.best_candidate is not None
assert state.best_candidate.prompt.text == seed_prompt.text
def test_skips_perfect_scores(
self,
seed_prompt: Prompt,
synthetic_pool: list[SyntheticExample],
task_description: str,
mock_llm_port: MagicMock,
mock_judge_port: MagicMock,
mock_proposer_port: MagicMock,
) -> None:
"""When all scores are perfect, no proposition is made."""
evaluator = PromptEvaluator(mock_llm_port, mock_judge_port)
bootstrap = MagicMock(spec=SyntheticBootstrap)
bootstrap.sample_minibatch.return_value = synthetic_pool[:5]
perfect_eval = EvalResult(
scores=[1.0, 1.0, 1.0, 1.0, 1.0],
feedbacks=["perfect"] * 5,
trajectories=[
Trajectory(f"input{i}", f"output{i}", 1.0, "perfect", "prompt")
for i in range(5)
],
)
evaluator.evaluate = MagicMock(return_value=perfect_eval)
loop = EvolutionLoop(
evaluator=evaluator,
proposer=mock_proposer_port,
bootstrap=bootstrap,
max_iterations=3,
minibatch_size=5,
)
with patch.object(loop, "_log"):
loop.run(seed_prompt, synthetic_pool, task_description)
mock_proposer_port.propose.assert_not_called()

View File

@@ -0,0 +1,99 @@
"""Unit tests for YamlPersistence file I/O."""
from __future__ import annotations
from pathlib import Path
import pytest
import yaml
from prometheus.infrastructure.file_io import YamlPersistence
class TestYamlPersistenceReadConfig:
"""Tests for read_config YAML loading."""
def test_roundtrip_write_and_read(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
data = {
"seed_prompt": "You are helpful.",
"task_description": "Answer questions.",
"max_iterations": 30,
"verbose": True,
}
config_file = tmp_path / "config.yaml"
with open(config_file, "w") as f:
yaml.dump(data, f)
result = persistence.read_config(str(config_file))
assert result == data
def test_reads_nested_yaml(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
data = {
"model": {"name": "gpt-4o", "temperature": 0.7},
"params": [1, 2, 3],
}
config_file = tmp_path / "nested.yaml"
with open(config_file, "w") as f:
yaml.dump(data, f)
result = persistence.read_config(str(config_file))
assert result["model"]["name"] == "gpt-4o"
assert result["params"] == [1, 2, 3]
def test_missing_file_raises_error(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
missing = tmp_path / "nonexistent.yaml"
with pytest.raises(FileNotFoundError):
persistence.read_config(str(missing))
def test_malformed_yaml_raises_error(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
bad_file = tmp_path / "bad.yaml"
bad_file.write_text(": [invalid: {yaml", encoding="utf-8")
with pytest.raises(yaml.YAMLError):
persistence.read_config(str(bad_file))
class TestYamlPersistenceWriteResult:
"""Tests for write_result YAML output."""
def test_roundtrip_write_result(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
data = {
"optimized_prompt": "Improved prompt.",
"initial_score": 0.4,
"final_score": 0.85,
}
output_file = tmp_path / "result.yaml"
persistence.write_result(str(output_file), data)
with open(output_file) as f:
loaded = yaml.safe_load(f)
assert loaded == data
def test_write_result_creates_valid_yaml(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
data = {"key": "value", "number": 42}
output_file = tmp_path / "out.yaml"
persistence.write_result(str(output_file), data)
content = output_file.read_text()
assert "key: value" in content
assert "number: 42" in content
def test_write_result_handles_unicode(self, tmp_path: Path) -> None:
persistence = YamlPersistence()
data = {"prompt": "Répondez en français. 中文测试"}
output_file = tmp_path / "unicode.yaml"
persistence.write_result(str(output_file), data)
with open(output_file, encoding="utf-8") as f:
loaded = yaml.safe_load(f)
assert loaded["prompt"] == "Répondez en français. 中文测试"

View File

@@ -0,0 +1,54 @@
"""Unit tests for scoring logic."""
from __future__ import annotations
from prometheus.domain.entities import EvalResult, Trajectory
from prometheus.domain.scoring import normalize_score, should_accept
def _make_eval(scores: list[float]) -> EvalResult:
return EvalResult(
scores=scores,
feedbacks=[""] * len(scores),
trajectories=[
Trajectory(f"in{i}", f"out{i}", s, "", "p")
for i, s in enumerate(scores)
],
)
class TestShouldAccept:
def test_accepts_improvement(self) -> None:
old = _make_eval([0.3, 0.4])
new = _make_eval([0.8, 0.9])
assert should_accept(old, new) is True
def test_rejects_regression(self) -> None:
old = _make_eval([0.8, 0.9])
new = _make_eval([0.3, 0.4])
assert should_accept(old, new) is False
def test_rejects_equal(self) -> None:
old = _make_eval([0.5, 0.5])
new = _make_eval([0.5, 0.5])
assert should_accept(old, new) is False
def test_min_improvement_threshold(self) -> None:
old = _make_eval([0.5])
new = _make_eval([0.6])
assert should_accept(old, new, min_improvement=0.2) is False
assert should_accept(old, new, min_improvement=0.05) is True
class TestNormalizeScore:
def test_clamps_high(self) -> None:
assert normalize_score(1.5) == 1.0
def test_clamps_low(self) -> None:
assert normalize_score(-0.5) == 0.0
def test_passes_within_range(self) -> None:
assert normalize_score(0.7) == 0.7
def test_custom_range(self) -> None:
assert normalize_score(15.0, min_val=0.0, max_val=10.0) == 10.0
assert normalize_score(-5.0, min_val=0.0, max_val=10.0) == 0.0

2534
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff