Evaluators

Async conversation judging (LLM + embeddings).

class afterimage.ConversationJudge(llm: LLMProvider, embedding_provider: EmbeddingProvider, monitor: GenerationMonitor | None = None, *, config: ConversationJudgeConfig | None = None)[source]

Bases: object

Configurable async judge: embedding metrics + LLM rubrics.

Produces EvaluatedConversationWithContext with EvaluationSchema suitable for storage and the generator auto-improve loop.

async aclose() None[source]

Release embedding provider resources when applicable.

async aevaluate_row(conversation: ConversationWithContext) EvaluatedConversationWithContext[source]

Evaluate one conversation asynchronously.

classmethod from_factory(llm: LLMProvider, *, key_pool: SmartKeyPool, model_provider_name: Literal['gemini', 'openai', 'deepseek'], embedding_provider_config: dict[str, Any] | None = None, monitor: GenerationMonitor | None = None, config: ConversationJudgeConfig | None = None) ConversationJudge[source]

Convenience: build embedding provider from config + shared key pool.

class afterimage.ConversationJudgeConfig(min_acceptable_score: float = 0.58, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None, perfect_threshold: float = 0.88, good_threshold: float = 0.72, needs_improvement_threshold: float = 0.52, bad_threshold: float = 0.32)[source]

Bases: object

Tuning knobs for ConversationJudge.

Grade bands (overall score): >= perfect_threshold → PERFECT; >= good_threshold → GOOD; >= needs_improvement_threshold → NEEDS_IMPROVEMENT; >= bad_threshold → BAD; else NOT_ACCEPTABLE.

aggregation_mode: AggregationMode = 'mean'
bad_threshold: float = 0.32
good_threshold: float = 0.72
metric_weights: Dict[EvaluationMetric, float] | None = None
min_acceptable_score: float = 0.58
needs_improvement_threshold: float = 0.52
perfect_threshold: float = 0.88
afterimage.default_embedding_provider_config(model_provider_name: Literal['gemini', 'openai', 'deepseek']) dict[str, Any][source]

Default embedding backend for auto-improve when none is supplied.

Uses the same API vendor as chat when possible; DeepSeek has no public embedding API in this stack, so local SentenceTransformer is used.

Parameters:

model_provider_name – Active LLM provider for generation.

Returns:

Config dict for EmbeddingProviderFactory.

Evaluation package (composite and metrics)

class afterimage.evaluation.AggregationMode(value)[source]

Bases: str, Enum

How per-metric scores are combined into a single overall score.

MEAN = 'mean'

Unweighted arithmetic mean over all reported metrics.

MIN = 'min'

Minimum of all metric scores (strictest).

WEIGHTED_MEAN = 'weighted_mean'

Weighted average using CompositeEvaluator.metric_weights (default weight 1.0).

class afterimage.evaluation.CompositeEvaluator(evaluators: List[tuple[BaseEvaluator, float]], min_acceptable_score: float = 0.6, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None)[source]

Bases: object

Runs multiple async evaluators in parallel and aggregates scores.

Weighted combination: for each sub-evaluator (E, w), metric m receives score_m * w (only metrics that E returns). If multiple evaluators emit the same metric, contributions are summed. The overall score is then computed from the combined per-metric map using aggregation_mode.

  • MEAN: sum(combined_scores.values()) / len(combined_scores)

  • WEIGHTED_MEAN: sum(s * metric_weights.get(m,1)) / sum(metric_weights.get(m,1) for m in combined)

  • MIN: min(combined_scores.values()) (or 0 if empty)

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
class afterimage.evaluation.EvaluationResult(scores: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, float], feedback: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, str], overall_score: float, needs_regeneration: bool, regeneration_strategy: str | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Aggregated scores and feedback for one conversation.

feedback: Dict[EvaluationMetric, str]
property final_score: float

Alias for overall_score (monitoring and legacy call sites).

metadata: Dict[str, Any]
needs_regeneration: bool
overall_score: float
regeneration_strategy: str | None = None
scores: Dict[EvaluationMetric, float]
class afterimage.evaluation.EvaluationMetric(value)[source]

Bases: str, Enum

Available evaluation metrics.

COHERENCE = 'coherence'
FACTUALITY = 'factuality'
FORMATTING = 'formatting'
GROUNDING = 'grounding'
HELPFULNESS = 'helpfulness'
RELEVANCE = 'relevance'
SAFETY = 'safety'
class afterimage.evaluation.CoherenceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, coherence_threshold: float = 0.65)[source]

Bases: object

Question–answer semantic coherence via embedding cosine similarity.

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
class afterimage.evaluation.GroundingEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, grounding_threshold: float = 0.55)[source]

Bases: object

How well assistant answers align with the response context embedding.

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
class afterimage.evaluation.RelevanceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, relevance_threshold: float = 0.55)[source]

Bases: object

How relevant user questions are to the instruction context.

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
class afterimage.evaluation.FactualityEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]

Bases: AsyncLLMBaseEvaluator

LLM rubric for factual accuracy vs context.

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
class afterimage.evaluation.HelpfulnessEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]

Bases: AsyncLLMBaseEvaluator

LLM rubric for answer helpfulness.

async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]