Evaluators
Async conversation judging (LLM + embeddings).
- class afterimage.ConversationJudge(llm: LLMProvider, embedding_provider: EmbeddingProvider, monitor: GenerationMonitor | None = None, *, config: ConversationJudgeConfig | None = None)[source]
Bases:
objectConfigurable async judge: embedding metrics + LLM rubrics.
Produces
EvaluatedConversationWithContextwithEvaluationSchemasuitable for storage and the generator auto-improve loop.- async aevaluate_row(conversation: ConversationWithContext) EvaluatedConversationWithContext[source]
Evaluate one conversation asynchronously.
- classmethod from_factory(llm: LLMProvider, *, key_pool: SmartKeyPool, model_provider_name: Literal['gemini', 'openai', 'deepseek'], embedding_provider_config: dict[str, Any] | None = None, monitor: GenerationMonitor | None = None, config: ConversationJudgeConfig | None = None) ConversationJudge[source]
Convenience: build embedding provider from config + shared key pool.
- class afterimage.ConversationJudgeConfig(min_acceptable_score: float = 0.58, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None, perfect_threshold: float = 0.88, good_threshold: float = 0.72, needs_improvement_threshold: float = 0.52, bad_threshold: float = 0.32)[source]
Bases:
objectTuning knobs for
ConversationJudge.Grade bands (overall score):
>= perfect_threshold→ PERFECT;>= good_threshold→ GOOD;>= needs_improvement_threshold→ NEEDS_IMPROVEMENT;>= bad_threshold→ BAD; else NOT_ACCEPTABLE.- aggregation_mode: AggregationMode = 'mean'
- bad_threshold: float = 0.32
- good_threshold: float = 0.72
- metric_weights: Dict[EvaluationMetric, float] | None = None
- min_acceptable_score: float = 0.58
- needs_improvement_threshold: float = 0.52
- perfect_threshold: float = 0.88
- afterimage.default_embedding_provider_config(model_provider_name: Literal['gemini', 'openai', 'deepseek']) dict[str, Any][source]
Default embedding backend for auto-improve when none is supplied.
Uses the same API vendor as chat when possible; DeepSeek has no public embedding API in this stack, so local SentenceTransformer is used.
- Parameters:
model_provider_name – Active LLM provider for generation.
- Returns:
Config dict for
EmbeddingProviderFactory.
Evaluation package (composite and metrics)
- class afterimage.evaluation.AggregationMode(value)[source]
Bases:
str,EnumHow per-metric scores are combined into a single overall score.
- MEAN = 'mean'
Unweighted arithmetic mean over all reported metrics.
- MIN = 'min'
Minimum of all metric scores (strictest).
- WEIGHTED_MEAN = 'weighted_mean'
Weighted average using
CompositeEvaluator.metric_weights(default weight 1.0).
- class afterimage.evaluation.CompositeEvaluator(evaluators: List[tuple[BaseEvaluator, float]], min_acceptable_score: float = 0.6, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None)[source]
Bases:
objectRuns multiple async evaluators in parallel and aggregates scores.
Weighted combination: for each sub-evaluator
(E, w), metricmreceivesscore_m * w(only metrics thatEreturns). If multiple evaluators emit the same metric, contributions are summed. The overall score is then computed from the combined per-metric map usingaggregation_mode.MEAN:sum(combined_scores.values()) / len(combined_scores)WEIGHTED_MEAN:sum(s * metric_weights.get(m,1)) / sum(metric_weights.get(m,1) for m in combined)MIN:min(combined_scores.values())(or 0 if empty)
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
- class afterimage.evaluation.EvaluationResult(scores: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, float], feedback: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, str], overall_score: float, needs_regeneration: bool, regeneration_strategy: str | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectAggregated scores and feedback for one conversation.
- feedback: Dict[EvaluationMetric, str]
- property final_score: float
Alias for
overall_score(monitoring and legacy call sites).
- metadata: Dict[str, Any]
- needs_regeneration: bool
- overall_score: float
- regeneration_strategy: str | None = None
- scores: Dict[EvaluationMetric, float]
- class afterimage.evaluation.EvaluationMetric(value)[source]
Bases:
str,EnumAvailable evaluation metrics.
- COHERENCE = 'coherence'
- FACTUALITY = 'factuality'
- FORMATTING = 'formatting'
- GROUNDING = 'grounding'
- HELPFULNESS = 'helpfulness'
- RELEVANCE = 'relevance'
- SAFETY = 'safety'
- class afterimage.evaluation.CoherenceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, coherence_threshold: float = 0.65)[source]
Bases:
objectQuestion–answer semantic coherence via embedding cosine similarity.
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
- class afterimage.evaluation.GroundingEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, grounding_threshold: float = 0.55)[source]
Bases:
objectHow well assistant answers align with the response context embedding.
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
- class afterimage.evaluation.RelevanceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, relevance_threshold: float = 0.55)[source]
Bases:
objectHow relevant user questions are to the instruction context.
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
- class afterimage.evaluation.FactualityEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]
Bases:
AsyncLLMBaseEvaluatorLLM rubric for factual accuracy vs context.
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]
- class afterimage.evaluation.HelpfulnessEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]
Bases:
AsyncLLMBaseEvaluatorLLM rubric for answer helpfulness.
- async aevaluate(conversation: ConversationWithContext) EvaluationResult[source]