Evaluators

Async conversation judging (LLM + embeddings).

class afterimage.ConversationJudge(llm: LLMProvider, embedding_provider: EmbeddingProvider, monitor: GenerationMonitor | None = None, *, config: ConversationJudgeConfig | None = None)[source]

Bases: object

Configurable async judge: embedding metrics + LLM rubrics.

Produces EvaluatedConversationWithContext with EvaluationSchema suitable for storage and the generator auto-improve loop.

async aclose() → None[source]: Release embedding provider resources when applicable.

async aevaluate_row(conversation: ConversationWithContext) → EvaluatedConversationWithContext[source]: Evaluate one conversation asynchronously.

classmethod from_factory(llm: LLMProvider, *, key_pool: SmartKeyPool, model_provider_name: Literal['gemini', 'openai', 'deepseek', 'local', 'openrouter'], embedding_provider_config: dict[str, Any] | None = None, monitor: GenerationMonitor | None = None, config: ConversationJudgeConfig | None = None) → ConversationJudge[source]: Convenience: build embedding provider from config + shared key pool.

class afterimage.ConversationJudgeConfig(min_acceptable_score: float = 0.58, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None, perfect_threshold: float = 0.88, good_threshold: float = 0.72, needs_improvement_threshold: float = 0.52, bad_threshold: float = 0.32)[source]

Bases: object

Tuning knobs for ConversationJudge.

Grade bands (overall score): >= perfect_threshold → PERFECT; >= good_threshold → GOOD; >= needs_improvement_threshold → NEEDS_IMPROVEMENT; >= bad_threshold → BAD; else NOT_ACCEPTABLE.

aggregation_mode: AggregationMode = 'mean'

bad_threshold: float = 0.32

good_threshold: float = 0.72

metric_weights: Dict[EvaluationMetric, float] | None = None

min_acceptable_score: float = 0.58

needs_improvement_threshold: float = 0.52

perfect_threshold: float = 0.88

afterimage.default_embedding_provider_config(model_provider_name: Literal['gemini', 'openai', 'deepseek', 'local', 'openrouter']) → dict[str, Any][source]

Default embedding backend for auto-improve when none is supplied.

Uses the same API vendor as chat when possible; DeepSeek has no public embedding API in this stack, so a local SentenceTransformer worker pool is used. For local chat, the same process-based default applies. For those process defaults, the Hugging Face id altaidevorg/bge-m3-distill-8l is Alt AI’s open-source distilled BGE-M3 (multilingual, ~2.5× faster than full BGE-M3 with similar quality).

Parameters:: model_provider_name – Active LLM provider for generation.
Returns:: Config dict for EmbeddingProviderFactory.

Evaluation package (composite and metrics)

class afterimage.evaluation.AggregationMode(value)[source]

Bases: str, Enum

How per-metric scores are combined into a single overall score.

MEAN = 'mean': Unweighted arithmetic mean over all reported metrics.

MIN = 'min': Minimum of all metric scores (strictest).

WEIGHTED_MEAN = 'weighted_mean': Weighted average using CompositeEvaluator.metric_weights (default weight 1.0).

class afterimage.evaluation.CompositeEvaluator(evaluators: List[tuple[BaseEvaluator, float]], min_acceptable_score: float = 0.6, aggregation_mode: AggregationMode = AggregationMode.MEAN, metric_weights: Dict[EvaluationMetric, float] | None = None)[source]

Bases: object

Runs multiple async evaluators in parallel and aggregates scores.

Weighted combination: for each sub-evaluator (E, w), metric m receives score_m * w (only metrics that E returns). If multiple evaluators emit the same metric, contributions are summed. The overall score is then computed from the combined per-metric map using aggregation_mode.

MEAN: sum(combined_scores.values()) / len(combined_scores)
WEIGHTED_MEAN: sum(s * metric_weights.get(m,1)) / sum(metric_weights.get(m,1) for m in combined)
MIN: min(combined_scores.values()) (or 0 if empty)

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]

class afterimage.evaluation.EvaluationResult(scores: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, float], feedback: ~typing.Dict[~afterimage.evaluation.base.EvaluationMetric, str], overall_score: float, needs_regeneration: bool, regeneration_strategy: str | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Aggregated scores and feedback for one conversation.

feedback: Dict[EvaluationMetric, str]

property final_score: float: Alias for overall_score (monitoring and legacy call sites).

metadata: Dict[str, Any]

needs_regeneration: bool

overall_score: float

regeneration_strategy: str | None = None

scores: Dict[EvaluationMetric, float]

class afterimage.evaluation.EvaluationMetric(value)[source]

Bases: str, Enum

Available evaluation metrics.

COHERENCE = 'coherence'

FACTUALITY = 'factuality'

FORMATTING = 'formatting'

GROUNDING = 'grounding'

HELPFULNESS = 'helpfulness'

RELEVANCE = 'relevance'

SAFETY = 'safety'

class afterimage.evaluation.CoherenceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, coherence_threshold: float = 0.65)[source]

Bases: object

Question–answer semantic coherence via embedding cosine similarity.

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]

class afterimage.evaluation.GroundingEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, grounding_threshold: float = 0.55)[source]

Bases: object

How well assistant answers align with the response context embedding.

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]

class afterimage.evaluation.RelevanceEvaluator(embedding: EmbeddingProvider, monitor: GenerationMonitor | None = None, relevance_threshold: float = 0.55)[source]

Bases: object

How relevant user questions are to the instruction context.

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]

class afterimage.evaluation.FactualityEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]

Bases: AsyncLLMBaseEvaluator

LLM rubric for factual accuracy vs context.

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]

class afterimage.evaluation.HelpfulnessEvaluator(llm: LLMProvider, max_retries: int = 3, monitor: GenerationMonitor | None = None)[source]

Bases: AsyncLLMBaseEvaluator

LLM rubric for answer helpfulness.

async aevaluate(conversation: ConversationWithContext) → EvaluationResult[source]