Evaluation Framework

Generating synthetic data is only half the battle. You also need to ensure that the data is high quality, faithful to your source material, and diverse. Afterimage provides an async evaluation stack built around ConversationJudge: embedding-based metrics plus LLM-as-judge rubrics, composed with configurable aggregation and grade thresholds.

Overview

ConversationJudge (afterimage.evaluator): Primary entry point. Runs CoherenceEvaluator, GroundingEvaluator, RelevanceEvaluator (via :class:~afterimage.providers.embedding_providers.EmbeddingProvider) and FactualityEvaluator, HelpfulnessEvaluator (via :class:~afterimage.providers.llm_providers.LLMProvider structured output). Returns :class:~afterimage.types.EvaluatedConversationWithContext with :class:~afterimage.types.EvaluationSchema.
CompositeEvaluator (afterimage.evaluation): Combines async sub-evaluators in parallel; supports :class:~afterimage.evaluation.base.AggregationMode (MEAN, WEIGHTED_MEAN, MIN).
Generator integration: ConversationGenerator(..., auto_improve=True) builds a judge automatically using default_embedding_provider_config(model_provider_name) when you do not pass embedding_provider or embedding_provider_config.

The legacy Gemini one-shot path and evaluator_method="simple" | "hybrid" have been removed in favor of this single pipeline.

Metrics

Coherence (`CoherenceEvaluator`)

Method: Cosine similarity between embeddings of each user turn and the following assistant turn.
Goal: Question and answer are semantically aligned.

Grounding (`GroundingEvaluator`)

Method: Cosine similarity between response_context embedding and each assistant reply.
Goal: Answers stay close to the provided response context (RAG-style).

Relevance (`RelevanceEvaluator`)

Method: Cosine similarity between instruction_context embedding and each user question.
Goal: Questions match the instruction-side context.

Factuality / Helpfulness (`FactualityEvaluator`, `HelpfulnessEvaluator`)

Method: LLM structured output (agenerate_structured) with per-item scores in [0, 1] and short feedback.

Usage

Standalone judge

import asyncio
from afterimage import ConversationJudge, LLMFactory, SmartKeyPool
from afterimage.providers import EmbeddingProviderFactory

async def main():
    pool = SmartKeyPool.from_single_key("YOUR_KEY")
    llm = LLMFactory.create(provider="gemini", model_name="gemini-2.0-flash", api_key=pool)
    embed = EmbeddingProviderFactory.create(
        {"type": "gemini", "model": "gemini-embedding-001"},
        key_pool=pool,
    )
    judge = ConversationJudge(llm=llm, embedding_provider=embed)
    row = ...  # ConversationWithContext
    out = await judge.aevaluate_row(row)
    print(out.evaluation.overall_grade, out.final_score)
    await judge.aclose()

asyncio.run(main())

With `ConversationGenerator` (auto-improve)

Pass auto_improve=True. Optionally set embedding_provider, embedding_provider_config, or judge_config (ConversationJudgeConfig). For ConversationJudge.from_factory, model_provider_name is a known chat vendor (gemini, openai, deepseek, local, openrouter); embeddings follow default_embedding_provider_config for that vendor unless you override the embedding config.

Runnable example

See examples/conversation_judge_demo.py in the repository (requires GEMINI_API_KEY): it builds ConversationJudge with Gemini LLM + Gemini embeddings and scores two hand-written ConversationWithContext rows (one aligned with context, one weaker).

Interpreting results

Per-metric scores: 0.0–1.0 in evaluation.*.score.
overall_grade: Derived from the composite overall score and configurable thresholds on ConversationJudgeConfig.
needs_regeneration (internal): Composite result when overall score is below min_acceptable_score; future hooks may use this with regeneration strategies.