Evaluation Framework
Generating synthetic data is only half the battle. You also need to ensure that the data is high quality, faithful to your source material, and diverse. Afterimage provides an async evaluation stack built around ConversationJudge: embedding-based metrics plus LLM-as-judge rubrics, composed with configurable aggregation and grade thresholds.
Overview
ConversationJudge(afterimage.evaluator): Primary entry point. RunsCoherenceEvaluator,GroundingEvaluator,RelevanceEvaluator(via :class:~afterimage.providers.embedding_providers.EmbeddingProvider) andFactualityEvaluator,HelpfulnessEvaluator(via :class:~afterimage.providers.llm_providers.LLMProviderstructured output). Returns :class:~afterimage.types.EvaluatedConversationWithContextwith :class:~afterimage.types.EvaluationSchema.CompositeEvaluator(afterimage.evaluation): Combines async sub-evaluators in parallel; supports :class:~afterimage.evaluation.base.AggregationMode(MEAN,WEIGHTED_MEAN,MIN).Generator integration:
ConversationGenerator(..., auto_improve=True)builds a judge automatically usingdefault_embedding_provider_config(model_provider_name)when you do not passembedding_providerorembedding_provider_config.
The legacy Gemini one-shot path and evaluator_method="simple" | "hybrid" have been removed in favor of this single pipeline.
Metrics
Coherence (CoherenceEvaluator)
Method: Cosine similarity between embeddings of each user turn and the following assistant turn.
Goal: Question and answer are semantically aligned.
Grounding (GroundingEvaluator)
Method: Cosine similarity between
response_contextembedding and each assistant reply.Goal: Answers stay close to the provided response context (RAG-style).
Relevance (RelevanceEvaluator)
Method: Cosine similarity between
instruction_contextembedding and each user question.Goal: Questions match the instruction-side context.
Factuality / Helpfulness (FactualityEvaluator, HelpfulnessEvaluator)
Method: LLM structured output (
agenerate_structured) with per-item scores in[0, 1]and short feedback.
Usage
Standalone judge
import asyncio
from afterimage import ConversationJudge, LLMFactory, SmartKeyPool
from afterimage.providers import EmbeddingProviderFactory
async def main():
pool = SmartKeyPool.from_single_key("YOUR_KEY")
llm = LLMFactory.create("gemini", "gemini-2.0-flash", pool)
embed = EmbeddingProviderFactory.create(
{"type": "gemini", "model": "text-embedding-004"},
key_pool=pool,
)
judge = ConversationJudge(llm=llm, embedding_provider=embed)
row = ... # ConversationWithContext
out = await judge.aevaluate_row(row)
print(out.evaluation.overall_grade, out.final_score)
await judge.aclose()
asyncio.run(main())
With ConversationGenerator (auto-improve)
Pass auto_improve=True. Optionally set embedding_provider, embedding_provider_config, or judge_config (:class:~afterimage.evaluator.ConversationJudgeConfig).
Runnable example
See examples/conversation_judge_demo.py in the repository (requires GEMINI_API_KEY): it builds ConversationJudge with Gemini LLM + Gemini embeddings and scores two hand-written ConversationWithContext rows (one aligned with context, one weaker).
Interpreting results
Per-metric scores:
0.0–1.0inevaluation.*.score.overall_grade: Derived from the composite overall score and configurable thresholds on
ConversationJudgeConfig.needs_regeneration (internal): Composite result when overall score is below
min_acceptable_score; future hooks may use this with regeneration strategies.