# Overview Afterimage is a powerful Python framework designed for generating high-quality synthetic conversation datasets using Large Language Models (LLMs). It provides a structured, scalable, and customizable way to simulate interactions between a user (Correspondent) and an AI assistant (Respondent), enabling the creation of diverse training and evaluation datasets for various applications. Unlike simple prompt loops, Afterimage offers a sophisticated engine that manages personas, context retrieval (RAG), structured data extraction, and comprehensive quality evaluation, all while running asynchronously for maximum performance. ## Use Cases Afterimage is built to support a wide range of synthetic data needs: * **Domain-Specific QA**: Generate question-answer pairs based on your technical documentation to fine-tune models for expert-level support. * **Enterprise Knowledge Training**: Create safe, synthetic datasets from internal corporate wikis or knowledge bases to train models without exposing raw sensitive data. * **Customer Support Simulation**: Simulate realistic customer interaction scenarios—including angry, confused, or happy customers—to train support bots or human agents. * **Data Extraction & Structuring**: Convert unstructured text into strictly formatted JSON data for downstream processing or database population. * **Evaluation Datasets**: Produce "Golden Sets" of complex questions and ground-truth answers to benchmark the performance of your RAG pipelines. ## Key Features * **⚡ Async & Parallel**: Built on `asyncio` to generate hundreds of conversations concurrently, maximizing API throughput. * **🎭 Persona Simulation**: Automatically generate unique user personas (e.g., "Novice User", "Senior Engineer") to ensure dataset diversity. * **🧠 RAG-Native**: First-class support for injecting context from vector databases (like Qdrant) or local files into the generation process. * **📋 Structured Generation**: Enforce Pydantic schemas on outputs to ensure generated data is machine-readable and valid. * **⚖️ Evaluation Framework**: Built-in evaluators (both LLM-as-judge and embedding-based) to measure coherence, factuality, and grounding. * **📊 Monitoring**: Real-time observability into generation metrics like token usage, meaningfulness, and error rates. ## Key Concepts To effectively use Afterimage, it helps to understand its core abstractions: ### 1. Correspondent vs. Respondent * **Correspondent (The User)**: The simulated agent that initiates the conversation. It asks questions, follows up, or gives instructions. Its behavior is driven by an **Instruction Generator**. * **Respondent (The Assistant)**: The agent that replies. Its behavior is defined by your system prompt. This is usually the model you are trying to emulate or train. ### 2. Generators The engine that runs the simulation. * **`ConversationGenerator`**: The primary engine for multi-turn chat. * **`StructuredGenerator`**: Specialized engine for single-turn data extraction into JSON. ### 3. Callbacks Hooks to customize the behavior of the agents: * **Instruction Generator Callback**: Decides *what* the Correspondent asks. It can read from documents to ask relevant questions. * **Respondent Prompt Modifier**: Dynamically changes the Respondent's system prompt (e.g., injecting the relevant context/chunks for RAG). ### 4. Document Providers The source of knowledge for the generation. Afterimage supports reading from: * Local text/markdown files. * JSONL datasets. * Vector Databases (Qdrant). * In-memory lists. ### 5. Personas Profiles that shape the Correspondent's tone, vocabulary, and expertise level. Instead of a generic "User", Afterimage can simulate specific demographics to stress-test your model.