# Structured Generation Sometimes you don't want a conversation. You want to extract specific information from a document or generate synthetic data that fits a strict schema (like a database row). Afterimage supports **Structured Generation** using `StructuredGenerator` and Pydantic models. ## Concept Structured generation forces the LLM to output valid JSON that matches a schema you define. This is useful for: * **Data Extraction**: "Read this email and extract the sender, date, and sentiment." * **Synthetic Database Rows**: "Generate 100 fake user profiles with names, ages, and bios." * **Golden Sets for RAG**: "Generate a question, the correct answer, and the key facts" for evaluation. ## `StructuredGenerator` class This generator works differently than the conversation generator. Instead of simulation loops (User <-> Assistant), it simulates a single-turn interaction: `Instruction + Context -> Structured Output`. ### Initialization The strategy callbacks (for instructions and prompt modification) should be configured at initialization. ```python from afterimage import StructuredGenerator from pydantic import BaseModel, Field # 1. Define your Output Schema class CustomerFeedback(BaseModel): sentiment: str = Field(..., description="Positive, Negative, or Neutral") topics: list[str] = Field(..., description="List of topics mentioned (e.g., Pricing, UI)") summary: str = Field(..., description="One sentence summary") # 2. Initialize Generator with Strategies generator = StructuredGenerator( output_schema=CustomerFeedback, respondent_prompt="You are an expert data analyst. Extract insights from the feedback.", api_key=os.getenv("GEMINI_API_KEY"), # Strategies are passed here instruction_generator_callback=my_instruction_gen, respondent_prompt_modifier=my_prompt_modifier ) ``` **Key Parameters:** * `output_schema` (Type[BaseModel]): The Pydantic model defining the expected output structure. * `respondent_prompt` (str): System prompt for the generation model. * `instruction_generator_callback` (BaseInstructionGeneratorCallback, optional): Strategy to generate the input/instruction for each sample. * `respondent_prompt_modifier` (BaseRespondentPromptModifierCallback, optional): Strategy to modify the system prompt per sample. * `correspondent_prompt` (str, optional): A static prompt for the "user" side, if not using a callback. * `storage` (BaseStorage, optional): Where to save results. Defaults to `JSONLStorage`. ### Generating Data Use the `generate` method to produce samples. ```python await generator.generate( num_samples=50, max_concurrency=4, ) ``` **Parameters:** * `num_samples` (int, optional): Total number of samples to generate. * `max_concurrency` (int): Maximum concurrent generations. * `stopping_criteria` (List[BaseStoppingCallback], optional): Custom logic for stopping generation. If `num_samples` is set, a `FixedNumberStoppingCallback` is automatically added. ## Example: Data Extraction from Documents Here is how to use `AsyncStructuredGenerator` to process a list of "raw" reviews and extract structured data from them. ```python import asyncio import os from pydantic import BaseModel, Field from afterimage import ( StructuredGenerator, ContextualInstructionGeneratorCallback, InMemoryDocumentProvider ) # 1. Schema class ReviewAnalysis(BaseModel): product_name: str rating: int = Field(..., description="1-5 stars") is_spam: bool # 2. Raw Data (The "Context") raw_reviews = InMemoryDocumentProvider([ "I loved the SuperWidget! 5 stars best purchase ever.", "Click here for free money! www.spam.com", "It broke after one day. Terrible quality. 1 star.", ]) async def main(): api_key = os.getenv("GEMINI_API_KEY") # 3. Setup Instruction Generator # This will feed the raw reviews one by one as context instruction_gen = ContextualInstructionGeneratorCallback( api_key=api_key, documents=raw_reviews, num_random_contexts=1, # Just ask to analyze the context prompt="Analyze the review provided in the context." ) # 4. Initialize Generator generator = StructuredGenerator( output_schema=ReviewAnalysis, respondent_prompt="Analyze the provided review.", api_key=api_key, instruction_generator_callback=instruction_gen ) # 5. Run Extraction print("Extracting data...") await generator.generate(num_samples=3) print("Done. Data saved to JSONL.") if __name__ == "__main__": asyncio.run(main()) ``` The output will be saved to a `.jsonl` file where each line is a valid JSON object matching your `ReviewAnalysis` schema.