Structured Generation
Sometimes you don’t want a conversation. You want to extract specific information from a document or generate synthetic data that fits a strict schema (like a database row). Afterimage supports Structured Generation using StructuredGenerator and Pydantic models.
Concept
Structured generation forces the LLM to output valid JSON that matches a schema you define. This is useful for:
Data Extraction: “Read this email and extract the sender, date, and sentiment.”
Synthetic Database Rows: “Generate 100 fake user profiles with names, ages, and bios.”
Golden Sets for RAG: “Generate a question, the correct answer, and the key facts” for evaluation.
StructuredGenerator class
This generator works differently than the conversation generator. Instead of simulation loops (User <-> Assistant), it simulates a single-turn interaction: Instruction + Context -> Structured Output.
Initialization
The strategy callbacks (for instructions and prompt modification) should be configured at initialization.
from afterimage import StructuredGenerator
from pydantic import BaseModel, Field
# 1. Define your Output Schema
class CustomerFeedback(BaseModel):
sentiment: str = Field(..., description="Positive, Negative, or Neutral")
topics: list[str] = Field(..., description="List of topics mentioned (e.g., Pricing, UI)")
summary: str = Field(..., description="One sentence summary")
# 2. Initialize Generator with Strategies
generator = StructuredGenerator(
output_schema=CustomerFeedback,
respondent_prompt="You are an expert data analyst. Extract insights from the feedback.",
api_key=os.getenv("GEMINI_API_KEY"),
# Strategies are passed here
instruction_generator_callback=my_instruction_gen,
respondent_prompt_modifier=my_prompt_modifier
)
Key Parameters:
output_schema(Type[BaseModel]): The Pydantic model defining the expected output structure.respondent_prompt(str): System prompt for the generation model.instruction_generator_callback(BaseInstructionGeneratorCallback, optional): Strategy to generate the input/instruction for each sample.respondent_prompt_modifier(BaseRespondentPromptModifierCallback, optional): Strategy to modify the system prompt per sample.correspondent_prompt(str, optional): A static prompt for the “user” side, if not using a callback.storage(BaseStorage, optional): Where to save results. Defaults toJSONLStorage.
Generating Data
Use the generate method to produce samples.
await generator.generate(
num_samples=50,
max_concurrency=4,
)
Parameters:
num_samples(int, optional): Total number of samples to generate.max_concurrency(int): Maximum concurrent generations.stopping_criteria(List[BaseStoppingCallback], optional): Custom logic for stopping generation. Ifnum_samplesis set, aFixedNumberStoppingCallbackis automatically added.
Example: Data Extraction from Documents
Here is how to use AsyncStructuredGenerator to process a list of “raw” reviews and extract structured data from them.
import asyncio
import os
from pydantic import BaseModel, Field
from afterimage import (
StructuredGenerator,
ContextualInstructionGeneratorCallback,
InMemoryDocumentProvider
)
# 1. Schema
class ReviewAnalysis(BaseModel):
product_name: str
rating: int = Field(..., description="1-5 stars")
is_spam: bool
# 2. Raw Data (The "Context")
raw_reviews = InMemoryDocumentProvider([
"I loved the SuperWidget! 5 stars best purchase ever.",
"Click here for free money! www.spam.com",
"It broke after one day. Terrible quality. 1 star.",
])
async def main():
api_key = os.getenv("GEMINI_API_KEY")
# 3. Setup Instruction Generator
# This will feed the raw reviews one by one as context
instruction_gen = ContextualInstructionGeneratorCallback(
api_key=api_key,
documents=raw_reviews,
num_random_contexts=1,
# Just ask to analyze the context
prompt="Analyze the review provided in the context."
)
# 4. Initialize Generator
generator = StructuredGenerator(
output_schema=ReviewAnalysis,
respondent_prompt="Analyze the provided review.",
api_key=api_key,
instruction_generator_callback=instruction_gen
)
# 5. Run Extraction
print("Extracting data...")
await generator.generate(num_samples=3)
print("Done. Data saved to JSONL.")
if __name__ == "__main__":
asyncio.run(main())
The output will be saved to a .jsonl file where each line is a valid JSON object matching your ReviewAnalysis schema.