Generators

class afterimage.ConversationGenerator(respondent_prompt: str, api_key: str | SmartKeyPool, correspondent_prompt: str | None = None, model_name: str | None = None, safety_settings: List[Dict[str, str]] | None = None, auto_improve: bool = False, evaluator_model_name: str | None = None, model_provider_name: Literal['gemini', 'openai', 'deepseek', 'local', 'openrouter'] = 'gemini', llm_factory_kwargs: dict[str, Any] | None = None, embedding_provider: EmbeddingProvider | None = None, embedding_provider_config: dict[str, Any] | None = None, judge_config: ConversationJudgeConfig | None = None, storage: BaseStorage | None = None, monitor: GenerationMonitor | None = None, instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None, turn_hooks: ConversationTurnHooks | None = None)[source]

Bases: BaseGenerator

Generates conversations between a correspondent (question generator) and a respondent (answer generator) asynchronously.

Parameters:

respondent_prompt – System prompt to the respondent, e.g., assistant that you want you fine-tune on this dataset
api_key – Either a single API key string or a SmartKeyPool instance for LLM use
correspondent_prompt – System prompt to the correspondent, e.g., model that roleplays a user of the assistant that you want to fine-tune on this dataset
model_name – Model name to use
safety_settings – Safety settings for the model
auto_improve – Whether to try to improve low-quality generations
evaluator_model_name – Model name for the evaluator LLM when auto_improve is True.
model_provider_name – Chat LLM vendor: gemini, openai, deepseek, local (OpenAI-compatible server), or openrouter.
llm_factory_kwargs – Extra keyword arguments passed to create() for every LLM built by this generator (for example base_url for compatible endpoints).
embedding_provider – Optional shared EmbeddingProvider for embedding metrics.
embedding_provider_config – JSON-style config for EmbeddingProviderFactory when embedding_provider is omitted (defaults by chat provider).
judge_config – Optional ConversationJudgeConfig (aggregation and grade thresholds).
storage – Storage implementation for saving conversations. If None, creates JSONLStorage with datetime-based filename.
monitor – GenerationMonitor instance for tracking generation metrics. If None, a default one is created.
instruction_generator_callback – Callback for instruction generation. Can also be passed to generate() method (deprecated).
respondent_prompt_modifier – Callback to modify respondent prompts. Can also be passed to generate() method (deprecated).

async answer(respondent: ChatSession, question: str | ConversationEntry) → ConversationEntry[source]: Generates an answer from the respondent based on the given question.

async ask(correspondent: ChatSession, answer: str | ConversationEntry) → str[source]: Generates a question from the correspondent based on the given answer.

async create_correspondent_prompt(assistant_prompt: str) → str[source]: Create a correspondent prompt based on the assistant prompt.

async create_model(prompt: str) → ChatSession[source]: Creates and initializes a chat model with the given prompt.

property evaluator

The conversation evaluator (ConversationJudge or None).

Setting this also updates the internal QualityGate so that generate_single() uses the new evaluator for retry decisions.

async generate(num_dialogs: int | None = None, max_turns: int = 1, stopping_criteria: List[BaseStoppingCallback] | None = None, instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None, max_concurrency: int | None = None, num_requested: int | None = None) → None[source]

Generates multiple conversation dialogs until stopping criteria is met.

Parameters:

num_dialogs – Number of dialogs to generate. Defaults to 5 if no other stopping criteria is specified.
max_turns – Maximum number of turns per dialog. Actual number of turns is randomly sampled from 1 .. max_turns.
stopping_criteria – A list of callbacks to determine when to stop generation. If num_dialogs is specified, FixedNumberStoppingCallback is added to this list automatically.
instruction_generator_callback – Callback for instruction generation. Deprecated: Pass this to the constructor instead. Defaults to None.
respondent_prompt_modifier – Callback to modify respondent prompts. Deprecated: Pass this to the constructor instead. Defaults to None.
max_concurrency – Number of concurrent generations. Defaults to 8 for DeepSeek and 4 for other providers.
num_requested – Hint for progress reporting (e.g. tqdm total). When omitted, the first FixedNumberStoppingCallback in the merged stopping list is used if present.

async generate_single(max_turns: int, check_for_near_duplicates: bool = False, instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None) → AsyncGenerator[EvaluatedConversationWithContext | Conversation, None][source]: Generates conversations for a single session and yields them.

async go(turns: int = 1, first_question: str | None = None, check_for_near_duplicates: bool = False, correspondent_prompt: str | None = None, respondent_prompt: str | None = None) → List[ConversationEntry][source]: Simulates a multi-turn conversation between the correspondent and respondent.

to_preference_generator(judge, config=None, secondary_llm_provider=None)[source]

Create a PreferenceGenerator from this generator.

Parameters:

judge – ConversationJudge for scoring responses.
config – PreferenceConfig (optional).
secondary_llm_provider – Secondary LLM for model-variation strategy (optional).

Returns:

Configured PreferenceGenerator.

class afterimage.StructuredGenerator(output_schema: Type[T], respondent_prompt: str, api_key: str | SmartKeyPool, model_name: str | None = None, safety_settings: List[Dict[str, str]] | None = None, model_provider_name: Literal['gemini', 'openai', 'deepseek', 'local', 'openrouter'] = 'gemini', llm_factory_kwargs: dict[str, Any] | None = None, storage: BaseStorage | None = None, monitor: GenerationMonitor | None = None, instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None, correspondent_prompt: str | None = None)[source]

Bases: BaseGenerator

Generates structured datasets where outputs strictly conform to a Pydantic schema.

async create_correspondent_prompt(respondent_prompt: str) → str[source]

async generate(num_samples: int | None = None, stopping_criteria: list[BaseStoppingCallback] | None = None, instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None, max_concurrency: int | None = None) → None[source]

Generates structured samples and saves them to storage.

Parameters:

num_samples – Total number of samples to generate. Defaults to 5 if no other stopping criteria is specified.
stopping_criteria – A list of callbacks to determine when to stop generation. If num_samples is specified, FixedNumberStoppingCallback is added to this list.
instruction_generator_callback – Callback for instruction generation. Deprecated: Pass this to the constructor instead. Defaults to None.
respondent_prompt_modifier – Callback to modify respondent prompts. Deprecated: Pass this to the constructor instead. Defaults to None.
max_concurrency – Maximum number of concurrent tasks. Defaults to 8 for DeepSeek and 4 for other providers.

async generate_single(instruction_generator_callback: BaseInstructionGeneratorCallback | None = None, respondent_prompt_modifier: BaseRespondentPromptModifierCallback | None = None) → AsyncGenerator[StructuredGenerationRow[T], None][source]: Generates structured outputs for a single batch of instructions.

class afterimage.PersonaGenerator(api_key: str | SmartKeyPool, model_name: str | None = None, safety_settings: list[dict[str, str]] | None = None, model_provider_name: Literal['gemini', 'openai', 'deepseek', 'local', 'openrouter'] = 'gemini', llm_factory_kwargs: dict[str, Any] | None = None, storage: BaseStorage | None = None, monitor: GenerationMonitor | None = None, max_concurrency: int | None = None)[source]

Bases: object

async agenerate_from_persona(persona: str, generation: int = 1) → list[str][source]

async agenerate_from_text(text: str) → list[str][source]

expected_persona_count(n_iterations: int) → int[source]

async generate_from_documents(documents: DocumentProvider | list[str], max_docs: int | None = None, n_iterations: int | None = None, target_data_count: int | None = None, num_random_contexts: int = 1)[source]

generate_from_persona(persona: str, generation: int = 1) → list[str][source]

generate_from_text(text: str) → list[str][source]