Generating DPO/RLHF Preference Data

AfterImage can generate preference pairs (chosen/rejected responses) directly from your documents — no manual labeling needed.

Quick start

afterimage preference -c config.yaml

Add a preference section to your existing AfterImage config:

# config.yaml
model:
  provider: openai
  model_name: gpt-4o-mini

respondent:
  system_prompt: "You are a helpful assistant."

preference:
  num_pairs: 100
  num_responses: 2
  strategy: temperature
  min_score_gap: 0.1
  output_format: dpo
  output_path: ./preferences.jsonl

How it works

AfterImage generates a user prompt (with persona + document context, same as regular generation)
For the same prompt, it generates multiple responses with controlled variation
Each response is scored by the built-in quality judge
Highest and lowest scored responses become the (chosen, rejected) pair
Pairs with score gap below the threshold are discarded

Variation strategies

Temperature (default)

Generates responses at linearly-spaced temperatures. Lower temperature → more focused → usually higher quality.

preference:
  strategy: temperature
  num_responses: 2   # 2 responses: low temp (0.1) and high temp (0.9)

Best for: quick setup, works with any model including local (vLLM, Ollama).

Prompt variation

Modifies the system prompt: enhanced (“think step by step…”) vs degraded (“answer briefly…”).

preference:
  strategy: prompt
  num_responses: 2

Best for: teaching instruction-following quality differences.

Model variation

Uses different models for each response — one primary, one secondary.

preference:
  strategy: model
  secondary_model: gpt-4o     # stronger model for chosen responses
  num_responses: 2

Best for: largest quality spread, most realistic preference data.

Combined

Mixes temperature + prompt + model strategies for maximum diversity.

preference:
  strategy: combined
  secondary_model: gpt-4o   # optional
  num_responses: 3

Best for: production datasets, reward model training.

Output formats

Format	Description	Used by
`dpo`	Standard prompt/chosen/rejected	TRL DPOTrainer
`chat_dpo`	Message lists with chat template	TRL with chat template
`ultrafeedback`	All responses with scores	UltraFeedback-style training
`anthropic_hh`	Human:/Assistant: format	Anthropic HH training
`orpo`	DPO schema + scores	ORPO training

afterimage preference -c config.yaml --format chat_dpo

Multi-turn preferences

Generate a shared conversation history and branch at the final turn:

preference:
  multi_turn: true
  num_pairs: 50

This produces pairs where shared_prefix contains identical conversation history and chosen/rejected differ only in the final assistant response.

Full generation log

Save all scored responses (not just chosen/rejected) for analysis:

afterimage preference -c config.yaml --save-log

Or in config:

preference:
  save_log: true
  log_path: ./preferences_full_log.jsonl  # optional, defaults to output path + _log

Using with training tools

TRL DPOTrainer

from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

ds = load_dataset("json", data_files="preferences_dpo.jsonl", split="train")
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    train_dataset=ds,
    args=DPOConfig(output_dir="./dpo_output"),
)
trainer.train()

TRL with chat template (chat_dpo format)

from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

ds = load_dataset("json", data_files="preferences_chat_dpo.jsonl", split="train")
# chat_dpo format already has message lists — works directly with DPOTrainer
trainer = DPOTrainer(model=model, train_dataset=ds, args=DPOConfig(...))

Unsloth + DPO

Export as chat_dpo format and use Unsloth’s DPO notebook directly.

Python API

Set OPENAI_API_KEY in the environment (or switch the provider block to Gemini and GEMINI_API_KEY consistently).

import asyncio
import os

from afterimage import ConversationGenerator
from afterimage.callbacks import ContextualInstructionGeneratorCallback
from afterimage.evaluator import ConversationJudge
from afterimage.key_management import SmartKeyPool
from afterimage.preference import PreferenceConfig
from afterimage.providers import InMemoryDocumentProvider, LLMFactory


async def main():
    pool = SmartKeyPool.from_single_key(os.environ["OPENAI_API_KEY"])
    llm = LLMFactory.create(provider="openai", model_name="gpt-4o-mini", api_key=pool)
    judge = ConversationJudge.from_factory(
        llm,
        key_pool=pool,
        model_provider_name="openai",
    )

    docs = InMemoryDocumentProvider(["Short context document text."])
    instruction_callback = ContextualInstructionGeneratorCallback(
        api_key=pool,
        documents=docs,
        model_provider_name="openai",
        model_name="gpt-4o-mini",
    )

    gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=pool,
        model_provider_name="openai",
        instruction_generator_callback=instruction_callback,
    )

    pref_gen = gen.to_preference_generator(
        judge=judge,
        config=PreferenceConfig(
            num_pairs=50,
            strategy="temperature",
            output_format="dpo",
            output_path="./preferences.jsonl",
        ),
    )

    pairs, analytics = await pref_gen.generate()
    pref_gen.save_pairs(pairs, analytics)
    print(f"Generated {len(pairs)} pairs, discard rate: {analytics.discard_rate:.0%}")
    await judge.aclose()


asyncio.run(main())

CLI reference

afterimage preference [OPTIONS]

Options:
  -c, --config PATH       Path to YAML config file.  [required]
  --dry-run               Print plan without generating.
  --num-pairs INTEGER     Override preference.num_pairs from config.
  --format [dpo|chat_dpo|ultrafeedback|anthropic_hh|orpo]
                          Override output format.
  -o, --output PATH       Override output file path.
  --save-log              Save full generation log with all scored responses.
  --help                  Show this message and exit.

Tips

Start small: use num_responses=2 to minimize API cost
Best dataset: use strategy=combined for production datasets
High discard rate (>40%): lower min_score_gap or switch to a stronger strategy
One variation always wins: switch strategy — temperature may not create enough variation for your model
Local models: temperature strategy works with vLLM, Ollama, and llama.cpp — temperature is passed through to the provider
Analytics: check the stats printed after generation; warnings highlight degenerate patterns