Generating DPO/RLHF Preference Data
AfterImage can generate preference pairs (chosen/rejected responses) directly from your documents — no manual labeling needed.
Quick start
afterimage preference -c config.yaml
Add a preference section to your existing AfterImage config:
# config.yaml
model:
provider: openai
model_name: gpt-4o-mini
respondent:
system_prompt: "You are a helpful assistant."
preference:
num_pairs: 100
num_responses: 2
strategy: temperature
min_score_gap: 0.1
output_format: dpo
output_path: ./preferences.jsonl
How it works
AfterImage generates a user prompt (with persona + document context, same as regular generation)
For the same prompt, it generates multiple responses with controlled variation
Each response is scored by the built-in quality judge
Highest and lowest scored responses become the (chosen, rejected) pair
Pairs with score gap below the threshold are discarded
Variation strategies
Temperature (default)
Generates responses at linearly-spaced temperatures. Lower temperature → more focused → usually higher quality.
preference:
strategy: temperature
num_responses: 2 # 2 responses: low temp (0.1) and high temp (0.9)
Best for: quick setup, works with any model including local (vLLM, Ollama).
Prompt variation
Modifies the system prompt: enhanced (“think step by step…”) vs degraded (“answer briefly…”).
preference:
strategy: prompt
num_responses: 2
Best for: teaching instruction-following quality differences.
Model variation
Uses different models for each response — one primary, one secondary.
preference:
strategy: model
secondary_model: gpt-4o # stronger model for chosen responses
num_responses: 2
Best for: largest quality spread, most realistic preference data.
Combined
Mixes temperature + prompt + model strategies for maximum diversity.
preference:
strategy: combined
secondary_model: gpt-4o # optional
num_responses: 3
Best for: production datasets, reward model training.
Output formats
Format |
Description |
Used by |
|---|---|---|
|
Standard prompt/chosen/rejected |
TRL DPOTrainer |
|
Message lists with chat template |
TRL with chat template |
|
All responses with scores |
UltraFeedback-style training |
|
Human:/Assistant: format |
Anthropic HH training |
|
DPO schema + scores |
ORPO training |
afterimage preference -c config.yaml --format chat_dpo
Multi-turn preferences
Generate a shared conversation history and branch at the final turn:
preference:
multi_turn: true
num_pairs: 50
This produces pairs where shared_prefix contains identical conversation history and chosen/rejected differ only in the final assistant response.
Full generation log
Save all scored responses (not just chosen/rejected) for analysis:
afterimage preference -c config.yaml --save-log
Or in config:
preference:
save_log: true
log_path: ./preferences_full_log.jsonl # optional, defaults to output path + _log
Using with training tools
TRL DPOTrainer
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
ds = load_dataset("json", data_files="preferences_dpo.jsonl", split="train")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
train_dataset=ds,
args=DPOConfig(output_dir="./dpo_output"),
)
trainer.train()
TRL with chat template (chat_dpo format)
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
ds = load_dataset("json", data_files="preferences_chat_dpo.jsonl", split="train")
# chat_dpo format already has message lists — works directly with DPOTrainer
trainer = DPOTrainer(model=model, train_dataset=ds, args=DPOConfig(...))
Unsloth + DPO
Export as chat_dpo format and use Unsloth’s DPO notebook directly.
Python API
Set OPENAI_API_KEY in the environment (or switch the provider block to Gemini and GEMINI_API_KEY consistently).
import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.callbacks import ContextualInstructionGeneratorCallback
from afterimage.evaluator import ConversationJudge
from afterimage.key_management import SmartKeyPool
from afterimage.preference import PreferenceConfig
from afterimage.providers import InMemoryDocumentProvider, LLMFactory
async def main():
pool = SmartKeyPool.from_single_key(os.environ["OPENAI_API_KEY"])
llm = LLMFactory.create(provider="openai", model_name="gpt-4o-mini", api_key=pool)
judge = ConversationJudge.from_factory(
llm,
key_pool=pool,
model_provider_name="openai",
)
docs = InMemoryDocumentProvider(["Short context document text."])
instruction_callback = ContextualInstructionGeneratorCallback(
api_key=pool,
documents=docs,
model_provider_name="openai",
model_name="gpt-4o-mini",
)
gen = ConversationGenerator(
respondent_prompt="You are a helpful assistant.",
api_key=pool,
model_provider_name="openai",
instruction_generator_callback=instruction_callback,
)
pref_gen = gen.to_preference_generator(
judge=judge,
config=PreferenceConfig(
num_pairs=50,
strategy="temperature",
output_format="dpo",
output_path="./preferences.jsonl",
),
)
pairs, analytics = await pref_gen.generate()
pref_gen.save_pairs(pairs, analytics)
print(f"Generated {len(pairs)} pairs, discard rate: {analytics.discard_rate:.0%}")
await judge.aclose()
asyncio.run(main())
CLI reference
afterimage preference [OPTIONS]
Options:
-c, --config PATH Path to YAML config file. [required]
--dry-run Print plan without generating.
--num-pairs INTEGER Override preference.num_pairs from config.
--format [dpo|chat_dpo|ultrafeedback|anthropic_hh|orpo]
Override output format.
-o, --output PATH Override output file path.
--save-log Save full generation log with all scored responses.
--help Show this message and exit.
Tips
Start small: use
num_responses=2to minimize API costBest dataset: use
strategy=combinedfor production datasetsHigh discard rate (>40%): lower
min_score_gapor switch to a stronger strategyOne variation always wins: switch strategy — temperature may not create enough variation for your model
Local models: temperature strategy works with vLLM, Ollama, and llama.cpp — temperature is passed through to the provider
Analytics: check the stats printed after generation; warnings highlight degenerate patterns