# Generating DPO/RLHF Preference Data

AfterImage can generate preference pairs (chosen/rejected responses) directly from your documents — no manual labeling needed.

## Quick start

```bash
afterimage preference -c config.yaml
```

Add a `preference` section to your existing AfterImage config:

```yaml
# config.yaml
model:
  provider: openai
  model_name: gpt-4o-mini

respondent:
  system_prompt: "You are a helpful assistant."

preference:
  num_pairs: 100
  num_responses: 2
  strategy: temperature
  min_score_gap: 0.1
  output_format: dpo
  output_path: ./preferences.jsonl
```

## How it works

1. AfterImage generates a user prompt (with persona + document context, same as regular generation)
2. For the same prompt, it generates multiple responses with **controlled variation**
3. Each response is scored by the built-in quality judge
4. Highest and lowest scored responses become the (chosen, rejected) pair
5. Pairs with score gap below the threshold are discarded

## Variation strategies

### Temperature (default)

Generates responses at linearly-spaced temperatures. Lower temperature → more focused → usually higher quality.

```yaml
preference:
  strategy: temperature
  num_responses: 2   # 2 responses: low temp (0.1) and high temp (0.9)
```

Best for: quick setup, works with any model including local (vLLM, Ollama).

### Prompt variation

Modifies the system prompt: **enhanced** ("think step by step...") vs **degraded** ("answer briefly...").

```yaml
preference:
  strategy: prompt
  num_responses: 2
```

Best for: teaching instruction-following quality differences.

### Model variation

Uses different models for each response — one primary, one secondary.

```yaml
preference:
  strategy: model
  secondary_model: gpt-4o     # stronger model for chosen responses
  num_responses: 2
```

Best for: largest quality spread, most realistic preference data.

### Combined

Mixes temperature + prompt + model strategies for maximum diversity.

```yaml
preference:
  strategy: combined
  secondary_model: gpt-4o   # optional
  num_responses: 3
```

Best for: production datasets, reward model training.

## Output formats

| Format | Description | Used by |
|--------|-------------|---------|
| `dpo` | Standard prompt/chosen/rejected | TRL DPOTrainer |
| `chat_dpo` | Message lists with chat template | TRL with chat template |
| `ultrafeedback` | All responses with scores | UltraFeedback-style training |
| `anthropic_hh` | Human:/Assistant: format | Anthropic HH training |
| `orpo` | DPO schema + scores | ORPO training |

```bash
afterimage preference -c config.yaml --format chat_dpo
```

## Multi-turn preferences

Generate a shared conversation history and branch at the final turn:

```yaml
preference:
  multi_turn: true
  num_pairs: 50
```

This produces pairs where `shared_prefix` contains identical conversation history and `chosen`/`rejected` differ only in the final assistant response.

## Full generation log

Save all scored responses (not just chosen/rejected) for analysis:

```bash
afterimage preference -c config.yaml --save-log
```

Or in config:
```yaml
preference:
  save_log: true
  log_path: ./preferences_full_log.jsonl  # optional, defaults to output path + _log
```

## Using with training tools

### TRL DPOTrainer

```python
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

ds = load_dataset("json", data_files="preferences_dpo.jsonl", split="train")
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    train_dataset=ds,
    args=DPOConfig(output_dir="./dpo_output"),
)
trainer.train()
```

### TRL with chat template (chat_dpo format)

```python
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

ds = load_dataset("json", data_files="preferences_chat_dpo.jsonl", split="train")
# chat_dpo format already has message lists — works directly with DPOTrainer
trainer = DPOTrainer(model=model, train_dataset=ds, args=DPOConfig(...))
```

### Unsloth + DPO

Export as `chat_dpo` format and use Unsloth's DPO notebook directly.

## Python API

Set `OPENAI_API_KEY` in the environment (or switch the provider block to Gemini and `GEMINI_API_KEY` consistently).

```python
import asyncio
import os

from afterimage import ConversationGenerator
from afterimage.callbacks import ContextualInstructionGeneratorCallback
from afterimage.evaluator import ConversationJudge
from afterimage.key_management import SmartKeyPool
from afterimage.preference import PreferenceConfig
from afterimage.providers import InMemoryDocumentProvider, LLMFactory


async def main():
    pool = SmartKeyPool.from_single_key(os.environ["OPENAI_API_KEY"])
    llm = LLMFactory.create(provider="openai", model_name="gpt-4o-mini", api_key=pool)
    judge = ConversationJudge.from_factory(
        llm,
        key_pool=pool,
        model_provider_name="openai",
    )

    docs = InMemoryDocumentProvider(["Short context document text."])
    instruction_callback = ContextualInstructionGeneratorCallback(
        api_key=pool,
        documents=docs,
        model_provider_name="openai",
        model_name="gpt-4o-mini",
    )

    gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=pool,
        model_provider_name="openai",
        instruction_generator_callback=instruction_callback,
    )

    pref_gen = gen.to_preference_generator(
        judge=judge,
        config=PreferenceConfig(
            num_pairs=50,
            strategy="temperature",
            output_format="dpo",
            output_path="./preferences.jsonl",
        ),
    )

    pairs, analytics = await pref_gen.generate()
    pref_gen.save_pairs(pairs, analytics)
    print(f"Generated {len(pairs)} pairs, discard rate: {analytics.discard_rate:.0%}")
    await judge.aclose()


asyncio.run(main())
```

## CLI reference

```
afterimage preference [OPTIONS]

Options:
  -c, --config PATH       Path to YAML config file.  [required]
  --dry-run               Print plan without generating.
  --num-pairs INTEGER     Override preference.num_pairs from config.
  --format [dpo|chat_dpo|ultrafeedback|anthropic_hh|orpo]
                          Override output format.
  -o, --output PATH       Override output file path.
  --save-log              Save full generation log with all scored responses.
  --help                  Show this message and exit.
```

## Tips

- **Start small**: use `num_responses=2` to minimize API cost
- **Best dataset**: use `strategy=combined` for production datasets
- **High discard rate** (>40%): lower `min_score_gap` or switch to a stronger strategy
- **One variation always wins**: switch strategy — temperature may not create enough variation for your model
- **Local models**: temperature strategy works with vLLM, Ollama, and llama.cpp — temperature is passed through to the provider
- **Analytics**: check the stats printed after generation; warnings highlight degenerate patterns