Export & Integration Guide

AfterImage can export generated datasets to all major fine-tuning formats with a single command.

Quick Start

# Export to ShareGPT format
afterimage export -i dataset.jsonl -f sharegpt

# Export to multiple formats at once
afterimage export -i dataset.jsonl -f sharegpt -f alpaca -f messages

# Export to all formats with train/val split
afterimage export -i dataset.jsonl --all --split 0.1

# List available formats
afterimage export --list-formats

Available Formats

Format

Key

Multi-turn

System Prompt

Used by

ShareGPT

sharegpt

yes

yes

Unsloth, Axolotl, LLaMA-Factory

Alpaca

alpaca

no*

no

Stanford Alpaca, basic SFT

Messages

messages

yes

yes

TRL SFTTrainer, HuggingFace

Oumi

oumi

yes

yes

Oumi

LLaMA-Factory

llama_factory

yes

yes

LLaMA-Factory

OpenAI

openai

yes

yes

OpenAI fine-tuning API

DPO

dpo

no

yes

TRL DPOTrainer, RLHF

Raw

raw

yes

yes

AfterImage, custom pipelines

* Alpaca supports multi-turn via split_turns mode (programmatic API).

Output Schemas

ShareGPT

{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Roles are mapped: userhuman, assistantgpt. Consecutive same-role messages are merged with a newline separator.

Alpaca

{"instruction": "user question", "input": "optional context", "output": "assistant response"}

Uses first user/assistant pair. The input field is populated from instruction_context if present in the source row.

Messages / Oumi

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Standard HuggingFace chat format. Oumi uses the same schema. Rows without an assistant message are skipped.

LLaMA-Factory

{
  "instruction": "last user message",
  "input": "context",
  "output": "last assistant response",
  "history": [["user msg 1", "assistant msg 1"], ["user msg 2", "assistant msg 2"]],
  "system": "optional system prompt"
}

All turns except the last pair go into history as [user, assistant] arrays.

OpenAI Fine-Tuning

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Same as Messages format. Requires at least one assistant message (OpenAI API requirement).

DPO (Preference Pairs)

{"prompt": "user instruction", "chosen": "best response", "rejected": "worst response"}

Requires final_score in source data (enable quality.auto_improve: true). Groups conversations by first user message and pairs highest/lowest scored responses. Minimum score gap: 0.2 (configurable via API).

Raw

{"conversations": [...], "metadata": {...}, ...}

Passthrough — copies the entire source row as-is.

CLI Reference

afterimage export

Options:
  -i, --input PATH        Path to AfterImage JSONL dataset (required)
  -f, --format TEXT       Target format(s). Repeat for multiple
  --all                   Export to all available formats
  -o, --output-dir PATH   Output directory (default: same as input)
  --split FLOAT           Train/val split ratio (e.g. 0.1 = 10% validation)
  --shuffle/--no-shuffle  Shuffle before splitting (default: shuffle)
  --seed INT              Random seed for reproducible splits (default: 42)
  --system-prompt TEXT    System prompt to prepend
  --list-formats          Show all available formats

Output files are named {input_stem}_{format}.jsonl, or {input_stem}_{format}_train.jsonl / {input_stem}_{format}_val.jsonl when using --split.

afterimage push

Options:
  -i, --input PATH   Path to AfterImage JSONL dataset (required)
  -f, --format TEXT   Export format before pushing (default: messages)
  --repo TEXT         HuggingFace repo: username/dataset-name (required)
  --private           Create as private dataset
  --split FLOAT       Train/val split ratio (default: 0.1)

huggingface_hub is a core dependency; no extra install is required for push.

Auto-Export After Generation

Add an export section to your YAML config to automatically export after generation:

output:
  path: output/dataset.jsonl
  export:
    formats:
      - sharegpt
      - messages
    output_dir: output/exports   # optional, defaults to same directory
    split: 0.1                   # optional train/val split
    shuffle: true                # default
    seed: 42                     # default

Auto-export is fail-safe — if export fails, your generated dataset is unaffected.

Training Tool Integration

Unsloth

from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset_sharegpt.jsonl")

from unsloth import FastLanguageModel
# ... load model ...
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    dataset_text_field="conversations",
)

Axolotl

# axolotl config
datasets:
  - path: dataset_sharegpt.jsonl
    type: sharegpt

TRL SFTTrainer

from datasets import load_dataset
dataset = load_dataset("json", data_files={
    "train": "dataset_messages_train.jsonl",
    "validation": "dataset_messages_val.jsonl",
})

from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)

TRL DPOTrainer

from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset_dpo.jsonl")

from trl import DPOTrainer
trainer = DPOTrainer(
    model=model,
    train_dataset=dataset["train"],
)

LLaMA-Factory

{
  "my_dataset": {
    "file_name": "dataset_llama_factory.jsonl",
    "formatting": "alpaca",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "history": "history",
      "system": "system"
    }
  }
}

Oumi

oumi train -c my_config.yaml \
  --training_data.datasets[0].dataset_name=dataset_oumi.jsonl

OpenAI Fine-Tuning API

# Upload and create fine-tuning job
openai api fine_tuning.jobs.create \
  -t dataset_openai.jsonl \
  -m gpt-4o-mini-2024-07-18

HuggingFace Hub

# Push directly from AfterImage
afterimage push -i dataset.jsonl --repo username/my-dataset --private

# Then use anywhere
from datasets import load_dataset
ds = load_dataset("username/my-dataset")

Programmatic API

from afterimage.integrations import get_exporter, list_formats

# List all formats
for fmt in list_formats():
    print(f"{fmt['name']}: {fmt['description']}")

# Export a single format
exporter = get_exporter("sharegpt")
result = exporter.export_file("input.jsonl", "output.jsonl")
print(f"Exported {result.total_output} rows, skipped {result.skipped}")

# Export with system prompt
result = exporter.export_file(
    "input.jsonl",
    "output.jsonl",
    system_prompt="You are a helpful assistant.",
)

# Convert a single conversation
row = {"conversations": [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi!"},
]}
output_rows = exporter.convert_conversation(row)

Design Notes

  • Streaming: Exports process line-by-line. Memory usage is constant regardless of dataset size.

  • Graceful degradation: Unconvertible rows are skipped with warnings, never crashing mid-export.

  • Deterministic: Same input + seed always produces the same output.

  • No external dependencies: All core exporters use only the Python standard library. push uses huggingface_hub, which is installed with the package.