Export & Integration Guide
AfterImage can export generated datasets to all major fine-tuning formats with a single command.
Quick Start
# Export to ShareGPT format
afterimage export -i dataset.jsonl -f sharegpt
# Export to multiple formats at once
afterimage export -i dataset.jsonl -f sharegpt -f alpaca -f messages
# Export to all formats with train/val split
afterimage export -i dataset.jsonl --all --split 0.1
# List available formats
afterimage export --list-formats
Available Formats
Format |
Key |
Multi-turn |
System Prompt |
Used by |
|---|---|---|---|---|
ShareGPT |
|
yes |
yes |
Unsloth, Axolotl, LLaMA-Factory |
Alpaca |
|
no* |
no |
Stanford Alpaca, basic SFT |
Messages |
|
yes |
yes |
TRL SFTTrainer, HuggingFace |
Oumi |
|
yes |
yes |
Oumi |
LLaMA-Factory |
|
yes |
yes |
LLaMA-Factory |
OpenAI |
|
yes |
yes |
OpenAI fine-tuning API |
DPO |
|
no |
yes |
TRL DPOTrainer, RLHF |
Raw |
|
yes |
yes |
AfterImage, custom pipelines |
* Alpaca supports multi-turn via split_turns mode (programmatic API).
Output Schemas
Alpaca
{"instruction": "user question", "input": "optional context", "output": "assistant response"}
Uses first user/assistant pair. The input field is populated from instruction_context if present in the source row.
Messages / Oumi
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Standard HuggingFace chat format. Oumi uses the same schema. Rows without an assistant message are skipped.
LLaMA-Factory
{
"instruction": "last user message",
"input": "context",
"output": "last assistant response",
"history": [["user msg 1", "assistant msg 1"], ["user msg 2", "assistant msg 2"]],
"system": "optional system prompt"
}
All turns except the last pair go into history as [user, assistant] arrays.
OpenAI Fine-Tuning
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Same as Messages format. Requires at least one assistant message (OpenAI API requirement).
DPO (Preference Pairs)
{"prompt": "user instruction", "chosen": "best response", "rejected": "worst response"}
Requires final_score in source data (enable quality.auto_improve: true). Groups conversations by first user message and pairs highest/lowest scored responses. Minimum score gap: 0.2 (configurable via API).
Raw
{"conversations": [...], "metadata": {...}, ...}
Passthrough — copies the entire source row as-is.
CLI Reference
afterimage export
Options:
-i, --input PATH Path to AfterImage JSONL dataset (required)
-f, --format TEXT Target format(s). Repeat for multiple
--all Export to all available formats
-o, --output-dir PATH Output directory (default: same as input)
--split FLOAT Train/val split ratio (e.g. 0.1 = 10% validation)
--shuffle/--no-shuffle Shuffle before splitting (default: shuffle)
--seed INT Random seed for reproducible splits (default: 42)
--system-prompt TEXT System prompt to prepend
--list-formats Show all available formats
Output files are named {input_stem}_{format}.jsonl, or {input_stem}_{format}_train.jsonl / {input_stem}_{format}_val.jsonl when using --split.
afterimage push
Options:
-i, --input PATH Path to AfterImage JSONL dataset (required)
-f, --format TEXT Export format before pushing (default: messages)
--repo TEXT HuggingFace repo: username/dataset-name (required)
--private Create as private dataset
--split FLOAT Train/val split ratio (default: 0.1)
huggingface_hub is a core dependency; no extra install is required for push.
Auto-Export After Generation
Add an export section to your YAML config to automatically export after generation:
output:
path: output/dataset.jsonl
export:
formats:
- sharegpt
- messages
output_dir: output/exports # optional, defaults to same directory
split: 0.1 # optional train/val split
shuffle: true # default
seed: 42 # default
Auto-export is fail-safe — if export fails, your generated dataset is unaffected.
Training Tool Integration
Unsloth
from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset_sharegpt.jsonl")
from unsloth import FastLanguageModel
# ... load model ...
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
dataset_text_field="conversations",
)
Axolotl
# axolotl config
datasets:
- path: dataset_sharegpt.jsonl
type: sharegpt
TRL SFTTrainer
from datasets import load_dataset
dataset = load_dataset("json", data_files={
"train": "dataset_messages_train.jsonl",
"validation": "dataset_messages_val.jsonl",
})
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
TRL DPOTrainer
from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset_dpo.jsonl")
from trl import DPOTrainer
trainer = DPOTrainer(
model=model,
train_dataset=dataset["train"],
)
LLaMA-Factory
{
"my_dataset": {
"file_name": "dataset_llama_factory.jsonl",
"formatting": "alpaca",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history",
"system": "system"
}
}
}
Oumi
oumi train -c my_config.yaml \
--training_data.datasets[0].dataset_name=dataset_oumi.jsonl
OpenAI Fine-Tuning API
# Upload and create fine-tuning job
openai api fine_tuning.jobs.create \
-t dataset_openai.jsonl \
-m gpt-4o-mini-2024-07-18
HuggingFace Hub
# Push directly from AfterImage
afterimage push -i dataset.jsonl --repo username/my-dataset --private
# Then use anywhere
from datasets import load_dataset
ds = load_dataset("username/my-dataset")
Programmatic API
from afterimage.integrations import get_exporter, list_formats
# List all formats
for fmt in list_formats():
print(f"{fmt['name']}: {fmt['description']}")
# Export a single format
exporter = get_exporter("sharegpt")
result = exporter.export_file("input.jsonl", "output.jsonl")
print(f"Exported {result.total_output} rows, skipped {result.skipped}")
# Export with system prompt
result = exporter.export_file(
"input.jsonl",
"output.jsonl",
system_prompt="You are a helpful assistant.",
)
# Convert a single conversation
row = {"conversations": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi!"},
]}
output_rows = exporter.convert_conversation(row)
Design Notes
Streaming: Exports process line-by-line. Memory usage is constant regardless of dataset size.
Graceful degradation: Unconvertible rows are skipped with warnings, never crashing mid-export.
Deterministic: Same input + seed always produces the same output.
No external dependencies: All core exporters use only the Python standard library.
pushuseshuggingface_hub, which is installed with the package.