# Export & Integration Guide AfterImage can export generated datasets to all major fine-tuning formats with a single command. ## Quick Start ```bash # Export to ShareGPT format afterimage export -i dataset.jsonl -f sharegpt # Export to multiple formats at once afterimage export -i dataset.jsonl -f sharegpt -f alpaca -f messages # Export to all formats with train/val split afterimage export -i dataset.jsonl --all --split 0.1 # List available formats afterimage export --list-formats ``` ## Available Formats | Format | Key | Multi-turn | System Prompt | Used by | |----------------|--------------------|:----------:|:-------------:|---------------------------------| | ShareGPT | `sharegpt` | yes | yes | Unsloth, Axolotl, LLaMA-Factory| | Alpaca | `alpaca` | no* | no | Stanford Alpaca, basic SFT | | Messages | `messages` | yes | yes | TRL SFTTrainer, HuggingFace | | Oumi | `oumi` | yes | yes | Oumi | | LLaMA-Factory | `llama_factory` | yes | yes | LLaMA-Factory | | OpenAI | `openai` | yes | yes | OpenAI fine-tuning API | | DPO | `dpo` | no | yes | TRL DPOTrainer, RLHF | | Raw | `raw` | yes | yes | AfterImage, custom pipelines | \* Alpaca supports multi-turn via `split_turns` mode (programmatic API). ## Output Schemas ### ShareGPT ```json {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]} ``` Roles are mapped: `user` → `human`, `assistant` → `gpt`. Consecutive same-role messages are merged with a newline separator. ### Alpaca ```json {"instruction": "user question", "input": "optional context", "output": "assistant response"} ``` Uses first user/assistant pair. The `input` field is populated from `instruction_context` if present in the source row. ### Messages / Oumi ```json {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} ``` Standard HuggingFace chat format. Oumi uses the same schema. Rows without an assistant message are skipped. ### LLaMA-Factory ```json { "instruction": "last user message", "input": "context", "output": "last assistant response", "history": [["user msg 1", "assistant msg 1"], ["user msg 2", "assistant msg 2"]], "system": "optional system prompt" } ``` All turns except the last pair go into `history` as `[user, assistant]` arrays. ### OpenAI Fine-Tuning ```json {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} ``` Same as Messages format. Requires at least one assistant message (OpenAI API requirement). ### DPO (Preference Pairs) ```json {"prompt": "user instruction", "chosen": "best response", "rejected": "worst response"} ``` Requires `final_score` in source data (enable `quality.auto_improve: true`). Groups conversations by first user message and pairs highest/lowest scored responses. Minimum score gap: 0.2 (configurable via API). ### Raw ```json {"conversations": [...], "metadata": {...}, ...} ``` Passthrough — copies the entire source row as-is. ## CLI Reference ### `afterimage export` ``` Options: -i, --input PATH Path to AfterImage JSONL dataset (required) -f, --format TEXT Target format(s). Repeat for multiple --all Export to all available formats -o, --output-dir PATH Output directory (default: same as input) --split FLOAT Train/val split ratio (e.g. 0.1 = 10% validation) --shuffle/--no-shuffle Shuffle before splitting (default: shuffle) --seed INT Random seed for reproducible splits (default: 42) --system-prompt TEXT System prompt to prepend --list-formats Show all available formats ``` Output files are named `{input_stem}_{format}.jsonl`, or `{input_stem}_{format}_train.jsonl` / `{input_stem}_{format}_val.jsonl` when using `--split`. ### `afterimage push` ``` Options: -i, --input PATH Path to AfterImage JSONL dataset (required) -f, --format TEXT Export format before pushing (default: messages) --repo TEXT HuggingFace repo: username/dataset-name (required) --private Create as private dataset --split FLOAT Train/val split ratio (default: 0.1) ``` `huggingface_hub` is a core dependency; no extra install is required for `push`. ## Auto-Export After Generation Add an `export` section to your YAML config to automatically export after generation: ```yaml output: path: output/dataset.jsonl export: formats: - sharegpt - messages output_dir: output/exports # optional, defaults to same directory split: 0.1 # optional train/val split shuffle: true # default seed: 42 # default ``` Auto-export is fail-safe — if export fails, your generated dataset is unaffected. ## Training Tool Integration ### Unsloth ```python from datasets import load_dataset dataset = load_dataset("json", data_files="dataset_sharegpt.jsonl") from unsloth import FastLanguageModel # ... load model ... from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset["train"], dataset_text_field="conversations", ) ``` ### Axolotl ```yaml # axolotl config datasets: - path: dataset_sharegpt.jsonl type: sharegpt ``` ### TRL SFTTrainer ```python from datasets import load_dataset dataset = load_dataset("json", data_files={ "train": "dataset_messages_train.jsonl", "validation": "dataset_messages_val.jsonl", }) from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) ``` ### TRL DPOTrainer ```python from datasets import load_dataset dataset = load_dataset("json", data_files="dataset_dpo.jsonl") from trl import DPOTrainer trainer = DPOTrainer( model=model, train_dataset=dataset["train"], ) ``` ### LLaMA-Factory ```json { "my_dataset": { "file_name": "dataset_llama_factory.jsonl", "formatting": "alpaca", "columns": { "prompt": "instruction", "query": "input", "response": "output", "history": "history", "system": "system" } } } ``` ### Oumi ```bash oumi train -c my_config.yaml \ --training_data.datasets[0].dataset_name=dataset_oumi.jsonl ``` ### OpenAI Fine-Tuning API ```bash # Upload and create fine-tuning job openai api fine_tuning.jobs.create \ -t dataset_openai.jsonl \ -m gpt-4o-mini-2024-07-18 ``` ### HuggingFace Hub ```bash # Push directly from AfterImage afterimage push -i dataset.jsonl --repo username/my-dataset --private # Then use anywhere from datasets import load_dataset ds = load_dataset("username/my-dataset") ``` ## Programmatic API ```python from afterimage.integrations import get_exporter, list_formats # List all formats for fmt in list_formats(): print(f"{fmt['name']}: {fmt['description']}") # Export a single format exporter = get_exporter("sharegpt") result = exporter.export_file("input.jsonl", "output.jsonl") print(f"Exported {result.total_output} rows, skipped {result.skipped}") # Export with system prompt result = exporter.export_file( "input.jsonl", "output.jsonl", system_prompt="You are a helpful assistant.", ) # Convert a single conversation row = {"conversations": [ {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}, ]} output_rows = exporter.convert_conversation(row) ``` ## Design Notes - **Streaming**: Exports process line-by-line. Memory usage is constant regardless of dataset size. - **Graceful degradation**: Unconvertible rows are skipped with warnings, never crashing mid-export. - **Deterministic**: Same input + seed always produces the same output. - **No external dependencies**: All core exporters use only the Python standard library. `push` uses `huggingface_hub`, which is installed with the package.