Context-to-Skill Tutorial
This tutorial shows how to run Afterimage’s context-to-skill workflow and how to compare generated skills against an original Ctx2Skill run.
The workflow has three parts:
Convert Ctx2Skill-style contexts into Afterimage documents.
Run Afterimage skill discovery.
Optionally compare Afterimage skills with Ctx2Skill skills and a no-skill baseline.
All commands below assume they are run from the Afterimage repository root unless otherwise stated.
Requirements
Install the project dependencies:
uv sync
Set the API key expected by your config:
export GEMINI_API_KEY=...
If you use a different provider or environment variable, update the config’s
model.api_key_env field.
Configure the Model
Use examples/configs/context_to_skill_compare.yaml as a working example.
The top-level model is used for every stage unless a stage-specific override is provided:
model:
provider: gemini
model_name: gemini-3.1-flash-lite-preview
api_key_env: GEMINI_API_KEY
To configure each stage independently:
skill:
models:
challenger: gemini-3.1-flash-lite-preview
reasoner: gemini-3.1-flash-lite-preview
judge: gemini-3.1-flash-lite-preview
proposer: gemini-3.1-flash-lite-preview
generator: gemini-3.1-flash-lite-preview
Stage meanings:
challengerorprobe_generator: generates probe tasks and rubrics.reasoner: answers probes during discovery.judge: grades answers against rubrics.proposer: alias for both proposer stages.reasoner_proposer: proposes respondent-side skills from failed probes.challenger_proposer: proposes challenger-side skills from solved probes.generator: alias for both generator stages.reasoner_generator: writes respondent-side skill markdown.challenger_generator: writes challenger-side skill markdown.selector_reasoner: answers replay probes during skill selection.selector_judgeorselector: judges replay probes during skill selection.
A stage can also specify its own provider and API key:
skill:
models:
judge:
provider: gemini
model_name: gemini-3.1-flash-lite-preview
api_key_env: GEMINI_API_KEY
Convert Ctx2Skill Contexts
Ctx2Skill stores contexts as chat-style messages. Afterimage skill discovery expects document rows with a text field. Convert the context JSONL first:
uv run python scripts/ctx2skill_to_afterimage_docs.py \
--input ../Ctx2Skill/CL-bench-context-dedup.jsonl \
--output /tmp/ctx2skill-afterimage-docs.jsonl \
--max-rows 10
Arguments:
--input: Source Ctx2Skill JSONL file.--output: Converted Afterimage-compatible JSONL file.--max-rows: Number of contexts to convert.
The converted rows contain:
id: context idtext: formatted conversation/contextmetadata: source metadatarubrics: source benchmark rubrics, when present
Run Afterimage Skill Discovery
Make sure your config points at the converted JSONL:
documents:
provider: jsonl
path: /tmp/ctx2skill-afterimage-docs.jsonl
content_key: text
preserve_ids: true
include_metadata: true
skill:
output_dir: ./skills-compare-afterimage
iterations: 3
probes_per_context: 1
max_contexts: 10
bootstrap_when_no_failures: true
use_source_rubrics: true
Run discovery:
uv run afterimage skill discover \
-c examples/configs/context_to_skill_compare.yaml
Purpose:
This runs the context-to-skill loop:
Generate context-grounded probes.
Answer probes with the reasoner.
Judge answers against rubrics.
Update reasoner skills from failed probes.
Update challenger skills from solved probes.
Replay candidate skills and select the best respondent-side skill.
Important config fields:
skill.output_dir: Directory where generated skills are written.skill.iterations: Number of self-improvement rounds per context.skill.probes_per_context: Number of probes generated per round.skill.max_contexts: Number of contexts to process.skill.bootstrap_when_no_failures: Writes a skill even if all probes pass.skill.use_source_rubrics: Uses source rubrics as hints during probe generation.
Expected output:
skills-compare-afterimage/<context_id>/SKILL.md
Each context directory may contain:
probes.jsonl: generated probe tasks and rubricsresults.jsonl: probe answers and judge resultsversions.jsonl: respondent skill versionschallenger_versions.jsonl: challenger skill versionsskill-iter-*.md: respondent skill markdown by iterationchallenger-skill-iter-*.md: challenger skill markdown by iterationselection.json: replay selection scoresSKILL.md: final selected respondent skill
Inspect Generated Skills
List final selected skills:
find skills-compare-afterimage -mindepth 2 -maxdepth 2 -name SKILL.md -print
Read one selected skill:
cat skills-compare-afterimage/<context_id>/SKILL.md
Read its selection result:
cat skills-compare-afterimage/<context_id>/selection.json
Useful manual checks:
The skill should mention constraints, procedures, or failure modes from the context.
The skill should be reusable, not just a memorized answer to one probe.
The selected skill should be consistent with the replay scores in
selection.json.
Optional: Run Original Ctx2Skill
If you want to compare Afterimage with the original Ctx2Skill implementation, run Ctx2Skill on the same number of contexts.
From the Ctx2Skill repository:
python selfplay_loop.py \
--challenger-model gemini-3.1-flash-lite-preview \
--reasoner-model gemini-3.1-flash-lite-preview \
--judge-model gemini-3.1-flash-lite-preview \
--proposer-model gemini-3.1-flash-lite-preview \
--generator-model gemini-3.1-flash-lite-preview \
--input ./CL-bench-context-dedup.jsonl \
--output outputs/loop_data/compare_10ctx.jsonl \
--num-iterations 3 \
--num-tasks 1 \
--skills-dir skills-compare-10ctx \
--max-samples 10 \
--workers 1
The generated Ctx2Skill respondent skills are expected under:
skills-compare-10ctx/reasoner/<context_id>/SKILL.md
Compare Skill Texts
This is a cheap lexical comparison between selected SKILL.md files.
uv run python scripts/compare_skill_outputs.py \
--ctx2skill-root ../Ctx2Skill/skills-compare-10ctx/reasoner \
--afterimage-root ./skills-compare-afterimage \
--output output/skill_text_compare_10ctx.json
The report includes:
paired_contexts: contexts with skills on both sidesctx2skill_only: contexts only present in Ctx2Skill outputafterimage_only: contexts only present in Afterimage outputavg_token_jaccard: token overlap between skill textsavg_sequence_ratio: character sequence similarity
Lexical similarity is only a sanity check. Two useful skills can have low text overlap if they express the same procedure differently.
Evaluate Baseline vs Ctx2Skill vs Afterimage
Use pass-rate evaluation to test whether skills improve task performance.
uv run python scripts/evaluate_skill_pass_rates.py \
--config examples/configs/context_to_skill_compare.yaml \
--docs /tmp/ctx2skill-afterimage-docs.jsonl \
--ctx2skill-root ../Ctx2Skill/skills-compare-10ctx/reasoner \
--afterimage-root ./skills-compare-afterimage \
--output output/base_vs_ctx2skill_vs_afterimage_eval_10ctx_mixed.json \
--limit-contexts 10 \
--limit-tasks 6
Variants:
baseline: no skill injectedctx2skill: Ctx2Skill skill injectedafterimage: Afterimage skill injected
Task sources:
ctx2skill-hard-set: hard tasks generated by Ctx2Skillafterimage-probe: probes generated by Afterimage during discovery
Metrics:
tasks: number of evaluated taskspassed: number of tasks where all rubrics passedpass_rate:passed / tasksavg_score: strict pass score averaged across tasksavg_rubric_satisfaction: average fraction of satisfied rubrics
Example interpretation:
afterimage tasks=59 passed=30 pass_rate=0.508 avg_rubric_satisfaction=0.811
baseline tasks=59 passed=26 pass_rate=0.441 avg_rubric_satisfaction=0.776
ctx2skill tasks=59 passed=30 pass_rate=0.508 avg_rubric_satisfaction=0.792
source variant tasks passed pass_rate
--------------------------------------------------------
afterimage-probe afterimage 30 22 0.733
afterimage-probe baseline 30 18 0.600
afterimage-probe ctx2skill 30 20 0.667
ctx2skill-hard-set afterimage 29 8 0.276
ctx2skill-hard-set baseline 29 8 0.276
ctx2skill-hard-set ctx2skill 29 10 0.345
This means Afterimage produced functional context-specific skills and improved
over the no-skill baseline on the mixed task set. In this example, Afterimage
matched Ctx2Skill on strict pass rate while achieving higher average rubric
satisfaction. The source breakdown is important: Afterimage is strongest on
afterimage-probe, while Ctx2Skill is stronger on ctx2skill-hard-set.
Choosing Run Size
Start with a small run:
skill:
iterations: 1
probes_per_context: 1
max_contexts: 3
Then increase context count and iterations:
skill:
iterations: 3
probes_per_context: 1
max_contexts: 10
Evaluation call count is approximately:
contexts * tasks_per_context * variants * 2
The 2 accounts for one answer call and one judge call per variant.
For example:
10 contexts * 2 tasks * 3 variants * 2 = 120 LLM calls
Do not rely on lexical similarity alone. The pass-rate evaluation is the more important signal.