Simula / OpenSimula
Experimental Simula-style synthetic data pipeline (afterimage.simula).
See also
OpenSimula (afterimage.simula) — narrative guide and monitoring notes
Monitoring & Observability —
GenerationMonitorusage and export
- class afterimage.simula.OpenSimula(llm: LLMProvider, *, temperature: float = 0.4, monitor: GenerationMonitor | None = None)[source]
Bases:
objectHigh-level API for Simula-style synthetic dataset mechanisms (experimental).
All structured LLM stages (taxonomy construction, strategy inference, meta-prompt diversification and complexification, requirement critics, double-critic probes for MCQ, and task JSON generation) accept an optional
GenerationMonitor. Whenmonitoris set, each call is wrapped withtrack_generationand metadata includingcomponent="opensimula"and a dottedoperationlabel. Callshutdown()on the monitor when the run completes.- Parameters:
llm – Provider used for every structured generation in this pipeline.
temperature – Base temperature; individual stages may clamp to their own ranges.
monitor – Optional monitor instance, or
Noneto disable metric collection.
- async agenerate_single_qa_samples(*, instruction_y: str, bundle: TaxonomyBundle, spec: SamplingStrategySpec, n: int, K: int = 6, complexify_c: float = 0.0, sequential: bool = False, max_concurrency: int = 2, rng: Random | None = None, max_refine_rounds: int = 4) list[DataPointRecord | None][source]
Generate
nsingle-QA datapoints with independent (mix, meta) draws each time.Results are ordered by sample index
0 .. n-1. Concurrency is bounded bymax_concurrency(each task still performs its own mix, meta-prompt, and critic loop). Each concurrent task draws a fresh RNG stream fromrngso subsampling stays deterministic underasynciowithout sharing onerandom.Randomacross tasks.
- async aiter_single_qa_samples(*, instruction_y: str, bundle: TaxonomyBundle, spec: SamplingStrategySpec, n: int, K: int = 6, complexify_c: float = 0.0, sequential: bool = False, max_concurrency: int = 2, rng: Random | None = None, max_refine_rounds: int = 4) AsyncIterator[tuple[int, DataPointRecord | None]][source]
Like
agenerate_single_qa_samples()but yield(index, record)as each task finishes.Useful for appending to JSONL as samples complete. If the consumer stops early, unfinished tasks are cancelled.
- async build_taxonomy(instruction_y: str, *, document_provider: DocumentProvider | None = None, target_depth_D: int = 3, proposal_N: int = 3, max_factors: int = 4, max_children_per_node: int = 8, max_frontier_per_depth: int = 16, show_progress: bool = False) TaxonomyBundle[source]
Phase: global diversification — build factor taxonomies (Appendix B.4).
max_factors,max_children_per_node, andmax_frontier_per_depthbound API cost. Without them, wide trees multiply into hundreds of sequential LLM calls (minutes of silence).
- async draw_meta_prompt(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, K: int = 4, complexify_c: float = 0.0, sequential: bool = False, rng: Random | None = None) MetaPrompt[source]
Local diversification (+ optional complexification).
- async generate_mcq_datapoint(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, meta: MetaPrompt, num_choices: int = 4, max_refine_rounds: int = 4) DataPointRecord | None[source]
MCQ with requirement critic + double-critic gate.
- async generate_single_qa_datapoint(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, meta: MetaPrompt, max_refine_rounds: int = 4) DataPointRecord | None[source]
Single QA with requirement-critic loop (no double-critic).
- async infer_strategies(bundle: TaxonomyBundle) SamplingStrategySpec[source]
Propose weighted joint-sampling strategies (paper §2.2).
- class afterimage.simula.SimulaInstructionGeneratorCallback(scenarios: list[tuple[str, dict[str, Any]]], *, context_prefix: str = 'simula')[source]
Bases:
BaseInstructionGeneratorCallbackYields precomputed Simula scenarios as user instructions (one per call).
Checkpointing and export
- class afterimage.simula.Checkpointer(checkpoint_root: Path | str, *, validate_taxonomies: bool = True, clear_stale_optional: bool = True)[source]
Bases:
objectCollect OpenSimula artifacts under
<root>/opensimula/and writemanifest.jsonon exit.Typical usage:
with Checkpointer("./run") as cp: bundle.save(cp) spec.save(cp) cp.write_run_config(OpenSimulaRunConfig(name="demo", model="gemini-2.5-flash")) url = cp.push_to_hub("org/dataset-repo")
Call
write_taxonomy_bundle()(orbundle.save(cp)) at least once before the context exits. Optional files are removed on enter whenclear_stale_optionalis true so omittedspec.save/write_run_configdo not leave stale JSON.- async apush_to_hub(repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]
Same as
push_to_hub(), but runs blocking Hub I/O in a worker thread.Prefer this from async code so uploads do not block the event loop.
- finalize() OpenSimulaManifest[source]
Write
manifest.jsonimmediately (usually you rely on context exit instead).
- property opensimula_dir: Path
- push_to_hub(repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]
Upload
<root>/opensimula/to the Hugging Face Hub (creates the repo if missing).Requires
manifest.jsonon disk—for example after thewithblock exits or afterfinalize().dataset_cardbecomes the repositoryREADME.mdat the Hub root. When omitted or blank, a default card is generated (YAMLtagsfrontmatter plus a short introduction with links to AfterImage and the Simula paper / blog).
- write_run_config(config: OpenSimulaRunConfig) None[source]
Write
run_config.json(call afterwrite_taxonomy_bundle()).
- write_sampling_strategy(spec: SamplingStrategySpec) None[source]
Write
sampling_strategy.json(call afterwrite_taxonomy_bundle()).
- class afterimage.simula.SimulaCheckpoint(manifest: OpenSimulaManifest, bundle: TaxonomyBundle, sampling_strategy: SamplingStrategySpec | None, run_config: OpenSimulaRunConfig | None, root: Path)[source]
Bases:
objectLoaded checkpoint: manifest + parsed models + optional extras.
- bundle: TaxonomyBundle
- manifest: OpenSimulaManifest
- root: Path
- run_config: OpenSimulaRunConfig | None
- sampling_strategy: SamplingStrategySpec | None
- class afterimage.simula.OpenSimulaRunConfig(*, name: str | None = None, description: str | None = None, model: str | None = None, temperature: float | None = None, target_depth_D: int | None = None, proposal_N: int | None = None, meta_prompt_K: int | None = None, complexify_c: float | None = None, max_factors: int | None = None, max_children_per_node: int | None = None, max_frontier_per_depth: int | None = None, num_choices: int | None = None, num_samples: int | None = None, max_concurrency: int | None = None, seed: int | None = None, data_jsonl: str | None = None, corpus_excerpt_count: int | None = None)[source]
Bases:
BaseModelTyped metadata and hyperparameters stored in
run_config.jsonbeside a checkpoint.- complexify_c: float | None
- corpus_excerpt_count: int | None
- data_jsonl: str | None
- description: str | None
- max_children_per_node: int | None
- max_concurrency: int | None
- max_factors: int | None
- max_frontier_per_depth: int | None
- meta_prompt_K: int | None
- model: str | None
- model_config = {'extra': 'ignore'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str | None
- num_choices: int | None
- num_samples: int | None
- proposal_N: int | None
- seed: int | None
- target_depth_D: int | None
- temperature: float | None
- class afterimage.simula.OpenSimulaManifest(*, producer: Literal['afterimage'] = 'afterimage', format: Literal['opensimula'] = 'opensimula', format_version: str = '1.0', created_at: str, afterimage_version: str | None = None, instruction_y_sha256: str, taxonomy_bundle_sha256: str, sampling_strategy_sha256: str | None = None, taxonomy_bundle_file: str = 'taxonomy_bundle.json', sampling_strategy_file: str | None = None, run_config_file: str | None = None)[source]
Bases:
BaseModelVersioned checkpoint manifest (portable across tools that understand
format).- afterimage_version: str | None
- created_at: str
- format: Literal['opensimula']
- format_version: str
- instruction_y_sha256: str
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- producer: Literal['afterimage']
- run_config_file: str | None
- sampling_strategy_file: str | None
- sampling_strategy_sha256: str | None
- taxonomy_bundle_file: str
- taxonomy_bundle_sha256: str
- afterimage.simula.save_checkpoint(checkpoint_root: Path | str, *, bundle: TaxonomyBundle, sampling_strategy: SamplingStrategySpec | None = None, run_config: OpenSimulaRunConfig | None = None, validate_taxonomies: bool = True) OpenSimulaManifest[source]
Write
opensimula/undercheckpoint_rootand return the manifest.Equivalent to using
Checkpointerwithbundle.save/spec.save/Checkpointer.write_run_config().
- afterimage.simula.load_checkpoint(checkpoint_root: Path | str, *, verify_digests: bool = True, validate_taxonomies: bool = True) SimulaCheckpoint[source]
Load
opensimula/fromcheckpoint_root.
- afterimage.simula.push_checkpoint_to_hub(checkpoint_root: Path | str, repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]
Upload local
opensimula/to the Hub underpath_in_repo(defaultopensimula).Same as
Checkpointer(checkpoint_root).push_to_hub(...). Returns the canonical repo URL.
- afterimage.simula.pull_checkpoint_from_hub(repo_id: str, checkpoint_root: Path | str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str | None = None, token: str | None = None, path_in_repo: str = 'opensimula') Path[source]
Download
path_in_repo/**from the Hub intocheckpoint_root(merging withsnapshot_download).Returns
opensimula_dir(checkpoint_root).
- afterimage.simula.append_datapoints_jsonl(path: Path | str, records: Iterable[DataPointRecord], *, mkdir: bool = True) int[source]
Append each record as one JSON line. Creates parent directories when
mkdiris true.Returns the number of lines written.
- afterimage.simula.configure_example_console(*, simula_level: int = 30, root_level: int = 30) None[source]
One-line setup for example scripts: quiet root, optional simula detail, no httpx spam.
Use
simula_level=logging.INFOwhen you wantafterimage.simulaDEBUG/INFO without tqdm (e.g.show_progress=Falseonbuild_taxonomy).