Simula / OpenSimula

Experimental Simula-style synthetic data pipeline (afterimage.simula).

See also

class afterimage.simula.OpenSimula(llm: LLMProvider, *, temperature: float = 0.4, monitor: GenerationMonitor | None = None)[source]

Bases: object

High-level API for Simula-style synthetic dataset mechanisms (experimental).

All structured LLM stages (taxonomy construction, strategy inference, meta-prompt diversification and complexification, requirement critics, double-critic probes for MCQ, and task JSON generation) accept an optional GenerationMonitor. When monitor is set, each call is wrapped with track_generation and metadata including component="opensimula" and a dotted operation label. Call shutdown() on the monitor when the run completes.

Parameters:
  • llm – Provider used for every structured generation in this pipeline.

  • temperature – Base temperature; individual stages may clamp to their own ranges.

  • monitor – Optional monitor instance, or None to disable metric collection.

async agenerate_single_qa_samples(*, instruction_y: str, bundle: TaxonomyBundle, spec: SamplingStrategySpec, n: int, K: int = 6, complexify_c: float = 0.0, sequential: bool = False, max_concurrency: int = 2, rng: Random | None = None, max_refine_rounds: int = 4) list[DataPointRecord | None][source]

Generate n single-QA datapoints with independent (mix, meta) draws each time.

Results are ordered by sample index 0 .. n-1. Concurrency is bounded by max_concurrency (each task still performs its own mix, meta-prompt, and critic loop). Each concurrent task draws a fresh RNG stream from rng so subsampling stays deterministic under asyncio without sharing one random.Random across tasks.

async aiter_single_qa_samples(*, instruction_y: str, bundle: TaxonomyBundle, spec: SamplingStrategySpec, n: int, K: int = 6, complexify_c: float = 0.0, sequential: bool = False, max_concurrency: int = 2, rng: Random | None = None, max_refine_rounds: int = 4) AsyncIterator[tuple[int, DataPointRecord | None]][source]

Like agenerate_single_qa_samples() but yield (index, record) as each task finishes.

Useful for appending to JSONL as samples complete. If the consumer stops early, unfinished tasks are cancelled.

async build_taxonomy(instruction_y: str, *, document_provider: DocumentProvider | None = None, target_depth_D: int = 3, proposal_N: int = 3, max_factors: int = 4, max_children_per_node: int = 8, max_frontier_per_depth: int = 16, show_progress: bool = False) TaxonomyBundle[source]

Phase: global diversification — build factor taxonomies (Appendix B.4).

max_factors, max_children_per_node, and max_frontier_per_depth bound API cost. Without them, wide trees multiply into hundreds of sequential LLM calls (minutes of silence).

async draw_meta_prompt(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, K: int = 4, complexify_c: float = 0.0, sequential: bool = False, rng: Random | None = None) MetaPrompt[source]

Local diversification (+ optional complexification).

async generate_mcq_datapoint(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, meta: MetaPrompt, num_choices: int = 4, max_refine_rounds: int = 4) DataPointRecord | None[source]

MCQ with requirement critic + double-critic gate.

async generate_single_qa_datapoint(*, instruction_y: str, bundle: TaxonomyBundle, mix: Mix, meta: MetaPrompt, max_refine_rounds: int = 4) DataPointRecord | None[source]

Single QA with requirement-critic loop (no double-critic).

async infer_strategies(bundle: TaxonomyBundle) SamplingStrategySpec[source]

Propose weighted joint-sampling strategies (paper §2.2).

sample_mix(bundle: TaxonomyBundle, spec: SamplingStrategySpec, rng: Random | None = None) Mix[source]

Sample one mix from strategies.

static validate_taxonomy_bundle(bundle: TaxonomyBundle) None[source]

Validate all factor trees (call after construction).

class afterimage.simula.SimulaInstructionGeneratorCallback(scenarios: list[tuple[str, dict[str, Any]]], *, context_prefix: str = 'simula')[source]

Bases: BaseInstructionGeneratorCallback

Yields precomputed Simula scenarios as user instructions (one per call).

async agenerate(original_prompt: str) GeneratedInstructions[source]
generate(original_prompt: str) GeneratedInstructions[source]
reset() None[source]
set_monitor(monitor) None[source]

Checkpointing and export

class afterimage.simula.Checkpointer(checkpoint_root: Path | str, *, validate_taxonomies: bool = True, clear_stale_optional: bool = True)[source]

Bases: object

Collect OpenSimula artifacts under <root>/opensimula/ and write manifest.json on exit.

Typical usage:

with Checkpointer("./run") as cp:
    bundle.save(cp)
    spec.save(cp)
    cp.write_run_config(OpenSimulaRunConfig(name="demo", model="gemini-2.5-flash"))
url = cp.push_to_hub("org/dataset-repo")

Call write_taxonomy_bundle() (or bundle.save(cp)) at least once before the context exits. Optional files are removed on enter when clear_stale_optional is true so omitted spec.save / write_run_config do not leave stale JSON.

async apush_to_hub(repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]

Same as push_to_hub(), but runs blocking Hub I/O in a worker thread.

Prefer this from async code so uploads do not block the event loop.

finalize() OpenSimulaManifest[source]

Write manifest.json immediately (usually you rely on context exit instead).

property opensimula_dir: Path
push_to_hub(repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]

Upload <root>/opensimula/ to the Hugging Face Hub (creates the repo if missing).

Requires manifest.json on disk—for example after the with block exits or after finalize().

dataset_card becomes the repository README.md at the Hub root. When omitted or blank, a default card is generated (YAML tags frontmatter plus a short introduction with links to AfterImage and the Simula paper / blog).

write_run_config(config: OpenSimulaRunConfig) None[source]

Write run_config.json (call after write_taxonomy_bundle()).

write_sampling_strategy(spec: SamplingStrategySpec) None[source]

Write sampling_strategy.json (call after write_taxonomy_bundle()).

write_taxonomy_bundle(bundle: TaxonomyBundle) None[source]

Write taxonomy_bundle.json and record digests for the manifest.

class afterimage.simula.SimulaCheckpoint(manifest: OpenSimulaManifest, bundle: TaxonomyBundle, sampling_strategy: SamplingStrategySpec | None, run_config: OpenSimulaRunConfig | None, root: Path)[source]

Bases: object

Loaded checkpoint: manifest + parsed models + optional extras.

bundle: TaxonomyBundle
manifest: OpenSimulaManifest
root: Path
run_config: OpenSimulaRunConfig | None
sampling_strategy: SamplingStrategySpec | None
class afterimage.simula.OpenSimulaRunConfig(*, name: str | None = None, description: str | None = None, model: str | None = None, temperature: float | None = None, target_depth_D: int | None = None, proposal_N: int | None = None, meta_prompt_K: int | None = None, complexify_c: float | None = None, max_factors: int | None = None, max_children_per_node: int | None = None, max_frontier_per_depth: int | None = None, num_choices: int | None = None, num_samples: int | None = None, max_concurrency: int | None = None, seed: int | None = None, data_jsonl: str | None = None, corpus_excerpt_count: int | None = None)[source]

Bases: BaseModel

Typed metadata and hyperparameters stored in run_config.json beside a checkpoint.

complexify_c: float | None
corpus_excerpt_count: int | None
data_jsonl: str | None
description: str | None
max_children_per_node: int | None
max_concurrency: int | None
max_factors: int | None
max_frontier_per_depth: int | None
meta_prompt_K: int | None
model: str | None
model_config = {'extra': 'ignore'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None
num_choices: int | None
num_samples: int | None
proposal_N: int | None
seed: int | None
target_depth_D: int | None
temperature: float | None
class afterimage.simula.OpenSimulaManifest(*, producer: Literal['afterimage'] = 'afterimage', format: Literal['opensimula'] = 'opensimula', format_version: str = '1.0', created_at: str, afterimage_version: str | None = None, instruction_y_sha256: str, taxonomy_bundle_sha256: str, sampling_strategy_sha256: str | None = None, taxonomy_bundle_file: str = 'taxonomy_bundle.json', sampling_strategy_file: str | None = None, run_config_file: str | None = None)[source]

Bases: BaseModel

Versioned checkpoint manifest (portable across tools that understand format).

afterimage_version: str | None
created_at: str
format: Literal['opensimula']
format_version: str
instruction_y_sha256: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

producer: Literal['afterimage']
run_config_file: str | None
sampling_strategy_file: str | None
sampling_strategy_sha256: str | None
taxonomy_bundle_file: str
taxonomy_bundle_sha256: str
afterimage.simula.save_checkpoint(checkpoint_root: Path | str, *, bundle: TaxonomyBundle, sampling_strategy: SamplingStrategySpec | None = None, run_config: OpenSimulaRunConfig | None = None, validate_taxonomies: bool = True) OpenSimulaManifest[source]

Write opensimula/ under checkpoint_root and return the manifest.

Equivalent to using Checkpointer with bundle.save / spec.save / Checkpointer.write_run_config().

afterimage.simula.load_checkpoint(checkpoint_root: Path | str, *, verify_digests: bool = True, validate_taxonomies: bool = True) SimulaCheckpoint[source]

Load opensimula/ from checkpoint_root.

afterimage.simula.push_checkpoint_to_hub(checkpoint_root: Path | str, repo_id: str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', token: str | None = None, commit_message: str | None = None, private: bool = False, path_in_repo: str = 'opensimula', dataset_card: str | None = None) str[source]

Upload local opensimula/ to the Hub under path_in_repo (default opensimula).

Same as Checkpointer(checkpoint_root).push_to_hub(...). Returns the canonical repo URL.

afterimage.simula.pull_checkpoint_from_hub(repo_id: str, checkpoint_root: Path | str, *, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str | None = None, token: str | None = None, path_in_repo: str = 'opensimula') Path[source]

Download path_in_repo/** from the Hub into checkpoint_root (merging with snapshot_download).

Returns opensimula_dir(checkpoint_root).

afterimage.simula.append_datapoints_jsonl(path: Path | str, records: Iterable[DataPointRecord], *, mkdir: bool = True) int[source]

Append each record as one JSON line. Creates parent directories when mkdir is true.

Returns the number of lines written.

afterimage.simula.configure_example_console(*, simula_level: int = 30, root_level: int = 30) None[source]

One-line setup for example scripts: quiet root, optional simula detail, no httpx spam.

Use simula_level=logging.INFO when you want afterimage.simula DEBUG/INFO without tqdm (e.g. show_progress=False on build_taxonomy).

afterimage.simula.silence_noisy_third_party_loggers(level: int = 30) None[source]

Turn down chatty HTTP and google-genai log lines during OpenSimula runs.