Runtime¶

The experiment runtime is the live launch surface. It accepts a typed RunConfig, writes the manifest and event log, and dispatches the stage-specific work directly:

fit / test → Lightning trainer launch with config-driven data/model instantiation
cache → materialize the configured graph cache
extract → feature extraction over configured checkpoints and dataset
analyze → per-checkpoint artifact generation through graphids.core.artifacts.analyzer.Analyzer

The runtime module is intentionally narrow and replaces the old row/orchestrate chassis.

`graphids.exp.runtime`¶

runtime ¶

Execution helpers for the new experiment seam.

Ray/Hydra can attach here later. For now this module gives us a single place to write manifests and events around any callable run body.

launch_run ¶

launch_run(run: RunConfig) -> RunSummary

Run one launchable config with manifest/event tracking.

Source code in graphids/exp/runtime.py

def launch_run(
    run: RunConfig,
) -> RunSummary:
    """Run one launchable config with manifest/event tracking."""
    if run.stage in {"fit", "test"} and run.resources.accelerator == "gpu":
        from graphids.runtime_checks import assert_pyg_cuda_extensions_match

        assert_pyg_cuda_extensions_match()

    backend = run.resources.backend
    if backend == "ray":
        try:
            import ray  # noqa: F401
        except ImportError:
            backend = "local"
    logger = _make_run_logger(run)
    logger.log_hyperparams(run.mlflow_hparams(backend=backend))
    manifest = run.journal_manifest(status="running")
    write_manifest(run.outputs.run_dir, manifest, name=run.outputs.manifest_name)
    append_event(
        run.outputs.run_dir,
        EventRecord(
            status="running",
            stage=run.stage,
            message="launch_started",
            details={"backend": backend},
        ),
        name=run.outputs.events_name,
    )

    try:
        if backend == "ray":
            import ray

            ray.init(ignore_reinit_error=True, include_dashboard=False)
            if logger.run_id is None:
                raise RuntimeError("MLflow logger did not create a run id before Ray launch")
            result = ray.get(ray.remote(_run_stage_with_existing_mlflow_run).remote(run, logger.run_id))
        else:
            result = run_stage(run, logger=logger)
        if logger.run_id is not None:
            logger.experiment.set_terminated(logger.run_id, status="FINISHED")
        append_event(
            run.outputs.run_dir,
            EventRecord(status="finished", stage=run.stage, message="run_finished", details=_payload(result)),
            name=run.outputs.events_name,
        )
        write_manifest(
            run.outputs.run_dir,
            manifest.model_copy(update={"status": "finished"}),
            name=run.outputs.manifest_name,
        )
        return RunSummary(
            run_dir=str(run.outputs.run_dir),
            status="finished",
            stage=run.stage,
            name=run.name,
            last_event="run_finished",
        )
    except BaseException as exc:  # noqa: BLE001 - record all failures, then re-raise
        failure = f"{type(exc).__name__}: {exc}"
        if logger.run_id is not None:
            logger.experiment.set_terminated(logger.run_id, status="FAILED")
        append_event(
            run.outputs.run_dir,
            EventRecord(
                status="failed",
                stage=run.stage,
                message="run_failed",
                details={"failure": failure},
            ),
            name=run.outputs.events_name,
        )
        write_manifest(
            run.outputs.run_dir,
            manifest.model_copy(update={"status": "failed", "failure": failure}),
            name=run.outputs.manifest_name,
        )
        raise

run_stage ¶

run_stage(run: RunConfig, logger: Any | None = None) -> dict[str, Any] | None

Default stage dispatcher for experiment launches.

Fit/test, extract, and analyze all run directly from the typed experiment config objects.