SLURM: Sizing¶

Optional walltime estimation from MLflow run history. scripts/run --time-from-history calls estimate_walltime_minutes to tighten the wall limit for (cluster, group, dataset) combinations with ≥3 prior FINISHED runs; None means fall back to the static per-length default in configs/resources/submit_profiles.json.

Formula: ceil(p95(elapsed_mins) × 1.5) clamped to [10, 7 days].

`graphids.slurm.sizing`¶

sizing ¶

Optional walltime estimation from MLflow history.

Used by scripts/run --time-from-history to set a tighter wall limit for fit jobs on (cluster, group, dataset) combinations with ≥3 prior FINISHED runs. Returns None when there's nothing to estimate from; callers fall back to the static per-length default in submit_profiles.json.

estimate_walltime_minutes ¶

estimate_walltime_minutes(cluster: str, group: str, dataset: str) -> int | None

ceil(p95(elapsed_mins) * 1.5) clamped to [10, 7 days].

None when MLflow is unreachable, the URI is unset, or fewer than 3 matching FINISHED runs exist. slurm.slurm_cluster_name is preferred over graphids.cluster (always set by SLURM; the latter can be empty when the submitter shell's GRAPHIDS_CLUSTER isn't exported into the job env).

Source code in graphids/slurm/sizing.py

def estimate_walltime_minutes(cluster: str, group: str, dataset: str) -> int | None:
    """``ceil(p95(elapsed_mins) * 1.5)`` clamped to ``[10, 7 days]``.

    ``None`` when MLflow is unreachable, the URI is unset, or fewer than 3
    matching FINISHED runs exist. ``slurm.slurm_cluster_name`` is preferred
    over ``graphids.cluster`` (always set by SLURM; the latter can be empty
    when the submitter shell's ``GRAPHIDS_CLUSTER`` isn't exported into the
    job env).
    """
    try:
        from graphids._mlflow import ensure_tracking_uri
    except ImportError:
        return None

    uri = ensure_tracking_uri()
    if uri is None:
        return None
    try:
        from mlflow.tracking import MlflowClient
    except ImportError:
        return None

    try:
        client = MlflowClient(tracking_uri=uri)
        experiments = [e.experiment_id for e in client.search_experiments()]
        if not experiments:
            return None
        filter_str = (
            f"tags.`slurm.slurm_cluster_name` = '{cluster}' "
            f"AND tags.`graphids.group` = '{group}' "
            f"AND tags.`graphids.dataset` = '{dataset}' "
            f"AND tags.`graphids.phase` = 'fit' "
            f"AND attributes.status = 'FINISHED'"
        )
        runs = client.search_runs(
            experiment_ids=experiments, filter_string=filter_str, max_results=50
        )
    except Exception:
        return None

    elapsed = [
        (r.info.end_time - r.info.start_time) / 60000
        for r in runs
        if r.info.end_time and r.info.start_time and r.info.end_time > r.info.start_time
    ]
    if len(elapsed) < 3:
        return None
    p95 = statistics.quantiles(elapsed, n=100, method="inclusive")[94]
    return max(10, min(int(math.ceil(p95 * 1.5)), 7 * 24 * 60))

format_hms ¶

format_hms(minutes: float | int) -> str

Minutes → H:MM:SS sbatch-compatible duration (ceils to whole minutes).