Records API

Typed collection classes for benchmark runs.

LLMRun

`mlenergy_data.records.runs.LLMRun` `dataclass`

A single LLM benchmark run.

Attributes:

Name	Type	Description
`domain`	`str`	Always `"llm"`.
`task`	`str`	Benchmark task (e.g. `"gpqa"`, `"lm-arena-chat"`).
`model_id`	`str`	Full HF model identifier (e.g. `"meta-llama/Llama-3.1-8B-Instruct"`).
`nickname`	`str`	Human-friendly display name from `model_info.json`.
`architecture`	`str`	Model architecture (`"Dense"` or `"MoE"`).
`total_params_billions`	`float`	Total parameter count in billions.
`activated_params_billions`	`float`	Activated parameter count in billions (equals total for dense).
`weight_precision`	`str`	Weight precision (e.g. `"bfloat16"`, `"fp8"`).
`gpu_model`	`str`	GPU model identifier (e.g. `"H100"`, `"B200"`).
`num_gpus`	`int`	Number of GPUs used.
`max_num_seqs`	`int`	Maximum concurrent sequences (batch size).
`seed`	`int \| None`	Random seed used for the benchmark run.
`num_request_repeats`	`int \| None`	Number of request repetitions.
`tensor_parallel`	`int`	Tensor parallelism degree.
`expert_parallel`	`int`	Expert parallelism degree.
`data_parallel`	`int`	Data parallelism degree.
`steady_state_energy_joules`	`float`	Total GPU energy during steady state in joules.
`steady_state_duration_seconds`	`float`	Duration of steady state in seconds.
`energy_per_token_joules`	`float`	Steady-state energy per output token in joules.
`energy_per_request_joules`	`float \| None`	Estimated energy per request in joules.
`output_throughput_tokens_per_sec`	`float`	Steady-state output throughput in tokens/second.
`request_throughput_req_per_sec`	`float \| None`	Steady-state request throughput in requests/second.
`avg_power_watts`	`float`	Average GPU power during steady state in watts.
`total_output_tokens`	`float \| None`	Total output tokens generated (over full benchmark).
`completed_requests`	`float \| None`	Number of completed requests (over full benchmark).
`avg_output_len`	`float \| None`	Average output length in tokens.
`mean_itl_ms`	`float`	Mean inter-token latency in milliseconds.
`median_itl_ms`	`float`	Median inter-token latency in milliseconds.
`p50_itl_ms`	`float`	50th percentile inter-token latency in milliseconds.
`p90_itl_ms`	`float`	90th percentile inter-token latency in milliseconds.
`p95_itl_ms`	`float`	95th percentile inter-token latency in milliseconds.
`p99_itl_ms`	`float`	99th percentile inter-token latency in milliseconds.
`avg_batch_size`	`float \| None`	Average concurrent sequences during steady state (from Prometheus).
`is_stable`	`bool`	Whether this run passed stability checks.
`unstable_reason`	`str`	Reason for instability (empty if stable).

`read_results_json()`

Download (if needed) and return parsed results.json.

Caches the parsed dict per instance so repeated calls are free.

`read_prometheus_json()`

Download (if needed) and return parsed prometheus.json.

Caches the parsed dict per instance so repeated calls are free.

`output_lengths(*, include_unsuccessful=False)`

Return per-request output token lengths from results.json.

Parameters:

Name	Type	Description	Default
`include_unsuccessful`	`bool`	If True, include requests that failed during benchmarking. Defaults to False.	`False`

`inter_token_latencies()`

Return inter-token latency samples in seconds from results.json.

`timelines(*, metric='power.device_instant')`

Return power/temperature timeseries for this run.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['power.device_instant', 'power.device_average', 'temperature']`	Which timeline to extract. Supported values: `"power.device_instant"`, `"power.device_average"`, `"temperature"`.	`'power.device_instant'`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: timestamp, relative_time_s, value, metric.

LLMRuns

`mlenergy_data.records.runs.LLMRuns`

Immutable collection of LLM benchmark runs with fluent filtering.

Iterate to get individual LLMRun objects:

for r in runs.task("gpqa"):
    print(r.energy_per_token_joules, r.nickname)

best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)

Example:

runs = LLMRuns.from_directory("/path/to/compiled/data")
best = min(runs.stable().task("gpqa"), key=lambda r: r.energy_per_token_joules)
energies = [r.energy_per_token_joules for r in runs.task("gpqa")]

`download_raw_files(file=None)`

Download all raw files for this collection in parallel.

Downloads results.json and prometheus.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = LLMRuns.from_hf().task("gpqa").download_raw_files()

Parameters:

Name	Type	Description	Default
`file`	`Literal['results', 'prometheus'] \| None`	If specified, only download this file type. Otherwise, download both `results.json` and `prometheus.json`.	`None`

`from_directory(root, *, stable_only=True)` `classmethod`

Load runs from a compiled data directory (parquet-first).

Reads runs/llm.parquet from the compiled data repo. No raw file parsing or stability re-computation is performed.

Parameters:

Name	Type	Description	Default
`root`	`str \| Path`	Compiled data directory containing `runs/llm.parquet`.	required
`stable_only`	`bool`	If True (default), only return stable runs.	`True`

`from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True)` `classmethod`

Load LLM runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (output_lengths(), timelines(), inter_token_latencies()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name	Type	Description	Default
`repo_id`	`str`	HF dataset repository ID.	`'ml-energy/benchmark-v3'`
`revision`	`str \| None`	Git revision (branch, tag, or commit hash).	`None`
`stable_only`	`bool`	If True (default), only return stable runs. See from_raw_results for the definition of stability.	`True`

`from_parquet(path, *, base_dir=None, stable_only=True)` `classmethod`

Construct LLMRuns from a pre-built parquet file.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to the parquet file.	required
`base_dir`	`Path \| None`	If provided, resolve relative path fields against this directory.	`None`
`stable_only`	`bool`	If True (default), only return stable runs.	`True`

`from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None)` `classmethod`

Load runs from raw benchmark result directories.

Parses results.json files, computes stability, and returns the filtered collection.

A run is considered unstable if any of the following hold:

The steady-state duration is shorter than 20 seconds.
The energy-per-token value is missing or non-positive.
The average batch utilization during steady state is below 85% of the configured max_num_seqs.
Cascade rule: if any batch size for a (model, task, GPU, num_gpus) group is unstable, all larger batch sizes in the same group are also marked unstable.

Stability is computed jointly across all roots so the cascade rule works cross-root.

Parameters:

Name	Type	Description	Default
`roots`	`str \| Path`	One or more benchmark root directories (or results sub-dirs).	`()`
`tasks`	`set[str] \| None`	If given, only load these tasks.	`None`
`config_dir`	`str \| Path \| None`	Path to LLM config directory (model_info.json, etc.).	`None`
`stable_only`	`bool`	If True (default), only return stable runs. Pass False to include all runs; each run's `is_stable` and `unstable_reason` fields indicate its status.	`True`
`n_workers`	`int \| None`	Number of parallel workers (default: auto).	`None`

`task(*tasks)`

Filter to runs matching any of the given tasks.

`model_id(*model_ids)`

Filter to runs matching any of the given model IDs.

`gpu_model(*gpu_models)`

Filter to runs matching any of the given GPU models.

`num_gpus(*counts, min=None, max=None)`

Filter to runs matching given GPU counts or a range.

Parameters:

Name	Type	Description	Default
`counts`	`int`	Exact GPU counts to include.	`()`
`min`	`int \| None`	Minimum GPU count (inclusive).	`None`
`max`	`int \| None`	Maximum GPU count (inclusive).	`None`

`max_num_seqs(*sizes, min=None, max=None)`

Filter to runs matching given max_num_seqs values or a range.

Parameters:

Name	Type	Description	Default
`sizes`	`int`	Exact values to include.	`()`
`min`	`int \| None`	Minimum value (inclusive).	`None`
`max`	`int \| None`	Maximum value (inclusive).	`None`

`precision(*prec)`

Filter to runs matching any of the given weight precisions.

`architecture(*arch)`

Filter to runs matching any of the given architectures.

`nickname(*nicknames)`

Filter to runs matching any of the given nicknames.

`stable()`

Filter to stable runs only.

`unstable()`

Filter to unstable runs only.

Raises:

Type	Description
`ValueError`	If this collection was loaded with `stable_only=True`, since unstable runs were already filtered out at load time.

`where(predicate)`

Filter runs by an arbitrary predicate.

Parameters:

Name	Type	Description	Default
`predicate`	`Callable[[LLMRun], bool]`	Function that takes an `LLMRun` and returns True to keep it.	required

`group_by(*fields)`

Group runs by one or more fields.

Parameters:

Name	Type	Description	Default
`fields`	`str`	One or more `LLMRun` field names to group by.	`()`

Returns:

Type	Description
`dict[Any, LLMRuns]`	Single field: `{value: LLMRuns, ...}`.
`dict[Any, LLMRuns]`	Multiple fields: `{(v1, v2, ...): LLMRuns, ...}`.

`to_dataframe()`

Convert to DataFrame with one row per run.

Private fields (path fields, HF metadata) are excluded.

`output_lengths(*, include_unsuccessful=False)`

Extract per-request output lengths.

Calls LLMRun.output_lengths() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name	Type	Description	Default
`include_unsuccessful`	`bool`	If True, include requests that failed during benchmarking (`success=False` in `results.json`). Defaults to False (only successful requests).	`False`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: task, model_id, num_gpus,
`DataFrame`	max_num_seqs, output_len

Raises:

Type	Description
`FileNotFoundError`	If raw results files are not available locally and the collection was not loaded from HF Hub.

`inter_token_latencies()`

Extract per-token inter-token latency samples.

Calls LLMRun.inter_token_latencies() on each record, which handles downloading from HF Hub if needed. Chunked-prefill artifacts (zero-valued ITL entries) are smoothed by spreading the accumulated latency across the covered tokens.

Returns:

Type	Description
`DataFrame`	DataFrame with columns: task, model_id, num_gpus,
`DataFrame`	max_num_seqs, itl_s

Raises:

Type	Description
`FileNotFoundError`	If raw results files are not available locally and the collection was not loaded from HF Hub.

`timelines(*, metric='power.device_instant')`

Extract power/temperature timeseries.

Calls LLMRun.timelines() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['power.device_instant', 'power.device_average', 'temperature']`	Which timeline to extract. Supported values: `"power.device_instant"`, `"power.device_average"`, `"temperature"`.	`'power.device_instant'`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: task, model_id, num_gpus,
`DataFrame`	max_num_seqs, timestamp, relative_time_s, value, metric

DiffusionRun

`mlenergy_data.records.runs.DiffusionRun` `dataclass`

A single diffusion model benchmark run.

Attributes:

Name	Type	Description
`domain`	`str`	Always `"diffusion"`.
`task`	`str`	Benchmark task (`"text-to-image"` or `"text-to-video"`).
`model_id`	`str`	Full HF model identifier.
`nickname`	`str`	Human-friendly display name from `model_info.json`.
`total_params_billions`	`float`	Total parameter count in billions.
`activated_params_billions`	`float`	Activated parameter count in billions.
`weight_precision`	`str`	Weight precision (e.g. `"bfloat16"`, `"fp8"`).
`gpu_model`	`str`	GPU model identifier (e.g. `"H100"`).
`num_gpus`	`int`	Number of GPUs used.
`batch_size`	`int`	Batch size.
`inference_steps`	`int \| None`	Number of diffusion inference steps.
`height`	`int`	Output height in pixels.
`width`	`int`	Output width in pixels.
`num_frames`	`int \| None`	Number of video frames (`None` for images).
`fps`	`int \| None`	Video frames per second (`None` for images).
`ulysses_degree`	`int \| None`	Ulysses sequence parallelism degree.
`ring_degree`	`int \| None`	Ring attention parallelism degree.
`use_torch_compile`	`bool \| None`	Whether torch.compile was enabled.
`batch_latency_s`	`float`	Average batch latency in seconds.
`avg_power_watts`	`float`	Average GPU power in watts.
`energy_per_generation_joules`	`float`	Energy per generated output (image or video) in joules.
`throughput_generations_per_sec`	`float`	Throughput in generations per second.

`is_text_to_image` `property`

Whether this is a text-to-image run.

`is_text_to_video` `property`

Whether this is a text-to-video run.

`read_results_json()`

Download (if needed) and return parsed results.json.

Caches the parsed dict per instance so repeated calls are free.

`timelines(*, metric='power.device_instant')`

Return power/temperature timeseries for this run.

Diffusion runs do not have steady-state bounds, so the full timeline is returned.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['power.device_instant', 'power.device_average', 'temperature']`	Which timeline to extract. Supported values: `"power.device_instant"`, `"power.device_average"`, `"temperature"`.	`'power.device_instant'`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: timestamp, relative_time_s, value, metric.

DiffusionRuns

`mlenergy_data.records.runs.DiffusionRuns`

Immutable collection of diffusion model benchmark runs with fluent filtering.

Same collection pattern as LLMRuns, with diffusion-specific filters.

Iterate to get individual DiffusionRun objects:

for r in runs.task("text-to-image"):
    print(r.energy_per_generation_joules, r.nickname)

Example:

runs = DiffusionRuns.from_directory("/path/to/compiled/data")
powers = [r.avg_power_watts for r in runs.task("text-to-image")]

`download_raw_files()`

Download all raw files for this collection in parallel.

Downloads results.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = DiffusionRuns.from_hf().task("text-to-image").download_raw_files()

`from_directory(root)` `classmethod`

Load runs from a compiled data directory (parquet-first).

Reads runs/diffusion.parquet from the compiled data repo.

Parameters:

Name	Type	Description	Default
`root`	`str \| Path`	Compiled data directory containing `runs/diffusion.parquet`.	required

`from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None)` `classmethod`

Load diffusion runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (timelines()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name	Type	Description	Default
`repo_id`	`str`	HF dataset repository ID.	`'ml-energy/benchmark-v3'`
`revision`	`str \| None`	Git revision (branch, tag, or commit hash).	`None`

`from_parquet(path, *, base_dir=None)` `classmethod`

Construct DiffusionRuns from a pre-built parquet file.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to the parquet file.	required
`base_dir`	`Path \| None`	If provided, resolve relative path fields against this directory.	`None`

`from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None)` `classmethod`

Load runs from raw benchmark result directories.

Parses results.json files and returns the collection.

Parameters:

Name	Type	Description	Default
`roots`	`str \| Path`	One or more benchmark root directories (or results sub-dirs).	`()`
`tasks`	`set[str] \| None`	If given, only load these tasks.	`None`
`config_dir`	`str \| Path \| None`	Path to diffusion config directory.	`None`
`n_workers`	`int \| None`	Number of parallel workers (default: auto).	`None`

`task(*tasks)`

Filter to runs matching any of the given tasks.

`model_id(*model_ids)`

Filter to runs matching any of the given model IDs.

`gpu_model(*gpu_models)`

Filter to runs matching any of the given GPU models.

`num_gpus(*counts, min=None, max=None)`

Filter to runs matching given GPU counts or a range.

Parameters:

Name	Type	Description	Default
`counts`	`int`	Exact GPU counts to include.	`()`
`min`	`int \| None`	Minimum GPU count (inclusive).	`None`
`max`	`int \| None`	Maximum GPU count (inclusive).	`None`

`nickname(*nicknames)`

Filter to runs matching any of the given nicknames.

`batch_size(*sizes, min=None, max=None)`

Filter to runs matching given batch sizes or a range.

Parameters:

Name	Type	Description	Default
`sizes`	`int`	Exact batch sizes to include.	`()`
`min`	`int \| None`	Minimum batch size (inclusive).	`None`
`max`	`int \| None`	Maximum batch size (inclusive).	`None`

`precision(*prec)`

Filter to runs matching any of the given weight precisions.

`where(predicate)`

Filter runs by an arbitrary predicate.

Parameters:

Name	Type	Description	Default
`predicate`	`Callable[[DiffusionRun], bool]`	Function that takes a `DiffusionRun` and returns True to keep it.	required

`group_by(*fields)`

Group runs by one or more fields.

Parameters:

Name	Type	Description	Default
`fields`	`str`	One or more `DiffusionRun` field names to group by.	`()`

Returns:

Type	Description
`dict[Any, DiffusionRuns]`	Single field: `{value: DiffusionRuns, ...}`.
`dict[Any, DiffusionRuns]`	Multiple fields: `{(v1, v2, ...): DiffusionRuns, ...}`.

`to_dataframe()`

Convert to DataFrame with one row per run.

Private fields (path fields, HF metadata) are excluded.

`timelines(*, metric='power.device_instant')`

Extract power/temperature timeseries.

Calls DiffusionRun.timelines() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['power.device_instant', 'power.device_average', 'temperature']`	Which timeline to extract. Supported values: `"power.device_instant"`, `"power.device_average"`, `"temperature"`.	`'power.device_instant'`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: task, model_id, num_gpus,
`DataFrame`	batch_size, timestamp, relative_time_s, value, metric

Records API

LLMRun

mlenergy_data.records.runs.LLMRun dataclass

read_results_json()

read_prometheus_json()

output_lengths(*, include_unsuccessful=False)

inter_token_latencies()

timelines(*, metric='power.device_instant')

LLMRuns

mlenergy_data.records.runs.LLMRuns

download_raw_files(file=None)

from_directory(root, *, stable_only=True) classmethod

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True) classmethod

from_parquet(path, *, base_dir=None, stable_only=True) classmethod

from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None) classmethod

task(*tasks)

model_id(*model_ids)

gpu_model(*gpu_models)

num_gpus(*counts, min=None, max=None)

max_num_seqs(*sizes, min=None, max=None)

precision(*prec)

architecture(*arch)

nickname(*nicknames)

stable()

unstable()

where(predicate)

group_by(*fields)

to_dataframe()

output_lengths(*, include_unsuccessful=False)

inter_token_latencies()

timelines(*, metric='power.device_instant')

DiffusionRun

mlenergy_data.records.runs.DiffusionRun dataclass

is_text_to_image property

is_text_to_video property

read_results_json()

timelines(*, metric='power.device_instant')

DiffusionRuns

mlenergy_data.records.runs.DiffusionRuns

download_raw_files()

from_directory(root) classmethod

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None) classmethod

from_parquet(path, *, base_dir=None) classmethod

from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None) classmethod

task(*tasks)

model_id(*model_ids)

gpu_model(*gpu_models)

num_gpus(*counts, min=None, max=None)

nickname(*nicknames)

batch_size(*sizes, min=None, max=None)

precision(*prec)

where(predicate)

group_by(*fields)

to_dataframe()

timelines(*, metric='power.device_instant')

Stay Updated

`mlenergy_data.records.runs.LLMRun` `dataclass`

`read_results_json()`

`read_prometheus_json()`

`output_lengths(*, include_unsuccessful=False)`

`inter_token_latencies()`

`timelines(*, metric='power.device_instant')`

`mlenergy_data.records.runs.LLMRuns`

`download_raw_files(file=None)`

`from_directory(root, *, stable_only=True)` `classmethod`

`from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True)` `classmethod`

`from_parquet(path, *, base_dir=None, stable_only=True)` `classmethod`

`from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None)` `classmethod`

`task(*tasks)`

`model_id(*model_ids)`

`gpu_model(*gpu_models)`

`num_gpus(*counts, min=None, max=None)`

`max_num_seqs(*sizes, min=None, max=None)`

`precision(*prec)`

`architecture(*arch)`

`nickname(*nicknames)`

`stable()`

`unstable()`

`where(predicate)`

`group_by(*fields)`

`to_dataframe()`

`output_lengths(*, include_unsuccessful=False)`

`inter_token_latencies()`

`timelines(*, metric='power.device_instant')`

`mlenergy_data.records.runs.DiffusionRun` `dataclass`

`is_text_to_image` `property`

`is_text_to_video` `property`

`read_results_json()`

`timelines(*, metric='power.device_instant')`

`mlenergy_data.records.runs.DiffusionRuns`

`download_raw_files()`

`from_directory(root)` `classmethod`

`from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None)` `classmethod`

`from_parquet(path, *, base_dir=None)` `classmethod`

`from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None)` `classmethod`

`task(*tasks)`

`model_id(*model_ids)`

`gpu_model(*gpu_models)`

`num_gpus(*counts, min=None, max=None)`

`nickname(*nicknames)`

`batch_size(*sizes, min=None, max=None)`

`precision(*prec)`

`where(predicate)`

`group_by(*fields)`

`to_dataframe()`

`timelines(*, metric='power.device_instant')`