Skip to content

Records API

Typed collection classes for benchmark runs.

LLMRun

mlenergy_data.records.runs.LLMRun dataclass

A single LLM benchmark run.

Attributes:

Name Type Description
domain str

Always "llm".

task str

Benchmark task (e.g. "gpqa", "lm-arena-chat").

model_id str

Full HF model identifier (e.g. "meta-llama/Llama-3.1-8B-Instruct").

nickname str

Human-friendly display name from model_info.json.

architecture str

Model architecture ("Dense" or "MoE").

total_params_billions float

Total parameter count in billions.

activated_params_billions float

Activated parameter count in billions (equals total for dense).

weight_precision str

Weight precision (e.g. "bfloat16", "fp8").

gpu_model str

GPU model identifier (e.g. "H100", "B200").

num_gpus int

Number of GPUs used.

max_num_seqs int

Maximum concurrent sequences (batch size).

seed int | None

Random seed used for the benchmark run.

num_request_repeats int | None

Number of request repetitions.

tensor_parallel int

Tensor parallelism degree.

expert_parallel int

Expert parallelism degree.

data_parallel int

Data parallelism degree.

steady_state_energy_joules float

Total GPU energy during steady state in joules.

steady_state_duration_seconds float

Duration of steady state in seconds.

energy_per_token_joules float

Steady-state energy per output token in joules.

energy_per_request_joules float | None

Estimated energy per request in joules.

output_throughput_tokens_per_sec float

Steady-state output throughput in tokens/second.

request_throughput_req_per_sec float | None

Steady-state request throughput in requests/second.

avg_power_watts float

Average GPU power during steady state in watts.

total_output_tokens float | None

Total output tokens generated (over full benchmark).

completed_requests float | None

Number of completed requests (over full benchmark).

avg_output_len float | None

Average output length in tokens.

mean_itl_ms float

Mean inter-token latency in milliseconds.

median_itl_ms float

Median inter-token latency in milliseconds.

p50_itl_ms float

50th percentile inter-token latency in milliseconds.

p90_itl_ms float

90th percentile inter-token latency in milliseconds.

p95_itl_ms float

95th percentile inter-token latency in milliseconds.

p99_itl_ms float

99th percentile inter-token latency in milliseconds.

avg_batch_size float | None

Average concurrent sequences during steady state (from Prometheus).

is_stable bool

Whether this run passed stability checks.

unstable_reason str

Reason for instability (empty if stable).

read_results_json()

Download (if needed) and return parsed results.json.

Caches the parsed dict per instance so repeated calls are free.

read_prometheus_json()

Download (if needed) and return parsed prometheus.json.

Caches the parsed dict per instance so repeated calls are free.

output_lengths(*, include_unsuccessful=False)

Return per-request output token lengths from results.json.

Parameters:

Name Type Description Default
include_unsuccessful bool

If True, include requests that failed during benchmarking. Defaults to False.

False

inter_token_latencies()

Return inter-token latency samples in seconds from results.json.

timelines(*, metric='power.device_instant')

Return power/temperature timeseries for this run.

Parameters:

Name Type Description Default
metric Literal['power.device_instant', 'power.device_average', 'temperature']

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: timestamp, relative_time_s, value, metric.

LLMRuns

mlenergy_data.records.runs.LLMRuns

Immutable collection of LLM benchmark runs with fluent filtering.

Iterate to get individual LLMRun objects:

for r in runs.task("gpqa"):
    print(r.energy_per_token_joules, r.nickname)

best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)

Example:

runs = LLMRuns.from_directory("/path/to/compiled/data")
best = min(runs.stable().task("gpqa"), key=lambda r: r.energy_per_token_joules)
energies = [r.energy_per_token_joules for r in runs.task("gpqa")]

download_raw_files(file=None)

Download all raw files for this collection in parallel.

Downloads results.json and prometheus.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = LLMRuns.from_hf().task("gpqa").download_raw_files()

Parameters:

Name Type Description Default
file Literal['results', 'prometheus'] | None

If specified, only download this file type. Otherwise, download both results.json and prometheus.json.

None

from_directory(root, *, stable_only=True) classmethod

Load runs from a compiled data directory (parquet-first).

Reads runs/llm.parquet from the compiled data repo. No raw file parsing or stability re-computation is performed.

Parameters:

Name Type Description Default
root str | Path

Compiled data directory containing runs/llm.parquet.

required
stable_only bool

If True (default), only return stable runs.

True

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True) classmethod

Load LLM runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

  1. Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
  2. Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (output_lengths(), timelines(), inter_token_latencies()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name Type Description Default
repo_id str

HF dataset repository ID.

'ml-energy/benchmark-v3'
revision str | None

Git revision (branch, tag, or commit hash).

None
stable_only bool

If True (default), only return stable runs. See from_raw_results for the definition of stability.

True

from_parquet(path, *, base_dir=None, stable_only=True) classmethod

Construct LLMRuns from a pre-built parquet file.

Parameters:

Name Type Description Default
path Path

Path to the parquet file.

required
base_dir Path | None

If provided, resolve relative path fields against this directory.

None
stable_only bool

If True (default), only return stable runs.

True

from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None) classmethod

Load runs from raw benchmark result directories.

Parses results.json files, computes stability, and returns the filtered collection.

A run is considered unstable if any of the following hold:

  • The steady-state duration is shorter than 20 seconds.
  • The energy-per-token value is missing or non-positive.
  • The average batch utilization during steady state is below 85% of the configured max_num_seqs.
  • Cascade rule: if any batch size for a (model, task, GPU, num_gpus) group is unstable, all larger batch sizes in the same group are also marked unstable.

Stability is computed jointly across all roots so the cascade rule works cross-root.

Parameters:

Name Type Description Default
roots str | Path

One or more benchmark root directories (or results sub-dirs).

()
tasks set[str] | None

If given, only load these tasks.

None
config_dir str | Path | None

Path to LLM config directory (model_info.json, etc.).

None
stable_only bool

If True (default), only return stable runs. Pass False to include all runs; each run's is_stable and unstable_reason fields indicate its status.

True
n_workers int | None

Number of parallel workers (default: auto).

None

task(*tasks)

Filter to runs matching any of the given tasks.

model_id(*model_ids)

Filter to runs matching any of the given model IDs.

gpu_model(*gpu_models)

Filter to runs matching any of the given GPU models.

num_gpus(*counts, min=None, max=None)

Filter to runs matching given GPU counts or a range.

Parameters:

Name Type Description Default
counts int

Exact GPU counts to include.

()
min int | None

Minimum GPU count (inclusive).

None
max int | None

Maximum GPU count (inclusive).

None

max_num_seqs(*sizes, min=None, max=None)

Filter to runs matching given max_num_seqs values or a range.

Parameters:

Name Type Description Default
sizes int

Exact values to include.

()
min int | None

Minimum value (inclusive).

None
max int | None

Maximum value (inclusive).

None

precision(*prec)

Filter to runs matching any of the given weight precisions.

architecture(*arch)

Filter to runs matching any of the given architectures.

nickname(*nicknames)

Filter to runs matching any of the given nicknames.

stable()

Filter to stable runs only.

unstable()

Filter to unstable runs only.

Raises:

Type Description
ValueError

If this collection was loaded with stable_only=True, since unstable runs were already filtered out at load time.

where(predicate)

Filter runs by an arbitrary predicate.

Parameters:

Name Type Description Default
predicate Callable[[LLMRun], bool]

Function that takes an LLMRun and returns True to keep it.

required

group_by(*fields)

Group runs by one or more fields.

Parameters:

Name Type Description Default
fields str

One or more LLMRun field names to group by.

()

Returns:

Type Description
dict[Any, LLMRuns]

Single field: {value: LLMRuns, ...}.

dict[Any, LLMRuns]

Multiple fields: {(v1, v2, ...): LLMRuns, ...}.

to_dataframe()

Convert to DataFrame with one row per run.

Private fields (path fields, HF metadata) are excluded.

output_lengths(*, include_unsuccessful=False)

Extract per-request output lengths.

Calls LLMRun.output_lengths() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name Type Description Default
include_unsuccessful bool

If True, include requests that failed during benchmarking (success=False in results.json). Defaults to False (only successful requests).

False

Returns:

Type Description
DataFrame

DataFrame with columns: task, model_id, num_gpus,

DataFrame

max_num_seqs, output_len

Raises:

Type Description
FileNotFoundError

If raw results files are not available locally and the collection was not loaded from HF Hub.

inter_token_latencies()

Extract per-token inter-token latency samples.

Calls LLMRun.inter_token_latencies() on each record, which handles downloading from HF Hub if needed. Chunked-prefill artifacts (zero-valued ITL entries) are smoothed by spreading the accumulated latency across the covered tokens.

Returns:

Type Description
DataFrame

DataFrame with columns: task, model_id, num_gpus,

DataFrame

max_num_seqs, itl_s

Raises:

Type Description
FileNotFoundError

If raw results files are not available locally and the collection was not loaded from HF Hub.

timelines(*, metric='power.device_instant')

Extract power/temperature timeseries.

Calls LLMRun.timelines() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name Type Description Default
metric Literal['power.device_instant', 'power.device_average', 'temperature']

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: task, model_id, num_gpus,

DataFrame

max_num_seqs, timestamp, relative_time_s, value, metric

DiffusionRun

mlenergy_data.records.runs.DiffusionRun dataclass

A single diffusion model benchmark run.

Attributes:

Name Type Description
domain str

Always "diffusion".

task str

Benchmark task ("text-to-image" or "text-to-video").

model_id str

Full HF model identifier.

nickname str

Human-friendly display name from model_info.json.

total_params_billions float

Total parameter count in billions.

activated_params_billions float

Activated parameter count in billions.

weight_precision str

Weight precision (e.g. "bfloat16", "fp8").

gpu_model str

GPU model identifier (e.g. "H100").

num_gpus int

Number of GPUs used.

batch_size int

Batch size.

inference_steps int | None

Number of diffusion inference steps.

height int

Output height in pixels.

width int

Output width in pixels.

num_frames int | None

Number of video frames (None for images).

fps int | None

Video frames per second (None for images).

ulysses_degree int | None

Ulysses sequence parallelism degree.

ring_degree int | None

Ring attention parallelism degree.

use_torch_compile bool | None

Whether torch.compile was enabled.

batch_latency_s float

Average batch latency in seconds.

avg_power_watts float

Average GPU power in watts.

energy_per_generation_joules float

Energy per generated output (image or video) in joules.

throughput_generations_per_sec float

Throughput in generations per second.

is_text_to_image property

Whether this is a text-to-image run.

is_text_to_video property

Whether this is a text-to-video run.

read_results_json()

Download (if needed) and return parsed results.json.

Caches the parsed dict per instance so repeated calls are free.

timelines(*, metric='power.device_instant')

Return power/temperature timeseries for this run.

Diffusion runs do not have steady-state bounds, so the full timeline is returned.

Parameters:

Name Type Description Default
metric Literal['power.device_instant', 'power.device_average', 'temperature']

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: timestamp, relative_time_s, value, metric.

DiffusionRuns

mlenergy_data.records.runs.DiffusionRuns

Immutable collection of diffusion model benchmark runs with fluent filtering.

Same collection pattern as LLMRuns, with diffusion-specific filters.

Iterate to get individual DiffusionRun objects:

for r in runs.task("text-to-image"):
    print(r.energy_per_generation_joules, r.nickname)

Example:

runs = DiffusionRuns.from_directory("/path/to/compiled/data")
powers = [r.avg_power_watts for r in runs.task("text-to-image")]

download_raw_files()

Download all raw files for this collection in parallel.

Downloads results.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = DiffusionRuns.from_hf().task("text-to-image").download_raw_files()

from_directory(root) classmethod

Load runs from a compiled data directory (parquet-first).

Reads runs/diffusion.parquet from the compiled data repo.

Parameters:

Name Type Description Default
root str | Path

Compiled data directory containing runs/diffusion.parquet.

required

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None) classmethod

Load diffusion runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

  1. Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
  2. Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (timelines()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name Type Description Default
repo_id str

HF dataset repository ID.

'ml-energy/benchmark-v3'
revision str | None

Git revision (branch, tag, or commit hash).

None

from_parquet(path, *, base_dir=None) classmethod

Construct DiffusionRuns from a pre-built parquet file.

Parameters:

Name Type Description Default
path Path

Path to the parquet file.

required
base_dir Path | None

If provided, resolve relative path fields against this directory.

None

from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None) classmethod

Load runs from raw benchmark result directories.

Parses results.json files and returns the collection.

Parameters:

Name Type Description Default
roots str | Path

One or more benchmark root directories (or results sub-dirs).

()
tasks set[str] | None

If given, only load these tasks.

None
config_dir str | Path | None

Path to diffusion config directory.

None
n_workers int | None

Number of parallel workers (default: auto).

None

task(*tasks)

Filter to runs matching any of the given tasks.

model_id(*model_ids)

Filter to runs matching any of the given model IDs.

gpu_model(*gpu_models)

Filter to runs matching any of the given GPU models.

num_gpus(*counts, min=None, max=None)

Filter to runs matching given GPU counts or a range.

Parameters:

Name Type Description Default
counts int

Exact GPU counts to include.

()
min int | None

Minimum GPU count (inclusive).

None
max int | None

Maximum GPU count (inclusive).

None

nickname(*nicknames)

Filter to runs matching any of the given nicknames.

batch_size(*sizes, min=None, max=None)

Filter to runs matching given batch sizes or a range.

Parameters:

Name Type Description Default
sizes int

Exact batch sizes to include.

()
min int | None

Minimum batch size (inclusive).

None
max int | None

Maximum batch size (inclusive).

None

precision(*prec)

Filter to runs matching any of the given weight precisions.

where(predicate)

Filter runs by an arbitrary predicate.

Parameters:

Name Type Description Default
predicate Callable[[DiffusionRun], bool]

Function that takes a DiffusionRun and returns True to keep it.

required

group_by(*fields)

Group runs by one or more fields.

Parameters:

Name Type Description Default
fields str

One or more DiffusionRun field names to group by.

()

Returns:

Type Description
dict[Any, DiffusionRuns]

Single field: {value: DiffusionRuns, ...}.

dict[Any, DiffusionRuns]

Multiple fields: {(v1, v2, ...): DiffusionRuns, ...}.

to_dataframe()

Convert to DataFrame with one row per run.

Private fields (path fields, HF metadata) are excluded.

timelines(*, metric='power.device_instant')

Extract power/temperature timeseries.

Calls DiffusionRun.timelines() on each record, which handles downloading from HF Hub if needed.

Parameters:

Name Type Description Default
metric Literal['power.device_instant', 'power.device_average', 'temperature']

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: task, model_id, num_gpus,

DataFrame

batch_size, timestamp, relative_time_s, value, metric