Skip to content

Records API

Typed collection classes for benchmark runs.

LLMRun

mlenergy_data.records.runs.LLMRun dataclass

A single LLM benchmark run.

Attributes:

Name Type Description
domain str

Always "llm".

task str

Benchmark task (e.g. "gpqa", "lm-arena-chat").

model_id str

Full HF model identifier (e.g. "meta-llama/Llama-3.1-8B-Instruct").

nickname str

Human-friendly display name from model_info.json.

architecture str

Model architecture ("Dense" or "MoE").

total_params_billions float

Total parameter count in billions.

activated_params_billions float

Activated parameter count in billions (equals total for dense).

weight_precision str

Weight precision (e.g. "bfloat16", "fp8").

gpu_model str

GPU model identifier (e.g. "H100", "B200").

num_gpus int

Number of GPUs used.

max_num_seqs int

Maximum concurrent sequences (batch size).

seed int | None

Random seed used for the benchmark run.

num_request_repeats int | None

Number of request repetitions.

tensor_parallel int

Tensor parallelism degree.

expert_parallel int

Expert parallelism degree.

data_parallel int

Data parallelism degree.

steady_state_energy_joules float

Total GPU energy during steady state in joules.

steady_state_duration_seconds float

Duration of steady state in seconds.

energy_per_token_joules float

Steady-state energy per output token in joules.

energy_per_request_joules float | None

Estimated energy per request in joules.

output_throughput_tokens_per_sec float

Steady-state output throughput in tokens/second.

request_throughput_req_per_sec float | None

Steady-state request throughput in requests/second.

avg_power_watts float

Average GPU power during steady state in watts.

total_output_tokens float | None

Total output tokens generated (over full benchmark).

completed_requests float | None

Number of completed requests (over full benchmark).

avg_output_len float | None

Average output length in tokens.

mean_itl_ms float

Mean inter-token latency in milliseconds.

median_itl_ms float

Median inter-token latency in milliseconds.

p50_itl_ms float

50th percentile inter-token latency in milliseconds.

p90_itl_ms float

90th percentile inter-token latency in milliseconds.

p95_itl_ms float

95th percentile inter-token latency in milliseconds.

p99_itl_ms float

99th percentile inter-token latency in milliseconds.

avg_batch_size float | None

Average concurrent sequences during steady state (from Prometheus).

is_stable bool

Whether this run passed stability checks.

unstable_reason str

Reason for instability (empty if stable).

results_path str

Path to the raw results.json file.

prometheus_path str

Path to the prometheus.json file.

LLMRuns

mlenergy_data.records.runs.LLMRuns

Immutable collection of LLM benchmark runs with fluent filtering.

Supports chained filtering, grouping, iteration, and conversion to DataFrames. Two data access patterns are available:

Per-record (row) -- iterate to get individual LLMRun objects:

for r in runs.task("gpqa"):
    print(r.energy_per_token_joules, r.nickname)

best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)

Per-field (column) -- use the data property for typed field arrays:

energies = runs.data.energy_per_token_joules  # list[float]
gpus = runs.data.num_gpus                     # list[int]

Example:

runs = LLMRuns.from_directory("/path/to/compiled/data")
best = min(runs.stable().task("gpqa"), key=lambda r: r.energy_per_token_joules)
energies = runs.task("gpqa").data.energy_per_token_joules

data property

Typed field accessor returning list[T] per field.

Provides column-oriented access to run fields with full type safety:

runs.data.energy_per_token_joules  # list[float]
runs.data.num_gpus                 # list[int]
runs.data.nickname                 # list[str]

Each property returns a plain list with one element per run, in the same order as iteration.

prefetch()

Eagerly download all raw files for this collection.

When loaded from HF Hub, downloads all raw results.json and prometheus.json files for every run in the collection. Useful when you know you'll need all raw data and want to pay the download cost upfront rather than lazily.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = LLMRuns.from_hf().task("gpqa").prefetch()

from_directory(root, *, stable_only=True) classmethod

Load runs from a compiled data directory (parquet-first).

Reads runs/llm.parquet from the compiled data repo. No raw file parsing or stability re-computation is performed.

Parameters:

Name Type Description Default
root str | Path

Compiled data directory containing runs/llm.parquet.

required
stable_only bool

If True (default), only return stable runs.

True

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True) classmethod

Load LLM runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

  1. Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
  2. Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (output_lengths(), timelines(), inter_token_latencies()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name Type Description Default
repo_id str

HF dataset repository ID.

'ml-energy/benchmark-v3'
revision str | None

Git revision (branch, tag, or commit hash).

None
stable_only bool

If True (default), only return stable runs. See from_raw_results for the definition of stability.

True

from_parquet(path, *, base_dir=None, stable_only=True) classmethod

Construct LLMRuns from a pre-built parquet file.

Parameters:

Name Type Description Default
path Path

Path to the parquet file.

required
base_dir Path | None

If provided, resolve relative results_path and prometheus_path against this directory.

None
stable_only bool

If True (default), only return stable runs.

True

from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None) classmethod

Load runs from raw benchmark result directories.

Parses results.json files, computes stability, and returns the filtered collection.

A run is considered unstable if any of the following hold:

  • The steady-state duration is shorter than 20 seconds.
  • The energy-per-token value is missing or non-positive.
  • The average batch utilization during steady state is below 85% of the configured max_num_seqs.
  • Cascade rule: if any batch size for a (model, task, GPU, num_gpus) group is unstable, all larger batch sizes in the same group are also marked unstable.

Stability is computed jointly across all roots so the cascade rule works cross-root.

Parameters:

Name Type Description Default
roots str | Path

One or more benchmark root directories (or results sub-dirs).

()
tasks set[str] | None

If given, only load these tasks.

None
config_dir str | Path | None

Path to LLM config directory (model_info.json, etc.).

None
stable_only bool

If True (default), only return stable runs. Pass False to include all runs; each run's is_stable and unstable_reason fields indicate its status.

True
n_workers int | None

Number of parallel workers (default: auto).

None

task(*tasks)

Filter to runs matching any of the given tasks.

model(*model_ids)

Filter to runs matching any of the given model IDs.

gpu(*gpu_models)

Filter to runs matching any of the given GPU models.

num_gpus(*counts, min=None, max=None)

Filter to runs matching given GPU counts or a range.

Parameters:

Name Type Description Default
counts int

Exact GPU counts to include.

()
min int | None

Minimum GPU count (inclusive).

None
max int | None

Maximum GPU count (inclusive).

None

batch(*sizes, min=None, max=None)

Filter to runs matching given batch sizes or a range.

Parameters:

Name Type Description Default
sizes int

Exact batch sizes to include.

()
min int | None

Minimum batch size (inclusive).

None
max int | None

Maximum batch size (inclusive).

None

precision(*prec)

Filter to runs matching any of the given weight precisions.

architecture(*arch)

Filter to runs matching any of the given architectures.

nickname(*nicknames)

Filter to runs matching any of the given nicknames.

stable()

Filter to stable runs only.

unstable()

Filter to unstable runs only.

Raises:

Type Description
ValueError

If this collection was loaded with stable_only=True, since unstable runs were already filtered out at load time.

where(predicate)

Filter runs by an arbitrary predicate.

Parameters:

Name Type Description Default
predicate Callable[[LLMRun], bool]

Function that takes an LLMRun and returns True to keep it.

required

group_by(*fields)

Group runs by one or more fields.

Parameters:

Name Type Description Default
fields str

One or more LLMRun field names to group by.

()

Returns:

Type Description
dict[Any, LLMRuns]

Single field: {value: LLMRuns, ...}.

dict[Any, LLMRuns]

Multiple fields: {(v1, v2, ...): LLMRuns, ...}.

to_dataframe()

Convert to DataFrame with one row per run.

output_lengths(*, include_unsuccessful=False)

Extract per-request output lengths.

When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.

Parameters:

Name Type Description Default
include_unsuccessful bool

If True, include requests that failed during benchmarking (success=False in results.json). Defaults to False (only successful requests).

False

Returns:

Type Description
DataFrame

DataFrame with columns: results_path, task, model_id,

DataFrame

num_gpus, max_num_seqs, output_len, success

Raises:

Type Description
FileNotFoundError

If raw results files are not available locally and the collection was not loaded from HF Hub.

inter_token_latencies()

Extract per-token inter-token latency samples.

Reads raw results files and extracts ITL values from each successful request. Chunked-prefill artifacts (zero-valued ITL entries) are smoothed by spreading the accumulated latency across the covered tokens.

When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.

Returns:

Type Description
DataFrame

DataFrame with columns: results_path, task, model_id,

DataFrame

num_gpus, max_num_seqs, itl_s

Raises:

Type Description
FileNotFoundError

If raw results files are not available locally and the collection was not loaded from HF Hub.

timelines(*, metric='power.device_instant')

Extract power/temperature timeseries.

When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.

Parameters:

Name Type Description Default
metric str

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: results_path, domain, task, model_id,

DataFrame

num_gpus, max_num_seqs, batch_size, timestamp,

DataFrame

relative_time_s, value, metric

DiffusionRun

mlenergy_data.records.runs.DiffusionRun dataclass

A single diffusion model benchmark run.

Attributes:

Name Type Description
domain str

Always "diffusion".

task str

Benchmark task ("text-to-image" or "text-to-video").

model_id str

Full HF model identifier.

nickname str

Human-friendly display name from model_info.json.

total_params_billions float

Total parameter count in billions.

activated_params_billions float

Activated parameter count in billions.

weight_precision str

Weight precision (e.g. "bfloat16", "fp8").

gpu_model str

GPU model identifier (e.g. "H100").

num_gpus int

Number of GPUs used.

batch_size int

Batch size.

inference_steps int | None

Number of diffusion inference steps.

height int

Output height in pixels.

width int

Output width in pixels.

num_frames int | None

Number of video frames (None for images).

fps int | None

Video frames per second (None for images).

ulysses_degree int | None

Ulysses sequence parallelism degree.

ring_degree int | None

Ring attention parallelism degree.

use_torch_compile bool | None

Whether torch.compile was enabled.

batch_latency_s float

Average batch latency in seconds.

avg_power_watts float

Average GPU power in watts.

energy_per_generation_joules float

Energy per generated output (image or video) in joules.

throughput_generations_per_sec float

Throughput in generations per second.

results_path str

Path to the raw results.json file.

is_text_to_image property

Whether this is a text-to-image run.

is_text_to_video property

Whether this is a text-to-video run.

DiffusionRuns

mlenergy_data.records.runs.DiffusionRuns

Immutable collection of diffusion model benchmark runs with fluent filtering.

Same collection pattern as LLMRuns, with diffusion-specific filters. Two data access patterns are available:

Per-record (row) -- iterate to get individual DiffusionRun objects:

for r in runs.task("text-to-image"):
    print(r.energy_per_generation_joules, r.nickname)

Per-field (column) -- use the data property for typed field arrays:

powers = runs.data.avg_power_watts  # list[float]

data property

Typed field accessor returning list[T] per field.

Provides column-oriented access to run fields with full type safety:

runs.data.avg_power_watts          # list[float]
runs.data.num_gpus                 # list[int]
runs.data.nickname                 # list[str]

Each property returns a plain list with one element per run, in the same order as iteration.

prefetch()

Eagerly download all raw files for this collection.

When loaded from HF Hub, downloads all raw results.json files for every run in the collection. Useful when you know you'll need all raw data and want to pay the download cost upfront rather than lazily.

The full unfiltered dataset is ~100 GB. Filter first to limit download size:

runs = DiffusionRuns.from_hf().task("text-to-image").prefetch()

from_directory(root) classmethod

Load runs from a compiled data directory (parquet-first).

Reads runs/diffusion.parquet from the compiled data repo.

Parameters:

Name Type Description Default
root str | Path

Compiled data directory containing runs/diffusion.parquet.

required

from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None) classmethod

Load diffusion runs from a Hugging Face dataset repository.

The default dataset is gated. Before calling this method:

  1. Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
  2. Set the HF_TOKEN environment variable to a Hugging Face access token.

Downloads only the parquet summary file (~few MB). Methods that need raw data (timelines()) will automatically download the required files on first access.

Respects the HF_HOME environment variable for cache location.

Parameters:

Name Type Description Default
repo_id str

HF dataset repository ID.

'ml-energy/benchmark-v3'
revision str | None

Git revision (branch, tag, or commit hash).

None

from_parquet(path, *, base_dir=None) classmethod

Construct DiffusionRuns from a pre-built parquet file.

Parameters:

Name Type Description Default
path Path

Path to the parquet file.

required
base_dir Path | None

If provided, resolve relative results_path against this directory.

None

from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None) classmethod

Load runs from raw benchmark result directories.

Parses results.json files and returns the collection.

Parameters:

Name Type Description Default
roots str | Path

One or more benchmark root directories (or results sub-dirs).

()
tasks set[str] | None

If given, only load these tasks.

None
config_dir str | Path | None

Path to diffusion config directory.

None
n_workers int | None

Number of parallel workers (default: auto).

None

task(*tasks)

Filter to runs matching any of the given tasks.

model(*model_ids)

Filter to runs matching any of the given model IDs.

gpu(*gpu_models)

Filter to runs matching any of the given GPU models.

num_gpus(*counts, min=None, max=None)

Filter to runs matching given GPU counts or a range.

Parameters:

Name Type Description Default
counts int

Exact GPU counts to include.

()
min int | None

Minimum GPU count (inclusive).

None
max int | None

Maximum GPU count (inclusive).

None

nickname(*nicknames)

Filter to runs matching any of the given nicknames.

batch(*sizes, min=None, max=None)

Filter to runs matching given batch sizes or a range.

Parameters:

Name Type Description Default
sizes int

Exact batch sizes to include.

()
min int | None

Minimum batch size (inclusive).

None
max int | None

Maximum batch size (inclusive).

None

precision(*prec)

Filter to runs matching any of the given weight precisions.

where(predicate)

Filter runs by an arbitrary predicate.

Parameters:

Name Type Description Default
predicate Callable[[DiffusionRun], bool]

Function that takes a DiffusionRun and returns True to keep it.

required

group_by(*fields)

Group runs by one or more fields.

Parameters:

Name Type Description Default
fields str

One or more DiffusionRun field names to group by.

()

Returns:

Type Description
dict[Any, DiffusionRuns]

Single field: {value: DiffusionRuns, ...}.

dict[Any, DiffusionRuns]

Multiple fields: {(v1, v2, ...): DiffusionRuns, ...}.

to_dataframe()

Convert to DataFrame with one row per run.

timelines(*, metric='power.device_instant')

Extract power/temperature timeseries.

When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.

Parameters:

Name Type Description Default
metric str

Which timeline to extract. Supported values: "power.device_instant", "power.device_average", "temperature".

'power.device_instant'

Returns:

Type Description
DataFrame

DataFrame with columns: results_path, domain, task, model_id,

DataFrame

num_gpus, max_num_seqs, batch_size, timestamp,

DataFrame

relative_time_s, value, metric