Records API
Typed collection classes for benchmark runs.
LLMRun
mlenergy_data.records.runs.LLMRun
dataclass
A single LLM benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
domain |
str
|
Always |
task |
str
|
Benchmark task (e.g. |
model_id |
str
|
Full HF model identifier (e.g. |
nickname |
str
|
Human-friendly display name from |
architecture |
str
|
Model architecture ( |
total_params_billions |
float
|
Total parameter count in billions. |
activated_params_billions |
float
|
Activated parameter count in billions (equals total for dense). |
weight_precision |
str
|
Weight precision (e.g. |
gpu_model |
str
|
GPU model identifier (e.g. |
num_gpus |
int
|
Number of GPUs used. |
max_num_seqs |
int
|
Maximum concurrent sequences (batch size). |
seed |
int | None
|
Random seed used for the benchmark run. |
num_request_repeats |
int | None
|
Number of request repetitions. |
tensor_parallel |
int
|
Tensor parallelism degree. |
expert_parallel |
int
|
Expert parallelism degree. |
data_parallel |
int
|
Data parallelism degree. |
steady_state_energy_joules |
float
|
Total GPU energy during steady state in joules. |
steady_state_duration_seconds |
float
|
Duration of steady state in seconds. |
energy_per_token_joules |
float
|
Steady-state energy per output token in joules. |
energy_per_request_joules |
float | None
|
Estimated energy per request in joules. |
output_throughput_tokens_per_sec |
float
|
Steady-state output throughput in tokens/second. |
request_throughput_req_per_sec |
float | None
|
Steady-state request throughput in requests/second. |
avg_power_watts |
float
|
Average GPU power during steady state in watts. |
total_output_tokens |
float | None
|
Total output tokens generated (over full benchmark). |
completed_requests |
float | None
|
Number of completed requests (over full benchmark). |
avg_output_len |
float | None
|
Average output length in tokens. |
mean_itl_ms |
float
|
Mean inter-token latency in milliseconds. |
median_itl_ms |
float
|
Median inter-token latency in milliseconds. |
p50_itl_ms |
float
|
50th percentile inter-token latency in milliseconds. |
p90_itl_ms |
float
|
90th percentile inter-token latency in milliseconds. |
p95_itl_ms |
float
|
95th percentile inter-token latency in milliseconds. |
p99_itl_ms |
float
|
99th percentile inter-token latency in milliseconds. |
avg_batch_size |
float | None
|
Average concurrent sequences during steady state (from Prometheus). |
is_stable |
bool
|
Whether this run passed stability checks. |
unstable_reason |
str
|
Reason for instability (empty if stable). |
results_path |
str
|
Path to the raw |
prometheus_path |
str
|
Path to the |
LLMRuns
mlenergy_data.records.runs.LLMRuns
Immutable collection of LLM benchmark runs with fluent filtering.
Supports chained filtering, grouping, iteration, and conversion to DataFrames. Two data access patterns are available:
Per-record (row) -- iterate to get individual LLMRun objects:
for r in runs.task("gpqa"):
print(r.energy_per_token_joules, r.nickname)
best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)
Per-field (column) -- use the data property for typed field arrays:
energies = runs.data.energy_per_token_joules # list[float]
gpus = runs.data.num_gpus # list[int]
Example:
runs = LLMRuns.from_directory("/path/to/compiled/data")
best = min(runs.stable().task("gpqa"), key=lambda r: r.energy_per_token_joules)
energies = runs.task("gpqa").data.energy_per_token_joules
data
property
Typed field accessor returning list[T] per field.
Provides column-oriented access to run fields with full type safety:
runs.data.energy_per_token_joules # list[float]
runs.data.num_gpus # list[int]
runs.data.nickname # list[str]
Each property returns a plain list with one element per run,
in the same order as iteration.
prefetch()
Eagerly download all raw files for this collection.
When loaded from HF Hub, downloads all raw results.json and
prometheus.json files for every run in the collection. Useful
when you know you'll need all raw data and want to pay the download
cost upfront rather than lazily.
The full unfiltered dataset is ~100 GB. Filter first to limit download size:
runs = LLMRuns.from_hf().task("gpqa").prefetch()
from_directory(root, *, stable_only=True)
classmethod
Load runs from a compiled data directory (parquet-first).
Reads runs/llm.parquet from the compiled data repo. No raw file
parsing or stability re-computation is performed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Compiled data directory containing |
required |
stable_only
|
bool
|
If True (default), only return stable runs. |
True
|
from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True)
classmethod
Load LLM runs from a Hugging Face dataset repository.
The default dataset is gated. Before calling this method:
- Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
- Set the
HF_TOKENenvironment variable to a Hugging Face access token.
Downloads only the parquet summary file (~few MB). Methods that need raw data (output_lengths(), timelines(), inter_token_latencies()) will automatically download the required files on first access.
Respects the HF_HOME environment variable for cache location.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HF dataset repository ID. |
'ml-energy/benchmark-v3'
|
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
stable_only
|
bool
|
If True (default), only return stable runs. See from_raw_results for the definition of stability. |
True
|
from_parquet(path, *, base_dir=None, stable_only=True)
classmethod
Construct LLMRuns from a pre-built parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the parquet file. |
required |
base_dir
|
Path | None
|
If provided, resolve relative results_path and prometheus_path against this directory. |
None
|
stable_only
|
bool
|
If True (default), only return stable runs. |
True
|
from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None)
classmethod
Load runs from raw benchmark result directories.
Parses results.json files, computes stability, and returns
the filtered collection.
A run is considered unstable if any of the following hold:
- The steady-state duration is shorter than 20 seconds.
- The energy-per-token value is missing or non-positive.
- The average batch utilization during steady state is below 85%
of the configured
max_num_seqs. - Cascade rule: if any batch size for a (model, task, GPU, num_gpus) group is unstable, all larger batch sizes in the same group are also marked unstable.
Stability is computed jointly across all roots so the cascade rule works cross-root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
roots
|
str | Path
|
One or more benchmark root directories (or results sub-dirs). |
()
|
tasks
|
set[str] | None
|
If given, only load these tasks. |
None
|
config_dir
|
str | Path | None
|
Path to LLM config directory (model_info.json, etc.). |
None
|
stable_only
|
bool
|
If True (default), only return stable runs.
Pass False to include all runs; each run's |
True
|
n_workers
|
int | None
|
Number of parallel workers (default: auto). |
None
|
task(*tasks)
Filter to runs matching any of the given tasks.
model(*model_ids)
Filter to runs matching any of the given model IDs.
gpu(*gpu_models)
Filter to runs matching any of the given GPU models.
num_gpus(*counts, min=None, max=None)
Filter to runs matching given GPU counts or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
int
|
Exact GPU counts to include. |
()
|
min
|
int | None
|
Minimum GPU count (inclusive). |
None
|
max
|
int | None
|
Maximum GPU count (inclusive). |
None
|
batch(*sizes, min=None, max=None)
Filter to runs matching given batch sizes or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sizes
|
int
|
Exact batch sizes to include. |
()
|
min
|
int | None
|
Minimum batch size (inclusive). |
None
|
max
|
int | None
|
Maximum batch size (inclusive). |
None
|
precision(*prec)
Filter to runs matching any of the given weight precisions.
architecture(*arch)
Filter to runs matching any of the given architectures.
nickname(*nicknames)
Filter to runs matching any of the given nicknames.
stable()
Filter to stable runs only.
unstable()
Filter to unstable runs only.
Raises:
| Type | Description |
|---|---|
ValueError
|
If this collection was loaded with |
where(predicate)
Filter runs by an arbitrary predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicate
|
Callable[[LLMRun], bool]
|
Function that takes an |
required |
group_by(*fields)
to_dataframe()
Convert to DataFrame with one row per run.
output_lengths(*, include_unsuccessful=False)
Extract per-request output lengths.
When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_unsuccessful
|
bool
|
If True, include requests that failed
during benchmarking ( |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: results_path, task, model_id, |
DataFrame
|
num_gpus, max_num_seqs, output_len, success |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If raw results files are not available locally and the collection was not loaded from HF Hub. |
inter_token_latencies()
Extract per-token inter-token latency samples.
Reads raw results files and extracts ITL values from each successful request. Chunked-prefill artifacts (zero-valued ITL entries) are smoothed by spreading the accumulated latency across the covered tokens.
When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: results_path, task, model_id, |
DataFrame
|
num_gpus, max_num_seqs, itl_s |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If raw results files are not available locally and the collection was not loaded from HF Hub. |
timelines(*, metric='power.device_instant')
Extract power/temperature timeseries.
When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
str
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: results_path, domain, task, model_id, |
DataFrame
|
num_gpus, max_num_seqs, batch_size, timestamp, |
DataFrame
|
relative_time_s, value, metric |
DiffusionRun
mlenergy_data.records.runs.DiffusionRun
dataclass
A single diffusion model benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
domain |
str
|
Always |
task |
str
|
Benchmark task ( |
model_id |
str
|
Full HF model identifier. |
nickname |
str
|
Human-friendly display name from |
total_params_billions |
float
|
Total parameter count in billions. |
activated_params_billions |
float
|
Activated parameter count in billions. |
weight_precision |
str
|
Weight precision (e.g. |
gpu_model |
str
|
GPU model identifier (e.g. |
num_gpus |
int
|
Number of GPUs used. |
batch_size |
int
|
Batch size. |
inference_steps |
int | None
|
Number of diffusion inference steps. |
height |
int
|
Output height in pixels. |
width |
int
|
Output width in pixels. |
num_frames |
int | None
|
Number of video frames ( |
fps |
int | None
|
Video frames per second ( |
ulysses_degree |
int | None
|
Ulysses sequence parallelism degree. |
ring_degree |
int | None
|
Ring attention parallelism degree. |
use_torch_compile |
bool | None
|
Whether torch.compile was enabled. |
batch_latency_s |
float
|
Average batch latency in seconds. |
avg_power_watts |
float
|
Average GPU power in watts. |
energy_per_generation_joules |
float
|
Energy per generated output (image or video) in joules. |
throughput_generations_per_sec |
float
|
Throughput in generations per second. |
results_path |
str
|
Path to the raw |
is_text_to_image
property
Whether this is a text-to-image run.
is_text_to_video
property
Whether this is a text-to-video run.
DiffusionRuns
mlenergy_data.records.runs.DiffusionRuns
Immutable collection of diffusion model benchmark runs with fluent filtering.
Same collection pattern as LLMRuns, with diffusion-specific filters.
Two data access patterns are available:
Per-record (row) -- iterate to get individual DiffusionRun objects:
for r in runs.task("text-to-image"):
print(r.energy_per_generation_joules, r.nickname)
Per-field (column) -- use the data property for typed field arrays:
powers = runs.data.avg_power_watts # list[float]
data
property
Typed field accessor returning list[T] per field.
Provides column-oriented access to run fields with full type safety:
runs.data.avg_power_watts # list[float]
runs.data.num_gpus # list[int]
runs.data.nickname # list[str]
Each property returns a plain list with one element per run,
in the same order as iteration.
prefetch()
Eagerly download all raw files for this collection.
When loaded from HF Hub, downloads all raw results.json files
for every run in the collection. Useful when you know you'll need
all raw data and want to pay the download cost upfront rather than
lazily.
The full unfiltered dataset is ~100 GB. Filter first to limit download size:
runs = DiffusionRuns.from_hf().task("text-to-image").prefetch()
from_directory(root)
classmethod
Load runs from a compiled data directory (parquet-first).
Reads runs/diffusion.parquet from the compiled data repo.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Compiled data directory containing |
required |
from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None)
classmethod
Load diffusion runs from a Hugging Face dataset repository.
The default dataset is gated. Before calling this method:
- Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
- Set the
HF_TOKENenvironment variable to a Hugging Face access token.
Downloads only the parquet summary file (~few MB). Methods that need raw data (timelines()) will automatically download the required files on first access.
Respects the HF_HOME environment variable for cache location.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HF dataset repository ID. |
'ml-energy/benchmark-v3'
|
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
from_parquet(path, *, base_dir=None)
classmethod
Construct DiffusionRuns from a pre-built parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the parquet file. |
required |
base_dir
|
Path | None
|
If provided, resolve relative results_path against this directory. |
None
|
from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None)
classmethod
Load runs from raw benchmark result directories.
Parses results.json files and returns the collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
roots
|
str | Path
|
One or more benchmark root directories (or results sub-dirs). |
()
|
tasks
|
set[str] | None
|
If given, only load these tasks. |
None
|
config_dir
|
str | Path | None
|
Path to diffusion config directory. |
None
|
n_workers
|
int | None
|
Number of parallel workers (default: auto). |
None
|
task(*tasks)
Filter to runs matching any of the given tasks.
model(*model_ids)
Filter to runs matching any of the given model IDs.
gpu(*gpu_models)
Filter to runs matching any of the given GPU models.
num_gpus(*counts, min=None, max=None)
Filter to runs matching given GPU counts or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
int
|
Exact GPU counts to include. |
()
|
min
|
int | None
|
Minimum GPU count (inclusive). |
None
|
max
|
int | None
|
Maximum GPU count (inclusive). |
None
|
nickname(*nicknames)
Filter to runs matching any of the given nicknames.
batch(*sizes, min=None, max=None)
Filter to runs matching given batch sizes or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sizes
|
int
|
Exact batch sizes to include. |
()
|
min
|
int | None
|
Minimum batch size (inclusive). |
None
|
max
|
int | None
|
Maximum batch size (inclusive). |
None
|
precision(*prec)
Filter to runs matching any of the given weight precisions.
where(predicate)
Filter runs by an arbitrary predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicate
|
Callable[[DiffusionRun], bool]
|
Function that takes a |
required |
group_by(*fields)
Group runs by one or more fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields
|
str
|
One or more |
()
|
Returns:
| Type | Description |
|---|---|
dict[Any, DiffusionRuns]
|
Single field: |
dict[Any, DiffusionRuns]
|
Multiple fields: |
to_dataframe()
Convert to DataFrame with one row per run.
timelines(*, metric='power.device_instant')
Extract power/temperature timeseries.
When loaded from HF Hub, automatically downloads the raw files needed for the current (possibly filtered) collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
str
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: results_path, domain, task, model_id, |
DataFrame
|
num_gpus, max_num_seqs, batch_size, timestamp, |
DataFrame
|
relative_time_s, value, metric |