Records API
Typed collection classes for benchmark runs.
LLMRun
mlenergy_data.records.runs.LLMRun
dataclass
A single LLM benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
domain |
str
|
Always |
task |
str
|
Benchmark task (e.g. |
model_id |
str
|
Full HF model identifier (e.g. |
nickname |
str
|
Human-friendly display name from |
architecture |
str
|
Model architecture ( |
total_params_billions |
float
|
Total parameter count in billions. |
activated_params_billions |
float
|
Activated parameter count in billions (equals total for dense). |
weight_precision |
str
|
Weight precision (e.g. |
gpu_model |
str
|
GPU model identifier (e.g. |
num_gpus |
int
|
Number of GPUs used. |
max_num_seqs |
int
|
Maximum concurrent sequences (batch size). |
seed |
int | None
|
Random seed used for the benchmark run. |
num_request_repeats |
int | None
|
Number of request repetitions. |
tensor_parallel |
int
|
Tensor parallelism degree. |
expert_parallel |
int
|
Expert parallelism degree. |
data_parallel |
int
|
Data parallelism degree. |
steady_state_energy_joules |
float
|
Total GPU energy during steady state in joules. |
steady_state_duration_seconds |
float
|
Duration of steady state in seconds. |
energy_per_token_joules |
float
|
Steady-state energy per output token in joules. |
energy_per_request_joules |
float | None
|
Estimated energy per request in joules. |
output_throughput_tokens_per_sec |
float
|
Steady-state output throughput in tokens/second. |
request_throughput_req_per_sec |
float | None
|
Steady-state request throughput in requests/second. |
avg_power_watts |
float
|
Average GPU power during steady state in watts. |
total_output_tokens |
float | None
|
Total output tokens generated (over full benchmark). |
completed_requests |
float | None
|
Number of completed requests (over full benchmark). |
avg_output_len |
float | None
|
Average output length in tokens. |
mean_itl_ms |
float
|
Mean inter-token latency in milliseconds. |
median_itl_ms |
float
|
Median inter-token latency in milliseconds. |
p50_itl_ms |
float
|
50th percentile inter-token latency in milliseconds. |
p90_itl_ms |
float
|
90th percentile inter-token latency in milliseconds. |
p95_itl_ms |
float
|
95th percentile inter-token latency in milliseconds. |
p99_itl_ms |
float
|
99th percentile inter-token latency in milliseconds. |
avg_batch_size |
float | None
|
Average concurrent sequences during steady state (from Prometheus). |
is_stable |
bool
|
Whether this run passed stability checks. |
unstable_reason |
str
|
Reason for instability (empty if stable). |
read_results_json()
Download (if needed) and return parsed results.json.
Caches the parsed dict per instance so repeated calls are free.
read_prometheus_json()
Download (if needed) and return parsed prometheus.json.
Caches the parsed dict per instance so repeated calls are free.
output_lengths(*, include_unsuccessful=False)
Return per-request output token lengths from results.json.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_unsuccessful
|
bool
|
If True, include requests that failed during benchmarking. Defaults to False. |
False
|
inter_token_latencies()
Return inter-token latency samples in seconds from results.json.
timelines(*, metric='power.device_instant')
Return power/temperature timeseries for this run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
Literal['power.device_instant', 'power.device_average', 'temperature']
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: timestamp, relative_time_s, value, metric. |
LLMRuns
mlenergy_data.records.runs.LLMRuns
Immutable collection of LLM benchmark runs with fluent filtering.
Iterate to get individual LLMRun objects:
for r in runs.task("gpqa"):
print(r.energy_per_token_joules, r.nickname)
best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)
Example:
runs = LLMRuns.from_directory("/path/to/compiled/data")
best = min(runs.stable().task("gpqa"), key=lambda r: r.energy_per_token_joules)
energies = [r.energy_per_token_joules for r in runs.task("gpqa")]
download_raw_files(file=None)
Download all raw files for this collection in parallel.
Downloads results.json and prometheus.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.
The full unfiltered dataset is ~100 GB. Filter first to limit download size:
runs = LLMRuns.from_hf().task("gpqa").download_raw_files()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Literal['results', 'prometheus'] | None
|
If specified, only download this file type. Otherwise,
download both |
None
|
from_directory(root, *, stable_only=True)
classmethod
Load runs from a compiled data directory (parquet-first).
Reads runs/llm.parquet from the compiled data repo. No raw file
parsing or stability re-computation is performed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Compiled data directory containing |
required |
stable_only
|
bool
|
If True (default), only return stable runs. |
True
|
from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None, stable_only=True)
classmethod
Load LLM runs from a Hugging Face dataset repository.
The default dataset is gated. Before calling this method:
- Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
- Set the
HF_TOKENenvironment variable to a Hugging Face access token.
Downloads only the parquet summary file (~few MB). Methods that need raw data (output_lengths(), timelines(), inter_token_latencies()) will automatically download the required files on first access.
Respects the HF_HOME environment variable for cache location.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HF dataset repository ID. |
'ml-energy/benchmark-v3'
|
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
stable_only
|
bool
|
If True (default), only return stable runs. See from_raw_results for the definition of stability. |
True
|
from_parquet(path, *, base_dir=None, stable_only=True)
classmethod
Construct LLMRuns from a pre-built parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the parquet file. |
required |
base_dir
|
Path | None
|
If provided, resolve relative path fields against this directory. |
None
|
stable_only
|
bool
|
If True (default), only return stable runs. |
True
|
from_raw_results(*roots, tasks=None, config_dir=None, stable_only=True, n_workers=None)
classmethod
Load runs from raw benchmark result directories.
Parses results.json files, computes stability, and returns
the filtered collection.
A run is considered unstable if any of the following hold:
- The steady-state duration is shorter than 20 seconds.
- The energy-per-token value is missing or non-positive.
- The average batch utilization during steady state is below 85%
of the configured
max_num_seqs. - Cascade rule: if any batch size for a (model, task, GPU, num_gpus) group is unstable, all larger batch sizes in the same group are also marked unstable.
Stability is computed jointly across all roots so the cascade rule works cross-root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
roots
|
str | Path
|
One or more benchmark root directories (or results sub-dirs). |
()
|
tasks
|
set[str] | None
|
If given, only load these tasks. |
None
|
config_dir
|
str | Path | None
|
Path to LLM config directory (model_info.json, etc.). |
None
|
stable_only
|
bool
|
If True (default), only return stable runs.
Pass False to include all runs; each run's |
True
|
n_workers
|
int | None
|
Number of parallel workers (default: auto). |
None
|
task(*tasks)
Filter to runs matching any of the given tasks.
model_id(*model_ids)
Filter to runs matching any of the given model IDs.
gpu_model(*gpu_models)
Filter to runs matching any of the given GPU models.
num_gpus(*counts, min=None, max=None)
Filter to runs matching given GPU counts or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
int
|
Exact GPU counts to include. |
()
|
min
|
int | None
|
Minimum GPU count (inclusive). |
None
|
max
|
int | None
|
Maximum GPU count (inclusive). |
None
|
max_num_seqs(*sizes, min=None, max=None)
Filter to runs matching given max_num_seqs values or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sizes
|
int
|
Exact values to include. |
()
|
min
|
int | None
|
Minimum value (inclusive). |
None
|
max
|
int | None
|
Maximum value (inclusive). |
None
|
precision(*prec)
Filter to runs matching any of the given weight precisions.
architecture(*arch)
Filter to runs matching any of the given architectures.
nickname(*nicknames)
Filter to runs matching any of the given nicknames.
stable()
Filter to stable runs only.
unstable()
Filter to unstable runs only.
Raises:
| Type | Description |
|---|---|
ValueError
|
If this collection was loaded with |
where(predicate)
Filter runs by an arbitrary predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicate
|
Callable[[LLMRun], bool]
|
Function that takes an |
required |
group_by(*fields)
to_dataframe()
Convert to DataFrame with one row per run.
Private fields (path fields, HF metadata) are excluded.
output_lengths(*, include_unsuccessful=False)
Extract per-request output lengths.
Calls LLMRun.output_lengths() on each record, which handles
downloading from HF Hub if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_unsuccessful
|
bool
|
If True, include requests that failed
during benchmarking ( |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: task, model_id, num_gpus, |
DataFrame
|
max_num_seqs, output_len |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If raw results files are not available locally and the collection was not loaded from HF Hub. |
inter_token_latencies()
Extract per-token inter-token latency samples.
Calls LLMRun.inter_token_latencies() on each record, which handles
downloading from HF Hub if needed. Chunked-prefill artifacts
(zero-valued ITL entries) are smoothed by spreading the accumulated
latency across the covered tokens.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: task, model_id, num_gpus, |
DataFrame
|
max_num_seqs, itl_s |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If raw results files are not available locally and the collection was not loaded from HF Hub. |
timelines(*, metric='power.device_instant')
Extract power/temperature timeseries.
Calls LLMRun.timelines() on each record, which handles
downloading from HF Hub if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
Literal['power.device_instant', 'power.device_average', 'temperature']
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: task, model_id, num_gpus, |
DataFrame
|
max_num_seqs, timestamp, relative_time_s, value, metric |
DiffusionRun
mlenergy_data.records.runs.DiffusionRun
dataclass
A single diffusion model benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
domain |
str
|
Always |
task |
str
|
Benchmark task ( |
model_id |
str
|
Full HF model identifier. |
nickname |
str
|
Human-friendly display name from |
total_params_billions |
float
|
Total parameter count in billions. |
activated_params_billions |
float
|
Activated parameter count in billions. |
weight_precision |
str
|
Weight precision (e.g. |
gpu_model |
str
|
GPU model identifier (e.g. |
num_gpus |
int
|
Number of GPUs used. |
batch_size |
int
|
Batch size. |
inference_steps |
int | None
|
Number of diffusion inference steps. |
height |
int
|
Output height in pixels. |
width |
int
|
Output width in pixels. |
num_frames |
int | None
|
Number of video frames ( |
fps |
int | None
|
Video frames per second ( |
ulysses_degree |
int | None
|
Ulysses sequence parallelism degree. |
ring_degree |
int | None
|
Ring attention parallelism degree. |
use_torch_compile |
bool | None
|
Whether torch.compile was enabled. |
batch_latency_s |
float
|
Average batch latency in seconds. |
avg_power_watts |
float
|
Average GPU power in watts. |
energy_per_generation_joules |
float
|
Energy per generated output (image or video) in joules. |
throughput_generations_per_sec |
float
|
Throughput in generations per second. |
is_text_to_image
property
Whether this is a text-to-image run.
is_text_to_video
property
Whether this is a text-to-video run.
read_results_json()
Download (if needed) and return parsed results.json.
Caches the parsed dict per instance so repeated calls are free.
timelines(*, metric='power.device_instant')
Return power/temperature timeseries for this run.
Diffusion runs do not have steady-state bounds, so the full timeline is returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
Literal['power.device_instant', 'power.device_average', 'temperature']
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: timestamp, relative_time_s, value, metric. |
DiffusionRuns
mlenergy_data.records.runs.DiffusionRuns
Immutable collection of diffusion model benchmark runs with fluent filtering.
Same collection pattern as LLMRuns, with diffusion-specific filters.
Iterate to get individual DiffusionRun objects:
for r in runs.task("text-to-image"):
print(r.energy_per_generation_joules, r.nickname)
Example:
runs = DiffusionRuns.from_directory("/path/to/compiled/data")
powers = [r.avg_power_watts for r in runs.task("text-to-image")]
download_raw_files()
Download all raw files for this collection in parallel.
Downloads results.json for every run in the collection. Only useful when loaded from HF Hub; no-op for local sources.
The full unfiltered dataset is ~100 GB. Filter first to limit download size:
runs = DiffusionRuns.from_hf().task("text-to-image").download_raw_files()
from_directory(root)
classmethod
Load runs from a compiled data directory (parquet-first).
Reads runs/diffusion.parquet from the compiled data repo.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Compiled data directory containing |
required |
from_hf(repo_id='ml-energy/benchmark-v3', *, revision=None)
classmethod
Load diffusion runs from a Hugging Face dataset repository.
The default dataset is gated. Before calling this method:
- Visit https://huggingface.co/datasets/ml-energy/benchmark-v3 and request access (granted automatically).
- Set the
HF_TOKENenvironment variable to a Hugging Face access token.
Downloads only the parquet summary file (~few MB). Methods that need raw data (timelines()) will automatically download the required files on first access.
Respects the HF_HOME environment variable for cache location.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HF dataset repository ID. |
'ml-energy/benchmark-v3'
|
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
from_parquet(path, *, base_dir=None)
classmethod
Construct DiffusionRuns from a pre-built parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the parquet file. |
required |
base_dir
|
Path | None
|
If provided, resolve relative path fields against this directory. |
None
|
from_raw_results(*roots, tasks=None, config_dir=None, n_workers=None)
classmethod
Load runs from raw benchmark result directories.
Parses results.json files and returns the collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
roots
|
str | Path
|
One or more benchmark root directories (or results sub-dirs). |
()
|
tasks
|
set[str] | None
|
If given, only load these tasks. |
None
|
config_dir
|
str | Path | None
|
Path to diffusion config directory. |
None
|
n_workers
|
int | None
|
Number of parallel workers (default: auto). |
None
|
task(*tasks)
Filter to runs matching any of the given tasks.
model_id(*model_ids)
Filter to runs matching any of the given model IDs.
gpu_model(*gpu_models)
Filter to runs matching any of the given GPU models.
num_gpus(*counts, min=None, max=None)
Filter to runs matching given GPU counts or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
int
|
Exact GPU counts to include. |
()
|
min
|
int | None
|
Minimum GPU count (inclusive). |
None
|
max
|
int | None
|
Maximum GPU count (inclusive). |
None
|
nickname(*nicknames)
Filter to runs matching any of the given nicknames.
batch_size(*sizes, min=None, max=None)
Filter to runs matching given batch sizes or a range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sizes
|
int
|
Exact batch sizes to include. |
()
|
min
|
int | None
|
Minimum batch size (inclusive). |
None
|
max
|
int | None
|
Maximum batch size (inclusive). |
None
|
precision(*prec)
Filter to runs matching any of the given weight precisions.
where(predicate)
Filter runs by an arbitrary predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicate
|
Callable[[DiffusionRun], bool]
|
Function that takes a |
required |
group_by(*fields)
Group runs by one or more fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields
|
str
|
One or more |
()
|
Returns:
| Type | Description |
|---|---|
dict[Any, DiffusionRuns]
|
Single field: |
dict[Any, DiffusionRuns]
|
Multiple fields: |
to_dataframe()
Convert to DataFrame with one row per run.
Private fields (path fields, HF metadata) are excluded.
timelines(*, metric='power.device_instant')
Extract power/temperature timeseries.
Calls DiffusionRun.timelines() on each record, which handles
downloading from HF Hub if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
Literal['power.device_instant', 'power.device_average', 'temperature']
|
Which timeline to extract. Supported values:
|
'power.device_instant'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: task, model_id, num_gpus, |
DataFrame
|
batch_size, timestamp, relative_time_s, value, metric |