Toolkit Guide
Real-World Examples
For full working examples of the toolkit in production, see:
| Project | Description | Script |
|---|---|---|
| The ML.ENERGY Leaderboard | Builds the leaderboard JSON data from benchmark runs | Link |
| The ML.ENERGY Blog | Analysis for the blog post on the V3 benchmark results | Link |
| OpenG2G Simulation | Power traces and models for datacenter–grid simulation | Link |
Dataset Access
The benchmark dataset (ml-energy/benchmark-v3) is gated on Hugging Face Hub.
Before loading data with from_hf(), you need to:
- Visit the dataset page and request access (granted automatically).
- Set the
HF_TOKENenvironment variable to a Hugging Face access token.
Loading Benchmark Runs
LLMRuns and DiffusionRuns are typed, immutable collections.
Each run is a frozen dataclass (LLMRun / DiffusionRun) with IDE autocomplete and type checking.
from mlenergy_data.records import LLMRuns, DiffusionRuns
# Load from Hugging Face Hub
runs = LLMRuns.from_hf()
# Include unstable runs
runs = LLMRuns.from_hf(stable_only=False)
# Diffusion runs
diff = DiffusionRuns.from_hf()
Or load from a local compiled data directory:
root = "/path/to/compiled/data"
runs = LLMRuns.from_directory(root)
runs = LLMRuns.from_directory(root, stable_only=False)
diff = DiffusionRuns.from_directory(root)
Note
A "compiled data directory" is one built by data_publishing/build_hf_data.py (or downloaded from HF Hub). It contains parquet summary files under runs/, raw result files under llm/ and diffusion/, and benchmark config files under configs/.
Filtering
All filter methods return a new collection — chain freely:
# Single filter
gpqa = runs.task("gpqa")
h100 = runs.gpu("H100")
fp8 = runs.precision("fp8")
# Chained filters
best_candidates = runs.task("gpqa").gpu("B200").precision("fp8")
# Multiple values (OR within a filter)
chat_or_gpqa = runs.task("gpqa", "lm-arena-chat")
# By nickname
deepseek = runs.nickname("DeepSeek R1")
# Architecture (LLM only)
moe_models = runs.architecture("MoE")
# Batch size: exact values or range
batch_128 = runs.batch(128)
large_batch = runs.batch(min=64)
mid_batch = runs.batch(min=16, max=128)
# GPU count: exact or range
single_gpu = runs.num_gpus(1)
multi_gpu = runs.num_gpus(min=2)
# Stability
# Relevant when you explicitly set stable_only=False at load time to include unstable runs. By default, only stable runs are loaded.
stable_only = runs.stable()
unstable_only = runs.unstable()
# Arbitrary predicate
big_models = runs.where(lambda r: r.total_params_billions > 70)
Data Access
There are two ways to access run data:
Per-record (row): Iterate the collection to get individual typed records.
for r in runs.task("gpqa"):
print(r.energy_per_token_joules, r.nickname)
best = min(runs.task("gpqa"), key=lambda r: r.energy_per_token_joules)
print(f"{best.nickname}: {best.energy_per_token_joules:.3f} J/tok")
Per-field (column): Use the data property for typed field arrays.
energies = runs.data.energy_per_token_joules # list[float]
gpus = runs.data.num_gpus # list[int]
names = runs.data.nickname # list[str]
The data property provides full IDE autocomplete and type checking.
Each attribute returns a list[T] with one element per run, in iteration order.
import matplotlib.pyplot as plt
plt.scatter(runs.data.max_num_seqs, runs.data.energy_per_token_joules)
plt.xlabel("Batch size")
plt.ylabel("Energy per token (J)")
Indexing and concatenation:
first_run = runs[0]
# Concatenate collections
h100 = runs.gpu("H100")
b200 = runs.gpu("B200")
combined = h100 + b200
Grouping
# Group by task
for task, group in runs.group_by("task").items():
print(f"{task}: {len(group)} runs")
# Group by multiple fields
for (model, batch), g in runs.group_by("model_id", "max_num_seqs").items():
best = min(g, key=lambda r: r.energy_per_token_joules)
print(f"{model} @ batch={batch}: {best.energy_per_token_joules:.3f} J/tok")
Analysis Patterns
Python is the analysis layer — no special helper functions needed:
# Compare GPU generations on a task
for gpu, group in runs.task("lm-arena-chat").group_by("gpu_model").items():
best = min(group, key=lambda r: r.output_throughput_tokens_per_sec)
print(f"{gpu}: {best.nickname} @ {best.output_throughput_tokens_per_sec:.0f} tok/s")
# Comparing GPUs for a specific model
llama70b = runs.model("meta-llama/Llama-3.1-70B-Instruct")
for gpu, g in llama70b.group_by("gpu_model").items():
plt.scatter(g.data.max_num_seqs, g.data.energy_per_token_joules, label=gpu)
plt.legend()
Bulk Data
These methods return pandas DataFrames for numerical analysis.
When loaded from HF Hub (from_hf()), they automatically download only the raw files needed for the current collection. The download scope is determined by your filters. HF Hub caches files locally, so repeated calls are instant.
To eagerly download all raw files upfront, use prefetch():
# Eagerly download all raw files for a filtered collection
runs = LLMRuns.from_hf().task("lm-arena-chat").gpu("H100").prefetch()
power_tl = runs.timelines(metric="power.device_instant") # no download delay
# Power timelines (long-form)
power_tl = runs.timelines(metric="power.device_instant")
# Temperature timelines
temp_tl = runs.timelines(metric="temperature")
# Output lengths
out_df = runs.output_lengths()
# Full DataFrame (one row per run, all fields as columns)
df = runs.to_dataframe()
Diffusion Runs
DiffusionRuns follows the same patterns:
from mlenergy_data.records import DiffusionRuns
diff = DiffusionRuns.from_hf()
t2i = diff.task("text-to-image")
best = min(t2i, key=lambda r: r.energy_per_generation_joules)
print(f"{best.nickname}: {best.energy_per_generation_joules:.3f} J/image")
# Task field and convenience properties
r = diff[0]
r.task # "text-to-image" or "text-to-video"
r.is_text_to_image # True for text-to-image tasks
r.is_text_to_video # True for text-to-video tasks
# Available filters: task(), model(), gpu(), nickname(), batch(),
# num_gpus(), precision(), where()
Model Fitting
Logistic Curves
LogisticModel models a four-parameter logistic y = b0 + L * sigmoid(k * (x - x0)) where x = log2(batch_size):
import numpy as np
from mlenergy_data.modeling import LogisticModel
# Fit from data
x = np.log2([8, 16, 32, 64, 128, 256])
y_power = np.array([200, 250, 320, 400, 480, 530])
fit = LogisticModel.fit(x, y_power)
# Evaluate at a specific batch size
predicted = fit.eval(batch=128)
# Serialize / deserialize
d = fit.to_dict() # {"L": ..., "x0": ..., "k": ..., "b0": ...}
fit2 = LogisticModel.from_dict(d)
ITL Latency Distributions
ITLMixtureModel fits a two-component lognormal mixture for inter-token latency:
from mlenergy_data.modeling import ITLMixtureModel
# Fit from raw ITL samples (seconds)
model = ITLMixtureModel.fit(itl_samples_s, max_samples=2048, seed=0)
# Analytical mean and variance
mean, var = model.mean_var()
# Simulate average ITL across replicas
rng = np.random.default_rng(0)
avg_itl = model.sample_avg(n_replicas=180, rng=rng)
# Serialize / deserialize
d = model.to_dict()
model2 = ITLMixtureModel.from_dict(d)