Skip to content

Getting Started

Zeus is an energy measurement and optimization toolbox for deep learning.

How it works

Zeus in action, integrated with Stable Diffusion fine-tuning:

Just measuring GPU time and energy

Prerequisites

If your NVIDIA GPU's architecture is Volta or newer, simply do the following in your Python environment

1
pip install zeus-ml
and get going with ZeusMonitor.

Otherwise, we recommend using our Docker container:

  1. Set up your environment.
  2. Install Zeus.

ZeusMonitor

ZeusMonitor makes it very simple to measure the GPU time and energy consumption of arbitrary Python code blocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from zeus.monitor import ZeusMonitor

# All GPUs are measured simultaneously if `gpu_indices` is not given.
monitor = ZeusMonitor(gpu_indices=[torch.cuda.current_device()])

for epoch in range(100):
    monitor.begin_window("epoch")

    measurements = []
    for x, y in train_loader:
        monitor.begin_window("step")
        train_one_step(x, y)
        result = monitor.end_window("step")
        measurements.append(result)

    result = monitor.end_window("epoch")
    print(f"Epoch {epoch} consumed {result.time} s and {result.total_energy} J.")

    avg_time = sum(map(lambda m: m.time, measurements)) / len(measurements)
    avg_energy = sum(map(lambda m: m.total_energy, measurements)) / len(measurements)
    print(f"One step took {avg_time} s and {avg_energy} J on average.")

Optimizing a single training job's energy consumption

All GPU power limits can be profiled quickly during training and used to optimize the energy consumption of the training job.

Prerequisites

In order to change the GPU's power limit, the process requires the Linux SYS_ADMIN security capability, and the easiest way to do this is to spin up a container and give it --cap-add SYS_ADMIN. We provide ready-to-go Docker images.

GlobalPowerLimitOptimizer

After going through the prerequisites, GlobalPowerLimitOptimizer into your training script.

Refer to our integration example with ImageNet for complete running examples for single-GPU and multi-GPU data parallel training.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer

# Data parallel training with four GPUs.
# Omitting `gpu_indices` will use all GPUs, while respecting
# `CUDA_VISIBLE_DEVICES`.
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
# The power limit optimizer profiles power limits during training
# using the `ZeusMonitor` instance.
plo = GlobalPowerLimitOptimizer(monitor)

for epoch in range(100):
    plo.on_epoch_begin()

    for x, y in train_dataloader:
        plo.on_step_begin()
        # Learn from x and y!
        plo.on_step_end()

    plo.on_epoch_end()

    # Validate the model if needed, but `plo` won't care.

Important

What is the optimal power limit? The GlobalPowerLimitOptimizer supports multiple OptimumSelectors that chooses one power limit among all the profiled power limits. Selectors that are current implemented are Energy, Time, ZeusCost and MaxSlowdownConstraint.

HFGlobalPowerLimitOptimizer

For easy use with HuggingFace 🤗 Transformers, HFGlobalPowerLimitOptimizer is a drop-in compatible HuggingFace 🤗 Trainer Callback. When initializing a HuggingFace 🤗 Trainer or a TFL SFTTrainer, initialize and pass in HFGlobalPowerLimitOptimizer as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer

monitor = ZeusMonitor()
optimizer = HFGlobalPowerLimitOptimizer(monitor)

# Also works with SFTTrainer.
trainer = Trainer(
    ...,
    callbacks=[optimizer], # Add the `HFGlobalPowerLimitOptimizer` callback
)
Refer to our HuggingFace 🤗 example integration for:

  • Transformers Trainer integration for causal langauge modeling (i.e., pre-training)
  • TRL SFTTrainer integration for Gemma 7b supervised fine-tuning with QLoRA

Large model training jobs

We created Perseus, which can optimize the energy consumption of large model training with practically no slowdown!

Recurring jobs

The cost-optimal batch size is located across multiple job runs using a Multi-Armed Bandit algorithm. First, go through the steps for non-recurring jobs. ZeusDataLoader will transparently optimize the GPU power limit for any given batch size. Then, you can use ZeusMaster to drive recurring jobs and batch size optimization.

This example will come in handy: