Getting Started
Zeus is an energy measurement and optimization toolbox for deep learning.
How it works
Zeus in action, integrated with Stable Diffusion fine-tuning:
Just measuring GPU time and energy
Prerequisites
If your NVIDIA GPU's architecture is Volta or newer, simply do the following in your Python environment
1 |
|
ZeusMonitor
.
Otherwise, we recommend using our Docker container:
ZeusMonitor
ZeusMonitor
makes it very simple to measure the GPU time and energy consumption of arbitrary Python code blocks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Optimizing a single training job's energy consumption
All GPU power limits can be profiled quickly during training and used to optimize the energy consumption of the training job.
Prerequisites
In order to change the GPU's power limit, the process requires the Linux SYS_ADMIN
security capability, and the easiest way to do this is to spin up a container and give it --cap-add SYS_ADMIN
.
We provide ready-to-go Docker images.
GlobalPowerLimitOptimizer
After going through the prerequisites, GlobalPowerLimitOptimizer
into your training script.
Refer to our integration example with ImageNet for complete running examples for single-GPU and multi-GPU data parallel training.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Important
What is the optimal power limit?
The GlobalPowerLimitOptimizer
supports multiple OptimumSelector
s that chooses one power limit among all the profiled power limits.
Selectors that are current implemented are Energy
, Time
, ZeusCost
and MaxSlowdownConstraint
.
HFGlobalPowerLimitOptimizer
For easy use with HuggingFace 🤗 Transformers, HFGlobalPowerLimitOptimizer
is a drop-in compatible HuggingFace 🤗 Trainer Callback. When initializing a HuggingFace 🤗 Trainer or a TFL SFTTrainer, initialize and pass in HFGlobalPowerLimitOptimizer
as shown below:
1 2 3 4 5 6 7 8 9 10 11 |
|
- Transformers
Trainer
integration for causal langauge modeling (i.e., pre-training) - TRL
SFTTrainer
integration for Gemma 7b supervised fine-tuning with QLoRA
Large model training jobs
We created Perseus, which can optimize the energy consumption of large model training with practically no slowdown!
Recurring jobs
The cost-optimal batch size is located across multiple job runs using a Multi-Armed Bandit algorithm.
First, go through the steps for non-recurring jobs.
ZeusDataLoader
will transparently optimize the GPU power limit for any given batch size.
Then, you can use ZeusMaster
to drive recurring jobs and batch size optimization.
This example will come in handy: