Power Limit Optimizer
The power limit optimizer (GlobalPowerLimitOptimizer) finds the optimal GPU power limit for DNN training.
Users can customize the power limit optimizer to choose the optimal power limit based on their own criteria using the OptimumSelector interface.
Usage
Use cases currently supported are single GPU training and data parallel training. For data parallel training, the power limit of all GPUs involved are changed together, since all GPUs have the same computation load.
Upcoming
Distributed data parallel training support is planned (tracking issue).
Extra system privileges needed
In order to optimize the GPU power limit, the power limit optimizer should be able to change the power limit. This requires extra system privileges. See here for details.
GlobalPowerLimitOptimizer
You can use the power limit optimizer by integrating GlobalPowerLimitOptimizer into your training loop.
In order to inform the optimizer of epoch and training step boundaries, a couple methods need to be called inside the training loop (highlighted):
from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
# Data parallel training with four GPUs.
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)
for epoch in range(100):
plo.on_epoch_begin()
for x, y in train_dataloader:
plo.on_step_begin()
# Learn from x and y
plo.on_step_end()
plo.on_epoch_end()
We provide integration examples for Torchvision & ImageNet single-GPU and data parallel training.
What is the optimal power limit?
GlobalPowerLimitOptimizer accepts an optional OptimumSelector in its constructor, which defines how to choose one power limit among all the profiled power limits.
Built-in optimum selectors are Energy, Time, ZeusCost and MaxSlowdownConstraint.
Users can inherit from OptimumSelector to implement their custom optimum selector.
HFGlobalPowerLimitOptimizer
For easy use with HuggingFace Transformers, HFGlobalPowerLimitOptimizer is implemented as a HuggingFace Trainer Callback by inheriting from TrainerCallback.
When initializing a HuggingFace Trainer or a TFL SFTTrainer, initialize and pass in HFGlobalPowerLimitOptimizer as shown below:
from transformers import Trainer
from zeus.monitor import ZeusMonitor
from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer
monitor = ZeusMonitor()
plo = HFGlobalPowerLimitOptimizer(monitor)
# Also works with trl.SFTTrainer.
trainer = Trainer(
...,
callbacks=[plo],
)
Refer to our HuggingFace integration examples for:
- Transformers
Trainerintegration for causal language modeling (i.e., pre-training) - TRL
SFTTrainerintegration for Gemma 7B supervised fine-tuning with QLoRA