Zeus: Understanding and Optimizing
GPU Energy Consumption of DNN Training

NSDI '23

Abstract

Training Deep Neural Networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.

In this paper, we observe that common practices of DNN training can lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus does not require any offline profiling and can adapt to data drifts.

Why care about GPU energy?

Recent years have seen an increasing adoption of DNNs for intelligent applications. Large clusters of GPUs were created to support such growth, and the surge continues.

GPUs are power-hungry hardware; GPUs consume ~ 70% of the power of the entire server when training DNNs.¹ At extreme scales, training the GPT-3 model just once consumes 1,287 MWh,² which is enough to supply an average US household for 120 years.³

However, latency and throughput have been the primary targets of existing optimization techniques, devoid of any careful consideration of how such optimizations might impact energy efficiency. We argue that energy should be considered as the third dimension.

Opportunity for energy savings

We observe that common practices of DNN training can often lead to energy inefficiency.

To see this, we trained⁴ the same DNN multiple times using a sweep of possible batch sizes and GPU power limits.⁵

Potential energy savings on an NVIDIA V100 GPU.

The baseline dotted line uses the default batch size from the model's publication and the default (maximum) GPU power limit. It can be seen that choosing the best batch size and power limit can lead to large energy savings.

Tradeoff between time & energy

Is energy reduction free?

We discover that there is a tradeoff between DNN training time and energy consumption.

All (batch size, power limit) configurations and their time/energy consumption.

The energy-time Pareto frontier zoomed in.

These results are from training DeepSpeech2 on LibriSpeech with an NVIDIA V100 GPU. Notice the yellow Pareto frontier of efficient (time, energy) pairs, resulting from a set of efficient (batch size, power limit) knobs.

Navigating the tradeoff

All points on the Pareto frontier are efficient, but which one is the best?

Different users will have different answers, because they have different preferences of how they would like to trade off time and energy.⁶

To allow users to express their tradeoff preference, we define a simple cost metric⁷

\[ \textrm{Cost} = \eta \cdot \textrm{Energy} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{Time,} \]

where the user picks the value of \(\eta\) between 0 and 1. Smaller \(\eta\) values will reduce more time, while larger ones will prefer to reduce more energy.

Finding the optimal knob

Given the user's preference via the value of \(\eta\), how do we find the best (batch size, power limit) knob on the Pareto frontier?

This is no easy problem. We only have the Pareto frontier in the previous plot because we trained all possible combinations of batch size and power limit until completion to characterize the tradeoff.⁸

Fortunately, DNN training jobs often recur in production GPU clusters,⁹ allowing us to explore, observe, and optimize across job recurrences.

This results in two main components in Zeus:

Just-In-Time energy profiler: Finds the optimal power limit via online profiling.
Multi-Armed Bandit + Thompson Sampling: Finds the optimal batch size across recurring training runs.

Research reproducibility

We have our trace-driven simulator open-sourced here with instructions.

Extending the Zeus simulator

Users can implement custom policies that optimize batch size and power limit, and plug it into the Zeus simulator. We have training and energy traces for 6 different DNNs and 4 different NVIDIA GPU microarchitectures here, which the simulator runs with.

Zeus defines two abstract classes BatchSizeOptimizer and PowerLimitOptimizer in zeus._legacy.policy.interface. Each class optimizes the batch size and power limit of a recurring training job respectively. As in our paper, the batch size optimizer is first invoked to decide which batch size to use, and then the power limit optimizer is invoked with both the job and the batch size chosen to decide which power limit to use. You can find examples of policy implementations in zeus._legacy.policy.optimizer.

The Zeus simulator (Simulator) accepts one BatchSizeOptimizer and PowerLimitOptimizer in its constructor. A full-example can be found here.

Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, and Will Buchanan. Measuring the carbon intensity of ai in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 1877–1894, New York, NY, USA, 2022. Association for Computing Machinery. ↩
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021. ↩
How much electricity does an American home use? https://www.eia.gov/tools/faqs/faq.php?id=97&t=3. ↩
In all cases of training, we train until the DNN reaches a specific target validation metric. Thus, when we say time, it's TTA (Time To Accuracy). Likewise for energy, it's ETA (Enerty To Accuracy). Please refer to our paper for the complete workload table. ↩
It is possible to cap the maximum power draw of a GPU using NVML. ↩
For instance, some production training jobs might have tight deadlines; they probably don't want to trade time for energy savings. On the other hand, exploratory training jobs may have more leeway; it might make sense for them to reduce energy consumption at the cost of longer training time. ↩
\(\textrm{MaxPower}\) is the maximum possible power limit of the GPU. It's just a constant number introduced to equalize the units of the left and right terms to Joules. ↩
Since doing this will consume so much time and energy, it may even offset or exceed the energy savings from choosing the optimal knobs if we decide to do it for every future incoming job! ↩
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629. IEEE, 2018. ↩

Zeus: Understanding and OptimizingGPU Energy Consumption of DNN Training