Skip to content
Zeus logo Zeus logo

Deep Learning Energy Measurement and Optimization

Join the Zeus Slack workspace!

Zeus exceeded 100k+ pulls on Docker Hub! The Zeus team is always happy to chat with Zeus users and help out. Reach out to us in the Zeus Slack workspace.


Project News

  • [2024/02] Zeus was selected as a 2024 Mozilla Technology Fund awardee. Thanks, Mozilla!
  • [2023/12] The preprint of the Perseus paper is out here!
  • [2023/10] We released Perseus, an energy optimizer for large model training. Get started here!
  • [2023/09] We moved to under ml-energy! Please stay tuned for new exciting projects!
  • [2023/07] ZeusMonitor was used to profile GPU time and energy consumption for the ML.ENERGY leaderboard & Colosseum.
  • [2023/03] Chase, an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
  • [2022/11] Carbon-Aware Zeus won the second overall best solution award at Carbon Hack 22.

Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.

Measuring GPU energy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from zeus.monitor import ZeusMonitor

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")

print(f"Energy: {measurement.total_energy} J")
print(f"Time  : {measurement.time} s")

Finding the optimal GPU power limit

Zeus silently profiles different power limits during training and converges to the optimal one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

# Data parallel training with four GPUs
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

CLI power and energy monitor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
1
2
3
4
5
6
7
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})

Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.

Zeus is part of The ML.ENERGY Initiative.

Getting Started

Refer to Getting Started for instructions on environment setup, installation, and integration. We also provide integration examples in our GitHub repository.

Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to Extending Zeus for details.