Measuring Energy
Zeus makes it very easy to measure time, power, and energy both programmatically in Python and also on the command line. Measuring power and energy is also very low overhead, typically taking less than 10 ms for each call.
Programmatic measurement
ZeusMonitor
makes it very simple to measure the GPU time and energy consumption of arbitrary Python code blocks.
A measurement window is defined by a code block wrapped with begin_window
and end_window
.
end_window
will return a Measurement
object, which holds the time and energy consumption of the window.
Users can specify and measure multiple measurement windows at the same time, and they can be arbitrarily nested or overlapping as long as they are given different names.
from zeus.monitor import ZeusMonitor
if __name__ == "__main__":
# All GPUs are measured simultaneously if `gpu_indices` is not given.
monitor = ZeusMonitor(gpu_indices=[torch.cuda.current_device()])
for epoch in range(100):
monitor.begin_window("epoch")
steps = []
for x, y in train_loader:
monitor.begin_window("step")
train_one_step(x, y)
result = monitor.end_window("step")
steps.append(result)
mes = monitor.end_window("epoch")
print(f"Epoch {epoch} consumed {mes.time} s and {mes.total_energy} J.")
avg_time = sum(map(lambda m: m.time, steps)) / len(steps)
avg_energy = sum(map(lambda m: m.total_energy, steps)) / len(steps)
print(f"One step took {avg_time} s and {avg_energy} J on average.")
This monitor spawns a process that polls the instantaneous GPU power consumption API and exposes two methods: get_power
and get_energy
.
For older GPUs that do not support querying energy directly, ZeusMonitor
automatically uses the PowerMonitor
internally.
Use of global variables on GPUs older than Volta
On older GPUs, you should not instantiate ZeusMonitor
as a global variable without protecting it with if __name__ == "__main__"
.
It's because the energy query API is only available on Volta or newer NVIDIA GPU microarchitectures, and for older GPUs, a separate process that polls the power API has to be spawned (i.e., PowerMonitor
).
In this case, global code that spawns the process should be guarded with if __name__ == "__main__"
.
More details in Python docs.
gpu_indices
and CUDA_VISIBLE_DEVICES
Zeus always respects CUDA_VISIBLE_DEVICES
if set.
In other words, if CUDA_VISIBLE_DEVICES=1,3
and gpu_indices=[1]
, Zeus will understand that as GPU 3 in the system.
gpu_indices
and optimization
In general, energy optimizers measure the energy of the GPU through a ZeusMonitor
instance that is passed to their constructor.
Thus, only the GPUs specified by gpu_indices
will be the target of optimization.
CLI power and energy monitor
The energy monitor measures the total energy consumed by the GPU during the lifetime of the monitor process.
It's a simple wrapper around ZeusMonitor
.
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
The power monitor periodically prints out the GPU's power draw.
It's a simple wrapper around PowerMonitor
.
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}