Skip to content

power_limit

zeus.optimizer.power_limit

Optimizers that select the optimum power limit.

This module contains the following pieces:

OptimumSelector

Bases: ABC

Base class for optimum power limit selectors.

Source code in zeus/optimizer/power_limit.py
53
54
55
56
57
58
class OptimumSelector(ABC):
    """Base class for optimum power limit selectors."""

    @abstractmethod
    def select(self, measurements: list[PowerLimitMeasurement]) -> int:
        """Select the optimal power limit (W) from measurements."""

select abstractmethod

1
select(measurements)

Select the optimal power limit (W) from measurements.

Source code in zeus/optimizer/power_limit.py
56
57
58
@abstractmethod
def select(self, measurements: list[PowerLimitMeasurement]) -> int:
    """Select the optimal power limit (W) from measurements."""

Energy

Bases: OptimumSelector

Selects the power limit that minimizes energy consumption.

Source code in zeus/optimizer/power_limit.py
61
62
63
64
65
66
class Energy(OptimumSelector):
    """Selects the power limit that minimizes energy consumption."""

    def select(self, measurements: list[PowerLimitMeasurement]) -> int:
        """Select the optimal power limit (W) from measurements."""
        return min(measurements, key=lambda x: x.energy).power_limit

select

1
select(measurements)

Select the optimal power limit (W) from measurements.

Source code in zeus/optimizer/power_limit.py
64
65
66
def select(self, measurements: list[PowerLimitMeasurement]) -> int:
    """Select the optimal power limit (W) from measurements."""
    return min(measurements, key=lambda x: x.energy).power_limit

Time

Bases: OptimumSelector

Selects the power limit that minimizes training time.

This may not necessarily choose the maximum power limit, as time profiling results can be slightly noisy. However, we believe that's actually better because it means that training time is very similar among higher power limits, but lower power limit will consume less power.

Source code in zeus/optimizer/power_limit.py
69
70
71
72
73
74
75
76
77
78
79
80
class Time(OptimumSelector):
    """Selects the power limit that minimizes training time.

    This may not necessarily choose the maximum power limit, as time profiling
    results can be slightly noisy. However, we believe that's actually better
    because it means that training time is very similar among higher power limits,
    but lower power limit will consume less power.
    """

    def select(self, measurements: list[PowerLimitMeasurement]) -> int:
        """Select the optimal power limit (W) from measurements."""
        return min(measurements, key=lambda x: x.time).power_limit

select

1
select(measurements)

Select the optimal power limit (W) from measurements.

Source code in zeus/optimizer/power_limit.py
78
79
80
def select(self, measurements: list[PowerLimitMeasurement]) -> int:
    """Select the optimal power limit (W) from measurements."""
    return min(measurements, key=lambda x: x.time).power_limit

ZeusCost

Bases: OptimumSelector

Selects the power limit that minimizes a linear Zeus time-energy cost function.

Cost function is \(C = \eta \cdot Energy + MaxPower \cdot (1 - \eta) \cdot Time\).

Source code in zeus/optimizer/power_limit.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
class ZeusCost(OptimumSelector):
    r"""Selects the power limit that minimizes a linear Zeus time-energy cost function.

    Cost function is $C = \eta \cdot Energy + MaxPower \cdot (1 - \eta) \cdot Time$.
    """

    def __init__(self, eta_knob: float, world_size: int = 1) -> None:
        r"""Initialize the selector.

        Args:
            eta_knob: The $0 \le \eta \le 1$ knob for the Zeus time-energy cost function.
            world_size: The number of GPUs in the training job. Defaults to 1.
        """
        if eta_knob < 0 or eta_knob > 1:
            raise ValueError("eta_knob must be between 0 and 1, inclusive both sides.")
        if world_size < 1:
            raise ValueError("world_size must be greater than or equal to 1.")

        self.eta_knob = eta_knob
        self.world_size = world_size

    def select(self, measurements: list[PowerLimitMeasurement]) -> int:
        """Select the optimal power limit (W) from measurements."""
        max_power = (
            max(measurement.power_limit for measurement in measurements)
            * self.world_size
        )
        zeus_cost_map = {
            measurement.power_limit: zeus_cost(
                energy=measurement.energy,
                time=measurement.time,
                eta_knob=self.eta_knob,
                max_power=max_power,
            )
            for measurement in measurements
        }
        return min(zeus_cost_map, key=lambda x: zeus_cost_map[x])

__init__

1
__init__(eta_knob, world_size=1)

Parameters:

Name Type Description Default
eta_knob float

The \(0 \le \eta \le 1\) knob for the Zeus time-energy cost function.

required
world_size int

The number of GPUs in the training job. Defaults to 1.

1
Source code in zeus/optimizer/power_limit.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def __init__(self, eta_knob: float, world_size: int = 1) -> None:
    r"""Initialize the selector.

    Args:
        eta_knob: The $0 \le \eta \le 1$ knob for the Zeus time-energy cost function.
        world_size: The number of GPUs in the training job. Defaults to 1.
    """
    if eta_knob < 0 or eta_knob > 1:
        raise ValueError("eta_knob must be between 0 and 1, inclusive both sides.")
    if world_size < 1:
        raise ValueError("world_size must be greater than or equal to 1.")

    self.eta_knob = eta_knob
    self.world_size = world_size

select

1
select(measurements)

Select the optimal power limit (W) from measurements.

Source code in zeus/optimizer/power_limit.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def select(self, measurements: list[PowerLimitMeasurement]) -> int:
    """Select the optimal power limit (W) from measurements."""
    max_power = (
        max(measurement.power_limit for measurement in measurements)
        * self.world_size
    )
    zeus_cost_map = {
        measurement.power_limit: zeus_cost(
            energy=measurement.energy,
            time=measurement.time,
            eta_knob=self.eta_knob,
            max_power=max_power,
        )
        for measurement in measurements
    }
    return min(zeus_cost_map, key=lambda x: zeus_cost_map[x])

MaxSlowdownConstraint

Bases: OptimumSelector

Selects the minumum power limit that does not slow down training by more than the given factor.

Source code in zeus/optimizer/power_limit.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
class MaxSlowdownConstraint(OptimumSelector):
    """Selects the minumum power limit that does not slow down training by more than the given factor."""

    def __init__(self, factor: float) -> None:
        """Initialize the selector.

        Args:
            factor: The maximum allowed slowdown factor. Greater than or equal to 1.0.
        """
        if factor < 1.0:
            raise ValueError(
                f"max_slowdown_factor must be greater than or equal to 1.0. Got {factor}.",
            )

        self.factor = factor

    def select(self, measurements: list[PowerLimitMeasurement]) -> int:
        """Select the optimal power limit (W) from measurements."""
        feasible_power_limits = []
        max_power = max(measurement.power_limit for measurement in measurements)
        shortest_time = next(
            measurement.time
            for measurement in measurements
            if measurement.power_limit == max_power
        )
        for measurement in measurements:
            if measurement.time <= self.factor * shortest_time:
                feasible_power_limits.append(measurement.power_limit)
        return min(feasible_power_limits)

__init__

1
__init__(factor)

Parameters:

Name Type Description Default
factor float

The maximum allowed slowdown factor. Greater than or equal to 1.0.

required
Source code in zeus/optimizer/power_limit.py
125
126
127
128
129
130
131
132
133
134
135
136
def __init__(self, factor: float) -> None:
    """Initialize the selector.

    Args:
        factor: The maximum allowed slowdown factor. Greater than or equal to 1.0.
    """
    if factor < 1.0:
        raise ValueError(
            f"max_slowdown_factor must be greater than or equal to 1.0. Got {factor}.",
        )

    self.factor = factor

select

1
select(measurements)

Select the optimal power limit (W) from measurements.

Source code in zeus/optimizer/power_limit.py
138
139
140
141
142
143
144
145
146
147
148
149
150
def select(self, measurements: list[PowerLimitMeasurement]) -> int:
    """Select the optimal power limit (W) from measurements."""
    feasible_power_limits = []
    max_power = max(measurement.power_limit for measurement in measurements)
    shortest_time = next(
        measurement.time
        for measurement in measurements
        if measurement.power_limit == max_power
    )
    for measurement in measurements:
        if measurement.time <= self.factor * shortest_time:
            feasible_power_limits.append(measurement.power_limit)
    return min(feasible_power_limits)

Ready

Bases: BaseModel

State for when we are ready to start measuring the next power limit.

Initial state of the state machine if no previous profiling results were given. Ready -> Warmup after step'th on_step_begin.

Source code in zeus/optimizer/power_limit.py
153
154
155
156
157
158
159
160
161
class Ready(BaseModel):
    """State for when we are ready to start measuring the next power limit.

    Initial state of the state machine if no previous profiling results were given.
    `Ready` -> `Warmup` after `step`'th `on_step_begin`.
    """

    next_power_limit: PositiveInt
    steps: PositiveInt

Warmup

Bases: BaseModel

State for when we are warming up for a power limit.

Warmup -> Profiling on the steps'th on_step_begin. Warmup -> Ready on on_epoch_end before steps'th on_step_begin.

Source code in zeus/optimizer/power_limit.py
164
165
166
167
168
169
170
171
172
class Warmup(BaseModel):
    """State for when we are warming up for a power limit.

    `Warmup` -> `Profiling` on the `steps`'th `on_step_begin`.
    `Warmup` -> `Ready` on `on_epoch_end` before `steps`'th `on_step_begin`.
    """

    current_power_limit: PositiveInt
    steps: PositiveInt

Profiling

Bases: BaseModel

State for when we are profiling a power limit.

Profiling -> Warmup after steps'th on_step_begin and there are still power limits left to profile. Profiling -> Done after steps'th on_step_begin and there are no more power limits left to profile. Profiling -> Ready on on_epoch_end before steps'th on_step_begin.

Source code in zeus/optimizer/power_limit.py
175
176
177
178
179
180
181
182
183
184
185
186
class Profiling(BaseModel):
    """State for when we are profiling a power limit.

    `Profiling` -> `Warmup` after `steps`'th `on_step_begin` and
        there are still power limits left to profile.
    `Profiling` -> `Done` after `steps`'th `on_step_begin` and
        there are no more power limits left to profile.
    `Profiling` -> `Ready` on `on_epoch_end` before `steps`'th `on_step_begin`.
    """

    current_power_limit: PositiveInt
    steps: PositiveInt

Done

Bases: BaseModel

State for when we are done profiling all power limits.

Initial state of the state machine if previous profiling results were given. Final state of the state machine in any case.

Source code in zeus/optimizer/power_limit.py
189
190
191
192
193
194
195
196
class Done(BaseModel):
    """State for when we are done profiling all power limits.

    Initial state of the state machine if previous profiling results were given.
    Final state of the state machine in any case.
    """

    optimal_power_limit: PositiveInt

PowerLimitMeasurement

Bases: BaseModel

POD for GPU energy and time measurements for one power limit (W).

Source code in zeus/optimizer/power_limit.py
199
200
201
202
203
204
class PowerLimitMeasurement(BaseModel):
    """POD for GPU energy and time measurements for one power limit (W)."""

    power_limit: PositiveInt  # In Watts.
    energy: PositiveFloat
    time: PositiveFloat

_PowerLimitMeasurementList

Bases: BaseModel

Proxy class to save and load a list of PowerLimitMeasurements.

Source code in zeus/optimizer/power_limit.py
207
208
209
210
class _PowerLimitMeasurementList(BaseModel):
    """Proxy class to save and load a list of `PowerLimitMeasurement`s."""

    measurements: list[PowerLimitMeasurement]

GlobalPowerLimitOptimizer

Bases: Callback

Optimizer for the power limit knob.

This optimizer uses the JIT profiling log to determine the optimal power limit.

Source code in zeus/optimizer/power_limit.py
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
class GlobalPowerLimitOptimizer(Callback):
    """Optimizer for the power limit knob.

    This optimizer uses the JIT profiling log to determine the optimal power limit.
    """

    def __init__(
        self,
        monitor: ZeusMonitor,
        optimum_selector: OptimumSelector | None = None,
        wait_steps: int = 1,
        warmup_steps: int = 10,
        profile_steps: int = 40,
        pl_step: int = 25,
        profile_path: str | Path | None = None,
    ) -> None:
        r"""Initialize the optimizer.

        GPU indices to profile and optimize for are taken from `monitor.gpu_indices`.

        Args:
            monitor: `ZeusMonitor` instance used to profile GPU time and energy consumption.
            optimum_selector: The optimum selector to use. If not given, use `ZeusCost` with \eta=0.5.
            wait_steps: Number of steps to pass by before doing anything at the beginning.
                Useful if you have something like `torch.backends.cudnn.benchmark=True`,
                because the first iteration won't be representative of the rest of the iterations.
            warmup_steps: Number of warmup iterations for each power limit.
            profile_steps: Number of profie iterations for each power limit.
            pl_step: The stride between power limits to explore, in unites of Watts.
            profile_path: If the path points to an existing file, load the profile from the file
                and do not run any profiling. If the path points to a non-existing file, profile
                and save the profile to the file. If `None`, do not save or load any profile.
        """
        # Sanity checks.
        if wait_steps < 0:
            raise ValueError("wait_steps must be non-negative.")
        if warmup_steps < 0:
            raise ValueError("warmup_steps must be non-negative.")
        if profile_steps <= 0:
            raise ValueError("profile_steps must be positive.")
        if pl_step <= 0:
            raise ValueError("pl_step must be positive.")

        self.monitor = monitor
        self.optimum_selector = optimum_selector or ZeusCost(
            eta_knob=0.5,
            world_size=len(monitor.gpu_indices),
        )
        self.warmup_steps = warmup_steps
        self.profile_steps = profile_steps
        self.pl_step = pl_step * 1000  # Internally, we use milliWatts.
        self.profile_path = (
            Path(profile_path) if isinstance(profile_path, str) else profile_path
        )

        # Setup logging.
        self.logger = get_logger(type(self).__name__)

        # Set the range of power limits to explore.
        # Assert that supported power limits ranges are uniform across GPUs.
        gpus = get_gpus(ensure_homogeneous=True)
        pls = []
        for index in monitor.gpu_indices:
            pls.append(gpus.getPowerManagementLimitConstraints(index))
        if not all(pls[0] == pl for pl in pls):
            raise ValueError("Power limits ranges are not uniform across GPUs.")
        self.power_limits = list(
            range(pls[0][1], pls[0][0] - self.pl_step, -self.pl_step)
        )

        # Turn on persistence mode and set to the highest power limit.
        try:
            for index in monitor.gpu_indices:
                gpus.setPersistenceMode(index, enable=True)
        except ZeusGPUNoPermissionError as ze:
            raise RuntimeError(
                "SYS_ADMIN capability is required to modify GPU power limits. "
                "Using --cap-add SYS_ADMIN when running the Docker container "
                "is the easiest way to do this."
            ) from ze
        self.current_power_limit = 0

        # Store `Measurement` objects in a list, one for each power limit.
        self.measurements: list[PowerLimitMeasurement] = []

        # State for the profiler state machine.
        self.state: Ready | Warmup | Profiling | Done

        # Initialize JIT profiling states.
        if self.profile_path is None:
            self.logger.info("JIT profiling enabled.")
            self.logger.info("Will wait %d step(s) before profiling.", wait_steps)
            self.state = Ready(
                next_power_limit=self.power_limits[0], steps=wait_steps + 1
            )
            self.logger.info("Set power limit to the maximum before starting.")
            self._set_power_limit(max(self.power_limits))
        elif not self.profile_path.exists():
            self.logger.info(
                "JIT Profiling enabled. Profile will be saved to '%s'.",
                str(self.profile_path),
            )
            self.logger.info("Will wait %d step(s) before profiling.", wait_steps)
            self.state = Ready(
                next_power_limit=self.power_limits[0], steps=wait_steps + 1
            )
            self.logger.info("Set power limit to the maximum before starting.")
            self._set_power_limit(max(self.power_limits))
        else:
            self.measurements = _PowerLimitMeasurementList.parse_file(
                self.profile_path,
            ).measurements
            # self.measurements = _PowerLimitMeasurementList.model_validate_json(
            #     open(self.profile_path).read(),
            #     strict=True,
            # ).measurements
            self.logger.info(
                "Loaded previous profiling results from '%s'.", str(self.profile_path)
            )
            optimal_power_limit = self._compute_optimal_power_limit()
            self.logger.info(
                "Optimal power limit is %d W.", optimal_power_limit // 1000
            )
            self.state = Done(optimal_power_limit=optimal_power_limit)
            self._set_power_limit(self.state.optimal_power_limit)

        # Restore all GPUs back to their maximum power limit on exit.
        atexit.register(lambda: self._set_power_limit(max(self.power_limits)))

    def on_epoch_end(self) -> None:
        """Mark the end of a training epoch."""
        if isinstance(self.state, Ready):
            pass

        elif isinstance(self.state, (Warmup, Profiling)):
            # Warmup/Profiling stage interrupted by the end of an epoch.
            self.logger.info(
                "%s phase for %d W interrupted by the end of a training epoch.",
                type(self.state).__name__,
                self.state.current_power_limit // 1000,
            )
            if isinstance(self.state, Profiling):
                self.monitor.end_window(
                    f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
                    cancel=True,
                )
            self.state = Ready(next_power_limit=self.state.current_power_limit, steps=1)
            self._set_power_limit(max(self.power_limits))

        elif isinstance(self.state, Done):
            pass

    def on_step_begin(self) -> None:
        """Mark the beginning of a training step."""
        if isinstance(self.state, Ready):
            self.state.steps -= 1
            if self.state.steps == 0:
                self.logger.info(
                    "Starting warmup for power limit %d W.",
                    self.state.next_power_limit // 1000,
                )
                self._set_power_limit(self.state.next_power_limit)
                self.state = Warmup(
                    current_power_limit=self.state.next_power_limit,
                    steps=self.warmup_steps,
                )

        elif isinstance(self.state, Warmup):
            self.state.steps -= 1
            if self.state.steps == 0:
                self.logger.info(
                    "Starting actual profiling for power limit %d W.",
                    self.state.current_power_limit // 1000,
                )
                self.state = Profiling(
                    current_power_limit=self.state.current_power_limit,
                    steps=self.profile_steps,
                )
                self.monitor.begin_window(
                    f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
                )

        elif isinstance(self.state, Profiling):
            self.state.steps -= 1
            if self.state.steps == 0:
                measurement = self.monitor.end_window(
                    f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
                )
                self.logger.info(
                    "Finished profiling for power limit %d W.",
                    self.state.current_power_limit // 1000,
                )
                self.measurements.append(
                    PowerLimitMeasurement(
                        power_limit=self.state.current_power_limit // 1000,
                        energy=measurement.total_energy,
                        time=measurement.time,
                    )
                )
                # If we're done profiling all power limits, compute the optimal
                # power limit and transition to the Done state. Otherwise, move
                # on to the Warmup phase for the next power limit.
                current_power_limit_index = self.power_limits.index(
                    self.state.current_power_limit
                )
                if current_power_limit_index == len(self.power_limits) - 1:
                    self.state = Done(
                        optimal_power_limit=self._compute_optimal_power_limit(),
                    )
                    self._set_power_limit(self.state.optimal_power_limit)
                    self._save_profile()
                else:
                    next_power_limit = self.power_limits[current_power_limit_index + 1]
                    self.logger.info(
                        "Starting warmup for power limit %d W.",
                        next_power_limit // 1000,
                    )
                    self._set_power_limit(next_power_limit)
                    self.state = Warmup(
                        current_power_limit=next_power_limit,
                        steps=self.warmup_steps,
                    )

        elif isinstance(self.state, Done):
            pass

    def _set_power_limit(self, power_limit: int) -> None:
        """Set the power limit for all GPUs.

        Args:
            power_limit: The power limit to set, in milliWatts.
        """
        gpus = get_gpus()
        self.logger.info("Setting power limit to %d W.", power_limit // 1000)
        if self.current_power_limit == power_limit:
            return
        for index in self.monitor.gpu_indices:
            gpus.setPowerManagementLimit(index, power_limit)
        self.current_power_limit = power_limit

    def _compute_optimal_power_limit(self) -> int:
        """Compute the optimal power limit in milliWatts."""
        optimal_power_limit = self.optimum_selector.select(self.measurements) * 1000
        self.logger.info("Optimal power limit is %d W.", optimal_power_limit // 1000)
        return optimal_power_limit

    def _save_profile(self) -> None:
        """Save JIT profiling results and the optimal power limit to a JSON file."""
        if self.profile_path is None:
            return

        assert isinstance(self.state, Done)
        with self.profile_path.open("w", encoding="utf-8") as f:
            f.write(
                _PowerLimitMeasurementList(measurements=self.measurements).json(
                    indent=4
                ),
            )
        self.logger.info("JIT profiling results saved to '%s'.", str(self.profile_path))

__init__

1
2
3
4
5
6
7
8
9
__init__(
    monitor,
    optimum_selector=None,
    wait_steps=1,
    warmup_steps=10,
    profile_steps=40,
    pl_step=25,
    profile_path=None,
)

GPU indices to profile and optimize for are taken from monitor.gpu_indices.

Parameters:

Name Type Description Default
monitor ZeusMonitor

ZeusMonitor instance used to profile GPU time and energy consumption.

required
optimum_selector OptimumSelector | None

The optimum selector to use. If not given, use ZeusCost with \eta=0.5.

None
wait_steps int

Number of steps to pass by before doing anything at the beginning. Useful if you have something like torch.backends.cudnn.benchmark=True, because the first iteration won't be representative of the rest of the iterations.

1
warmup_steps int

Number of warmup iterations for each power limit.

10
profile_steps int

Number of profie iterations for each power limit.

40
pl_step int

The stride between power limits to explore, in unites of Watts.

25
profile_path str | Path | None

If the path points to an existing file, load the profile from the file and do not run any profiling. If the path points to a non-existing file, profile and save the profile to the file. If None, do not save or load any profile.

None
Source code in zeus/optimizer/power_limit.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def __init__(
    self,
    monitor: ZeusMonitor,
    optimum_selector: OptimumSelector | None = None,
    wait_steps: int = 1,
    warmup_steps: int = 10,
    profile_steps: int = 40,
    pl_step: int = 25,
    profile_path: str | Path | None = None,
) -> None:
    r"""Initialize the optimizer.

    GPU indices to profile and optimize for are taken from `monitor.gpu_indices`.

    Args:
        monitor: `ZeusMonitor` instance used to profile GPU time and energy consumption.
        optimum_selector: The optimum selector to use. If not given, use `ZeusCost` with \eta=0.5.
        wait_steps: Number of steps to pass by before doing anything at the beginning.
            Useful if you have something like `torch.backends.cudnn.benchmark=True`,
            because the first iteration won't be representative of the rest of the iterations.
        warmup_steps: Number of warmup iterations for each power limit.
        profile_steps: Number of profie iterations for each power limit.
        pl_step: The stride between power limits to explore, in unites of Watts.
        profile_path: If the path points to an existing file, load the profile from the file
            and do not run any profiling. If the path points to a non-existing file, profile
            and save the profile to the file. If `None`, do not save or load any profile.
    """
    # Sanity checks.
    if wait_steps < 0:
        raise ValueError("wait_steps must be non-negative.")
    if warmup_steps < 0:
        raise ValueError("warmup_steps must be non-negative.")
    if profile_steps <= 0:
        raise ValueError("profile_steps must be positive.")
    if pl_step <= 0:
        raise ValueError("pl_step must be positive.")

    self.monitor = monitor
    self.optimum_selector = optimum_selector or ZeusCost(
        eta_knob=0.5,
        world_size=len(monitor.gpu_indices),
    )
    self.warmup_steps = warmup_steps
    self.profile_steps = profile_steps
    self.pl_step = pl_step * 1000  # Internally, we use milliWatts.
    self.profile_path = (
        Path(profile_path) if isinstance(profile_path, str) else profile_path
    )

    # Setup logging.
    self.logger = get_logger(type(self).__name__)

    # Set the range of power limits to explore.
    # Assert that supported power limits ranges are uniform across GPUs.
    gpus = get_gpus(ensure_homogeneous=True)
    pls = []
    for index in monitor.gpu_indices:
        pls.append(gpus.getPowerManagementLimitConstraints(index))
    if not all(pls[0] == pl for pl in pls):
        raise ValueError("Power limits ranges are not uniform across GPUs.")
    self.power_limits = list(
        range(pls[0][1], pls[0][0] - self.pl_step, -self.pl_step)
    )

    # Turn on persistence mode and set to the highest power limit.
    try:
        for index in monitor.gpu_indices:
            gpus.setPersistenceMode(index, enable=True)
    except ZeusGPUNoPermissionError as ze:
        raise RuntimeError(
            "SYS_ADMIN capability is required to modify GPU power limits. "
            "Using --cap-add SYS_ADMIN when running the Docker container "
            "is the easiest way to do this."
        ) from ze
    self.current_power_limit = 0

    # Store `Measurement` objects in a list, one for each power limit.
    self.measurements: list[PowerLimitMeasurement] = []

    # State for the profiler state machine.
    self.state: Ready | Warmup | Profiling | Done

    # Initialize JIT profiling states.
    if self.profile_path is None:
        self.logger.info("JIT profiling enabled.")
        self.logger.info("Will wait %d step(s) before profiling.", wait_steps)
        self.state = Ready(
            next_power_limit=self.power_limits[0], steps=wait_steps + 1
        )
        self.logger.info("Set power limit to the maximum before starting.")
        self._set_power_limit(max(self.power_limits))
    elif not self.profile_path.exists():
        self.logger.info(
            "JIT Profiling enabled. Profile will be saved to '%s'.",
            str(self.profile_path),
        )
        self.logger.info("Will wait %d step(s) before profiling.", wait_steps)
        self.state = Ready(
            next_power_limit=self.power_limits[0], steps=wait_steps + 1
        )
        self.logger.info("Set power limit to the maximum before starting.")
        self._set_power_limit(max(self.power_limits))
    else:
        self.measurements = _PowerLimitMeasurementList.parse_file(
            self.profile_path,
        ).measurements
        # self.measurements = _PowerLimitMeasurementList.model_validate_json(
        #     open(self.profile_path).read(),
        #     strict=True,
        # ).measurements
        self.logger.info(
            "Loaded previous profiling results from '%s'.", str(self.profile_path)
        )
        optimal_power_limit = self._compute_optimal_power_limit()
        self.logger.info(
            "Optimal power limit is %d W.", optimal_power_limit // 1000
        )
        self.state = Done(optimal_power_limit=optimal_power_limit)
        self._set_power_limit(self.state.optimal_power_limit)

    # Restore all GPUs back to their maximum power limit on exit.
    atexit.register(lambda: self._set_power_limit(max(self.power_limits)))

on_epoch_end

1
on_epoch_end()

Mark the end of a training epoch.

Source code in zeus/optimizer/power_limit.py
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
def on_epoch_end(self) -> None:
    """Mark the end of a training epoch."""
    if isinstance(self.state, Ready):
        pass

    elif isinstance(self.state, (Warmup, Profiling)):
        # Warmup/Profiling stage interrupted by the end of an epoch.
        self.logger.info(
            "%s phase for %d W interrupted by the end of a training epoch.",
            type(self.state).__name__,
            self.state.current_power_limit // 1000,
        )
        if isinstance(self.state, Profiling):
            self.monitor.end_window(
                f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
                cancel=True,
            )
        self.state = Ready(next_power_limit=self.state.current_power_limit, steps=1)
        self._set_power_limit(max(self.power_limits))

    elif isinstance(self.state, Done):
        pass

on_step_begin

1
on_step_begin()

Mark the beginning of a training step.

Source code in zeus/optimizer/power_limit.py
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
def on_step_begin(self) -> None:
    """Mark the beginning of a training step."""
    if isinstance(self.state, Ready):
        self.state.steps -= 1
        if self.state.steps == 0:
            self.logger.info(
                "Starting warmup for power limit %d W.",
                self.state.next_power_limit // 1000,
            )
            self._set_power_limit(self.state.next_power_limit)
            self.state = Warmup(
                current_power_limit=self.state.next_power_limit,
                steps=self.warmup_steps,
            )

    elif isinstance(self.state, Warmup):
        self.state.steps -= 1
        if self.state.steps == 0:
            self.logger.info(
                "Starting actual profiling for power limit %d W.",
                self.state.current_power_limit // 1000,
            )
            self.state = Profiling(
                current_power_limit=self.state.current_power_limit,
                steps=self.profile_steps,
            )
            self.monitor.begin_window(
                f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
            )

    elif isinstance(self.state, Profiling):
        self.state.steps -= 1
        if self.state.steps == 0:
            measurement = self.monitor.end_window(
                f"__GlobalPowerLimitOptimizer_{self.state.current_power_limit // 1000}",
            )
            self.logger.info(
                "Finished profiling for power limit %d W.",
                self.state.current_power_limit // 1000,
            )
            self.measurements.append(
                PowerLimitMeasurement(
                    power_limit=self.state.current_power_limit // 1000,
                    energy=measurement.total_energy,
                    time=measurement.time,
                )
            )
            # If we're done profiling all power limits, compute the optimal
            # power limit and transition to the Done state. Otherwise, move
            # on to the Warmup phase for the next power limit.
            current_power_limit_index = self.power_limits.index(
                self.state.current_power_limit
            )
            if current_power_limit_index == len(self.power_limits) - 1:
                self.state = Done(
                    optimal_power_limit=self._compute_optimal_power_limit(),
                )
                self._set_power_limit(self.state.optimal_power_limit)
                self._save_profile()
            else:
                next_power_limit = self.power_limits[current_power_limit_index + 1]
                self.logger.info(
                    "Starting warmup for power limit %d W.",
                    next_power_limit // 1000,
                )
                self._set_power_limit(next_power_limit)
                self.state = Warmup(
                    current_power_limit=next_power_limit,
                    steps=self.warmup_steps,
                )

    elif isinstance(self.state, Done):
        pass

_set_power_limit

1
_set_power_limit(power_limit)

Set the power limit for all GPUs.

Parameters:

Name Type Description Default
power_limit int

The power limit to set, in milliWatts.

required
Source code in zeus/optimizer/power_limit.py
439
440
441
442
443
444
445
446
447
448
449
450
451
def _set_power_limit(self, power_limit: int) -> None:
    """Set the power limit for all GPUs.

    Args:
        power_limit: The power limit to set, in milliWatts.
    """
    gpus = get_gpus()
    self.logger.info("Setting power limit to %d W.", power_limit // 1000)
    if self.current_power_limit == power_limit:
        return
    for index in self.monitor.gpu_indices:
        gpus.setPowerManagementLimit(index, power_limit)
    self.current_power_limit = power_limit

_compute_optimal_power_limit

1
_compute_optimal_power_limit()

Compute the optimal power limit in milliWatts.

Source code in zeus/optimizer/power_limit.py
453
454
455
456
457
def _compute_optimal_power_limit(self) -> int:
    """Compute the optimal power limit in milliWatts."""
    optimal_power_limit = self.optimum_selector.select(self.measurements) * 1000
    self.logger.info("Optimal power limit is %d W.", optimal_power_limit // 1000)
    return optimal_power_limit

_save_profile

1
_save_profile()

Save JIT profiling results and the optimal power limit to a JSON file.

Source code in zeus/optimizer/power_limit.py
459
460
461
462
463
464
465
466
467
468
469
470
471
def _save_profile(self) -> None:
    """Save JIT profiling results and the optimal power limit to a JSON file."""
    if self.profile_path is None:
        return

    assert isinstance(self.state, Done)
    with self.profile_path.open("w", encoding="utf-8") as f:
        f.write(
            _PowerLimitMeasurementList(measurements=self.measurements).json(
                indent=4
            ),
        )
    self.logger.info("JIT profiling results saved to '%s'.", str(self.profile_path))

HFGlobalPowerLimitOptimizer

Bases: TrainerCallback

[Wrapped for Hugging Face Trainer Callback] Optimizer for the power limit knob.

This optimizer uses the JIT profiling log to determine the optimal power limit. See GlobalPowerLimitOptimizer for the underlying optimizer implementation.

Source code in zeus/optimizer/power_limit.py
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
class HFGlobalPowerLimitOptimizer(TrainerCallback):
    """[Wrapped for Hugging Face Trainer Callback] Optimizer for the power limit knob.

    This optimizer uses the JIT profiling log to determine the optimal power limit.
    See [`GlobalPowerLimitOptimizer`][zeus.optimizer.power_limit.GlobalPowerLimitOptimizer]
    for the underlying optimizer implementation.
    """

    def __init__(
        self,
        monitor: ZeusMonitor,
        optimum_selector: OptimumSelector | None = None,
        wait_steps: int = 1,
        warmup_steps: int = 10,
        profile_steps: int = 40,
        pl_step: int = 25,
        profile_path: str | Path | None = None,
    ) -> None:
        r"""Initialize the optimizer.

        GPU indices to profile and optimize for are taken from `monitor.gpu_indices`.

        Args:
            monitor: `ZeusMonitor` instance used to profile GPU time and energy consumption.
            optimum_selector: The optimum selector to use. If not given, use `ZeusCost` with \eta=0.5.
            wait_steps: Number of steps to pass by before doing anything at the beginning.
                Useful if you have something like `torch.backends.cudnn.benchmark=True`,
                because the first iteration won't be representative of the rest of the iterations.
            warmup_steps: Number of warmup iterations for each power limit.
            profile_steps: Number of profie iterations for each power limit.
            pl_step: The stride between power limits to explore, in unites of Watts.
            profile_path: If the path points to an existing file, load the profile from the file
                and do not run any profiling. If the path points to a non-existing file, profile
                and save the profile to the file. If `None`, do not save or load any profile.
        """
        if not transformers_available:
            raise ImportError(
                "The transformers package is not installed. Please install it to use the HFGlobalPowerLimitOptimizer."
            )

        self.optimizer = GlobalPowerLimitOptimizer(
            monitor=monitor,
            optimum_selector=optimum_selector,
            wait_steps=wait_steps,
            warmup_steps=warmup_steps,
            profile_steps=profile_steps,
            pl_step=pl_step,
            profile_path=profile_path,
        )

    def on_epoch_end(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        model: PreTrainedModel,
        **kwargs,
    ) -> None:
        """Mark the end of a training epoch."""
        self.optimizer.on_epoch_end()

    def on_step_begin(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        model: PreTrainedModel,
        **kwargs,
    ) -> None:
        """Mark the beginning of a training step."""
        self.optimizer.on_step_begin()

__init__

1
2
3
4
5
6
7
8
9
__init__(
    monitor,
    optimum_selector=None,
    wait_steps=1,
    warmup_steps=10,
    profile_steps=40,
    pl_step=25,
    profile_path=None,
)

GPU indices to profile and optimize for are taken from monitor.gpu_indices.

Parameters:

Name Type Description Default
monitor ZeusMonitor

ZeusMonitor instance used to profile GPU time and energy consumption.

required
optimum_selector OptimumSelector | None

The optimum selector to use. If not given, use ZeusCost with \eta=0.5.

None
wait_steps int

Number of steps to pass by before doing anything at the beginning. Useful if you have something like torch.backends.cudnn.benchmark=True, because the first iteration won't be representative of the rest of the iterations.

1
warmup_steps int

Number of warmup iterations for each power limit.

10
profile_steps int

Number of profie iterations for each power limit.

40
pl_step int

The stride between power limits to explore, in unites of Watts.

25
profile_path str | Path | None

If the path points to an existing file, load the profile from the file and do not run any profiling. If the path points to a non-existing file, profile and save the profile to the file. If None, do not save or load any profile.

None
Source code in zeus/optimizer/power_limit.py
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
def __init__(
    self,
    monitor: ZeusMonitor,
    optimum_selector: OptimumSelector | None = None,
    wait_steps: int = 1,
    warmup_steps: int = 10,
    profile_steps: int = 40,
    pl_step: int = 25,
    profile_path: str | Path | None = None,
) -> None:
    r"""Initialize the optimizer.

    GPU indices to profile and optimize for are taken from `monitor.gpu_indices`.

    Args:
        monitor: `ZeusMonitor` instance used to profile GPU time and energy consumption.
        optimum_selector: The optimum selector to use. If not given, use `ZeusCost` with \eta=0.5.
        wait_steps: Number of steps to pass by before doing anything at the beginning.
            Useful if you have something like `torch.backends.cudnn.benchmark=True`,
            because the first iteration won't be representative of the rest of the iterations.
        warmup_steps: Number of warmup iterations for each power limit.
        profile_steps: Number of profie iterations for each power limit.
        pl_step: The stride between power limits to explore, in unites of Watts.
        profile_path: If the path points to an existing file, load the profile from the file
            and do not run any profiling. If the path points to a non-existing file, profile
            and save the profile to the file. If `None`, do not save or load any profile.
    """
    if not transformers_available:
        raise ImportError(
            "The transformers package is not installed. Please install it to use the HFGlobalPowerLimitOptimizer."
        )

    self.optimizer = GlobalPowerLimitOptimizer(
        monitor=monitor,
        optimum_selector=optimum_selector,
        wait_steps=wait_steps,
        warmup_steps=warmup_steps,
        profile_steps=profile_steps,
        pl_step=pl_step,
        profile_path=profile_path,
    )

on_epoch_end

1
on_epoch_end(args, state, control, model, **kwargs)

Mark the end of a training epoch.

Source code in zeus/optimizer/power_limit.py
542
543
544
545
546
547
548
549
550
551
def on_epoch_end(
    self,
    args: TrainingArguments,
    state: TrainerState,
    control: TrainerControl,
    model: PreTrainedModel,
    **kwargs,
) -> None:
    """Mark the end of a training epoch."""
    self.optimizer.on_epoch_end()

on_step_begin

1
on_step_begin(args, state, control, model, **kwargs)

Mark the beginning of a training step.

Source code in zeus/optimizer/power_limit.py
553
554
555
556
557
558
559
560
561
562
def on_step_begin(
    self,
    args: TrainingArguments,
    state: TrainerState,
    control: TrainerControl,
    model: PreTrainedModel,
    **kwargs,
) -> None:
    """Mark the beginning of a training step."""
    self.optimizer.on_step_begin()