Skip to content

gpu

zeus.device.gpu

Abstraction layer for GPU devices.

The main function of this module is get_gpus, which returns a GPU Manager object specific to the platform.

Important

In theory, any NVIDIA GPU would be supported. On the other hand, for AMD GPUs, we currently only support ROCm 6.1 and later.

Getting handles to GPUs

The main API exported from this module is the get_gpus function. It returns either NVIDIAGPUs or AMDGPUs depending on the platform.

from zeus.device import get_gpus
gpus = get_gpus()  

Calling GPU management APIs

GPU management library APIs are mapped to methods on GPU.

For example, for NVIDIA GPUs (which uses pynvml), you would have called:

handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
constraints = pynvml.nvmlDeviceGetPowerManagementLimitConstraints(handle)

With the Zeus GPU abstraction layer, you would now call:

gpus = get_gpus() # returns an NVIDIAGPUs object
constraints = gpus.getPowerManagementLimitConstraints(gpu_index)

Non-blocking calls

Some implementations of GPU support non-blocking calls to setters. If non-blocking calls are not supported, setting block will be ignored and the call will block. Check GPU.supports_non_blocking to see if non-blocking calls are supported. Note that non-blocking calls will not raise exceptions even if the call fails.

Currently, only ZeusdNVIDIAGPU supports non-blocking calls to methods that set the GPU's power limit, GPU frequency, memory frequency, and persistence mode. This is possible because the Zeus daemon supports a block: bool parameter in HTTP requests, which can be set to False to make the call return immediately without checking the result.

Error handling

The following exceptions are defined in this module:

ZeusBaseGPUError

Bases: ZeusBaseError

Zeus base GPU exception class.

Source code in zeus/device/exception.py
 6
 7
 8
 9
10
11
class ZeusBaseGPUError(ZeusBaseError):
    """Zeus base GPU exception class."""

    def __init__(self, message: str) -> None:
        """Initialize Base Zeus Exception."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/exception.py
 9
10
11
def __init__(self, message: str) -> None:
    """Initialize Base Zeus Exception."""
    super().__init__(message)

GPU

Bases: ABC

Abstract base class for managing one GPU.

For each method, child classes should call into vendor-specific GPU management libraries (e.g., NVML for NVIDIA GPUs).

Source code in zeus/device/gpu/common.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
class GPU(abc.ABC):
    """Abstract base class for managing one GPU.

    For each method, child classes should call into vendor-specific
    GPU management libraries (e.g., NVML for NVIDIA GPUs).
    """

    def __init__(self, gpu_index: int) -> None:
        """Initializ the GPU with a specified index."""
        self.gpu_index = gpu_index

    @property
    @abc.abstractmethod
    def supports_nonblocking_setters(self) -> bool:
        """Return True if the GPU object supports non-blocking configuration setters."""
        return False

    @abc.abstractmethod
    def getName(self) -> str:
        """Return the name of the GPU model."""
        pass

    @abc.abstractmethod
    def getPowerManagementLimitConstraints(self) -> tuple[int, int]:
        """Return the minimum and maximum power management limits. Units: mW."""
        pass

    @abc.abstractmethod
    def setPowerManagementLimit(self, power_limit_mw: int, _block: bool = True) -> None:
        """Set the GPU's power management limit. Unit: mW."""
        pass

    @abc.abstractmethod
    def resetPowerManagementLimit(self, _block: bool = True) -> None:
        """Reset the GPU's power management limit to the default value."""
        pass

    @abc.abstractmethod
    def setPersistenceMode(self, enabled: bool, _block: bool = True) -> None:
        """Set persistence mode."""
        pass

    @abc.abstractmethod
    def getSupportedMemoryClocks(self) -> list[int]:
        """Return a list of supported memory clock frequencies. Units: MHz."""
        pass

    @abc.abstractmethod
    def setMemoryLockedClocks(
        self, min_clock_mhz: int, max_clock_mhz: int, _block: bool = True
    ) -> None:
        """Lock the memory clock to a specified range. Units: MHz."""
        pass

    @abc.abstractmethod
    def resetMemoryLockedClocks(self, _block: bool = True) -> None:
        """Reset the locked memory clocks to the default."""
        pass

    @abc.abstractmethod
    def getSupportedGraphicsClocks(
        self, memory_clock_mhz: int | None = None
    ) -> list[int]:
        """Return a list of supported graphics clock frequencies. Units: MHz.

        Args:
            memory_clock_mhz: Memory clock frequency to use. Some GPUs have
                different supported graphics clocks depending on the memory clock.
        """
        pass

    @abc.abstractmethod
    def setGpuLockedClocks(
        self, min_clock_mhz: int, max_clock_mhz: int, _block: bool = True
    ) -> None:
        """Lock the GPU clock to a specified range. Units: MHz."""
        pass

    @abc.abstractmethod
    def resetGpuLockedClocks(self, _block: bool = True) -> None:
        """Reset the locked GPU clocks to the default."""
        pass

    @abc.abstractmethod
    def getAveragePowerUsage(self) -> int:
        """Return the average power usage of the GPU. Units: mW."""
        pass

    @abc.abstractmethod
    def getInstantPowerUsage(self) -> int:
        """Return the current power draw of the GPU. Units: mW."""
        pass

    @abc.abstractmethod
    def getAverageMemoryPowerUsage(self) -> int:
        """Return the average power usage of the GPU's memory. Units: mW."""
        pass

    @abc.abstractmethod
    def supportsGetTotalEnergyConsumption(self) -> bool:
        """Check if the GPU supports retrieving total energy consumption."""
        pass

    @abc.abstractmethod
    def getTotalEnergyConsumption(self) -> int:
        """Return the total energy consumption of the GPU since driver load. Units: mJ."""
        pass

supports_nonblocking_setters abstractmethod property

supports_nonblocking_setters

Return True if the GPU object supports non-blocking configuration setters.

__init__

__init__(gpu_index)
Source code in zeus/device/gpu/common.py
23
24
25
def __init__(self, gpu_index: int) -> None:
    """Initializ the GPU with a specified index."""
    self.gpu_index = gpu_index

getName abstractmethod

getName()

Return the name of the GPU model.

Source code in zeus/device/gpu/common.py
33
34
35
36
@abc.abstractmethod
def getName(self) -> str:
    """Return the name of the GPU model."""
    pass

getPowerManagementLimitConstraints abstractmethod

getPowerManagementLimitConstraints()

Return the minimum and maximum power management limits. Units: mW.

Source code in zeus/device/gpu/common.py
38
39
40
41
@abc.abstractmethod
def getPowerManagementLimitConstraints(self) -> tuple[int, int]:
    """Return the minimum and maximum power management limits. Units: mW."""
    pass

setPowerManagementLimit abstractmethod

setPowerManagementLimit(power_limit_mw, _block=True)

Set the GPU's power management limit. Unit: mW.

Source code in zeus/device/gpu/common.py
43
44
45
46
@abc.abstractmethod
def setPowerManagementLimit(self, power_limit_mw: int, _block: bool = True) -> None:
    """Set the GPU's power management limit. Unit: mW."""
    pass

resetPowerManagementLimit abstractmethod

resetPowerManagementLimit(_block=True)

Reset the GPU's power management limit to the default value.

Source code in zeus/device/gpu/common.py
48
49
50
51
@abc.abstractmethod
def resetPowerManagementLimit(self, _block: bool = True) -> None:
    """Reset the GPU's power management limit to the default value."""
    pass

setPersistenceMode abstractmethod

setPersistenceMode(enabled, _block=True)

Set persistence mode.

Source code in zeus/device/gpu/common.py
53
54
55
56
@abc.abstractmethod
def setPersistenceMode(self, enabled: bool, _block: bool = True) -> None:
    """Set persistence mode."""
    pass

getSupportedMemoryClocks abstractmethod

getSupportedMemoryClocks()

Return a list of supported memory clock frequencies. Units: MHz.

Source code in zeus/device/gpu/common.py
58
59
60
61
@abc.abstractmethod
def getSupportedMemoryClocks(self) -> list[int]:
    """Return a list of supported memory clock frequencies. Units: MHz."""
    pass

setMemoryLockedClocks abstractmethod

setMemoryLockedClocks(
    min_clock_mhz, max_clock_mhz, _block=True
)

Lock the memory clock to a specified range. Units: MHz.

Source code in zeus/device/gpu/common.py
63
64
65
66
67
68
@abc.abstractmethod
def setMemoryLockedClocks(
    self, min_clock_mhz: int, max_clock_mhz: int, _block: bool = True
) -> None:
    """Lock the memory clock to a specified range. Units: MHz."""
    pass

resetMemoryLockedClocks abstractmethod

resetMemoryLockedClocks(_block=True)

Reset the locked memory clocks to the default.

Source code in zeus/device/gpu/common.py
70
71
72
73
@abc.abstractmethod
def resetMemoryLockedClocks(self, _block: bool = True) -> None:
    """Reset the locked memory clocks to the default."""
    pass

getSupportedGraphicsClocks abstractmethod

getSupportedGraphicsClocks(memory_clock_mhz=None)

Return a list of supported graphics clock frequencies. Units: MHz.

Parameters:

Name Type Description Default
memory_clock_mhz int | None

Memory clock frequency to use. Some GPUs have different supported graphics clocks depending on the memory clock.

None
Source code in zeus/device/gpu/common.py
75
76
77
78
79
80
81
82
83
84
85
@abc.abstractmethod
def getSupportedGraphicsClocks(
    self, memory_clock_mhz: int | None = None
) -> list[int]:
    """Return a list of supported graphics clock frequencies. Units: MHz.

    Args:
        memory_clock_mhz: Memory clock frequency to use. Some GPUs have
            different supported graphics clocks depending on the memory clock.
    """
    pass

setGpuLockedClocks abstractmethod

setGpuLockedClocks(
    min_clock_mhz, max_clock_mhz, _block=True
)

Lock the GPU clock to a specified range. Units: MHz.

Source code in zeus/device/gpu/common.py
87
88
89
90
91
92
@abc.abstractmethod
def setGpuLockedClocks(
    self, min_clock_mhz: int, max_clock_mhz: int, _block: bool = True
) -> None:
    """Lock the GPU clock to a specified range. Units: MHz."""
    pass

resetGpuLockedClocks abstractmethod

resetGpuLockedClocks(_block=True)

Reset the locked GPU clocks to the default.

Source code in zeus/device/gpu/common.py
94
95
96
97
@abc.abstractmethod
def resetGpuLockedClocks(self, _block: bool = True) -> None:
    """Reset the locked GPU clocks to the default."""
    pass

getAveragePowerUsage abstractmethod

getAveragePowerUsage()

Return the average power usage of the GPU. Units: mW.

Source code in zeus/device/gpu/common.py
 99
100
101
102
@abc.abstractmethod
def getAveragePowerUsage(self) -> int:
    """Return the average power usage of the GPU. Units: mW."""
    pass

getInstantPowerUsage abstractmethod

getInstantPowerUsage()

Return the current power draw of the GPU. Units: mW.

Source code in zeus/device/gpu/common.py
104
105
106
107
@abc.abstractmethod
def getInstantPowerUsage(self) -> int:
    """Return the current power draw of the GPU. Units: mW."""
    pass

getAverageMemoryPowerUsage abstractmethod

getAverageMemoryPowerUsage()

Return the average power usage of the GPU's memory. Units: mW.

Source code in zeus/device/gpu/common.py
109
110
111
112
@abc.abstractmethod
def getAverageMemoryPowerUsage(self) -> int:
    """Return the average power usage of the GPU's memory. Units: mW."""
    pass

supportsGetTotalEnergyConsumption abstractmethod

supportsGetTotalEnergyConsumption()

Check if the GPU supports retrieving total energy consumption.

Source code in zeus/device/gpu/common.py
114
115
116
117
@abc.abstractmethod
def supportsGetTotalEnergyConsumption(self) -> bool:
    """Check if the GPU supports retrieving total energy consumption."""
    pass

getTotalEnergyConsumption abstractmethod

getTotalEnergyConsumption()

Return the total energy consumption of the GPU since driver load. Units: mJ.

Source code in zeus/device/gpu/common.py
119
120
121
122
@abc.abstractmethod
def getTotalEnergyConsumption(self) -> int:
    """Return the total energy consumption of the GPU since driver load. Units: mJ."""
    pass

EmptyGPUs

Bases: GPUs

A concrete class implementing the GPUs abstract base class, but representing an empty collection of GPUs.

This class is used to represent a scenario where no GPUs are available or detected. Any method call attempting to interact with a GPU will raise a ValueError.

Source code in zeus/device/gpu/common.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
class EmptyGPUs(GPUs):
    """A concrete class implementing the GPUs abstract base class, but representing an empty collection of GPUs.

    This class is used to represent a scenario where no GPUs are available or detected.
    Any method call attempting to interact with a GPU will raise a ValueError.
    """

    def __init__(self, ensure_homogeneous: bool = False) -> None:
        """Initialize the EMPTYGPUs class.

        Since this class represents an empty collection of GPUs, no actual initialization of GPU objects is performed.
        """
        pass

    def __del__(self) -> None:
        """Clean up any resources if necessary.

        As this class represents an empty collection of GPUs, no specific cleanup is required.
        """
        pass

    @property
    def gpus(self) -> Sequence["GPU"]:
        """Return an empty list as no GPUs are being tracked."""
        return []

    def __len__(self) -> int:
        """Return 0, indicating no GPUs are being tracked."""
        return 0

    def _ensure_homogeneous(self) -> None:
        """Raise a ValueError as no GPUs are being tracked."""
        raise ValueError("No GPUs available to ensure homogeneity.")

    def _warn_sys_admin(self) -> None:
        """Raise a ValueError as no GPUs are being tracked."""
        raise ValueError("No GPUs available to warn about SYS_ADMIN privileges.")

    def getName(self, gpu_index: int) -> str:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def getPowerManagementLimitConstraints(self, gpu_index: int) -> tuple[int, int]:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def setPowerManagementLimit(
        self, gpu_index: int, power_limit_mw: int, _block: bool = True
    ) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def resetPowerManagementLimit(self, gpu_index: int, _block: bool = True) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def setPersistenceMode(
        self, gpu_index: int, enabled: bool, _block: bool = True
    ) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def getSupportedMemoryClocks(self, gpu_index: int) -> list[int]:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def setMemoryLockedClocks(
        self,
        gpu_index: int,
        min_clock_mhz: int,
        max_clock_mhz: int,
        _block: bool = True,
    ) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def resetMemoryLockedClocks(self, gpu_index: int, _block: bool = True) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def getSupportedGraphicsClocks(
        self, gpu_index: int, memory_clock_mhz: int | None = None
    ) -> list[int]:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def setGpuLockedClocks(
        self,
        gpu_index: int,
        min_clock_mhz: int,
        max_clock_mhz: int,
        _block: bool = True,
    ) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def resetGpuLockedClocks(self, gpu_index: int, _block: bool = True) -> None:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def getInstantPowerUsage(self, gpu_index: int) -> int:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def supportsGetTotalEnergyConsumption(self, gpu_index: int) -> bool:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

    def getTotalEnergyConsumption(self, gpu_index: int) -> int:
        """Raise a ValueError as no GPUs are available."""
        raise ValueError("No GPUs available.")

gpus property

gpus

Return an empty list as no GPUs are being tracked.

__init__

__init__(ensure_homogeneous=False)

Since this class represents an empty collection of GPUs, no actual initialization of GPU objects is performed.

Source code in zeus/device/gpu/common.py
280
281
282
283
284
285
def __init__(self, ensure_homogeneous: bool = False) -> None:
    """Initialize the EMPTYGPUs class.

    Since this class represents an empty collection of GPUs, no actual initialization of GPU objects is performed.
    """
    pass

__del__

__del__()

Clean up any resources if necessary.

As this class represents an empty collection of GPUs, no specific cleanup is required.

Source code in zeus/device/gpu/common.py
287
288
289
290
291
292
def __del__(self) -> None:
    """Clean up any resources if necessary.

    As this class represents an empty collection of GPUs, no specific cleanup is required.
    """
    pass

__len__

__len__()

Return 0, indicating no GPUs are being tracked.

Source code in zeus/device/gpu/common.py
299
300
301
def __len__(self) -> int:
    """Return 0, indicating no GPUs are being tracked."""
    return 0

_ensure_homogeneous

_ensure_homogeneous()

Raise a ValueError as no GPUs are being tracked.

Source code in zeus/device/gpu/common.py
303
304
305
def _ensure_homogeneous(self) -> None:
    """Raise a ValueError as no GPUs are being tracked."""
    raise ValueError("No GPUs available to ensure homogeneity.")

_warn_sys_admin

_warn_sys_admin()

Raise a ValueError as no GPUs are being tracked.

Source code in zeus/device/gpu/common.py
307
308
309
def _warn_sys_admin(self) -> None:
    """Raise a ValueError as no GPUs are being tracked."""
    raise ValueError("No GPUs available to warn about SYS_ADMIN privileges.")

getName

getName(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
311
312
313
def getName(self, gpu_index: int) -> str:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

getPowerManagementLimitConstraints

getPowerManagementLimitConstraints(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
315
316
317
def getPowerManagementLimitConstraints(self, gpu_index: int) -> tuple[int, int]:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

setPowerManagementLimit

setPowerManagementLimit(
    gpu_index, power_limit_mw, _block=True
)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
319
320
321
322
323
def setPowerManagementLimit(
    self, gpu_index: int, power_limit_mw: int, _block: bool = True
) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

resetPowerManagementLimit

resetPowerManagementLimit(gpu_index, _block=True)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
325
326
327
def resetPowerManagementLimit(self, gpu_index: int, _block: bool = True) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

setPersistenceMode

setPersistenceMode(gpu_index, enabled, _block=True)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
329
330
331
332
333
def setPersistenceMode(
    self, gpu_index: int, enabled: bool, _block: bool = True
) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

getSupportedMemoryClocks

getSupportedMemoryClocks(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
335
336
337
def getSupportedMemoryClocks(self, gpu_index: int) -> list[int]:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

setMemoryLockedClocks

setMemoryLockedClocks(
    gpu_index, min_clock_mhz, max_clock_mhz, _block=True
)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
339
340
341
342
343
344
345
346
347
def setMemoryLockedClocks(
    self,
    gpu_index: int,
    min_clock_mhz: int,
    max_clock_mhz: int,
    _block: bool = True,
) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

resetMemoryLockedClocks

resetMemoryLockedClocks(gpu_index, _block=True)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
349
350
351
def resetMemoryLockedClocks(self, gpu_index: int, _block: bool = True) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

getSupportedGraphicsClocks

getSupportedGraphicsClocks(
    gpu_index, memory_clock_mhz=None
)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
353
354
355
356
357
def getSupportedGraphicsClocks(
    self, gpu_index: int, memory_clock_mhz: int | None = None
) -> list[int]:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

setGpuLockedClocks

setGpuLockedClocks(
    gpu_index, min_clock_mhz, max_clock_mhz, _block=True
)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
359
360
361
362
363
364
365
366
367
def setGpuLockedClocks(
    self,
    gpu_index: int,
    min_clock_mhz: int,
    max_clock_mhz: int,
    _block: bool = True,
) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

resetGpuLockedClocks

resetGpuLockedClocks(gpu_index, _block=True)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
369
370
371
def resetGpuLockedClocks(self, gpu_index: int, _block: bool = True) -> None:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

getInstantPowerUsage

getInstantPowerUsage(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
373
374
375
def getInstantPowerUsage(self, gpu_index: int) -> int:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

supportsGetTotalEnergyConsumption

supportsGetTotalEnergyConsumption(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
377
378
379
def supportsGetTotalEnergyConsumption(self, gpu_index: int) -> bool:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

getTotalEnergyConsumption

getTotalEnergyConsumption(gpu_index)

Raise a ValueError as no GPUs are available.

Source code in zeus/device/gpu/common.py
381
382
383
def getTotalEnergyConsumption(self, gpu_index: int) -> int:
    """Raise a ValueError as no GPUs are available."""
    raise ValueError("No GPUs available.")

ZeusGPUInvalidArgError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Invalid Argument.

Source code in zeus/device/gpu/common.py
394
395
396
397
398
399
class ZeusGPUInvalidArgError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Invalid Argument."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
397
398
399
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUNotSupportedError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Not Supported Operation on GPU.

Source code in zeus/device/gpu/common.py
402
403
404
405
406
407
class ZeusGPUNotSupportedError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Not Supported Operation on GPU."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
405
406
407
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUNoPermissionError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps No Permission to perform GPU operation.

Source code in zeus/device/gpu/common.py
410
411
412
413
414
415
class ZeusGPUNoPermissionError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps No Permission to perform GPU operation."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
413
414
415
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUAlreadyInitializedError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Already Initialized GPU.

Source code in zeus/device/gpu/common.py
418
419
420
421
422
423
class ZeusGPUAlreadyInitializedError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Already Initialized GPU."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
421
422
423
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUNotFoundError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Not Found GPU.

Source code in zeus/device/gpu/common.py
426
427
428
429
430
431
class ZeusGPUNotFoundError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Not Found GPU."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
429
430
431
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUInsufficientSizeError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Insufficient Size.

Source code in zeus/device/gpu/common.py
434
435
436
437
438
439
class ZeusGPUInsufficientSizeError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Insufficient Size."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
437
438
439
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUInsufficientPowerError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Insufficient Power.

Source code in zeus/device/gpu/common.py
442
443
444
445
446
447
class ZeusGPUInsufficientPowerError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Insufficient Power."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
445
446
447
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUDriverNotLoadedError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Driver Error.

Source code in zeus/device/gpu/common.py
450
451
452
453
454
455
class ZeusGPUDriverNotLoadedError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Driver Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
453
454
455
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUTimeoutError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Timeout Error.

Source code in zeus/device/gpu/common.py
458
459
460
461
462
463
class ZeusGPUTimeoutError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Timeout Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
461
462
463
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUIRQError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps IRQ Error.

Source code in zeus/device/gpu/common.py
466
467
468
469
470
471
class ZeusGPUIRQError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps IRQ Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
469
470
471
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPULibraryNotFoundError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Library Not Found Error.

Source code in zeus/device/gpu/common.py
474
475
476
477
478
479
class ZeusGPULibraryNotFoundError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Library Not Found Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
477
478
479
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUFunctionNotFoundError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Function Not Found Error.

Source code in zeus/device/gpu/common.py
482
483
484
485
486
487
class ZeusGPUFunctionNotFoundError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Function Not Found Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
485
486
487
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUCorruptedInfoROMError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Corrupted Info ROM Error.

Source code in zeus/device/gpu/common.py
490
491
492
493
494
495
class ZeusGPUCorruptedInfoROMError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Corrupted Info ROM Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
493
494
495
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPULostError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Lost GPU Error.

Source code in zeus/device/gpu/common.py
498
499
500
501
502
503
class ZeusGPULostError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Lost GPU Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
501
502
503
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUResetRequiredError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Reset Required Error.

Source code in zeus/device/gpu/common.py
506
507
508
509
510
511
class ZeusGPUResetRequiredError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Reset Required Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
509
510
511
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUOperatingSystemError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Operating System Error.

Source code in zeus/device/gpu/common.py
514
515
516
517
518
519
class ZeusGPUOperatingSystemError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Operating System Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
517
518
519
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPULibRMVersionMismatchError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps LibRM Version Mismatch Error.

Source code in zeus/device/gpu/common.py
522
523
524
525
526
527
class ZeusGPULibRMVersionMismatchError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps LibRM Version Mismatch Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
525
526
527
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUMemoryError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Insufficient Memory Error.

Source code in zeus/device/gpu/common.py
530
531
532
533
534
535
class ZeusGPUMemoryError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Insufficient Memory Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
533
534
535
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUUnknownError

Bases: ZeusBaseGPUError

Zeus GPU exception that wraps Unknown Error.

Source code in zeus/device/gpu/common.py
538
539
540
541
542
543
class ZeusGPUUnknownError(ZeusBaseGPUError):
    """Zeus GPU exception that wraps Unknown Error."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
541
542
543
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

ZeusGPUHeterogeneousError

Bases: ZeusBaseGPUError

Exception for when GPUs are not homogeneous.

Source code in zeus/device/gpu/common.py
546
547
548
549
550
551
class ZeusGPUHeterogeneousError(ZeusBaseGPUError):
    """Exception for when GPUs are not homogeneous."""

    def __init__(self, message: str) -> None:
        """Intialize the exception object."""
        super().__init__(message)

__init__

__init__(message)
Source code in zeus/device/gpu/common.py
549
550
551
def __init__(self, message: str) -> None:
    """Intialize the exception object."""
    super().__init__(message)

get_logger

get_logger(name, level=logging.INFO, propagate=False)

Get a logger with the given name with some formatting configs.

Source code in zeus/utils/logging.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def get_logger(
    name: str,
    level: int = logging.INFO,
    propagate: bool = False,
) -> logging.Logger:
    """Get a logger with the given name with some formatting configs."""
    if name in logging.Logger.manager.loggerDict:
        return logging.getLogger(name)

    logger = logging.getLogger(name)
    logger.propagate = propagate
    logger.setLevel(os.environ.get("ZEUS_LOG_LEVEL", level))
    formatter = logging.Formatter(
        "[%(asctime)s] [%(name)s](%(filename)s:%(lineno)d) %(message)s"
    )
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger

has_sys_admin cached

has_sys_admin()

Check if the current process has SYS_ADMIN capabilities.

Source code in zeus/device/common.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@lru_cache(maxsize=1)
def has_sys_admin() -> bool:
    """Check if the current process has `SYS_ADMIN` capabilities."""
    # First try to read procfs.
    try:
        with open("/proc/self/status") as f:
            for line in f:
                if line.startswith("CapEff"):
                    bitmask = int(line.strip().split()[1], 16)
                    has = bool(bitmask & (1 << 21))
                    logger.info(
                        "Read security capabilities from /proc/self/status -- SYS_ADMIN: %s",
                        has,
                    )
                    return has
    except Exception:
        logger.info("Failed to read capabilities from /proc/self/status", exc_info=True)

    # If that fails, try to use the capget syscall.
    class CapHeader(ctypes.Structure):
        _fields_ = [("version", ctypes.c_uint32), ("pid", ctypes.c_int)]

    class CapData(ctypes.Structure):
        _fields_ = [
            ("effective", ctypes.c_uint32),
            ("permitted", ctypes.c_uint32),
            ("inheritable", ctypes.c_uint32),
        ]

    # Attempt to load libc and set up capget
    try:
        libc = ctypes.CDLL("libc.so.6")
        capget = libc.capget
        capget.argtypes = [ctypes.POINTER(CapHeader), ctypes.POINTER(CapData)]
        capget.restype = ctypes.c_int
    except Exception:
        logger.info("Failed to load libc.so.6", exc_info=True)
        return False

    # Initialize the header and data structures
    header = CapHeader(version=0x20080522, pid=0)  # Use the current process
    data = CapData()

    # Call capget and check for errors
    if capget(ctypes.byref(header), ctypes.byref(data)) != 0:
        errno = ctypes.get_errno()
        logger.info(
            "capget failed with error: %s (errno %s)", os.strerror(errno), errno
        )
        return False

    bitmask = data.effective
    has = bool(bitmask & (1 << 21))
    logger.info("Read security capabilities from capget -- SYS_ADMIN: %s", has)
    return has

get_gpus

get_gpus(ensure_homogeneous=False)

Initialize and return a singleton object for GPU management.

This function returns a GPU management object that aims to abstract the underlying GPU vendor and their specific monitoring library (pynvml for NVIDIA GPUs and amdsmi for AMD GPUs). Management APIs are mapped to methods on the returned GPUs object.

GPU availability is checked in the following order:

  1. NVIDIA GPUs using pynvml
  2. AMD GPUs using amdsmi
  3. If both are unavailable, a ZeusGPUInitError is raised.

Parameters:

Name Type Description Default
ensure_homogeneous bool

If True, ensures that all tracked GPUs have the same name.

False
Source code in zeus/device/gpu/__init__.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def get_gpus(ensure_homogeneous: bool = False) -> GPUs:
    """Initialize and return a singleton object for GPU management.

    This function returns a GPU management object that aims to abstract
    the underlying GPU vendor and their specific monitoring library
    (pynvml for NVIDIA GPUs and amdsmi for AMD GPUs). Management APIs
    are mapped to methods on the returned [`GPUs`][zeus.device.gpu.GPUs] object.

    GPU availability is checked in the following order:

    1. NVIDIA GPUs using `pynvml`
    1. AMD GPUs using `amdsmi`
    1. If both are unavailable, a `ZeusGPUInitError` is raised.

    Args:
        ensure_homogeneous (bool): If True, ensures that all tracked GPUs have the same name.
    """
    global _gpus
    if _gpus is not None:
        return _gpus

    if nvml_is_available():
        _gpus = NVIDIAGPUs(ensure_homogeneous)
        return _gpus
    elif amdsmi_is_available():
        _gpus = AMDGPUs(ensure_homogeneous)
        return _gpus
    else:
        raise ZeusGPUInitError(
            "NVML and AMDSMI unavailable. Failed to initialize GPU management library."
        )