Skip to content

metric

zeus.util.metric

Defines the energy-time cost metric function.

ZeusCostThresholdExceededError

Bases: Exception

Raised when the predicted cost of the next epoch exceeds the cost threshold.

This exception is used for terminating all the processes when doing data parallel training with multiple processes, because ONLY the master process will predict next_cost and do the threshold checking. However, once the predicted cost exceeds the threshold, we want to terminate ALL the processes. Currently this is achieved by throwing an exception at the master process. The lauching script will terminate all the processes that are still alive.

Attributes:

Name Type Description
time_consumed float

Time consumed until the current epoch.

energy_consumed float

Energy consumed until the current epoch.

cost float

Computed Zeus's energy-time cost metric until the current epoch.

next_cost float

Predicted Zeus's energy-time cost metric after next epoch.

cost_thresh float

The cost threshold.

Source code in zeus/util/metric.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class ZeusCostThresholdExceededError(Exception):
    """Raised when the predicted cost of the next epoch exceeds the cost threshold.

    This exception is used for terminating all the processes when doing data
    parallel training with multiple processes, because ONLY the master
    process will predict `next_cost` and do the threshold checking. However,
    once the predicted cost exceeds the threshold, we want to terminate ALL
    the processes. Currently this is achieved by throwing an exception at the
    master process. The lauching script will terminate all the processes that
    are still alive.

    Attributes:
        time_consumed (float): Time consumed until the current epoch.
        energy_consumed (float): Energy consumed until the current epoch.
        cost (float): Computed Zeus's energy-time cost metric until the current epoch.
        next_cost (float): Predicted Zeus's energy-time cost metric after next epoch.
        cost_thresh (float): The cost threshold.
    """

    def __init__(
        self,
        time_consumed: float,
        energy_consumed: float,
        cost: float,
        next_cost: float,
        cost_thresh: float,
    ) -> None:
        """Initialize the exception."""
        msg = (
            f"Next expected cost {next_cost:.2f} exceeds cost threshold {cost_thresh:.2f}! "
            f"Stopping. Saved training results: time={time_consumed:.2f}, "
            f"energy={energy_consumed:.2f}, cost={cost:.2f}, reached=false"
        )
        super().__init__(msg)
        self.time_consumed = time_consumed
        self.energy_consumed = energy_consumed
        self.cost = cost
        self.next_cost = next_cost
        self.cost_thresh = cost_thresh

__init__

1
2
3
4
5
6
7
__init__(
    time_consumed,
    energy_consumed,
    cost,
    next_cost,
    cost_thresh,
)
Source code in zeus/util/metric.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def __init__(
    self,
    time_consumed: float,
    energy_consumed: float,
    cost: float,
    next_cost: float,
    cost_thresh: float,
) -> None:
    """Initialize the exception."""
    msg = (
        f"Next expected cost {next_cost:.2f} exceeds cost threshold {cost_thresh:.2f}! "
        f"Stopping. Saved training results: time={time_consumed:.2f}, "
        f"energy={energy_consumed:.2f}, cost={cost:.2f}, reached=false"
    )
    super().__init__(msg)
    self.time_consumed = time_consumed
    self.energy_consumed = energy_consumed
    self.cost = cost
    self.next_cost = next_cost
    self.cost_thresh = cost_thresh

zeus_cost

1
zeus_cost(energy, time, eta_knob, max_power)

Compute Zeus's energy-time cost metric.

Trades off ETA and TTA based on the value of eta_knob. The caller is expected to do bound checking for eta_knob, because eta_knob does not change frequently.

Parameters:

Name Type Description Default
energy float

Joules

required
time float

seconds

required
eta_knob float

Real number in [0, 1].

required
max_power int

The maximum power limit of the GPU.

required

Returns:

Type Description
float

The cost of the DL training job.

Source code in zeus/util/metric.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def zeus_cost(energy: float, time: float, eta_knob: float, max_power: int) -> float:
    """Compute Zeus's energy-time cost metric.

    Trades off ETA and TTA based on the value of `eta_knob`.
    The caller is expected to do bound checking for `eta_knob`,
    because `eta_knob` does not change frequently.

    Args:
        energy: Joules
        time: seconds
        eta_knob: Real number in [0, 1].
        max_power: The maximum power limit of the GPU.

    Returns:
        The cost of the DL training job.
    """
    return eta_knob * energy + (1 - eta_knob) * max_power * time