Skip to content

Extending Zeus

Warning

Content in this page pertains to Zeus when it was a research artifact of our NSDI paper. We will soon refactor the simulator and replace this page with something along the lines of "How to reproduce our research results." Track this issue here.

Users can implement custom policies to optimize batch size and power limits, and plug it into Zeus.

Interfaces

Zeus defines two abstract classes BatchSizeOptimizer and PowerLimitOptimizer in zeus.policy.interface. Each class optimizes the batch size and power limit of a recurring training job respectively. As in our paper, the batch size optimizer is first invoked to decide which batch size to use, and then the power limit optimizer is invoked with both the job and the batch size chosen to decide which power limit to use.

You can find examples of policy implementations in zeus.policy.optimizer.

Plugging it into Zeus

There are two ways to run Zeus: trace-driven and end-to-end.

Trace-driven Zeus

The Zeus simulator (Simulator) accepts one BatchSizeOptimizer and PowerLimitOptimizer in its constructor. A full-example can be found in examples/trace_driven.

End-to-end Zeus

There are two central components in end-to-end Zeus: ZeusMaster and ZeusDataLoader. The former takes charge of driving the entire optimization over recurring jobs, and accepts an instance of BatchSizeOptimizer in its constructor. The latter takes charge of JIT-profiling power in the background, determining the optimal power limit, and setting it. Hence, the functionality of JITPowerLimitOptimizer is already tightly integrated into ZeusDataLoader. Users will have to implement their own ZeusDataLoader in order to test another PowerLimitOptimizer policy.