An Energy Optimization Framework for DNN Training
Join the Zeus Slack workspace!
Zeus exceeded 100k+ pulls on Docker Hub! The Zeus team is always happy to chat with Zeus users and help out. Reach out to us in the Zeus Slack workspace.
Zeus automatically optimizes the energy and time of recurring DNN training jobs by finding the optimal batch size and GPU power limit.
Please refer to our NSDI’23 paper and slides for details. Check out Overview for a summary.
Zeus is part of The ML.ENERGY Initiative.
Refer to Getting Started for instructions on environment setup, installation, and integration. We also provide integration examples:
- Integrating Zeus with Computer Vision
- Integrating Zeus with Natural Language Processing and Huggingface
- Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace
You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.
Refer to Extending Zeus for details.