Skip to content

Getting Started

Most of the common setup steps are described in this page. Some optimizers or examples may require some extra setup steps, which are described in the corresponding documentation.

Installing the Python package

From PyPI

Install the Zeus Python package simply with:

pip install zeus-ml

From source for development

You can also install Zeus from source by cloning our GitHub repository. Specifically for development, you can do an editable installation with extra dev dependencies:

git clone https://github.com/ml-energy/zeus.git
cd zeus
pip install -e '.[dev]'

Using Docker

Dependencies

You should have the following already installed on your system:

Our Docker image should suit most of the use cases for Zeus. On top of the nvidia/cuda:11.8.0-base-ubuntu22.04 image, we add:

  • Miniconda 3, PyTorch, and Torchvision
  • A copy of the Zeus repo in /workspace/zeus
docker/Dockerfile
Dockerfile
# Copyright (C) 2023 Jae-Won Chung <jwnchung@umich.edu>
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build instructions
#   If you're building this image locally, make sure you specify `TARGETARCH`.
#   Currently, this image supports `amd64` and `arm64`. For instance:
#     docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
       build-essential software-properties-common wget git tar rsync cmake \
    && apt-get clean all \
    && rm -r /var/lib/apt/lists/*

# Install Miniconda3 23.3.1
ENV PATH="/root/.local/miniconda3/bin:$PATH"
ARG TARGETARCH
RUN if [ "$TARGETARCH" = "amd64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-x86_64.sh"; \
    elif [ "$TARGETARCH" = "arm64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-aarch64.sh"; \
    else \
      echo "Unsupported architecture ${TARGETARCH}" && exit 1; \
    fi \
    && mkdir -p /root/.local \
    && wget "https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_PATH" \
    && mkdir /root/.conda \
    && bash "$CONDA_INSTALLER_PATH" -b -p /root/.local/miniconda3 \
    && rm -f "$CONDA_INSTALLER_PATH" \
    && ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

# Install PyTorch and CUDA Toolkit
RUN pip install --no-cache-dir torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118

# Place stuff under /workspace
WORKDIR /workspace

# Snapshot of Zeus
ADD . /workspace/zeus

# When an outside zeus directory is mounted, have it apply immediately.
RUN cd /workspace/zeus && pip install --no-cache-dir -e .

The default command would be:

docker run -it \
    --gpus all \                 # (1)!
    --cap-add SYS_ADMIN \       # (2)!
    --ipc host \               # (3)!
    mlenergy/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container.
  2. SYS_ADMIN capability is needed to change the GPU's power limit or frequency. See here.
  3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.

Overriding Zeus installation

Inside the container, zeus's installation is editable (pip install -e). So, you can mount your locally modified Zeus repository into the right path in the container (-v /path/to/zeus:/workspace/zeus), and your modifications will automatically be applied without you having to run pip install again.

Pulling from Docker Hub

Pre-built images are hosted on Docker Hub. There are three types of images available:

  • latest: The latest versioned release.
  • v*: Each versioned release.
  • master: The HEAD commit of Zeus. Usually stable enough, and you will get all the new features.

Building the image locally

You should specify TARGETARCH to be one of amd64 or arm64 based on your environment:

git clone https://github.com/ml-energy/zeus.git
cd zeus
docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 -f docker/Dockerfile .

System privileges

Nevermind if you're just measuring GPU energy

No special system-level privileges are needed if you are just measuring GPU time and energy. However, when you're looking into optimizing energy and if that method requires changing the GPU's power limit or SM frequency, special system-level privileges are required.

When are extra system privileges needed?

The Linux capability SYS_ADMIN is required in order to change the GPU's power limit or frequency. Specifically, this is needed by the GlobalPowerLimitOptimizer and the PipelineFrequencyOptimizer.

Option 1: Running applications in a Docker container

Using Docker, you can pass --cap-add SYS_ADMIN to docker run. Since this significantly simplifies running Zeus, we recommend users to consider this option first. This is also possible for Kubernetes Pods with securityContext.capabilities.add in container specs (docs).

Option 2: Deploying the Zeus daemon (zeusd)

Granting SYS_ADMIN to the entire application just to be able to change the GPU's configuration is granting too much. Instead, Zeus provides the Zeus daemon or zeusd, which is a simple server/daemon process that is designed to run with admin privileges and exposes the minimal set of APIs wrapping NVML methods for changing the GPU's configuration. Then, an unprivileged (i.e., run normally by any user) application can ask zeusd via a Unix Domain Socket to change the local node's GPU configuration on its behalf.

To deploy zeusd:

# Install zeusd
cargo install zeusd

# Run zeusd with admin privileges
sudo zeusd \
    --socket-path /var/run/zeusd.sock \  # (1)!
    --socket-permissions 666            # (2)!
  1. Unix domain socket path that zeusd listens to.
  2. Applications need write access to the socket to be able to talk to zeusd. This string is interpreted as UNIX file permissions.

Option 3: Running applications with sudo

This is probably the worst option. However, if none of the options above work, you can run your application with sudo, which automatically has SYS_ADMIN.

Next Steps