Skip to content

Getting Started

Most of the common setup steps are described in this page. Some optimizers or examples may require some extra setup steps, which are described in the corresponding documentation.

Installing the Python package

From PyPI

Install the Zeus Python package simply with:

pip install zeus-ml

From source for development

You can also install Zeus from source by cloning our GitHub repository. Specifically for development, you can do an editable installation with extra dev dependencies:

git clone https://github.com/ml-energy/zeus.git
cd zeus
pip install -e '.[dev]'

Using Docker

Dependencies

You should have the following already installed on your system:

Our Docker image should suit most of the use cases for Zeus. On top of the nvidia/cuda:11.8.0-base-ubuntu22.04 image, we add:

  • Miniconda 3, PyTorch, and Torchvision
  • A copy of the Zeus repo in /workspace/zeus
docker/Dockerfile
Dockerfile
# Copyright (C) 2023 Jae-Won Chung <jwnchung@umich.edu>
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build instructions
#   If you're building this image locally, make sure you specify `TARGETARCH`.
#   Currently, this image supports `amd64` and `arm64`. For instance:
#     docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
       build-essential software-properties-common wget git tar rsync cmake \
    && apt-get clean all \
    && rm -r /var/lib/apt/lists/*

# Install Miniconda3 23.3.1
ENV PATH="/root/.local/miniconda3/bin:$PATH"
ARG TARGETARCH
RUN if [ "$TARGETARCH" = "amd64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-x86_64.sh"; \
    elif [ "$TARGETARCH" = "arm64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-aarch64.sh"; \
    else \
      echo "Unsupported architecture ${TARGETARCH}" && exit 1; \
    fi \
    && mkdir -p /root/.local \
    && wget "https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_PATH" \
    && mkdir /root/.conda \
    && bash "$CONDA_INSTALLER_PATH" -b -p /root/.local/miniconda3 \
    && rm -f "$CONDA_INSTALLER_PATH" \
    && ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

# Install PyTorch and CUDA Toolkit
RUN pip install --no-cache-dir torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118

# Place stuff under /workspace
WORKDIR /workspace

# Snapshot of Zeus
ADD . /workspace/zeus

# When an outside zeus directory is mounted, have it apply immediately.
RUN cd /workspace/zeus && pip install --no-cache-dir -e .

The default command would be:

docker run -it \
    --gpus all \                 # (1)!
    --cap-add SYS_ADMIN \       # (2)!
    --ipc host \               # (3)!
    mlenergy/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container.
  2. SYS_ADMIN capability is needed to change the GPU's power limit or frequency. See here.
  3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.

Overriding Zeus installation

Inside the container, zeus's installation is editable (pip install -e). So, you can mount your locally modified Zeus repository into the right path in the container (-v /path/to/zeus:/workspace/zeus), and your modifications will automatically be applied without you having to run pip install again.

Pulling from Docker Hub

Pre-built images are hosted on Docker Hub. There are three types of images available:

  • latest: The latest versioned release.
  • v*: Each versioned release.
  • master: The HEAD commit of Zeus. Usually stable enough, and you will get all the new features.

Building the image locally

You should specify TARGETARCH to be one of amd64 or arm64 based on your environment:

git clone https://github.com/ml-energy/zeus.git
cd zeus
docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 -f docker/Dockerfile .

System privileges

Nevermind if you're just measuring

No special system-level privileges are needed if you are just measuring time and energy. However, when you're looking into optimizing energy and if that method requires changing the GPU's power limit or SM frequency, special system-level privileges are required.

When are extra system privileges needed?

The Linux capability SYS_ADMIN is required in order to change the GPU's power limit or frequency. Specifically, this is needed by the GlobalPowerLimitOptimizer and the PipelineFrequencyOptimizer.

Obtaining privileges with Docker

Using Docker, you can pass --cap-add SYS_ADMIN to docker run. Since this significantly simplifies running Zeus, we recommend users to consider this option first. Also, since Zeus is running inside a container, there is less potential for damage even if things go wrong.

Obtaining privileges with sudo

If you cannot use Docker, you can run your application with sudo. This is not recommended due to security reasons, but it will work.

GPU management server

It is fair to say that granting SYS_ADMIN to the application is itself giving too much privilege. We just need to be able to change the GPU's power limit or frequency, instead of giving the process privileges to administer the system. Thus, to reduce the attack surface, we are considering solutions such as a separate GPU management server process on a node (tracking issue), which has SYS_ADMIN. Then, an unprivileged application process can ask the GPU management server via a UDS to change the GPU's configuration on its behalf.

Next Steps