Skip to content

Getting Started

Most of the common setup steps are described in this page. Some optimizers or examples may require some extra setup steps, which are described in the corresponding documentation.

Installing the Python package

From PyPI

Install the Zeus Python package simply with:

pip install zeus-ml

From source for development

You can also install Zeus from source by cloning our GitHub repository. Specifically for development, you can do an editable installation with extra dev dependencies:

git clone https://github.com/ml-energy/zeus.git
cd zeus
pip install -e '.[dev]'

Using Docker

Dependencies

You should have the following already installed on your system:

Our Docker image should suit most of the use cases for Zeus. On top of the nvidia/cuda:11.8.0-base-ubuntu22.04 image, we add:

  • Miniconda 3, PyTorch, and Torchvision
  • A copy of the Zeus repo in /workspace/zeus
docker/Dockerfile
Dockerfile
# Build instructions
#   If you're building this image locally, make sure you specify `TARGETARCH`.
#   Currently, this image supports `amd64` and `arm64`. For instance:
#     docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
       build-essential software-properties-common wget git tar rsync cmake \
    && apt-get clean all \
    && rm -r /var/lib/apt/lists/*

# Install Miniconda3 23.3.1
ENV PATH="/root/.local/miniconda3/bin:$PATH"
ARG TARGETARCH
RUN if [ "$TARGETARCH" = "amd64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-x86_64.sh"; \
    elif [ "$TARGETARCH" = "arm64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-aarch64.sh"; \
    else \
      echo "Unsupported architecture ${TARGETARCH}" && exit 1; \
    fi \
    && mkdir -p /root/.local \
    && wget "https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_PATH" \
    && mkdir /root/.conda \
    && bash "$CONDA_INSTALLER_PATH" -b -p /root/.local/miniconda3 \
    && rm -f "$CONDA_INSTALLER_PATH" \
    && ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

# Install PyTorch and CUDA Toolkit
RUN pip install --no-cache-dir torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118

# Place stuff under /workspace
WORKDIR /workspace

# Snapshot of Zeus
ADD . /workspace/zeus

# When an outside zeus directory is mounted, have it apply immediately.
RUN cd /workspace/zeus && pip install --no-cache-dir -e .

The default command would be:

docker run -it \
    --gpus all \              # (1)!
    --cap-add SYS_ADMIN \   # (2)!
    --ipc host \          # (3)!
    -v /sys/class/powercap/intel-rapl:/zeus_sys/class/powercap/intel-rapl \ # (4)!
    mlenergy/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container. See Docker docs for more about the --gpus argument.
  2. The SYS_ADMIN Linux security capability is needed to change the GPU's power limit or frequency. See here for details and alternatives.
  3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.
  4. Zeus reads Intel RAPL metrics for CPU/DRAM energy measurement through the sysfs interface. Docker disables this by default, so we need to mount it into the container separately (under /zeus_sys).

Especially, --cap-add SYS_ADMIN is to be able to change the GPU's power limit or frequency, and -v /sys/class/powercap/intel-rapl:/zeus_sys/class/powercap/intel-rapl is to be able to measure CPU/DRAM energy via Intel RAPL. See System privileges for details.

Pulling from Docker Hub

Pre-built images are hosted on Docker Hub. There are three types of images available:

  • latest: The latest versioned release.
  • v*: Each versioned release.
  • master: The HEAD commit of Zeus. Usually stable enough, and you will get all the new features.

Building the image locally

You should specify TARGETARCH to be one of amd64 or arm64 based on your environment:

git clone https://github.com/ml-energy/zeus.git
cd zeus
docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 -f docker/Dockerfile .

Verifying installation

After installing the Zeus package, you can run the following to see whether packages and hardware are properly detected by Zeus.

$ python -m zeus.show_env
================================================================================

Python version: 3.9.19

================================================================================

[2024-09-09 16:40:14,495] [zeus.utils.framework](framework.py:25) PyTorch with CUDA support is available.
[2024-09-09 16:40:14,496] [zeus.utils.framework](framework.py:45) JAX is not available

Package availability and versions:
  Zeus: 0.10.0
  PyTorch: 2.4.1+cu121
  JAX: not available

================================================================================

[2024-09-09 16:40:14,512] [zeus.device.gpu.nvidia](nvidia.py:46) pynvml is available and initialized.

GPU availability:
  GPU 0: NVIDIA A40

================================================================================

[2024-09-09 16:40:14,519] [zeus.device.cpu.rapl](rapl.py:136) RAPL is available.
[2024-09-09 16:40:14,519] [RaplWraparoundTracker](rapl.py:82) Monitoring wrap around of /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
[2024-09-09 16:40:14,528] [RaplWraparoundTracker](rapl.py:82) Monitoring wrap around of /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/energy_uj
[2024-09-09 16:40:14,533] [RaplWraparoundTracker](rapl.py:82) Monitoring wrap around of /sys/class/powercap/intel-rapl/intel-rapl:1/energy_uj
[2024-09-09 16:40:14,535] [RaplWraparoundTracker](rapl.py:82) Monitoring wrap around of /sys/class/powercap/intel-rapl/intel-rapl:1/intel-rapl:1:0/energy_uj

CPU availability:
  CPU 0:
    CPU measurements available (/sys/class/powercap/intel-rapl/intel-rapl:0)
    DRAM measurements available (/sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0)
  CPU 1:
    CPU measurements available (/sys/class/powercap/intel-rapl/intel-rapl:1)
    DRAM measurements available (/sys/class/powercap/intel-rapl/intel-rapl:1/intel-rapl:1:0)

================================================================================

System privileges

When are extra system privileges needed?

  1. CPU energy measurement: root privileges are needed when measuring CPU energy through the Intel RAPL interface. This is due to a security issue. Specifically, this is needed if you want to measure CPU energy via ZeusMonitor with cpu_indices.
  2. GPU energy optimization: The Linux security capability SYS_ADMIN (root is fine as well as it's stronger) is required in order to change the GPU's power limit or frequency. Specifically, this is needed by the GlobalPowerLimitOptimizer and the PipelineFrequencyOptimizer.

Option 1: Running applications in a Docker container

For CPU energy measurement, you are root inside a Docker container. You will just need to mount the RAPL sysfs directory into the Docker container. See here for instructions.

For GPU energy optimization, you can pass --cap-add SYS_ADMIN to docker run. Since this significantly simplifies running Zeus, we recommend users to consider this option first. This is also possible for Kubernetes Pods with securityContext.capabilities.add in container specs (docs).

Option 2: Deploying the Zeus daemon (zeusd)

Granting SYS_ADMIN to the entire application just to be able to change the GPU's configuration is granting too much. Instead, Zeus provides the Zeus daemon or zeusd, which is a simple server/daemon process that is designed to run with admin privileges and exposes the minimal set of APIs wrapping NVML methods for changing the GPU's configuration. Then, an unprivileged (i.e., run normally by any user) application can ask zeusd via a Unix Domain Socket to change the local node's GPU configuration on its behalf.

To deploy zeusd:

# Install zeusd
cargo install zeusd

# Run zeusd with admin privileges
sudo zeusd \
    --socket-path /var/run/zeusd.sock \   # (1)!
    --socket-permissions 666            # (2)!
  1. Unix domain socket path that zeusd listens to.
  2. Applications need write access to the socket to be able to talk to zeusd. This string is interpreted as UNIX file permissions.

We're currently working on adding Intel RAPL support to the Zeus daemon (tracking issue). We plan to land this feature at the end of 2024.

Option 3: Running applications with sudo

This is probably the worst option. However, if none of the options above work, you can run your application with sudo, which is essentially root and automatically has SYS_ADMIN.

Next Steps