Skip to content

Setting Up the Environment

We encourage users to do everything inside a Docker container spawned with our pre-built Docker image.

Tip

Docker may not be an option for some users. In that case,

  • Python still needs the Linux SYS_ADMIN capability to change the GPU's power limit. One dirty way is to run Python with sudo.
  • Skim through our Dockerfile (shown below) to make sure you have the stuff that's being installed.
  • Follow the instructions in Installing Zeus.

Zeus Docker image

We provide a pre-built Docker image in Docker Hub. On top of the nvidia/cuda:11.8.0-devel-ubuntu22.04 image, the following are added:

  1. Miniconda3 23.3.1, PyTorch 2.0.1, torchvision 0.15.2
  2. A copy of the Zeus repo in /workspace/zeus.
  3. An editable install of the zeus package in /workspace/zeus/zeus. Users can override the copy of the repo by mounting the edited repo into the container. See instructions below.
Dockerfile
Dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Copyright (C) 2023 Jae-Won Chung <jwnchung@umich.edu>
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build instructions
#   If you're building this image locally, make sure you specify `TARGETARCH`.
#   Currently, this image supports `amd64` and `arm64`. For instance:
#     docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
       build-essential software-properties-common wget git tar rsync cmake \
    && apt-get clean all \
    && rm -r /var/lib/apt/lists/*

# Install Miniconda3 23.3.1
ENV PATH="/root/.local/miniconda3/bin:$PATH"
ARG TARGETARCH
RUN if [ "$TARGETARCH" = "amd64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-x86_64.sh"; \
    elif [ "$TARGETARCH" = "arm64" ]; then \
      export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-aarch64.sh"; \
    else \
      echo "Unsupported architecture ${TARGETARCH}" && exit 1; \
    fi \
    && mkdir -p /root/.local \
    && wget "https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_PATH" \
    && mkdir /root/.conda \
    && bash "$CONDA_INSTALLER_PATH" -b -p /root/.local/miniconda3 \
    && rm -f "$CONDA_INSTALLER_PATH" \
    && ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

# Install PyTorch and CUDA Toolkit
RUN pip install --no-cache-dir torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118

# Place stuff under /workspace
WORKDIR /workspace

# Snapshot of Zeus
ADD . /workspace/zeus

# When an outside zeus directory is mounted, have it apply immediately.
RUN cd /workspace/zeus && pip install --no-cache-dir -e .

Tip

If you want to build our Docker image locally, you should specify TARGETARCH to be one of amd64 or arm64 based on your environment's architecture:

1
docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .

Dependencies

  1. docker
  2. nvidia-docker2

Spawn the container

The default command would be:

1
2
3
4
5
6
docker run -it \
    --gpus all \                      # (1)!
    --cap-add SYS_ADMIN \           # (2)!
    --ipc host \                  # (3)!
    mlenergy/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container. nvidia-docker2 provides this option.
  2. SYS_ADMIN capability is needed to manage the power configurations of the GPU via NVML.
  3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.

Use the -v option to mount outside data into the container. For instance, if you would like your changes to zeus/ outside the container to be immediately applied inside the container, mount the repository into the container. You can also mount training data into the container.

1
2
3
4
5
6
7
8
9
# Working directory is repository root
docker run -it \
    --gpus all \                               # (1)!
    --cap-add SYS_ADMIN \                    # (2)!
    --ipc host \                           # (3)!
    -v $(pwd):/workspace/zeus \          # (4)!
    -v /data/imagenet:/data/imagenet:ro \
    mlenergy/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container. nvidia-docker2 provides this option.
  2. SYS_ADMIN capability is needed to manage the power configurations of the GPU via NVML.
  3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.
  4. Mounts the repository directory into the Docker container. Since the zeus installation inside the container is editable, changes you made outside will apply immediately.