Skip to content

Setting Up the Environment

We encourage users to do everything inside a Docker container spawned with our pre-built Docker image.

Zeus Docker image

We provide a pre-built Docker image in Docker Hub. On top of the nvidia/cuda:11.3.1-devel-ubuntu20.04 image, the following are provided:

  1. CMake 3.22.0
  2. Miniconda3 4.12.0, PyTorch 1.10.1, torchvision 0.11.2, cudatoolkit 11.3.1
  3. A copy of the Zeus repo in /workspace/zeus.
  4. An editable install of the zeus package in /workspace/zeus/zeus. Users can override the copy of the repo by mounting the edited repo into the container. See instructions below.
Dockerfile
Dockerfile
# Copyright (C) 2022 Jae-Won Chung <jwnchung@umich.edu>
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
       build-essential software-properties-common wget git tar rsync \
    && apt-get clean all \
    && rm -r /var/lib/apt/lists/*

# Install cmake 3.22.0
RUN wget https://github.com/Kitware/CMake/releases/download/v3.22.0/cmake-3.22.0-linux-x86_64.tar.gz \
    && tar xzf cmake-* \
    && rsync -a cmake-*/bin /usr/local \
    && rsync -a cmake-*/share /usr/local \
    && rm -r cmake-*

# Install Miniconda3 4.12.0
ENV PATH="/root/.local/miniconda3/bin:$PATH"
RUN mkdir -p /root/.local \
    && wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh \
    && mkdir /root/.conda \
    && bash Miniconda3-py39_4.12.0-Linux-x86_64.sh -b -p /root/.local/miniconda3 \
    && rm -f Miniconda3-py39_4.12.0-Linux-x86_64.sh \
    && ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

# Install PyTorch and CUDA Toolkit
RUN conda install -y -c pytorch pytorch==1.10.1 torchvision==0.11.2 cudatoolkit==11.3.1

# Place stuff under /workspace
WORKDIR /workspace

# Snapshot of Zeus
ADD . /workspace/zeus

# When an outside zeus directory is mounted, have it apply immediately.
RUN pip install -e zeus

# Build and bake in the Zeus monitor.
RUN cd /workspace/zeus/zeus_monitor && cmake . && make && cp zeus_monitor /usr/local/bin/ && cd /workspace

Dependencies

  1. docker
  2. nvidia-docker2

Spawn the container

The default command would be:

docker run -it \
    --gpus all \                      # (1)!
    --cap-add SYS_ADMIN \           # (2)!
    --shm-size 64G \              # (3)!
    symbioticlab/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container. nvidia-docker2 provides this option.
  2. SYS_ADMIN capability is needed to manage the power configurations of the GPU via NVML.
  3. PyTorch DataLoader workers need enough shared memory for IPC. If the PyTorch training process dies with a Bus error, consider increasing this even more.

Use the -v option to mount outside data into the container. For instance, if you would like your changes to zeus/ outside the container to be immediately applied inside the container, mount the repository into the container. You can also mount training data into the container.

# Working directory is repository root
docker run -it \
    --gpus all \                               # (1)!
    --cap-add SYS_ADMIN \                    # (2)!
    --shm-size 64G \                       # (3)!
    -v $(pwd):/workspace/zeus \          # (4)!
    -v /data/imagenet:/data/imagenet:ro \
    symbioticlab/zeus:latest \
    bash
  1. Mounts all GPUs into the Docker container. nvidia-docker2 provides this option.
  2. SYS_ADMIN capability is needed to manage the power configurations of the GPU via NVML.
  3. PyTorch DataLoader workers need enough shared memory for IPC. If the PyTorch training process dies with a Bus error, consider increasing this even more.
  4. Mounts the repository directory into the Docker container. Since the zeus installation inside the container is editable, changes you made outside will apply immediately.