Setting Up the Environment
We encourage users to do everything inside a Docker container spawned with our pre-built Docker image.
Tip
Docker may not be an option for some users. In that case,
- Python still needs the Linux
SYS_ADMIN
capability to change the GPU's power limit. One dirty way is to run Python with sudo
.
- Skim through our Dockerfile (shown below) to make sure you have the stuff that's being installed.
- Follow the instructions in Installing Zeus.
Zeus Docker image
We provide a pre-built Docker image in Docker Hub.
On top of the nvidia/cuda:11.8.0-devel-ubuntu22.04
image, the following are added:
- Miniconda3 23.3.1, PyTorch 2.0.1, torchvision 0.15.2
- A copy of the Zeus repo in
/workspace/zeus
.
- An editable install of the
zeus
package in /workspace/zeus/zeus
. Users can override the copy of the repo by mounting the edited repo into the container. See instructions below.
Dockerfile
Dockerfile |
---|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 | # Copyright (C) 2023 Jae-Won Chung <jwnchung@umich.edu>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Build instructions
# If you're building this image locally, make sure you specify `TARGETARCH`.
# Currently, this image supports `amd64` and `arm64`. For instance:
# docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .
FROM nvidia/cuda:11.8.0-base-ubuntu22.04
# Basic installs
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ='America/Detroit'
RUN apt-get update -qq \
&& apt-get -y --no-install-recommends install \
build-essential software-properties-common wget git tar rsync cmake \
&& apt-get clean all \
&& rm -r /var/lib/apt/lists/*
# Install Miniconda3 23.3.1
ENV PATH="/root/.local/miniconda3/bin:$PATH"
ARG TARGETARCH
RUN if [ "$TARGETARCH" = "amd64" ]; then \
export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-x86_64.sh"; \
elif [ "$TARGETARCH" = "arm64" ]; then \
export CONDA_INSTALLER_PATH="Miniconda3-py39_23.3.1-0-Linux-aarch64.sh"; \
else \
echo "Unsupported architecture ${TARGETARCH}" && exit 1; \
fi \
&& mkdir -p /root/.local \
&& wget "https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_PATH" \
&& mkdir /root/.conda \
&& bash "$CONDA_INSTALLER_PATH" -b -p /root/.local/miniconda3 \
&& rm -f "$CONDA_INSTALLER_PATH" \
&& ln -sf /root/.local/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
# Install PyTorch and CUDA Toolkit
RUN pip install --no-cache-dir torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
# Place stuff under /workspace
WORKDIR /workspace
# Snapshot of Zeus
ADD . /workspace/zeus
# When an outside zeus directory is mounted, have it apply immediately.
RUN cd /workspace/zeus && pip install --no-cache-dir -e .
|
Tip
If you want to build our Docker image locally, you should specify TARGETARCH
to be one of amd64
or arm64
based on your environment's architecture:
| docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 .
|
Dependencies
docker
nvidia-docker2
Spawn the container
The default command would be:
| docker run -it \
--gpus all \ # (1)!
--cap-add SYS_ADMIN \ # (2)!
--ipc host \ # (3)!
mlenergy/zeus:latest \
bash
|
- Mounts all GPUs into the Docker container.
nvidia-docker2
provides this option.
SYS_ADMIN
capability is needed to manage the power configurations of the GPU via NVML.
- PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.
Use the -v
option to mount outside data into the container.
For instance, if you would like your changes to zeus/
outside the container to be immediately applied inside the container, mount the repository into the container.
You can also mount training data into the container.
| # Working directory is repository root
docker run -it \
--gpus all \ # (1)!
--cap-add SYS_ADMIN \ # (2)!
--ipc host \ # (3)!
-v $(pwd):/workspace/zeus \ # (4)!
-v /data/imagenet:/data/imagenet:ro \
mlenergy/zeus:latest \
bash
|
- Mounts all GPUs into the Docker container.
nvidia-docker2
provides this option.
SYS_ADMIN
capability is needed to manage the power configurations of the GPU via NVML.
- PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.
- Mounts the repository directory into the Docker container. Since the
zeus
installation inside the container is editable, changes you made outside will apply immediately.