Batch Size Optimizer

The batch size optimizer (BSO) finds the optimal DNN training batch size that minimizes cost:

\[ \eta \cdot \textrm{ETA} + (1 - \eta) \cdot \textrm{MaxPower} \cdot \textrm{TTA} \]

where ETA and TTA stands for Energy-to-Accuracy and Time-to-Accuracy, respectively, and accuracy is the target validation metric of training. Users can trade-off between energy and time by setting the \(\eta\) parameter to be between 0 and 1.

Usage

In production environments, it is common for a single DNN to be trained and re-trained many times over time. This is because the data distribution changes over time, and the model needs to be re-trained to adapt to the new data distribution. The batch size optimizer uses these recurring training jobs as opportunities to find the optimal batch size for the DNN training job. To do this, the batch size optimizer uses a Multi-Armed Bandit algorithm to explore the batch size space. For more details, please refer to the Zeus paper.

Constraints

Currently, the batch size optimizer only supports cases where the number and type of GPUs used for each recurrent training job is always the same.

High-level architecture

sequenceDiagram;
    participant BSO server
    participant BSO client
    loop Every recurrent training
        BSO client->>BSO server: Register the training job and ask for the batch size to use
        BSO server->>BSO client: Return the batch size to use and an integer trial number
        loop Every epoch
            BSO client->>BSO server: At the end of each epoch, report time and energy consumption
            BSO server->>BSO client: Compute cost and determine whether the client should terminate training
        end
        BSO client->>BSO server: Notify the server that the trial finished
    end

In order to persist the state of the optimizer across all recurring training runs, we need to have a server that outlives individual training jobs. Therefore, the batch size optimizer consists of two parts: the server and the client. The server is a FastAPI server that manages the Multi-Armed Bandit algorithm and the database, while the client (BatchSizeOptimizer) is integrated into the training script.

Deploying the server

Largely three steps: (1) starting the database, (2) running migration on the database, and (3) starting the BSO server.

Clone the Zeus repository

To get example docker files and database migration scripts, clone the Zeus repository.

git clone https://github.com/ml-energy/zeus.git

Decide on the database

By default, our examples use MySQL. However, you can use any database supported by SQLAlchemy. Please make sure the database's corresponding async connection driver (e.g., asyncmy for MySQL, aiosqlite for SQLite) is installed. For instance, to adapt our examples, you can (1) add pip install to migration.Dockerfile and server.Dockerfile, and (2) change the db container specification in server-docker-compose.yaml.

Server configuration

You can configure the server using enviornment variables or the .env file. Below are the complete list of environment variables you can set and example values.

ZEUS_BSO_DB_USER="me" 
ZEUS_BSO_DB_PASSWORD="secret"
ZEUS_BSO_DATABASE_URL="mysql+asyncmy://me:secret@localhost:3306/Zeus"
ZEUS_BSO_ROOT_PASSWORD="secret*"
ZEUS_BSO_SERVER_PORT=8000
ZEUS_BSO_LOG_LEVEL="INFO"
ZEUS_BSO_ECHO_SQL="True"

With Docker Compose

cd docker/batch_size_optimizer
docker-compose -f ./server-docker-compose.yaml up

Docker Compose will first build necessary images and spin up the containers.

With Kubernetes

Build the Docker image.
```
# From the repository root
docker build -f ./docker/batch_size_optimizer/server.Dockerfile -t bso-server . 
docker build -f ./docker/batch_size_optimizer/migration.Dockerfile -t bso-migration .
```
Make sure Kubernetes has access to these build images. If you are locally using minikube, then the images are already available. However, if you are using the cloud such as AWS EKS, you should push the image to the registry and modify the image path in server-docker-compose.yaml to allow Kubernetes to pull the image.

Convert Docker Compose files to Kubernetes YAML files using Kompose.

cd docker/batch_size_optimizer
docker-compose -f server-docker-compose.yaml config > server-docker-compose-resolved.yaml
kompose convert -f server-docker-compose-resolved.yaml -o ./kube/
rm server-docker-compose-resolved.yaml

This first resolves env files using docker-compose, then converts it into Kubernetes YAML files in ./kube/.

Apply the Kubernetes YAML files.
```
cd kube
kubectl apply -f .
```

With just Python

You can also run the server without Docker or Kubernetes.

Spin up the database of your choice.
Set server configuration environment variables described here.

Perform DB migration with Alembic.

Install dependencies

# From the repository root
pip install asyncmy cryptography
pip install '.[migration]'

Create the migration script. This will create scripts in ./versions.

cd zeus/optimizer/batch_size
alembic revision --autogenerate -m "Create tables"

Apply migration, either online or offline.

# Online migration (applied directly to the database)
alembic upgrade head 
# Offline migration (just generate SQL)
alembic upgrade head --sql

Spin up the server using uvicorn.

uvicorn zeus.optimizer.batch_size.server.router:app

Integrating `BatchSizeOptimizer`

In order for your recurring training job to communicate with the BSO server, you need to integrate the BatchSizeOptimizer class into your training script.

Install the Zeus package, including dependencies needed for the batch size optimizer.
```
pip install zeus[bso]
```

Integrate BatchSizeOptimizer to your training script.

from zeus.monitor import ZeusMonitor
from zeus.optimizer.batch_size import BatchSizeOptimizer, JobSpec

monitor = ZeusMonitor()

# On instantiation, the BSO will register the job to the BSO server
bso = BatchSizeOptimizer(
    monitor=monitor,
    server_url="http://bso-server:8000",
    job=JobSpec(
        job_id=os.environ.get("ZEUS_JOB_ID"),
        job_id_prefix="mnist",
        default_batch_size=256,
        batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
        max_epochs=100
    ),
)

# Get batch size to use from the server
batch_size = bso.get_batch_size()

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size)

# Start measuring the time and energy consumption of this training run
bso.on_train_begin()

for epoch in range(100):
    for batch in train_dataloader:
        # Training loop
        pass

    # The BSO server needs to know whether training has converged
    metric = evaluate(model, eval_dataloader)
    bso.on_evaluate(metric)

    # The BSO server will determine whether to stop training
    if bso.training_finished:
        break

When does training stop?

The BSO server will determine whether to stop training and this will be reflected in the training_finished attribute of the BatchSizeOptimizer instance.

If the DNN reaches the target validation metric, training should stop. However, training fails if

it failed to converge within the configured JobSpec.max_epochs (reference) epochs, or
its cost exceeded the early stopping threshold configured by JobSpec.beta_knob(reference) .

In such failure cases, the optimizer will raise a ZeusBSOTrainFailError. This means that the chosen batch size was not useful, and the BSO server will never try this batch size again. The user should re-launch the training run in this case, and the BSO server will try another batch size.

Integration examples

Two full examples are given for the batch size optimizer:

MNIST: Single-GPU and data parallel training, with integration examples with Kubeflow
Sentiment Analysis: Full training example with HuggingFace transformers using the Capriccio dataset, a sentiment analysis dataset with data drift.