Tuning on AWS Batch

In this tutorial, we introduce how to setup an AWS batch environment to distributely tune a huge number of workloads on AWS EC2 supported platforms step-by-step:

Steps

1. Setup AWS account and credential

To use AWS batch (and other AWS services such as DynamoDB and S3 used by Lorien), you need to create an AWS account and an IAM role with 1) an access key and 2) secret access key ready in order to configure AWS credential on host as well as worker machines/instances. The purpose of configuring AWS credential is to authorize AWS client APIs (i.e., boto3 Python package) to access AWS services on your behalf, so that Lorien can help automating everything for you.

After creating an AWS account, please refer to https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html to get your access key and secret access key ready. We will use them to configure Lorien in the last step.

2. Prepare Container Images

Since AWS batch runs every submitted jobs in a container to guarantee a unified environment, we need to specify an image ID when creating an AWS batch computation environment. Please note that Lorien docker images on Docker Hub only include the dependencies (e.g., Python packages and TVM) but not Lorien itself, so you can base on that to create a working image. You can create your own docker account and push the image (in private), or you can push your images to AWS ECR (Elastic Container Registry). To do so, you need to first authorize Docker CLI to push your images to AWS ECR. The command for authorization can be generated by AWS CLI:

# Generate docker login command.
aws ecr get-login --no-include-email
# Out: docker login –u AWS –p password https://<aws_account>.dkr.ecr.<region>>.amazonaws.com

# Copy-paste the generated command to login.
docker login –u AWS –p password https://<aws_account>.dkr.ecr.<region>>.amazonaws.com

# Change the tag of your image. Replace <repo>:<tag> with your local Docker repository name and tag.
docker tag <repo>:<tag> <aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>:<tag>

# Push your image to AWS ECR.
docker push <aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>:<tag>

Note that you can use AWS ECR console to create a repository to maintain Lorien images.

3. Create an AWS batch compute environment

Now we can start setting AWS batch. The first step is to create a compute environment. A compute environment describes platform details of your submitted jobs. Go to https://<region>.console.aws.amazon.com/batch -> Compute environments -> Create environment and fill out the information. In this example, we create a C5 environment as follows:

  • Compute environment type: Managed.

  • Compute environment name: lorien-c5-env.

  • Service role: AWSBatchServiceRole. (If this is the first environment you create, leave this box blank and AWS batch will create AWSBatchServiceRole IAM role for you.)

  • Instance role: ecsInstanceRole. (If this is the first environment you create, leave this box blank and AWS batch will create ecsInstanceRole IAM instance profile for you.)

  • EC2 key pair: N/A. (Leave it blank because we will configure AWS credential when launching a container.)

  • Provisioning model: On-Demand (Although spot instances are much cheaper, it has a risk to be terminated by AWS any time. Since Lorien tuning jobs usually needs minutes to hours, it is recommended to use on-demand instances.)

  • Allowed instance types: c5.4xlarge (It is not recommended to specify more then one types of instances in one computation environment, because in that case we cannot know which type of instances is being used to tune a workload. In other words, you need to create N environments if you need to tune worklaods on N types of instances.)

  • Allocation strategy: BEST_FIT.

  • Launch template: N/A (If your container image is too large, then you may encounter job starting failures when submitting jobs to AWS batch. This is because the default docker image size limit is set to 10~Gib. In this case, you could follow scripts/aws/create_launch_template.sh to create a launch template that modifies the docker image size limitation.)

  • Launch template version: N/A.

  • Minimum vCPUs: 0 (Do not specify a number larger than 0; otherwise you will have minimum cores running for all times even you do not submit any jobs.)

  • Maximum vCPUs: 256 (You have to do some math. For example, c5.4xlarge has 16 cores, so you will have at most 256/16=16 C5s working. You can refer to https://aws.amazon.com/ec2/instance-types/ for the number of cores on each instance.)

  • Enable user-specified Ami ID: N/A.

Note that we skip the network setting in this tutorial as it depends on your account. You may need to consult your account admins for this part.

4. Create an AWS batch job queue

After we have lorien-c5-env compute environment, we can then go to Job queues on the AWS batch console and click Create queue to create a job queue.

  • Queue name: lorien-c5-queue.

  • Priority: 1 (I usually use the same priority number for all queues.)

  • Connected compute environments for this queue: lorien-c5-env (you may need to wait for about a minute to see the compute environment we just created to appear in the drop down menu.)

5. Create an AWS batch job definition

Now we can create a job definition. Go to Job definitions on the AWS batch console and click Create.

  • Job definition name: lorien-job-cpu (Job definition is independent to job queues, so we can use one job definition for all CPU environments).

  • Job attempts: N/A.

  • Execution timeout: N/A.

  • Job role: N/A.

  • Container image: <aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>/lorien:cpu-latest (Assuming you pushed your images to AWS ECR.)

  • Command: N/A (Lorien will override this field when submitting jobs.)

  • vCPUs: 16 (We suggest to put the core number of an instance to let one job occupy one instance.)

  • Memory (MiB): 25000 (You may need to do some math or experiments for a proper memory capacity. Note that if the memory capacity you put here is larger than the total memory allocated to an instance, your job will stuck at RUNNABLE and never get started. We suggest to put at most 80% of instance memory capacity.)

  • Number of GPUs: 0 (If you are working on GPU job definition, put 1 here).

6. Configure Lorien and start tuning

Finally, we configure Lorien accordingly to make use of the AWS batch environment we just created. Note that if you have no idea about how to prepare the workloads for tuning, you can refer to Tuning on Local.

# tune_batch.yaml
batch:
  target: llvm -mcpu=skylake-avx512
  job_queue: lorien-c5-queue
  job_def: lorien-job-cpu:1
tuner: random
ntrial: 3000
commit-table-name: lorien
commit-nbest: 3
commit-log-to: tuned-logs
python3 -m lorien tune @tune_batch.yaml @workloads.yaml

where lorien-job-cpu:1 means we are using Revision 1 of lorien-job-cpu job definition. You may need to specify a proper version if you refine the job definition multiple times.

commit-nbest indicates how many best configs we will commit to the database, and commit-log-to specifies S3 bucket that we want to keep complete tuning logs. Note that tuning logs will not be stored if commit-log-to is unset.

During the tuning process, in addition to tracking the progress directly with the progress bar Lorien displays, you can also login your AWS batch console and see the dashboard for live job status. If jobs failed for some reasons, you can to go Jobs, click a job ID, and choose View logs for this job in CloudWatch console to see the logs.

Meanwhile, all state changes of tuing jobs will be recorded in a lorien-tune-<timestampe>.trace file. In case the master was interrupted and you wish to resume the tuning, you could specify --trace-file in the command, so that the tuning master will skip the finished jobs and keep tracking the state of tuning jobs.

python3 -m lorien tune @tune_batch.yaml @workloads.yaml --trace-file=<trace_file_path>