..  Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

..    http://www.apache.org/licenses/LICENSE-2.0

..  Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.

.. _tune-on-aws-batch:

###################
Tuning on AWS Batch
###################

In this tutorial, we introduce how to setup an AWS batch environment to distributely tune a huge number of workloads on AWS EC2 supported platforms step-by-step:

*****
Steps
*****

1. Setup AWS account and credential
-----------------------------------

To use AWS batch (and other AWS services such as DynamoDB and S3 used by Lorien), you need to create an AWS account and an IAM role with 1) an access key and 2) secret access key ready in order to configure AWS credential on host as well as worker machines/instances. The purpose of configuring AWS credential is to authorize AWS client APIs (i.e., ``boto3`` Python package) to access AWS services on your behalf, so that Lorien can help automating everything for you.

After creating an AWS account, please refer to `<https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html>`_ to get your access key and secret access key ready. We will use them to configure Lorien in the last step.

2. Prepare Container Images
---------------------------

Since AWS batch runs every submitted jobs in a container to guarantee a unified environment, we need to specify an image ID when creating an AWS batch computation environment. Please note that Lorien docker images on Docker Hub only include the dependencies (e.g., Python packages and TVM) but not Lorien itself, so you can base on that to create a working image. You can create your own docker account and push the image (in private), or you can push your images to AWS ECR (Elastic Container Registry). To do so, you need to first authorize Docker CLI to push your images to AWS ECR. The command for authorization can be generated by AWS CLI:

.. code-block:: bash

  # Generate docker login command.
  aws ecr get-login --no-include-email
  # Out: docker login –u AWS –p password https://<aws_account>.dkr.ecr.<region>>.amazonaws.com

  # Copy-paste the generated command to login.
  docker login –u AWS –p password https://<aws_account>.dkr.ecr.<region>>.amazonaws.com

  # Change the tag of your image. Replace <repo>:<tag> with your local Docker repository name and tag.
  docker tag <repo>:<tag> <aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>:<tag>

  # Push your image to AWS ECR.
  docker push <aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>:<tag>

Note that you can use AWS ECR console to create a repository to maintain Lorien images.

3. Create an AWS batch compute environment
----------------------------------------------

Now we can start setting AWS batch. The first step is to create a compute environment. A compute environment describes platform details of your submitted jobs. Go to :samp:`https://<region>.console.aws.amazon.com/batch`  -> ``Compute environments`` -> ``Create environment`` and fill out the information. In this example, we create a C5 environment as follows:

- **Compute environment type**: ``Managed``.
- **Compute environment name**: ``lorien-c5-env``.
- **Service role**: ``AWSBatchServiceRole``. (If this is the first environment you create, leave this box blank and AWS batch will create ``AWSBatchServiceRole`` IAM role for you.)
- **Instance role**: ``ecsInstanceRole``. (If this is the first environment you create, leave this box blank and AWS batch will create ``ecsInstanceRole`` IAM instance profile for you.)
- **EC2 key pair**: ``N/A``. (Leave it blank because we will configure AWS credential when launching a container.)
- **Provisioning model**: ``On-Demand`` (Although spot instances are much cheaper, it has a risk to be terminated by AWS any time. Since Lorien tuning jobs usually needs minutes to hours, it is recommended to use on-demand instances.)
- **Allowed instance types**: ``c5.4xlarge`` (It is not recommended to specify more then one types of instances in one computation environment, because in that case we cannot know which type of instances is being used to tune a workload. In other words, you need to create ``N`` environments if you need to tune worklaods on ``N`` types of instances.)
- **Allocation strategy**: ``BEST_FIT``.
- **Launch template**: ``N/A`` (If your container image is too large, then you may encounter job starting failures when submitting jobs to AWS batch. This is because the default docker image size limit is set to 10~Gib. In this case, you could follow ``scripts/aws/create_launch_template.sh`` to create a launch template that modifies the docker image size limitation.)
- **Launch template version**: ``N/A``.
- **Minimum vCPUs**: ``0`` (Do not specify a number larger than 0; otherwise you will have minimum cores running for all times even you do not submit any jobs.)
- **Maximum vCPUs**: ``256`` (You have to do some math. For example, ``c5.4xlarge`` has ``16`` cores, so you will have at most ``256/16=16`` C5s working. You can refer to `<https://aws.amazon.com/ec2/instance-types/>`_ for the number of cores on each instance.)
- **Enable user-specified Ami ID**: ``N/A``.

Note that we skip the network setting in this tutorial as it depends on your account. You may need to consult your account admins for this part.

4. Create an AWS batch job queue
--------------------------------

After we have ``lorien-c5-env`` compute environment, we can then go to ``Job queues`` on the AWS batch console and click ``Create queue`` to create a job queue.

- **Queue name**: ``lorien-c5-queue``.
- **Priority**: ``1`` (I usually use the same priority number for all queues.)
- **Connected compute environments for this queue**: ``lorien-c5-env`` (you may need to wait for about a minute to see the compute environment we just created to appear in the drop down menu.)

5. Create an AWS batch job definition
-------------------------------------

Now we can create a job definition. Go to ``Job definitions`` on the AWS batch console and click ``Create``.

- **Job definition name**: ``lorien-job-cpu`` (Job definition is independent to job queues, so we can use one job definition for all CPU environments).
- **Job attempts**: ``N/A``.
- **Execution timeout**: ``N/A``.
- **Job role**: ``N/A``.
- **Container image**: ``<aws_account>.dkr.ecr.<region>.amazonaws.com/<ecr_repo>/lorien:cpu-latest`` (Assuming you pushed your images to AWS ECR.)
- **Command**: ``N/A`` (Lorien will override this field when submitting jobs.)
- **vCPUs**: ``16`` (We suggest to put the core number of an instance to let one job occupy one instance.)
- **Memory (MiB)**: ``25000`` (You may need to do some math or experiments for a proper memory capacity. Note that if the memory capacity you put here is larger than the total memory allocated to an instance, your job will stuck at ``RUNNABLE`` and never get started. We suggest to put at most 80% of instance memory capacity.)
- **Number of GPUs**: ``0`` (If you are working on GPU job definition, put ``1`` here).

6. Configure Lorien and start tuning
------------------------------------

Finally, we configure Lorien accordingly to make use of the AWS batch environment we just created. Note that if you have no idea about how to prepare the workloads for tuning, you can refer to :ref:`tune-on-local`.

.. code-block:: yaml

  # tune_batch.yaml
  batch:
    target: llvm -mcpu=skylake-avx512
    job_queue: lorien-c5-queue
    job_def: lorien-job-cpu:1
  tuner: random
  ntrial: 3000
  commit-table-name: lorien
  commit-nbest: 3
  commit-log-to: tuned-logs

.. code-block:: bash

  python3 -m lorien tune @tune_batch.yaml @workloads.yaml

where ``lorien-job-cpu:1`` means we are using Revision 1 of ``lorien-job-cpu`` job definition. You may need to specify a proper version if you refine the job definition multiple times.

``commit-nbest`` indicates how many best configs we will commit to the database, and ``commit-log-to`` specifies S3 bucket that we want to keep complete tuning logs. Note that tuning logs will not be stored if ``commit-log-to`` is unset.

During the tuning process, in addition to tracking the progress directly with the progress bar Lorien displays, you can also login your AWS batch console and see the dashboard for live job status. If jobs failed for some reasons, you can to go ``Jobs``, click a job ID, and choose ``View logs for this job in CloudWatch console`` to see the logs.

Meanwhile, all state changes of tuing jobs will be recorded in a ``lorien-tune-<timestampe>.trace`` file. In case the master was interrupted and you wish to resume the tuning, you could specify ``--trace-file`` in the command, so that the tuning master will skip the finished jobs and keep tracking the state of tuning jobs.

.. code-block:: bash
   
  python3 -m lorien tune @tune_batch.yaml @workloads.yaml --trace-file=<trace_file_path>

*********
Reference
*********

- `<https://fredhutch.github.io/aws-batch-at-hutch-docs>`_
- `<https://aws.amazon.com/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/>`_