Skip to the content.

Serverless Inference with MMS on FARGATE

This is self-contained step by step guide that shows how to create launch and server your deep learning models with MMS in a production setup. In this document you will learn how to launch MMS with AWS Fargate, in order to achieve a serverless inference.

Prerequisites

Even though it is fully self-contained we do expect the reader to have some knowledge about the following topics:

Since we are doing inference, we need to have a pre-trained model that we can use to run inference. For the sake of this article, we will be using SqueezeNet model. In short, SqueezeNet is a model that allows you to recognize objects in a picture.

Now that we have the model chosen, let’s discuss at a high level what our pure-container based solution will look like:

architecture

In this document we are going to walk you through all the steps of setting up MMS 1.0 on Amazon Fargate services. The steps in this process are as follows:

  1. Familiarize yourself with MMS containers
  2. Create a SqueezeNet task definition (with the docker container of MMS)
  3. Create AWS Fargate cluster
  4. Create Application Load Balancer
  5. Create Squeezenet Fargate service on the cluster
  6. Profit!

Let the show begin…

Familiarize Yourself With Our Containers

With the current release of MMS, 1.0, Official pre-configured, optimized container images of MMS are provided on Docker hub.

docker pull awsdeeplearningteam/multi-model-server

# for gpu image use following command:
docker pull awsdeeplearningteam/multi-model-server:latest-gpu

In our article we are going to use the official CPU container image.

One major constraint for using Fargate service is that there is currently no support for GPU on Fargate.

The model-server container comes with a configuration file pre-baked inside the container. It is highly recommended that you understand all the parameters of the MMS configuration file. Familiarize yourself with the MMS configuration and configuring MMS Container docs. When you want to launch and host your custom model, you will have to update this configuration.

In this tutorial, we will be use the squeezenet model from the following S3 link.

https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar

Since MMS can consume model files from S3 buckets, we wouldn’t need to bake the containers with the actual model files.

The last question that we need to address: how we should be starting our MMS within our container. And the answer is very simple, you just need to set the following ENTRYPOINT:

multi-model-server --start --models https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar

You will now have a running container serving squeezenet model.

At this point, you are ready to start creating actual task definition.

Note: To start multiple models with the model-server, you could run the following command with multiple model names

# Example, following command starts model server with Resnet-18 and Squeezenet V1 models
$ multi-model-server --start --models https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar https://s3.amazonaws.com/model-server/model_archive_1.0/resnet-18.mar

Create an AWS Fargate task to serve SqueezeNet model

This is the first step towards getting your own “inference service” up and running in a production setup.

  1. Login to the AWS console and go to the Elastic Cloud Service -> Task Definitions and Click “Create new Task Definition”:

task def

  1. Now you need to specify the type of the task, you will be using the Fargate task:

  1. The task requires some configuration, let’s look at it step by step. First set the name:

Now is important part, you need to create a IAM role that will be used to publish metrics to CloudWatch:

The containers are optimized for 8 vCPUs, however in this example you are going to use slightly smaller task with 4 vCPUs and 8 GB of RAM:

  1. Now it is time to configure the actual container that the task should be executing.


</br> Note: If you are using a custom container, make sure to first upload your container to Amazon ECR or Dockerhub and replace the link in this step with the link to your uploaded container.

  1. The next task is to specify the port mapping. You need to expose container port 8080. This is the port that the MMS application inside the container is listening on. If needed it can be configured via the config here.

Next, you will have to configure the health-checks. This is the command that ECS should run to find out whether MMS is running within the container or not. MMS has a pre-configured endpoint /ping that can be used for health checks. Configure ECS to reach that endpoint at http://127.0.0.1:8080/ping using the curl command as shown below:

curl, http://127.0.0.1:8080/ping

The healthcheck portion of your container configuration should look like the image below:

After configuring the health-checks, you can go onto configuring the environment, with the entry point that we have discussed earlier:

Everything else can be left as default. So feel free to click Create to create your very first AWS Fargate-task. If everything is ok, you should now be able to see your task in the list of task definitions.

In ECS, Services are created to run Tasks. A service is in charge of running multiple tasks and making sure the that required number of tasks are always running, restarting un-healthy tasks, adding more tasks when needed.

To have your inference service accessible over the Internet, you would need to configure a load-balancer (LB). This LB will be in charge of serving the traffic from the Internet and redirecting it to these newly created tasks. Let’s create an Application Load Balancer now:

Create a Load Balancer

AWS supports several different types of Load Balancers:

For your cluster you are going to use application load balancer.

  1. Login to the EC2 Console.
  2. Go to the “Load balancers” section.
  3. Click on Create new Load Balancer.

  1. Choose Application Load Balancer.

  1. Set all the required details. Make a note of the VPC of the LB. This is important since the LB’s VPC and the ECS cluster’s VPC need to be same for them to communicate with each other.

  1. Next is configuring the security group. This is also important. Your security group should:

  1. Routing configuration is simple. Here you need to create a “target group”. But, in your case the AWS Fargate service, that you will create later, will automatically create a target group.
    Therefore you will create dummy “target group” that you will delete after the creation of the LB.

  1. Nothing needs to be done for the last two steps. Finish the creation and …
  2. Now you are ready to remove dummy listener and target group

Now that you are done-done-done with the Load Balancer creation, lets move onto creating our Serverless inference service.

Creating an ECS Service to launch our AWS Fargate task

  1. Go to Elastic Container Service → Task Definitions and select the task definitions name. Click on actions and select create service.

  1. There are two important things on the first step (apart from naming):

  1. Now it is time to configure the VPC and the security group. You should use the same VPC that was used for the LB (and same subnets!).

  1. As for the security group, it should be either the same security group as you had for the LB, or the one that accepts traffic from the LBs security group.

  1. Now you can connect your service to the LB that was created in the previous section. Select the “Application Load Balancer” and set the LB name:

  1. Now you need to specify which port on the LB our service should be listening on:

  1. You are not going to use service discovery now, so uncheck it:

  1. In this document, we are not using auto-scaling options. For an actual production system, it is advisable to have this configuration setup.

  1. Now you are done-done-done creating a running service. You can move to the final chapter of the journey, which is testing the service you created.

Test your service

First find the DNS name of your LB. It should be in AWS Console -> Service -> EC2 -> Load Balancers and click on the LB that you created.

Now you can run the health checks using this load-balancer public DNS name, to verify that your newly created service is working:

curl InfraLb-1624382880.us-east-1.elb.amazonaws.com/ping 
http://infralb-1624382880.us-east-1.elb.amazonaws.com/ping
{
    "status": "Healthy!"
}

And now you are finally ready to run our inference! Let’s download an example image:

curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg

The image:

The output of this query would be as follows,

curl -X POST InfraLb-1624382880.us-east-1.elb.amazonaws.com/predictions/squeezenet_v1.1 -F "data=@kitten.jpg"
{
      "prediction": [
    [
      {
        "class": "n02124075 Egyptian cat",
        "probability": 0.8515275120735168
      },
      {
        "class": "n02123045 tabby, tabby cat",
        "probability": 0.09674164652824402
      },
      {
        "class": "n02123159 tiger cat",
        "probability": 0.03909163549542427
      },
      {
        "class": "n02128385 leopard, Panthera pardus",
        "probability": 0.006105933338403702
      },
      {
        "class": "n02127052 lynx, catamount",
        "probability": 0.003104303264990449
      }
    ]
  ]
}

Instead of a Conclusion

There are a few things that we have not covered here and which are very useful, such as:

Each of the above topics require their own articles, so stay tuned!!

Authors