Skip to the content.

Management API

MMS provides a set of API allow user to manage models at runtime:

  1. Register a model
  2. Increase/decrease number of workers for specific model
  3. Describe a model’s status
  4. Unregister a model
  5. List registered models

Management API is listening on port 8081 and only accessible from localhost by default. To change the default setting, see MMS Configuration.

Similar as Inference API, Management API also provide a API description to describe management APIs with OpenAPI 3.0 specification.

Management APIs

Register a model

POST /models

curl -X POST "http://localhost:8081/models?url=https%3A%2F%2Fs3.amazonaws.com%2Fmodel-server%2Fmodel_archive_1.0%2Fsqueezenet_v1.1.mar"

{
  "status": "Model \"squeezenet_v1.1\" registered"
}

User may want to create workers while register, creating initial workers may take some time, user can choose between synchronous or synchronous call to make sure initial workers are created properly.

The asynchronous call will return before trying to create workers with HTTP code 202:

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=false&url=https%3A%2F%2Fs3.amazonaws.com%2Fmodel-server%2Fmodel_archive_1.0%2Fsqueezenet_v1.1.mar"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 29cde8a4-898e-48df-afef-f1a827a3cbc2
< content-length: 33
< connection: keep-alive
< 
{
  "status": "Worker updated"
}

The synchronous call will return after all workers has be adjusted with HTTP code 200.

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=true&url=https%3A%2F%2Fs3.amazonaws.com%2Fmodel-server%2Fmodel_archive_1.0%2Fsqueezenet_v1.1.mar"

< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: c4b2804e-42b1-4d6f-9e8f-1e8901fc2c6c
< content-length: 32
< connection: keep-alive
< 
{
  "status": "Worker scaled"
}

Scale workers

PUT /models/{model_name}

Use the Scale Worker API to dynamically adjust the number of workers to better serve different inference request loads.

There are two different flavour of this API, synchronous vs asynchronous.

The asynchronous call will return immediately with HTTP code 202:

curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 74b65aab-dea8-470c-bb7a-5a186c7ddee6
< content-length: 33
< connection: keep-alive
< 
{
  "status": "Worker updated"
}

The synchronous call will return after all workers has be adjusted with HTTP code 200.

curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3&synchronous=true"

< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: c4b2804e-42b1-4d6f-9e8f-1e8901fc2c6c
< content-length: 32
< connection: keep-alive
< 
{
  "status": "Worker scaled"
}

Describe model

GET /models/{model_name}

Use the Describe Model API to get detail runtime status of a model:

curl http://localhost:8081/models/noop

{
  "modelName": "noop",
  "modelVersion": "snapshot",
  "modelUrl": "noop.mar",
  "engine": "MXNet",
  "runtime": "python",
  "minWorkers": 1,
  "maxWorkers": 1,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "workers": [
    {
      "id": "9000",
      "startTime": "2018-10-02T13:44:53.034Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 89247744
    }
  ]
}

Unregister a model

DELETE /models/{model_name}

Use the Unregister Model API to free up system resources:

curl -X DELETE http://localhost:8081/models/noop

{
  "status": "Model \"noop\" unregistered"
}

List models

GET /models

Use the Models API to query current registered models:

curl "http://localhost:8081/models"

This API supports pagination:

curl "http://localhost:8081/models?limit=2&next_page_token=2"

{
  "nextPageToken": "4",
  "models": [
    {
      "modelName": "noop",
      "modelUrl": "noop-v1.0"
    },
    {
      "modelName": "noop_v0.1",
      "modelUrl": "noop-v0.1"
    }
  ]
}

API Description

OPTIONS /

To view a full list of inference and management APIs, you can use following command:

# To view all inference APIs:
curl -X OPTIONS http://localhost:8080

# To view all management APIs:
curl -X OPTIONS http://localhost:8081

The out is OpenAPI 3.0.1 json format. You use it to generate client code, see swagger codegen for detail.

Example outputs of the Inference and Management APIs: