Tag: Machine Learning

Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker

Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker

Amazon SageMaker supports all the popular deep learning frameworks, including TensorFlow. Over 85% of TensorFlow projects in the cloud run on AWS. Many of these projects already run in Amazon SageMaker. This is due to the many conveniences Amazon SageMaker provides for TensorFlow model hosting and training, including fully managed distributed training with Horovod and parameter servers.

Customers are increasingly interested in training models on large datasets, which can take a week or more. In these cases, you might be able to speed the process by distributing training on multiple machines or processes in a cluster. This post discusses how Amazon SageMaker helps you set up and launch distributed training with TensorFlow quickly, without the expense and difficulty of directly managing your training clusters.

Starting with TensorFlow version 1.11, you can use Amazon SageMaker prebuilt TensorFlow containers: Simply provide a Python training script, specify hyperparameters, and indicate your training hardware configuration. Amazon SageMaker does the rest, including spinning up a training cluster and tearing down the cluster when training ends. This feature is called “script mode.” Script mode currently supports two distributed training approaches out-of-the-box:

  • Option #1: TensorFlow’s native parameter server (TensorFlow versions 1.11 and above)
  • Option #2: Horovod (TensorFlow versions 1.12 and above)

In the following sections, we provide an overview of the steps required to enable these TensorFlow distributed training options in Amazon SageMaker script mode.

Option #1: Parameter servers

One common pattern in distributed training is to use one or more dedicated processes to collect gradients computed by “worker” processes, then aggregate them and distribute the updated gradients back to the workers in an asynchronous manner. These processes are known as parameter servers.

In a TensorFlow parameter server cluster in Amazon SageMaker script mode, each instance in the cluster runs one parameter server process and one worker process. Each parameter server communicates with all workers (“all-to-all”), as shown in the following diagram (from Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow):

In Amazon SageMaker script mode, the implementation of parameter servers is asynchronous: each worker computes gradients and submits gradient updates to the parameter servers independently, without waiting for the other workers’ updates.

In practice, asynchronous updates usually don’t have an overly adverse impact. Workers that fall behind might submit stale gradients, which can negatively affect training convergence. Generally, this can be managed by reducing the learning rate. On the plus side, because there is no waiting for other workers, asynchronous updates can result in faster training.

If you use Amazon SageMaker script mode, you don’t have to set up and manage the parameter server cluster yourself. The Amazon SageMaker prebuilt TensorFlow container comes with a built-in script mode option for use with parameter servers. Using this option saves time and spares you the complexities of cluster management.

The following code example shows how to set up a parameter server cluster with script mode. Specify “parameter_server” as the value in the distributions parameter of an Amazon SageMaker TensorFlow Estimator object. Amazon SageMaker script mode then launches a parameter server thread on each instance in the training cluster and executes your training code in a separate worker thread on each instance. To run a distributed training job with multiple instances, set train_instance_count to a number larger than 1.

from sagemaker.tensorflow import TensorFlow

ps_instance_type = 'ml.p3.2xlarge'
ps_instance_count = 2

distributions = {'parameter_server': {
                    'enabled': True}

hyperparameters = {'epochs': 60, 'batch-size' : 256}

estimator_ps = TensorFlow( base_job_name='ps-cifar10-tf',
                           distributions=distributions )

# start training; inputs can be in Amazon S3, Amazon EFS, or Amazon FSx for Lustre

For an example of how to use parameter server-based distributed training with script mode, see our TensorFlow Distributed Training Options example on GitHub.

Option #2: Horovod

Horovod is an open source framework for distributed deep learning. It is available for use with TensorFlow and several other deep learning frameworks. As with parameter servers, Amazon SageMaker automates Horovod cluster setup and runs the appropriate commands to make sure that training goes smoothly without the need for you to manage clusters directly yourself.

Horovod’s cluster architecture differs from the parameter server architecture. Recall that the parameter server architecture uses the all-to-all communication model, where the amount of data sent is proportional to the number of processes. By contrast, Horovod uses Ring-AllReduce, where the amount of data sent is more nearly proportional to the number of cluster nodes, which can be more efficient when training with a cluster where each node has multiple GPUs (and thus multiple worker processes).

Additionally, whereas the parameter server update process described above is asynchronous, in Horovod updates are synchronous. After all processes have completed their calculations for the current batch, gradients calculated by each process circulate around the ring until every process has a complete set of gradients for the batch from all processes.

At that time, each process updates its local model weights, so every process has the same model weights before starting work on the next batch. The following diagram shows how Ring-AllReduce works (from Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow):

Horovod employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication.

The Horovod framework eliminates many of the difficulties of Ring-AllReduce cluster setup and works with several popular deep learning frameworks and APIs. For example, if you are using the popular Keras API, you can use either the reference Keras implementation or tf.keras directly with Horovod without converting to an intermediate API such as tf.Estimator.

In Amazon SageMaker script mode, Horovod is available for TensorFlow version 1.12 or newer. When you use Horovod in script mode, the Amazon SageMaker TensorFlow container sets up the MPI environment and executes the mpirun command to start jobs on the cluster nodes. To enable Horovod in script mode, you must change the Amazon SageMaker TensorFlow Estimator and your training script. To configure training with Horovod, specify the following fields in the distributions parameter of the Estimator:

  • enabled (bool): If set to True, MPI is set up and the mpirun command executes.
  • processes_per_host (int): Number of processes MPI should launch on each host. Set this flag for multi-GPU training.
  • custom_mpi_options (str): Any mpirun flags passed in this field are added to the mpirun command and executed by Amazon SageMaker for Horovod training.

The number of processes MPI launches on each host should not be greater than the available slots on the selected instance type.

For example, here’s how to create an Estimator object to launch Horovod distributed training on two hosts with one GPU/process each:

from sagemaker.tensorflow import TensorFlow

hvd_instance_type = 'ml.p3.2xlarge'
hvd_processes_per_host = 1
hvd_instance_count = 2

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': hvd_processes_per_host,
                    'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'

hyperparameters = {'epochs': 60, 'batch-size' : 256}

estimator_hvd = TensorFlow(base_job_name='hvd-cifar10-tf',

# start training; inputs can be in Amazon S3, Amazon EFS, or Amazon FSx for Lustre

Besides modifying the Estimator object, you also must make the following additions to the training script. You can make these changes conditional based on whether MPI is enabled.

  1. Run hvd.init().
  2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. With the typical setup of one GPU per process, you can set this to local rank. In that case, the first process on the server allocates the first GPU, second process allocates the second GPU, and so forth.
  3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training should scale by the number of workers. An increase in learning rate compensates for the increased batch size.
  4. Wrap the optimizer in hvd.DistributedOptimizer. The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce, and then applies those averaged gradients.
  5. Add the code hvd.BroadcastGlobalVariablesHook(0) to broadcast initial variable states from rank 0 to all other processes. This initial broadcast makes sure that all workers receive consistent initialization (with random weights or restored from a checkpoint) when training starts. Alternatively, if you’re not using MonitoredTrainingSession, you can execute the hvd.broadcast_global_variables op after global variables initialize.
  6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them. To do this, pass checkpoint_dir=None to tf.train.MonitoredTrainingSession if hvd.rank() != 0.

Find more details about Horovod at the Horovod GitHub Repository. For an example of Horovod usage with script mode, see our TensorFlow Distributed Training Options example on GitHub.

Choosing a distributed training option

Before moving to distributed training in a cluster, make sure that you have first tried scaling up on a single machine with multiple GPUs. Communication between multiple GPUs on a single machine is faster than communicating across a network between multiple machines. For more details, see the AWS whitepaper Power Machine Learning at Scale.

If you must scale out to a cluster instead of scaling up with more GPUs within a single machine, the next consideration is whether to choose the parameter server option or Horovod. This choice partly depends on the version of TensorFlow that you are using.

  • For TensorFlow versions 1.11 and newer in Amazon SageMaker script mode, you can use parameter servers.
  • To use Horovod, you must use TensorFlow versions 1.12 or newer.

The following chart summarizes some general guidelines regarding performance for each option. These rules aren’t absolute, and ultimately, the best choice depends on the specific use case. Typically, the performance significantly depends on how long it takes to share gradient updates during training. In turn, this is affected by the model size, gradients size, GPU specifications, and network speed.

Better CPU performance Better GPU performance

Relatively long time to share gradients

(larger number of gradients / bigger model size)

Parameter server Parameter server, OR Horovod on a single instance with multi-GPUs

Relatively short time to share gradients

(smaller number of gradients / lesser model size)

Parameter server Horovod

Complexity is another consideration. Parameter servers are straightforward to use for one GPU per instance. However, to use multi-GPU instances, you must set up multiple towers, with each tower assigned to a different GPU. A “tower” is a function for computing inference and gradients for a single model replica, which in turn is a copy of a model training on a subset of the complete dataset. Towers involve a form of data parallelism. Horovod also employs data parallelism but abstracts away the implementation details.

Finally, cluster size makes a difference. Given larger clusters with many GPUs, parameter server all-to-all communication can overwhelm network bandwidth. Reduced scaling efficiency can result, among other adverse effects. In such situations, you might find Horovod a better option.

Additional considerations

The example code for this post consists of one large TFRecord file containing the CIFAR-10 dataset, which is relatively small. However, larger datasets might require that you shard the data into multiple files, particularly if Pipe Mode is used (see the second bullet following). Sharding may be accomplished by specifying an Amazon S3 data source as a manifest file or ShardedByS3Key. Also, Amazon SageMaker provides other ways to make distributed training more efficient for very large datasets:

  • VPC training: Performing Horovod training inside a VPC improves the network latency between nodes, leading to higher performance and stability of Horovod training jobs. To learn how to conduct distributed training within a VPC, see the example notebook Horovod Distributed Training with Amazon SageMaker TensorFlow script mode.
  • Pipe Mode: For large datasets, using Pipe Mode reduces startup and training times. Pipe Mode streams training data from Amazon S3 directly to the algorithm (as a Linux FIFO), without saving to disk. For details about using Pipe Mode with TensorFlow in Amazon SageMaker, see Training with Pipe Mode using PipeModeDataset.
  • Amazon FSx for Lustre and Amazon EFS: performance on large datasets in File Mode may be improved in some circumstances using either Amazon FSx for Lustre or Amazon EFS. For more details, please refer to the related blog post.


Amazon SageMaker provides multiple tools to make distributed training quicker and easier to use. If neither parameter server nor Horovod fit your needs, you can always provide another distributed training option using a Bring Your Own Container (BYOC) approach. Amazon SageMaker gives you the flexibility to mix and match the tools best suited for your use case and dataset.

To get started with Tensorflow distributed training in script mode, go to Amazon SageMaker console. Either create a new Amazon SageMaker notebook instance or open an existing one. Then, simply import the distributed training example referenced in this blog post, and compare and contrast the parameter server option and the Horovod option.

About the authors

Rama Thamman is R&D Manager on the AWS R&D and Innovation Solutions Architecture team. He works with customers to build scalable cloud and machine learning solutions on AWS.





Brent Rabowsky focuses on data science at AWS and uses his expertise to help AWS customers with their data science projects.


from AWS Machine Learning Blog

Performing batch inference with TensorFlow Serving in Amazon SageMaker

Performing batch inference with TensorFlow Serving in Amazon SageMaker

After you’ve trained and exported a TensorFlow model, you can use Amazon SageMaker to perform inferences using your model. You can either:

  • Deploy your model to an endpoint to obtain real-time inferences from your model.
  • Use batch transform to obtain inferences on an entire dataset stored in Amazon S3.

In the case of batch transform, it’s becoming increasingly necessary to perform fast, optimized batch inference on large datasets.

In this post, you learn how to use Amazon SageMaker batch transform to perform inferences on large datasets. The example in this post uses a TensorFlow Serving (TFS) container to do batch inference on a large dataset of images. You also see how to use the new pre– and post-processing feature of the Amazon SageMaker TFS container. This feature enables your TensorFlow model to make inferences directly on data in S3 and also save post-processed inferences to S3.


The dataset in this example is the “Challenge 2018/2019” subset of the Open Images V5 Dataset. This subset consists of 100,000 images in JPG format for a total of 10 GB. The model you use is an image-classification model based on the ResNet-50 architecture that has been trained on the ImageNet dataset and exported as a TensorFlow SavedModel. Use this model to predict the class of each image (for example, boat, car, bird). You write a pre– and post-processing script and package it with the SavedModel to perform inferences quickly, efficiently, and at scale.


When you run a batch transform job, you specify:

  • Where your input data is stored
  • Which Amazon SageMaker Model object (named “Model”) to use to transform your data
  • The number of cluster instances to use for your batch transform job

In our use case, a Model object is an HTTP server that serves a trained model artifact, a TensorFlow SavedModel, via the Amazon SageMaker TFS container. Amazon SageMaker batch transform distributes your input data among the instances.

To each instance in the cluster, Amazon SageMaker batch transform sends HTTP requests for inferences containing input data from S3 to the Model. Amazon SageMaker batch transform then saves these inferences back to S3.

Data often must be converted from one format to another before it can be passed to a model for predictions. For example, images may be in PNG or JPG format, but they must be converted into a format the model can accept. Also, sometimes other preprocessing of the data, such as resizing of images, must be performed.

Using the Amazon SageMaker TFS container’s new pre– and post-processing feature, you can easily convert data that you have in S3 (such as raw image data) into a TFS request. Your TensorFlow SavedModel can then use this request for inference. Here’s what happens:

  1. Your pre-processing code runs in an HTTP server inside the TFS container and processes incoming requests before sending them to a TFS instance within the same container.
  2. Your post-processing code processes responses from TFS before they are saved to S3.

The following diagram illustrates this solution.

Performing inferences on raw image data with an Amazon SageMaker batch transform

Here are the three steps required for implementation:

  1. Write a pre– and post-processing script for JPEG input data.
  2. Package a Model for JPEG input data.
  3. Run an Amazon SageMaker batch transform job for JPEG input data.

To write a pre– and post-processing script for JPEG input data

To make inferences, you first preprocess your image data in S3 to match the serving signature of your TensorFlow SavedModel, which you can inspect using thesaved_model_cli. The following is the serving signature of the ResNet-50 v2 (NCHW, JPEG) model:

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:


  The given SavedModel SignatureDef contains the following input(s):
    inputs['image_bytes'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1001)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

The Amazon SageMaker TFS container uses the model’s SignatureDef named serving_default, which is declared when the TensorFlow SavedModel is exported. This SignatureDef says that the model accepts a string of arbitrary length as input, and responds with classes and their probabilities. With your image classification model, the input string is a base64 encoded string representing a JPEG image, which your SavedModel decodes. The Python script that you use for pre– and post-processing, inference.py, is reproduced as follows:

import base64
import io
import json
import requests

def input_handler(data, context):
    """ Pre-process request input before it is sent to TensorFlow Serving REST API

        data (obj): the request data stream
        context (Context): an object containing request and configuration details

        (dict): a JSON-serializable dict that contains request body and headers

    if context.request_content_type == 'application/x-image':
        payload = data.read()
        encoded_image = base64.b64encode(payload).decode('utf-8')
        instance = [{"b64": encoded_image}]
        return json.dumps({"instances": instance})
        _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))

def output_handler(response, context):
    """Post-process TensorFlow Serving output before it is returned to the client.

        response (obj): the TensorFlow serving response
        context (Context): an object containing request and configuration details

        (bytes, string): data to return to client, response content type
    if response.status_code != 200:
        _return_error(response.status_code, response.content.decode('utf-8'))
    response_content_type = context.accept_header
    prediction = response.content
    return prediction, response_content_type

def _return_error(code, message):
    raise ValueError('Error: {}, {}'.format(str(code), message))

The input_handler intercepts inference requests, base64 encodes the request body, and formats the request body to conform to the TFS REST API. The return value of the input_handler function is used as the request body in the TensorFlow Serving request. Binary data must use key “b64”, according to the TFS REST API.

Because your serving signature’s input tensor has the suffix “_bytes,” the encoded image data under key “b64” is passed to the “image_bytes” tensor. Some serving signatures may accept a tensor of floats or integers instead of a base64 encoded string. However, for binary data (including image data), your model should accept a base64 encoded string for binary data because JSON representations of binary data can be large.

Each incoming request originally contains a serialized JPEG image in its request body. After passing through the input_handler, the request body contains the following, which TFS accepts for inference:

{"instances": [{"b64":"[base-64 encoded JPEG image]"}]}

The first field in the return value of output_handler is what Amazon SageMaker batch transform saves to S3 as this example’s prediction. In this case, output_handler passes the content on to S3 unmodified.

Pre– and post-processing functions let you perform inference with TFS on any data format, not just images. To learn more about the input_handler and output_handler, see the Amazon SageMaker TFS Container README.

Packaging a model for JPEG input data

After writing a pre– and post-processing script, package your TensorFlow SavedModel along with your script into a model.tar.gz file. Then, upload the file to S3 for the Amazon SageMaker TFS container to use. The following is an example of a packaged model:

├── code
│   ├── inference.py
│   └── requirements.txt
└── 1538687370
    ├── saved_model.pb
    └── variables
       ├── variables.data-00000-of-00001
       └── variables.index

The number 1538687370 refers to the model version number of the SavedModel, and this directory contains your SavedModel artifacts. The code directory contains your pre– and post-processing script, which must be named inference.py. It also contains an optional requirements.txt file, which is used to install dependencies (with pip) from the Python Package Index before the batch transform job starts. In this use case, you don’t need any additional dependencies.

To package your SavedModel and your code, create a GZIP tar file named model.tar.gz by running the following command:

tar -czvf model.tar.gz code --directory=resnet_v2_fp32_savedmodel_NCHW_jpg 1538687370

Use this model.tar.gz when you create a Model object to run batch transform jobs. To learn more about packaging a model, see the Amazon SageMaker TFS Container README.

After creating a model artifact package, upload your model.tar.gz and create a Model object that refers to your packaged model artifact and the Amazon SageMaker TFS container. Specify that you are running in Region us-west-2 using TFS 1.13.1 and GPU instances.

The following code examples use both the AWS CLI, which is helpful for scripting automated pipelines, and the Amazon SageMaker Python SDK. However, you can create models and Amazon SageMaker batch transform jobs using any AWS SDK.

timestamp() {
  date +%Y-%m-%d-%H-%M-%S


aws s3 cp model.tar.gz $MODEL_DATA_URL



# See the following document for more on SageMaker Roles:
# https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html
ROLE_ARN="[SageMaker-compatible IAM Role ARN]"

aws sagemaker create-model \
    --model-name $MODEL_NAME \
    --primary-container Image=$IMAGE,ModelDataUrl=$MODEL_DATA_URL \
    --execution-role-arn $ROLE_ARN
tar -czvf model.tar.gz code --directory=resnet_v2_fp32_savedmodel_NCHW_jpg 1538687370

Amazon SageMaker Python SDK
import os
import sagemaker
from sagemaker.tensorflow.serving import Model

sagemaker_session = sagemaker.Session()
# See the following document for more on SageMaker Roles:
# https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html
role = '[SageMaker-compatible IAM Role ARN']
bucket = 'sagemaker-data-bucket'
prefix = 'sagemaker/high-throughput-tfs-batch-transform'
s3_path = 's3://{}/{}'.format(bucket, prefix)

model_data = sagemaker_session.upload_data('model.tar.gz',
                                           os.path.join(prefix, 'model'))
tensorflow_serving_model = Model(model_data=model_data,

Running an Amazon SageMaker transform job for JPEG input data

Now, use the Model object you created to run batch predictions with Amazon SageMaker batch transform. Specify the S3 input data, the content type of the input data, the output S3 bucket, and the instance type and count.

You must specify two additional parameters that affect performance: max-payload-in-mb and max-concurrent-transforms.

The max-payload-in-mb parameter determines how large request payloads can be when sending requests to your model. Because the largest object in your S3 input is less than one megabyte, set this parameter to 1.

The max-concurrent-transforms parameter determines how many concurrent requests to send to your model. The value that maximizes throughput varies according to your model and input data. For this post, it was set to 64 after experimenting with powers of two.

# This S3 prefix contains .jpg files.




aws sagemaker create-transform-job \
    --model-name $MODEL_NAME \
    --transform-input DataSource=$TRANSFORM_INPUT_DATA_SOURCE,ContentType=$CONTENT_TYPE \
    --transform-output S3OutputPath=$TRANSFORM_S3_OUTPUT \
    --transform-resources InstanceType=$INSTANCE_TYPE,InstanceCount=$INSTANCE_COUNT \
    --max-payload-in-mb $MAX_PAYLOAD_IN_MB \
    --max-concurrent-transforms $MAX_CONCURRENT_TRANSFORMS \
    --transform-job-name $JOB_NAME  
Amazon SageMaker Python SDK
output_path = 's3://your-sagemaker-output-data-bucket/output'
tensorflow_serving_transformer = tensorflow_serving_model.transformer(

input_path = 's3://your-sagemaker-input-data/jpeg-images/'
tensorflow_serving_transformer.transform(input_path, content_type='application/x-image')

Your input data consists of 100,000 JPEG images. The S3 input data path looks like this:

2019-05-09 19:41:18     129216 00000b4dcff7f799.jpg
2019-05-09 19:41:18     118629 00001a21632de752.jpg
2019-05-09 19:41:18     154661 0000d67245642c5f.jpg
2019-05-09 19:41:18     163722 0001244aa8ed3099.jpg
2019-05-09 19:41:18     117780 000172d1dd1adce0.jpg

In tests, the batch transform job finished transforming the 100,000 images in 12 minutes on two ml.p3.2xlarge instances. You can easily scale your batch transform jobs to handle larger datasets or run faster by increasing the instance count.

After the batch transform job completes, inspect the output. In your output path, you find one S3 object per object in the input:

2019-05-16 05:46:05   12.7 KiB 00000b4dcff7f799.jpg.out
2019-05-16 05:46:04   12.6 KiB 00001a21632de752.jpg.out
2019-05-16 05:46:05   12.7 KiB 0000d67245642c5f.jpg.out
2019-05-16 05:46:05   12.7 KiB 0001244aa8ed3099.jpg.out
2019-05-16 05:46:04   12.7 KiB 000172d1dd1adce0.jpg.out

Inspecting one of the output objects, you see the prediction from the model:

    "predictions": [
            "probabilities": [6.08312e-07, 9.68555e-07, ...],
            "classes": 576

So that’s how to get inferences on a dataset consisting of JPEG images. In some cases, you may have converted your data to TFRecords for training or to encode multiple feature vectors in a single record. The following section shows how to perform batch inference with TFRecords.

Performing inferences on a TFRecord dataset with an Amazon SageMaker batch transform

TFRecord is a record-wrapping format commonly used with TensorFlow for storing multiple instances of tf.Example. With TFRecord, you can store multiple images (or other binary data) in a single S3 object, along with annotations and other metadata.

Amazon SageMaker batch transform can split an S3 object by the TFRecord delimiter, letting you perform inferences either on one example at a time or on batches of examples. Using Amazon SageMaker batch transform to perform inference on TFRecord data is similar to performing inference directly on image data, per the example earlier in this post.

To write a pre– and post-processing script for TFRecord data

Assume that you converted the image data used earlier into the TFRecord format. Rather than performing inference on 100,000 separate S3 image objects, perform inference on 100 S3 objects, each containing 1000 images bundled together as a TFRecord file. The images are stored in key “image/encoded” in a tf.Example and these tf.Examples are wrapped in the TFRecord format.​ We can perform inference on these TFRecords and output them in any data format, like JSON or CSV.

Now, tell Amazon SageMaker batch transform to split each object by a TFRecord header and do inference on a single record at a time, so that each request contains one serialized tf.Example. Use the following pre– and post-processing script to perform inference:

import base64
import io
import json
import requests
import tensorflow as tf
from google.protobuf.json_format import MessageToDict
from string import whitespace

def input_handler(data, context):
    """ Pre-process request input before it is sent to TensorFlow Serving REST API

        data (obj): the request data stream
        context (Context): an object containing request and configuration details

        (dict): a JSON-serializable dict that contains request body and headers

    if context.request_content_type == 'application/x-tfexample':
        payload = data.read()
        example = tf.train.Example()
        example_feature = MessageToDict(example.features)['feature']
        encoded_image = example_feature['image/encoded']['bytesList']['value'][0]
        instance = [{"b64": encoded_image}]
        return json.dumps({"instances": instance})
        _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))

def output_handler(response, context):
    """Post-process TensorFlow Serving output before it is returned to the client.

        data (obj): the TensorFlow serving response
        context (Context): an object containing request and configuration details

        (bytes, string): data to return to client, response content type
    if response.status_code != 200:
        _return_error(response.status_code, response.content.decode('utf-8'))
    response_content_type = context.accept_header
    # Remove whitespace from output JSON string.
    prediction = response.content.decode('utf-8').translate(dict.fromkeys(map(ord,whitespace)))
    return prediction, response_content_type

def _return_error(code, message):
    raise ValueError('Error: {}, {}'.format(str(code), message))

Your input handler extracts the image data from the value stored in the tf.Example’s key “image/encoded”. It then base64 encodes the image data for inference in TFS, just as in the previous example for JPEG input data.

Each output object contains 1000 predictions (one per image in the input object). These predictions are ordered in the same way they were in the input object. Your output_handler removes white space from the output JSON-formatted string that represents each TFS prediction. This way, the output S3 object can adhere to the JSONLines format, making it easier to parse. Later, configure your batch transform job to join each record with a newline character.

The pre-processing script requires additional dependencies to parse the input data. Add a dependency on TensorFlow to the requirements.txt file in the previous example so you can parse the serialized instances of tf.Example:


TensorFlow 1.13.1 is installed when each transform job starts. There are other options for including external dependencies, including options to avoid installing dependencies at runtime at all. For more information, see the Amazon SageMaker TFS Container README.

To package a model for TFRecord input data

Re-create your model.tar.gz to contain your new pre-processing script and then create a new Model object pointing to the new model.tar.gz. This step is identical to the previous example for JPEG input data, so refer to that section for details.

To run an Amazon SageMaker batch transform job for TFRecord input data

Now run a batch transform job against TFRecord files that contain the images:








aws sagemaker create-transform-job \
    --model-name $MODEL_NAME \
    --transform-input DataSource=$DATA_SOURCE \
    --transform-output S3OutputPath=$TRANSFORM_S3_OUTPUT,AssembleWith=$ASSEMBLE_WITH \
    --transform-resources InstanceType=$INSTANCE_TYPE,InstanceCount=$INSTANCE_COUNT \
    --max-payload-in-mb $MAX_PAYLOAD_IN_MB \
    --max-concurrent-transforms $MAX_CONCURRENT_TRANSFORMS \
    --transform-job-name $JOB_NAME \
    --batch-strategy $BATCH_STRATEGY \
Amazon SageMaker Python SDK
output_path = 's3://sagemaker-output-data-bucket/tfrecord-output/'
tensorflow_serving_transformer = tensorflow_serving_model.transformer(
                                     output_path=output_path, env=env)

input_path = 's3://your-sagemaker-input-data/tfrecord-images/'
tensorflow_serving_transformer.transform(input_path, content_type='application/x-tfexample')

The command is nearly identical to the one in the previous example for JPEG input data, with a few notable differences:

  • You specify SplitType as “TFRecord.” This is the delimiter that Amazon SageMaker batch transform uses. Other supported delimiters are “Line” for newline characters and “RecordIO” for the RecordIO data format.
  • You specify BatchStrategy as “SingleRecord” (rather than “MultiRecord”). This means that one record at a time is sent to the model after splitting by the TFRecord delimiter. In this case, choose “SingleRecord” to avoid having to strip records of the TFRecord header in your pre-processing script. If you choose “MultiRecord” instead, each request sent to your model contains up to 1 MB of records (since you chose MaxPayloadInMB to be 1 MB).
  • You specify AssembleWith as “Line.” This instructs your batch transform job to assemble the individual predictions in each object by newline characters rather than concatenate them.
  • You specify environment variables to be passed to the Amazon SageMaker TFS container. These particular environment variables enable request batching, a TFS feature that allows records from multiple requests to be batched together. The MaxConcurrentTransforms parameter is increased to 100, since TFS queues batches of requests. Request batching is an advanced feature that, when correctly configured, can significantly improve throughput, especially on GPU-enabled instances. For more information on configuring request batching, see the Amazon SageMaker TFS Container

Your S3 input data consisted of 100 objects corresponding to TFRecord files:

2019-05-20 21:07:12   99.3 MiB train-00000-of-00100
2019-05-20 21:07:12  100.8 MiB train-00001-of-00100
2019-05-20 21:07:12  100.4 MiB train-00002-of-00100
2019-05-20 21:07:12   99.2 MiB train-00003-of-00100
2019-05-20 21:07:12  101.5 MiB train-00004-of-00100
2019-05-20 21:07:14   99.8 MiB train-00005-of-00100

Your batch transform job on the images in TFRecord format should finish in about 8 minutes on two ml.p3.2xlarge instances with request batching enabled. Now, your output S3 data path consists of one output object for each of the 100 input objects:

2019-05-20 23:21:23   11.3 MiB train-00000-of-00100.out
2019-05-20 23:21:35   11.4 MiB train-00001-of-00100.out
2019-05-20 23:21:30   11.4 MiB train-00002-of-00100.out
2019-05-20 23:21:40   11.3 MiB train-00003-of-00100.out
2019-05-20 23:21:35   11.4 MiB train-00004-of-00100.out

Each object in the output follows the JSONLines format and contains 1000 lines, with each line containing the TFS output for one image:


For each file, the order of the images in the output remains the same as in the input. In other words, the first image in each input file corresponds to the first line in the matching output file, the second image to the second line, and so on.


Amazon SageMaker batch transform can transform large datasets quickly and at scale. You saw how to use the Amazon SageMaker TFS container to perform inferences with GPU-accelerated instances on image data as well as TFRecord files.

While the Amazon SageMaker TFS container supports CSV and JSON data out-of-the-box, its new pre– and post-processing feature also lets you run batch transform jobs on data of any format. The same container can be used for real-time inference as well, using an Amazon SageMaker-hosted model endpoint.

Our example in the blog post text above uses the Open Images dataset. However, we’ve made several examples available to suit your time constraints, use case, and preferred workflow. They are available on GitHub (click the links below) and in SageMaker notebook instances.

  • CIFAR-10 example: the CIFAR-10 dataset is much smaller than Open Images, so this example is meant for a quick demonstration of the above features.
  • Open Images example, image data: this example uses the Open Images dataset, and performs inference on raw image files. Two versions are available: one uses the SageMaker Python SDK, while the other uses the AWS CLI.
  • Open Images example, TFRecord data: this example uses the Open Images dataset, and performs inference on data stored in the TensorFlow binary format, TFRecord. Two versions are available: one uses the SageMaker Python SDK, while the other uses the AWS CLI.

To get started with these examples, go to Amazon SageMaker console, and either create a SageMaker notebook instance or open an existing one. Then go to the SageMaker Examples tab, and select the relevant examples from the SageMaker Batch Transform drop-down menu.

About the Authors

Andre Moeller is a Software Development Engineer at AWS AI. He focuses on developing scalable and reliable platforms and tools to help data scientists and engineers train, evaluate, and deploy machine learning models.




Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.





from AWS Machine Learning Blog

Optimizing TensorFlow model serving with Kubernetes and Amazon Elastic Inference

Optimizing TensorFlow model serving with Kubernetes and Amazon Elastic Inference

This post offers a dive deep into how to use Amazon Elastic Inference with Amazon Elastic Kubernetes Service. When you combine Elastic Inference with EKS, you can run low-cost, scalable inference workloads with your preferred container orchestration system.

Elastic Inference is an increasingly popular way to run low-cost inference workloads on AWS. It allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75%. Amazon EKS is also of growing importance to companies of all sizes, from startups to enterprises, for running containers on AWS. For those who are looking to run inference workloads in a managed Kubernetes environment, using Elastic Inference and EKS together opens the door to running these workloads with accelerated yet low-cost compute resources.

The example in this post shows how to use Elastic Inference and EKS together to deliver a cost-optimized, scalable solution for performing object detection on video frames. More specifically, it does the following:

  • Run pods in EKS that read a video from Amazon S3
  • Preprocess video frames
  • Send the frames for object detection to a TensorFlow Serving pod modified to work with Elastic Inference.

This computationally intensive use case showcases the advantages of using Elastic Inference and EKS to achieve accelerated inference at low cost within a scalable, containerized architecture. For more information about this code, see Optimizing scalable ML inference workloads with Amazon Elastic Inference and Amazon EKS on GitHub.

Elastic Inference overview

Research by AWS indicates that inference can drive as much as 90% of the cost of running machine learning workloads, a much higher percentage than training models. However, using a GPU instance for inference often is wasteful because you’re typically not fully utilizing the instance.

Elastic Inference solves this by allowing you to attach just the right amount of low-cost GPU-powered acceleration to any Amazon EC2 or Amazon SageMaker CPU-based instance. This reduces inference costs by up to 75% because you no longer need to over-provision GPU compute for inference.

You can configure EC2 instances or Amazon SageMaker endpoints with Elastic Inference accelerators using the AWS Management Console, AWS CLI, the AWS SDK, AWS CloudFormation, or Terraform. Launching an instance with Elastic Inference provisions an accelerator in the same Availability Zone behind a VPC endpoint. The accelerator then attaches to the instance over the network. There are two prerequisites:

  • First, provision an AWS PrivateLink VPC endpoint for the subnets where you plan to launch accelerators.
  • Second, provide an instance role with a policy that allows actions on Elastic Inference accelerators.

You also will need to make sure your security groups are properly configured to allow traffic to and from the VPC endpoint and instance. For further details, see Setting Up to Launch Amazon EC2 with Elastic Inference.

Deep learning tools and frameworks enabled for Elastic Inference can automatically detect and offload the appropriate model computations to the attached accelerator. Elastic Inference supports TensorFlow, Apache MXNet, and ONNX models, with more frameworks coming soon. If you are a PyTorch user, you can convert your models to the ONNX format to enable usage of Elastic Inference. The example in this post uses TensorFlow.

EKS overview

Kubernetes is open-source software that allows you to deploy and manage containerized applications at scale. Kubernetes groups containers into logical groupings for management and discoverability, then launches them into clusters of EC2 instances. EKS makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.

More specifically, EKS provisions, manages, and scales the Kubernetes control plane for you. At a high level, Kubernetes consists of two major components: a cluster of worker nodes that run your containers, and the control plane that manages when and where containers start on your cluster and monitors their status.

Without EKS, you have to run both the Kubernetes control plane and the cluster of worker nodes yourself. By handling the control plane for you, EKS removes a substantial operational burden for running Kubernetes, and allows you to focus on building your application instead of managing infrastructure.

Additionally, EKS is secure by default, and runs the Kubernetes management infrastructure across multiple Availability Zones to eliminate a single point of failure. EKS is certified Kubernetes conformant so you can use existing tooling and plugins from partners and the Kubernetes community. Applications running on any standard Kubernetes environment are fully compatible, and you can easily migrate them to EKS.

Integrating Elastic Inference with EKS

This post’s example involves performing object detection on video frames. These frames are extracted from videos stored in S3. The following diagram shows the overall architecture:

Inference node container with TensorFlow Serving and Elastic Inference support

TensorFlow Serving (TFS) is the preferred way to serve TensorFlow models. Accordingly, the first step in this solution is to build an inference container with TFS and Elastic Inference support. AWS provides a TFS binary modified for Elastic Inference. Versions of the binary are available for several different versions of TFS, as well as Apache MXNet. For more information and links to the binaries, see Amazon Elastic Inference Basics.

In this solution, the inference Dockerfile gets the modified TFS binary from S3 and installs it, along with an object detection model. The model is a variant of MobileNet trained on the COCO dataset, published in the Tensorflow detection model zoo. The complete Dockerfile is available in the amazon-elastic-inference-eks GitHub repo, under the /Dockerfile_tf_serving directory.

Standard node container

In addition to an inference container with Elastic Inference-enabled TFS and an object detection model, you also need a separate standard node container which performs the bulk of the application tasks and gets predictions from the inference container. As a top-level summary, this standard node container performs the following tasks:

  • Polls an Amazon SQS queue for messages regarding the availability of videos.
  • Fetches the next available video from S3.
  • Converts the video to individual frames.
  • Batches some of the frames together and sends the batched frames to the model server container for inference.
  • Processes the returned predictions.

The only aspect of the code that isn’t straightforward is the need to enable EC2 instance termination protection while workers are processing videos, as shown in the following code example:

    DisableApiTermination={ 'Value': True },

After the job processes, a similar API call disables termination protection. This example application uses termination protection because the jobs are long-running, and you don’t want an EC2 instance terminated during a scale-in event if it is still processing a video.

You can easily modify the inference code and optimize it for your use case, so this post doesn’t spend further time examining it. To review the Dockerfile for the inference code, see the amazon-elastic-inference-eks GitHub repo, under the /Dockerfile directory. The code itself is in the test.py file.

Kubernetes cluster details

The EKS cluster deployed in the sample CloudFormation template contains two distinct node groups by default. One node group contains M5 instances, which are currently the latest generation of general purpose instances, and the other node group contains C5 instances, which are currently the latest generation of compute-optimized instances. The instances in the C5 node group each have a single Elastic Inference accelerator attached.

Currently, Kubernetes doesn’t schedule pods using Elastic Inference accelerators. Accordingly, this example uses Kubernetes labels and selectors to distribute the inference workload to the resources in the cluster with attached Elastic Inference accelerators.

More specifically, to minimize the complexity of scheduling access to the Elastic Inference accelerator, the application and inference pods deploy as a DaemonSet with a selector, which ensures that each node with a defined label runs one copy of the application and inference on the instance. The sample application pulls job metadata from the SQS queue and then processes each one sequentially, so you don’t need to worry about multiple processes interacting with the Elastic Inference accelerator.

Additionally, the deployed cluster contains an Auto Scaling group that scales the nodes in the inference group in/out based upon the approximate depth of the SQS queue. Automatic scaling helps keep the inference node group sized appropriately to keep costs as low as possible. Depending on your workload, you also could consider using Spot Instances to keep your costs low.

Currently, SQS metrics update every five minutes, so you can trigger an AWS Lambda function using CloudWatch Events one time per minute to query the depth of the queue directly and update a custom CloudWatch metric.

Launching with AWS CloudFormation

To create the resources described in this post, you must run several AWS CLI commands. For more information about running and launching these resources, see the associated Makefile on GitHub. For instructions to create these resources using a CloudFormation template, see the README file in the GitHub repository.

Comparing costs

Finally, you can see how much you saved on costs by using Elastic Inference rather than a full GPU instance.

By default, this solution uses a CPU instance of type c5.large with an attached accelerator of type eia1.medium for the inference nodes. The current On-Demand pricing for those resources is $0.085 per hour, plus $0.130 per hour, for a total of $0.215 per hour. The total cost compared to the pricing of the smallest current generation GPU instance is as follows:

  • Elastic Inference solution – $0.215 per hour
  • GPU instance p3.2xlarge – $3.06 per hour

To summarize, the Elastic Inference solution cost is less than 10% of the cost of using a full GPU instance.

Despite its much lower cost, the Elastic Inference solution in these tests can do real-time inference, processing video at a rate of almost 30 frames per second. This result is impressive, especially considering that there is room to optimize the code further. For more information about the cost comparison of Elastic Inference versus GPU instances, see Optimizing costs in Amazon Elastic Inference with TensorFlow. To achieve even greater costs savings, you can use Spot Instances for the CPU instances for up to a 90% discount for those instances compared to On-Demand prices.


Elastic Inference enables low-cost accelerated inference, and EKS makes it easy to run Kubernetes on AWS. You can combine the two to create a powerful, low-touch solution for running inexpensive accelerated inference workloads in a managed, scalable, and highly available Kubernetes cluster.

Many variations of this solution are possible on AWS. For example, instead of using EKS, you could use Amazon ECS, another managed container orchestration service on AWS. Alternatively, you could run the Kubernetes control plane yourself directly on EC2, or run containers directly on EC2 without Kubernetes. The choice is yours. AWS enables you to build the architecture that best suits your use case, tooling, and workflow preferences.

To get started, go to the CloudFormation console and create a stack using the CloudFormation template for this blog post’s example solution. Details can be found in the Launching with AWS CloudFormation section above, and in the related GitHub repository linked there.

About the Authors

Ryan Nitz is a software engineer and architect working on the Startup Solutions Architecture team.





Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.








from AWS Machine Learning Blog

Tracking the throughput of your private labeling team through Amazon SageMaker Ground Truth

Tracking the throughput of your private labeling team through Amazon SageMaker Ground Truth

Launched at AWS re:Invent 2018, Amazon SageMaker Ground Truth helps you quickly build highly accurate training datasets for your machine learning models. Amazon SageMaker Ground Truth offers easy access to public and private human labelers, and provides them with built-in workflows and interfaces for common labeling tasks. Additionally, Amazon SageMaker Ground Truth can lower your labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.

When using your own private workers to perform data labeling, you want to measure and track their throughput and efficiency. Amazon SageMaker Ground Truth now logs worker events (for example, when a labeler starts and submits a task) to Amazon CloudWatch. In addition, you can also use the built-in metrics feature of CloudWatch to measure and track throughput across a work team or for individual workers. In this blog post, we cover how to use the raw worker event logs and built-in metrics in your AWS account.

How to use worker activity logs

Once you set up a private team of workers and run a labeling job with Amazon SageMaker Ground Truth, worker activity logs are automatically emitted to CloudWatch. To learn how to set up a private team and kick off your first labeling job, reference this getting started blog post. Note: If you have previously created a private work team, you need to create a new private work team to set up the trust permissions between work teams and CloudWatch. Realize, you do not have to use that private work team, and this is simply a one-time setup step.

To view the logs, visit the CloudWatch console and click on Logs in the left-hand panel. Here, you should see a log group named /aws/sagemaker/groundtruth/WorkerActivity.

This Log Group contains logs for each task a worker accepts during an Amazon SageMaker Ground Truth labeling job, and we have included an example log below. You see the worker’s Amazon Cognito sub ID in the “cognito_sub_id” field. We will demonstrate how to tie this back to worker’s identity through Amazon Cognito. In addition, you see the Amazon Resource Name (ARN) for the Amazon SageMaker Ground Truth labeling job in the “workflow_arn”. This log also contains timestamps for when the worker begins the task (“task_accepted_time”) and when the worker either returns or submits the task (“task_returned_time” or “task_submitted_time”).

     "worker_id": "cd449a289e129409", 
     "cognito_user_pool_id": "us-east-2_IpicJXXXX", 
     "cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd", 
     "task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019", 
     "task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019", 
     "task_returned_time": "", 
     "workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-labeling-team",
     "labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo",
     "work_requester_account_id": "############", 
     "job_reference_code": "############",
     "job_type": "Private", 
     "event_type": "TasksSubmitted", 
     "event_timestamp": "1565798464" 

Learn more about using CloudWatch Logs from the developer documentation.

How to use worker activity metrics

You can also use the CloudWatch metrics capability to generate your own interesting statistics or graphs about the throughput of your private workers. You can begin by navigating to the Metrics tab and then the AWS/SageMaker/Workteam namespace.

Say you want to find the average amount of time workers spent on tasks for a specific labeling job. You would select the LabelingJob, Workteam option.

From here, you can calculate your own statistics. In the example below, we calculate the average time spent per submitted task for a specific labeling job. There were 14 tasks submitted that took a total of 2.28 minutes or, on average, 9.78 seconds per task.

Learn more about using CloudWatch metrics from the developer documentation.

How to link Amazon Cognito sub ID to worker information

You can link the outputted Amazon Cognito sub ID to identifiable worker information, such as user name. To do so, you can write a quick script using the Amazon Cognito ListUsers API. Alternatively, you can use the Amazon Cognito console by following these steps:

  1. Navigate to Manage User Pools in the AWS Region where you are running your labeling jobs.
  2. Select the sagemaker-ground-userpool (if you integrated your own Amazon Cognito user pool with Amazon SageMaker Ground Truth, select that user pool).
  3. From the left-hand panel, click Users and groups to see all of the users in your user pool.
  4. Click on any users to see their respective sub ID.


In this post, I introduced how to measure and track the throughput of your private labeling team using CloudWatch Logs and metrics. In addition, I walked through how to link the outputted worker ID to identifiable worker information, such as a user name. Visit the AWS Management Console to get started.

As always, AWS welcomes feedback. Please submit comments or questions below.

About the Authors

Vikram Madan is the Product Manager for Amazon SageMaker Ground Truth. He focusing on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys running long distances and watching documentaries.



Pranav Sachdeva is a Software Development Engineer in AWS AI. He is passionate about building high performance distributed systems to solve real life problems. He is currently focused on innovating and building capabilities in the AWS AI ecosystem that allow customers to give AI the much needed human aspect.






from AWS Machine Learning Blog

Enable smart text analytics using Amazon Elasticsearch Search and Amazon Comprehend

Enable smart text analytics using Amazon Elasticsearch Search and Amazon Comprehend

We’re excited to announce an end-to-end solution that leverages natural language processing to analyze and visualize unstructured text in your Amazon Elasticsearch Service domain with Amazon Comprehend in the AWS Cloud. You can deploy this solution in minutes with an AWS CloudFormation template and visualize your data in a Kibana dashboard.

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that delivers Elasticsearch’s easy-to-use APIs and real-time capabilities along with the availability, scalability, and security required by production workloads. Amazon Comprehend is a fully managed natural language processing (NLP) service that enables text analytics to extract insights from the content of documents. Customers can now leverage Amazon ES and Amazon Comprehend to index and analyze unstructured text, and deploy a pre-configured Kibana dashboard to visualize extracted entities, key phrases, syntax, and sentiment from their documents.

As an example, a company might have large volumes of online customer feedback or transcribed customer calls. With this solution, you can visualize a time series of the sentiment of customer contacts, analyze a word cloud of the entities or key phrases in those contacts, search contacts for a specific product by sentiment, and much more. In this blog post, let’s look at an example Kibana dashboard that you can deploy to draw insights from your text data with Amazon ES and Amazon Comprehend. For detailed instructions, please visit the solution implementation guide.

This solution uses AWS CloudFormation to automate the deployment on the AWS Cloud. You can learn more about the solution by clicking this link and download the template here:

You can use this template to launch the solution and all associated components. Deploying this solution with the default parameters builds the following environment in the AWS Cloud.

The default configuration deploys Amazon API Gateway, AWS Lambda, Amazon Elasticsearch Service, and AWS Identity and Access Management roles and policies, but you can also customize the template based on your specific network needs. Once the solution is deployed, you get a fully compatible Amazon ES RESTful API that you can use to ingest documents to Amazon ES and automatically tag the documents with NLP-based text analytics from Amazon Comprehend. You can then use the pre-configured Kibana dashboard to visualize these insights. In the example below, the entity dashboard below shows the word cloud for commercial items, organizations, people, locations, events, and titles from news content.

The sentiment dashboard below shows the sentiment over time, total counts of each sentiment and the top documents with positive and negative sentiment from unstructured text.

The Kibana dashboard is interactive and user-friendly, allowing you to dive deep into your unstructured text data. Try this solution now:

This solution is available in all Regions where Amazon ES and Amazon Comprehend is available. Please refer to the AWS Region Table for more information about Amazon Elasticsearch Service and Amazon Comprehend availability.

About the Author

Sameer Karnik is a Sr. Product Manager leading product for Amazon Comprehend, AWS’s natural language processing service.





from AWS Machine Learning Blog

Build a custom entity recognizer using Amazon Comprehend

Build a custom entity recognizer using Amazon Comprehend

Amazon Comprehend is a natural language processing service that can extract key phrases, places, names, organizations, events, and even sentiment from unstructured text, and more. Customers usually want to add their own entity types unique to their business, like proprietary part codes or industry-specific terms. In November 2018, enhancements to Amazon Comprehend added the ability to extend the default entity types to custom entities. In addition, a custom classification feature allows you to group documents into named categories. For example, you can now group support emails by department, social media posts by product, and analyst reports by business unit.


In this post, I cover how to build a custom entity recognizer. No prior machine learning knowledge is required. I demonstrate an example that requires you to wrangle, filter, and clean the data before you can train the custom entity recognizer. Otherwise, you can just adhere to the following step-by-step instructions. These instructions begin with the dataset already prepared.

In this example, I use the following dataset: Customer Support on Twitter hosted on Kaggle. The dataset is chiefly comprised of short utterances. This is a typical and common illustration of chat conversations between a customer and a support representative. Here are some sample utterances from the Twitter dataset:

@AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡

@SpotifyCares Thanks! Version armv7 on anker bluetooth speaker on Samsung Galaxy Tab A (2016) Model SM-T280 Does distance from speaker matter?

I filtered the data and kept only the tweets that contain “TMobileHelp” and “sprintcare” so that you can focus on one particular domain and context. Download and unzip the dataset onto your computer from comprehend_blog_data.zip file.


In this example, you create a custom entity recognizer to extract information regarding iPhones and Samsung Galaxy phones. Currently, Amazon Comprehend recognizes both devices as “commercial items.” In this use case, you should be more specific.

Because you must be able to extract smartphone devices in particular, it would be counterproductive to limit the extracted data to generic commercial items. With this capability, a service provider can then easily extract device information from a tweet and route the problem to the relevant technical support team.

In the Amazon Comprehend console, create a custom entity recognizer for devices. Choose Train Recognizer.

Provide a name and an Entity type label, such as DEVICE.

To train a custom entity recognition model, you can choose one of two ways to provide data to Amazon Comprehend:

  • Annotations: Uses an annotation list, which provides the location of your entities within a large number of documents. Amazon Comprehend can train from both the entity itself and its context.
  • Entity lists: Provides only a limited context. It only uses a selection from the specific entities list so that Amazon Comprehend can train on identifying the custom entity.

For simplicity, use the entity list method. The Annotation method can often lead to more refined results.

Provide a list of unique entities that have at least 1000 matches within a training dataset. Here is a list of devices included in the entity_list.csv file:

iPhone X,DEVICE 
Samsung Galaxy,DEVICE 
Samsung Note,DEVICE 

Split the initial dataset and hold out about 1000 records for testing purposes. This sample of records is used to test the model in a later step.

The rest of the data constitutes the training dataset (raw_txt.csv). As a general rule, you should include as much relevant data as possible. The more data that you add, the more context the model can have on which to train itself.

Upload the entity_list.csv and the raw_txt.csv files to an S3 bucket and provide the path for the entity list and training dataset locations.

To grant permissions to Amazon Comprehend to access your S3 bucket, create an IAM service-linked role, as shown in the screenshot below. Use AmazonComprehendServiceRole-role.

Choose Train. This command allows you to submit your custom entity recognizer, go through a number of models, tune your hyperparameter, and check for cross validation to make sure that your model is robust. These are all the activities that data scientists perform to ensure that their models are robust.

Test your model

Next, create a job and test your model, as shown in the screenshot below.

Provide an output folder where Amazon Comprehend saves the results.

Select the IAM role that you created in the previous step, and choose Create Job.

When your job analysis is complete, you have JSON files in your output S3 bucket path.

Now, to create a schema and to query your data, use AWS Glue and Amazon Athena, respectively. Follow the steps, provide the output path of your results, and create a database in AWS Glue. My AWS Glue crawler is shown in the following screenshot.

Next, run some queries in Athena and see which entities your custom annotator picks up.

SELECT col3, count(col3) 
FROM "comprehend - device"."202860692096_ner_e4f07c65cc5d7f1ca0c2a46ccd3e408c" 
group by col3;

You might now notice that Amazon Comprehend has picked up additional words with varying spellings, which is something that can be expected when analyzing social media data, which has typos and abbreviated spellings.


In this post, I demonstrated how to build a custom entity recognition model, run some validation, and query the results. You could follow this post without having to know any of the complex and intricate procedures that must be mastered to build an NLP model.

In a real-life scenario, a service provider monitoring these tweets could leverage the custom entity recognition capabilities of Amazon Comprehend to extract information about the types of device mentioned in the tweet. They might also extract and assess the tone or sentiment of the tweet using Amazon Comprehend’s built-in sentiment analysis API.

This machine learning application can provide important context and assessment of a customer’s intent, which then enables Amazon Comprehend to make intelligent routing and remediation decisions. Overall, this process improves service and increases customer satisfaction.

Try custom entities now from the Amazon Comprehend console and get detailed instructions in the Amazon Comprehend documentation. This solution is available in all Regions where Amazon Comprehend is available. Please refer to the AWS Region Table for more information.

About the Authors

Phi Nguyen is a solution architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.




Ro Mullier is a Sr. Solutions Architect at AWS helping customers run a variety of applications on AWS and machine learning workloads in particular. In his spare time, he enjoy spending time with family and friends, playing soccer and competing in machine learning competitions.


from AWS Machine Learning Blog

Power contextual bandits using continual learning with Amazon SageMaker RL

Power contextual bandits using continual learning with Amazon SageMaker RL

Amazon SageMaker is a modular, fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Training models is quick and easy using a set of built-in high-performance algorithms, pre-built deep learning frameworks, or using your own framework. To help select your machine learning (ML) algorithm, Amazon SageMaker comes with the most common ML algorithms that are pre-installed and performance-optimized.

In addition to building machine learning models using supervised and unsupervised learning techniques, you can also build reinforcement learning models in Amazon SageMaker using Amazon SageMaker RL. Amazon SageMaker RL includes pre-built RL libraries and algorithms that make it easy to get started with reinforcement learning. There are several examples in GitHub that show you how you can use Amazon SageMaker RL for training robots and autonomous vehicles, portfolio management, energy optimization, and automatic capacity scaling.

In this blog post, we are excited to show you how you can use Amazon SageMaker RL to implement contextual multi-armed bandits (or contextual bandits for short) to personalize content for users. The contextual bandits algorithm recommends various content options to the users (such as gamers or hiking enthusiasts) by learning from user responses to the recommendations such as clicking a recommendation or not. These algorithms require that the machine learning models be continually updated to adapt to changes in data, and we show you how to build an iterative training and deployment loop in Amazon SageMaker.

Contextual bandits

Many applications like personalized web services (content layout, ads, search, product recommendations, etc.) are continuously faced with decisions to make, often based on some contextual information. These applications need to personalize content for individuals by making use of both user and content information. For example, user information related to her being a gaming enthusiast and content information related to it being a racing game. Machine learning systems that enable these applications face two challenges. The data to learn user preferences is sparse and biased (many users have little or no history and many products have never been recommended in the past). Also, new users and content are always being added to the system. Traditional Collaborative Filtering (CF) based approaches, used for personalization, build a static recommendation model for the sparse/biased dataset and for the current set of users and content. Contextual bandits, on the other hand, collect and augment data in a strategic manner by trading off between exploiting known information (recommending games to the gaming enthusiast) and exploring recommendations (recommending hiking gear to the gaming enthusiast) which may yield higher benefits. Bandits models also use user and content features and hence they can make recommendations for new content/users based on preferences of similar content and users.

Before we go any further, let us introduce some terminology. Contextual bandits algorithm is characterized by an iterative process. There are a number of choices (known as arms or actions), from which an agent can choose, which contain stochastic rewards. At the beginning of each round, the environment generates a state of fixed dimensionality (also called context), and rewards for each action, which are related to the state. The agent chooses an arm with a certain probability for that round, and the environment reveals the reward for that arm, but not for the others. The goal of the agent is to explore and exploit actions so that it learns a good model while minimizing use of actions that yield low rewards.

Amazon SageMaker RL contextual bandits solution

To implement the explore-exploit strategy in Amazon SageMaker RL, we developed an iterative training and deployment system that: (1) Presents the recommendations from the currently hosted contextual bandit model to the user, based on her features (context), (2) Captures the implicit feedback over time, and (3) Continuously re-trains the model with incremental interaction data.

In particular, the Amazon SageMaker RL bandits solution has the following features. Accompanying this blog, we are also releasing an Amazon SageMaker example Notebook demonstrating these features.

Amazon SageMaker RL Bandits Container: The Amazon SageMaker RL bandits container provides a library of contextual bandits algorithms from the Vowpal Wabbit (VW) project. In addition, it also provides support for hosting the trained bandit models for predictions.

Warm start: If there is historical data capturing user and content interactions, it can be used to create the initial model. In particular, data of the form <state, action, probability, reward> is needed. Presence of such data can help improve the model convergence times (number of training and deployment cycles). In the absence of such data, we can also initialize the model randomly. In the following code from our Amazon SageMaker example Notebook, we show how to warm start the bandits model with historical data.

bandits_experiment = ExperimentManager(config, experiment_id='demo-1')

(Simulated) Client Application and Reward Ingestion: Any real world application (for example, a retail website serving recommendations to users) is referred to as the Client Application in the figure above, will ping the Amazon SageMaker hosted endpoint with user features (state) and will receive recommendations (action) with an associated probability (probability) in return. In addition, the client application will also receive a system-generated event_id. Data generated as a result of user interactions with the recommendations is used in the subsequent iteration of training. In particular, the user behavior of interest (such as clicks and purchases) is captured as the feedback or reward. The feedback may not be instantaneous (purchase after a few hours of the recommendation) and the client application is expected to (1) associate the reward with the event_id and (2) upload the aggregated rewards data (<reward, event_id>) back on to S3. We include code in the example notebook to demonstrate how such a client application can be implemented. The simulated application has a predictor object that has the logic to make HTTP requests to the Amazon SageMaker endpoint. The event_id is used to join inference data (<state, action, probability, event_id>) with the rewards data (<reward, event_id>).

predictor = bandits_experiment.predictor
sim_app = StatlogSimApp(predictor=predictor)

batch_size = 500 # collect 500 data instances
print("Collecting batch of experience data...")

# Generate experiences and log them
for i in range(batch_size):
    user_id, user_context = sim_app.choose_random_user()
    action, event_id, model_id, action_prob, sample_prob = predictor.get_action(obs=user_context.tolist())
    reward = sim_app.get_reward(user_id, action, event_id, model_id, action_prob, sample_prob, local_mode)
# Join (observation, action) with rewards (can be delayed) and upload the data to S3
print("Waiting for Amazon Kinesis Data Firehose to flush data to s3...")
rewards_s3_prefix = bandits_experiment.ingest_rewards(sim_app.rewards_buffer)

Inference logging: To use data generated from user interactions with the deployed contextual bandit models, we need to be able to capture data at the inference time (<state, action, probability, event_id>). Inference data logging happens automatically from the deployed Amazon SageMaker endpoint serving the bandits model. The data is captured and uploaded to an S3 bucket in the user account. Please refer to the notebook for details on the S3 locations where this data is stored.

Customizable joins: At every iteration, the training data is obtained by joining the inference data with the rewards data. By default, all of the specified rewards data and inference data are used for the join. The Amazon SageMaker RL bandits solution also lets customers specify a time window on which the inference data and rewards data can be joined (number of hours before the join).

Iterative training and deployment (Continual Learning setup): The example notebook and accompanying code help demonstrate how to use Amazon SageMaker and other AWS services to create the iterative training and deployment loop to build and train bandit models. This is demonstrated in two parts. First, the notebook demonstrates each step individually (model initialization, deploying the first model, initializing the client application, reward ingestion, model re-training and re-deployment). These individual steps help during the development phase. Subsequently, an end-to-end loop demonstrates how bandits models can be deployed post development. The ExperimentManager class can be used for all the Bandits/RL and continual learning workflows. Similar to the estimators in the Amazon SageMaker Python SDKExperimentManager contains methods for training, deployment, and evaluation. It keeps track of the job status and reflects current progress in the workflow. It sets up an AWS CloudFormation stack of AWS resources like Amazon DynamoDB, Amazon Kinesis Data Firehose and Amazon Athena, that are required to support the continual learning loop, in addition to Amazon SageMaker.

Offline model evaluation and visualization: At every training and deployment iteration, we demonstrate how offline model evaluation can be used to aid the decision to update the deployed model. After every training cycle, we need to evaluate if the newly trained model is better than the one currently deployed. Using an evaluation dataset, we evaluate how the new model would have done had it been deployed compared to the model that is currently deployed. Amazon SageMaker RL supports offline evaluation by performing this counterfactual analysis (CFA). By default, we apply a doubly robust (DR) estimation method [1]. These evaluation scores are also sent to Amazon CloudWatch so that for long running cycles, users can visualize the progress over time.

# Evaluate the recently trained model

eval_score_last_trained_model = bandits_experiment.get_eval_score(

# Evaluate the deployed model

eval_score_last_hosted_model = bandits_experiment.get_eval_score(
# Deploy if trained model is better
if eval_score_last_trained_model <= eval_score_last_hosted_model:

Amazon SageMaker example notebook

To demonstrate the bandits application, we used the Statlog(Shuttle) dataset from the UCI Machine Learning repository [2]. It contains nine integer attributes (or features) related to indicators during a space shuttle flight, and the goal is to predict one of seven states of the radiator subsystem of the shuttle. For demonstrating the bandits solution, this multi-class classification problem is converted into a bandits problem. In the classification problem, the algorithm receives features and correct label per datapoint. In the bandit problem, the algorithm picks one of the label options given the features. If this matches the class in the original data point, a reward of one is assigned. If not, a reward of zero is assigned.

We create an offline dataset to showcase the warm-start feature. For this purpose, 100 data points are randomly selected. The features are considered as the context and an action is generated for each sample by selecting one class randomly from the seven (probability=1/7). During the training and deployment loop, the hosted bandits model generates a predicted class (action) and the associated probability. Again, the reward is assigned as one if the predicted class matches the actual class. Otherwise, it is set to zero. After every 500 data points the accumulated data is used to re-train the model that is deployed based on its offline model evaluation.

Local vs Amazon SageMaker modes

The explore/exploit strategy requires iterative training and model deployment cycles. For faster experimentation/development cycles, we have used the Amazon SageMaker local mode. In this mode, the model training, data joins, and deployment are happening in the Amazon SageMaker notebook instance, which aids faster iteration. You can easily move from the local mode to training in Amazon SageMaker for production use-cases where you need to scale to a high model throughput with a single click.

Comparing different Exploration strategies

We compare the rewards received in the Statlog (Shuttle) simulated environment between using a naive random strategy to explore the environment versus a bandit algorithm called online cover [3]. The figure below shows how the bandit algorithm explores different actions initially, learns from the received rewards and shifts to exploiting as time progresses. The agent receives a reward of one if the predicted action is the correct class and zero otherwise. The oracle always knows the right action to take for each state, and gets a perfect score of one. The experiment starts with a model warm started from 100 data points and updates the model every 500 interactions for a total of 7500 interactions. The rewards shown are a rolling mean over 100 data points. The rewards plot aligns with results reported in the literature [4].


In this blog post, we showcased how to you can use Amazon SageMaker RL and the Amazon SageMaker built-in bandits container to systematically train and deploy contextual bandit models. We explained how you can get started with training multi-armed contextual bandit models interacting with a live environment and updating the model along with efficient exploration. The accompanying Amazon SageMaker example notebook demonstrates how you can manage your own bandits workflow on top of all the benefits offered by the Amazon SageMaker managed service. To learn more about Amazon SageMaker RL, please visit the developer documentation here.


  1. Dudik, M., Langford, J. and Li, L. (2011). Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28thInternational Conferenceon Machine Learning (ICML 2011).
  2. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
  3. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L. and Schapire, R.E. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML-14).
  4. Bietti, A., Agarwal, A. and Langford, J. (2018). A Contextual Bandit Bake-off.

About the Authors

Saurabh Gupta is an Applied Scientist with AWS Deep Learning. He did his MS in AI and Machine Learning from UC San Diego. His interests lie in Natural Language Processing and Reinforcement Learning algorithms, and in providing high performance practical implementations of the former, that are deployable in the real world.




Bharathan Balaji is a Research Scientist in AWS and his research interests lie in reinforcement learning systems and applications. He contributed to the launch of Amazon SageMaker RL and AWS DeepRacer. He received his Ph.D. in Computer Science and Engineering from University of California, San Diego.




Anna Luo is an Applied Scientist in the AWS. She works on utilizing RL techniques for different domains including supply chain and recommender system. She received her Ph.D. in Statistics from University of California, Santa Barbara.




Yijie Zhuang is a Software Engineer with AWS SageMaker. He did his MS in Computer Engineering from Duke. His interests lie in building scalable algorithms and reinforcement learning systems. He contributed to Amazon SageMaker Built-in Algorithms and Amazon SageMaker RL.




Siddhartha Agarwal is a Software Developer with AWS Deep Learning team. He did his Masters in Computer Science from UC San Diego, and currently focuses on bulding Reinforcement Learning solutions on SageMaker. Prior to SageMaker, he worked on Amazon Comprehend, a natural language processing service on AWS. In his leisure time, he loves to cook and explore new places.




Vineet Khare is a Sciences Manager for AWS Deep Learning. He focuses on building Artificial Intelligence and Machine Learning applications for AWS customers using techniques that are at the forefront of research. In his spare time, he enjoys reading, hiking and spending time with his family.





from AWS Machine Learning Blog

Speed up training on Amazon SageMaker using Amazon EFS or Amazon FSx for Lustre file systems

Speed up training on Amazon SageMaker using Amazon EFS or Amazon FSx for Lustre file systems

Amazon SageMaker provides a fully-managed service for data science and machine learning workflows. One of the most important capabilities of Amazon SageMaker is its ability to run fully-managed training jobs to train machine learning models. Visit the service console to train machine learning models yourself on Amazon SageMaker.

Now, you can speed up your training job runs by training machine learning models from data stored in Amazon Elastic File System (EFS) or Amazon FSx for Lustre. Amazon EFS provides a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources. Amazon FSx for Lustre is a high-performance file system optimized for workloads, such as machine learning, analytics, and high performance computing.

Training machine learning models requires providing the training datasets to the training job. When using Amazon Simple Storage Service (S3) as the training datasource in file input mode, all training data is downloaded from Amazon S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as Amazon EFS or FSx for Lustre can speed up machine learning training by eliminating the need for this download step.

In this blog post, we go over the benefits of training your models using a file system, provide information to help you choose a file system, and show you how to get started.

Choosing a file system for training models on SageMaker

When considering whether you should train your machine learning models from a file system the first thing to consider is: where does your training data reside now?

If your training data is already in Amazon S3 and your needs do not dictate a faster training time for your training jobs, you can get started with Amazon SageMaker with no need for data movement. However, if you need faster startup and training times we recommend that you take advantage of Amazon SageMaker’s integration with Amazon FSx for Lustre file system, which can speed up your training jobs by serving as a high-speed cache.

The first time you run a training job, if Amazon FSx for Lustre is linked to Amazon S3, it automatically loads data from Amazon S3 and makes it available to Amazon SageMaker at hundreds of gigabytes per second and submillisecond latencies. Additionally, subsequent iterations of your training job will have instant access to the data in Amazon FSx. Because of this, Amazon FSx has the most benefit to training jobs that have several iterations requiring multiple downloads from Amazon S3, or in workflows where training jobs must be run several times using different training algorithms or parameters to see which gives the best result.

If your training data is already in an Amazon EFS file system, we recommend choosing Amazon EFS as the file system data source. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS, and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with which fields or labels to include. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from Amazon SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

Getting started with Amazon FSx for training on Amazon SageMaker

  1. Note your training data Amazon S3 bucket and path.
  2. Launch an Amazon FSx file system with the desired size and throughput, and reference the training data Amazon S3 bucket and path. Once created, note your file system id.
  3. Now, go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
  4. Create your training job:
    1. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to AmazonSageMakerFullAccess for details.
    2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow Lustre traffic over port 988 to control access to the training dataset stored in the file system. For more details, refer to Getting started with Amazon FSx.
    3. Choose file system as the data source and properly reference your file system id, path, and format.
  5. Launch your training job.

Getting started with Amazon EFS for training on Amazon SageMaker

  1. Put your training data in its own directory in Amazon EFS.
  2. Now go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
  3. Create your training job:
    1. Provide the IAM role ARN for the IAM role with the required access control and permissions policy
    2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow NFS traffic over port 2049 to control access to the training dataset stored in the file system.
    3. Choose file system as the data source and properly reference your file system id, path, and format.
  4. Launch your training job.

After your training job completes, you can view the status history of the training job to observe the faster download time when using a file system data source.


With the addition of Amazon EFS and Amazon FSx for Lustre as data sources for training machine learning models in Amazon SageMaker, you now have greater flexibility to choose a data source that is suited to your use case. In this blog post, we used a file system data source to train machine learning models, resulting in faster training start times by eliminating the data download step.

Go here to start training machine learning models yourself on Amazon SageMaker or refer to our sample notebook to train a liner learner model using a file system data source to learn more.


About the Authors

Vidhi Kastuar is a Sr. Product Manager for Amazon SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with his family and friends.



Will Ochandarena is a Principal Product Manager on the Amazon Elastic File System team, focusing on helping customers use EFS to modernize their application architectures. Prior to AWS, Will was Senior Director of Product Management at MapR.





from AWS Machine Learning Blog

Serving deep learning at Curalate with Apache MXNet, AWS Lambda, and Amazon Elastic Inference

Serving deep learning at Curalate with Apache MXNet, AWS Lambda, and Amazon Elastic Inference

This is a guest blog post by Jesse Brizzi, a computer vision research engineer at Curalate.

At Curalate, we’re always coming up with new ways to use deep learning and computer vision to find and leverage user-generated content (UGC) and activate influencers. Some of these applications, like Intelligent Product Tagging, require deep learning models to process images as quickly as possible. Other deep learning models must ingest hundreds of millions of images per month to generate useful signals and serve content to clients.

As a startup, Curalate had to find a way to do all of this at scale in a high-performance, cost-effective manner. Over the years, we’ve used every type of cloud infrastructure that AWS has to offer in order to host our deep learning models. In the process, we learned a lot about serving deep learning models in production and at scale.

In this post, I discuss the important factors that Curalate considered when designing our deep learning infrastructure, how API/service types prioritize these factors, and, most importantly, how various AWS products meet these requirements in the end.

Problem overview

Let’s say you have a trained MXNet model that you want to serve in your AWS Cloud infrastructure. How do you build it, and what solutions and architecture do you choose?

At Curalate, we’ve been working on an answer to this question for years. As a startup, we’ve always had to adapt quickly and try new options as they become available. We also roll our own code when building our deep learning services. Doing so allows for greater control and lets us work in our programming language of choice. 

In this post, I focus purely on the hardware options for deep learning services. If you’re also looking for model-serving solutions, there are options available from Amazon SageMaker.

The following are some questions we ask ourselves:

  • What type of service/API are we designing?
    • Is it user-facing and real-time, or is it an offline data pipeline processing service?
  • How does each AWS hardware serving option differ?
    • Performance characteristics
      • How fast can you run inputs through the model?
    • Ease of development
      • How difficult is it to engineer and code the service logic?
    • Stability
      • How is the hardware going to affect service stability?
    • Cost
      • How cost effective is one hardware option over the others?

GPU solutions

GPUs probably seem like the obvious solution. Developments in the field of machine learning are closely intertwined with GPU processing power. GPUs are the reason that true “deep” learning is possible in the first place.

GPUs have played a role in every one of our deep learning services. They are fast enough to power our user-facing apps and keep up with our image data pipeline.

AWS offers many GPU solutions in Amazon EC2, ranging from cost-effective g3s.xlarge instances to powerful (and expensive) p3dn.24xlarge instances.

Instance type CPU memory CPU cores GPUs GPU type GPU memory On-Demand cost
g3s.xlarge 30.5 GiB 4 vCPUs 1 Tesla M60 8 GiB $0.750 hourly
p2.xlarge 61.0 GiB 4 vCPUs 1 Tesla K80 12 GiB $0.900 hourly
g3.4xlarge 122.0 GiB 16 vCPUs 1 Tesla M60 8 GiB $1.140 hourly
g3.8xlarge 244.0 GiB 32 vCPUs 2 Tesla M60 16 GiB $2.280 hourly
p3.2xlarge 61.0 GiB 8 vCPUs 1 Tesla V100 16 GiB $3.060 hourly
g3.16xlarge 488.0 GiB 64 vCPUs 4 Tesla M60 32 GiB $4.560 hourly
p2.8xlarge 488.0 GiB 32 vCPUs 8 Tesla K80 96 GiB $7.200 hourly
p3.8xlarge 244.0 GiB 32 vCPUs 4 Tesla V100 64 GiB $12.240 hourly
p2.16xlarge 768.0 GiB 64 vCPUs 16 Tesla K80 192 GiB $14.400 hourly
p3.16xlarge 488.0 GiB 64 vCPUs 8 Tesla V100 128 GiB $24.480 hourly
p3dn.24xlarge 768.0 GiB 96 vCPUs 8 Tesla V100 256 GiB $31.212 hourly

As of August 2019*

There are plenty of GPU, CPU, and memory resource options from which to choose. Out of all of the AWS hardware options, GPUs offer the fastest model runtimes per input and provide memory options sizeable enough to support your large models and batch sizes. For instance, memory can be as large as 32 GB for the instances that use Nvidia V100 GPUs.

However, GPUs are also the most expensive option. Even the cheapest GPU could cost you $547 per month in On-Demand Instance costs. When you start scaling up your service, these costs add up. Even the smallest GPU instances pack a lot of compute resources into a single unit, and they are more expensive as a result. There are no micro, medium, or even large EC2 GPU options.

Consequently, it can be inefficient to scale your resources. Adding another GPU instance can cause you to go from being under-provisioned and falling slightly behind to massively over-provisioned, which is a waste of resources. It’s also an inefficient way to provide redundancy for your services. Running a minimum of two instances brings your base costs to over $1,000. For most service loads, you likely will not even come close to fully using those two instances.

In addition to runtime costs, the development costs and challenges are what you would expect from creating a deep learning–based service. If you are rolling your own code and not using a model server like MMS, you have to manage access to your GPU from all the incoming parallel requests. This can be a bit challenging, as you can only fit a few models on your GPU at one time.

Even then, running inputs simultaneously through multiple models can lead to suboptimal performance and cause stability issues. In fact, at Curalate, we only send one request at a time to any of the models on the GPU.

In addition, we use computer vision models. Consequently, we have to handle the downloading and preprocessing of input images. When you have hundreds of images coming in per second, it’s important to build memory and resource management considerations into your code to prevent your services from being overwhelmed.

Setting up AWS with GPUs is fairly trivial if you have previous experience setting up Elastic Load Balancing, Auto Scaling groups, or EC2 instances for other applications. The only difference is that you must ensure that your AMIs have the necessary Nvidia CUDA and cuDNN libraries installed for the code and MXNet to use. Beyond that consideration, you implement it on AWS just like any other cloud service or API.

Amazon Elastic Inference accelerator solution

What are these new Elastic Inference accelerators? They’re low-cost, GPU-powered accelerators that you can attach to an EC2 or Amazon SageMaker instance. AWS offers Elastic Inference accelerators from 4 GB all the way down to 1 GB of GPU memory. This is a fantastic development, as it solves the inefficiencies and scaling problems associated with using dedicated GPU instances.

Accelerator type FP32 throughput (TFLOPS) FP16 throughput (TFLOPS) Memory (GB)
eia1.medium 1 8 1
eia1.large 2 16 2
eia1.xlarge 4 32 4

You can precisely pair the optimal Elastic Inference accelerator for your application with the optimal EC2 instance for the compute resources that it needs. Such pairing allows you to, for example, use 16 CPU cores to host a large service that uses a small model in a 1 GB GPU. This also means that you can scale with finer granularity and avoid drastically over-provisioning your services. For scaling to your needs, a cluster of c5.large + eia.medium instances is much more efficient than a cluster of g3s.xlarge instances.

Using the Elastic Inference accelerators with MXNet currently requires the use of a closed fork of the Python API or Scala API published by AWS and MXNet. These forks and other API languages will eventually merge with the open-source master branch of MXNet. You can load your MXNet model into the context of Elastic Inference accelerators, as with the GPU or CPU contexts. Consequently, the development experience is similar to developing deep learning services on a GPU-equipped EC2 instance. The same engineering challenges are there, and the overall code base and infrastructure should be nearly identical to the GPU-equipped options.

Thanks to the new Elastic Inference accelerators and access to an MXNet EIA fork in our API language of choice, Curalate has been able to bring our GPU usage down to zero. We moved all of our services that previously used EC2 GPU instances to various combinations of eia.medium/large accelerators and c5.large/xlarge EC2 instances. We made this change based on specific service needs, requiring few to no code changes.

Setting up the infrastructure was a little more difficult, given that the Elastic Inference accelerators are fairly new and did not interact well with some of our cloud management tooling. However, if you know your way around the AWS Management Console, the cost savings are worth dealing with any challenges you may encounter during setup. After switching over, we’re saving between 35% and 65% on hosting costs, depending on the service.

The overall model and service processing latency has been just as fast as, or faster than, the previous EC2 GPU instances that we were using. Having access to the newer-generation C5 EC2 instances have made for significant improvements in network and CPU performance. The Elastic Inference accelerators themselves are just like any other AWS service that you can connect to over the network.

Compared to local GPU hardware, using Elastic Inference accelerators can lead to possible issues and potentially introduce more overhead. That said, the increased stability has proven highly beneficial and has been equal to what we would expect out of any other AWS service.

AWS Lambda solution

You might think that because a single AWS Lambda function lacks a GPU and has tiny compute resources, it would be a poor choice for deep learning. While it’s true that Lambda functions are the slowest option available for running deep learning models on AWS, they offer many other advantages when working with serverless infrastructure.

When you break down the logic of your deep learning service into a single Lambda function for a single request, things become much simpler—even performant. You can forget all about the resource handling needed for the parallel requests coming into your model. Instead, a single Lambda function loads its own instance of your deep learning models, prepares the single input that comes in, then computes the output of the model and returns the result.

As long as traffic is high enough, the Lambda instance is kept alive to reuse for the next request. Keeping the instance alive stores the model in memory, meaning that the next request only has to prep and compute the new input. Doing so greatly simplifies the procedure and makes it much easier to deploy a deep learning service on Lambda.

In terms of performance, each Lambda function is only working with up to 3 GB of memory and one or two vCPUs. Per-input latency is slower than a GPU but is largely acceptable for most applications.

However, the performance advantage that Lambda offers lies in its ability to automatically scale widely with the number of concurrent calls you can make to your Lambda functions. Each request always takes roughly the same amount of time. If you can make 50, 100, or even 500 parallel requests (all returning in the same amount of time), your overall throughput can easily surpass GPU instances with even the largest input batches.

These scaling characteristics also come with efficient cost-saving characteristics and stability. Lambda is serverless, so you only pay for the compute time and resources that you actually use. You’re never forced to waste money on unused resources, and you can accurately estimate your data pipeline processing costs based on the number of expected service requests.

In terms of stability, the parallel instances that run your functions are cycled often as they scale up and down. This means that there’s less of a chance that your service could be taken down by something like native library instability or a memory leak. If one of your Lambda function instances does go down, you have plenty of others still running that can handle the load.

Because of all of these advantages, we’ve been using Lambda to host a number of our deep learning models. It’s a great fit for some of our lower volume, data-pipeline services, and it’s perfect for trying out new beta applications with new models. The cost of developing a new service is low, and the cost of hosting a model is next to nothing because of the lower service traffic requirements present in beta.

Cost threshold

The following graphs display our nominal performance of images through a single Elastic Inference accelerator-hosted model per month. They include the average runtime and cost of the same model on Lambda (ResNet 152, AlexNet, and MobileNet architectures).

The graphs should give you a rough idea of the circumstances during which it’s more efficient to run on Lambda than the Elastic Inference accelerators, and vice versa. These values are all dependent on your network architecture.

Given the differences in model depth and the overall number of parameters, certain model architectures can run more efficiently on GPUs or Lambda than others. The following three examples are all basic image classification models. The monthly cost estimate for the EC2 instance is for a c5.xlarge + eia.medium = $220.82.

As an example, for the ResNet152 model, the crossover point is around 7,500,000 images per month, after which the C5 + EIA option becomes more cost-effective. We estimate that after you pass the bump at 40,000,000, you would require a second instance to meet demand and handle traffic spikes. If the load was perfectly distributed across the month, that rate would come out to about 15 images per second.  Assuming you are trying to run as cheaply as possible, one instance could easily handle this with plenty of headroom, but real-world service traffic is rarely uniform.

Realistically, for this type of load, our C5 + Elastic Inference accelerator clusters automatically scale anywhere from 2 to 6 instances. That’s dependent on the current load on our processing streams for any single model—the largest of which processes ~250,000,000 images per month.


To power all of our deep learning applications and services on AWS, Curalate uses a combination of AWS Lambda and Elastic Inference accelerators.

In our production environment, if the app or service is user-facing and requires low latency, we power it with an Elastic Inference accelerator-equipped EC2 instance. We have seen hosting cost savings of 35-65% compared to GPU instances, depending on the service using the accelerators.

If we’re looking to do offline data-pipeline processing, we first deploy our models to Lambda functions. It’s best to do so while traffic is below a certain threshold or while we’re trying something new. After that specific data pipeline reaches a certain level, we find that it’s more cost-effective to move the model back onto a cluster of Elastic Inference accelerator-equipped EC2 instances. These clusters smoothly and efficiently handle streams of hundreds of millions of deep learning model requests per month.

About the author

Jesse Brizzi is a Computer Vision Research Engineer at Curalate where he focuses on solving problems with machine learning and serving them at scale. In his spare time, he likes powerlifting, gaming, transportation memes, and eating at international McDonald’s. Follow his work at www.jessebrizzi.com.





from AWS Machine Learning Blog

Making daily dinner easy with Deliveroo meals and Amazon Rekognition

Making daily dinner easy with Deliveroo meals and Amazon Rekognition

When Software Engineer Florian Thomas describes Deliveroo, he is talking about a rapidly growing, highly in-demand company. Everyone must eat, after all, and Deliveroo is, in his words, “on a mission to transform the way you order food.”  Specifically, Deliveroo’s business is partnering with restaurants to bring customers their favorite eats, right to their doorsteps.

Deliveroo started in 2013 when Will Shu, the company’s founder and CEO, moved to London. He discovered a city full of great restaurants, but to his dismay, few of them delivered food. He made it his personal mission to bring the best local restaurants directly to people’s doors. Now, Deliveroo’s team is 2,000 strong and operates across not only the UK but also in 14 other global markets, including Australia, the United Arab Emirates, Hong Kong, and most of Europe.

As they’ve grown, Deliveroo has always kept customers at the center. Delivering their chosen meals in a convenient and timely way is not all that Deliveroo has to offer, though. They’re equally timely, responsive, and creative if something has gone awry with a customer’s order (such as a spilled item). Their service portal allows customers to share an image-based report of the issue.

“We’ve learned that, when things go wrong, customers don’t just want to tell us, they want to show us,” remarked Thomas. In addition to enabling the customer care team to provide a solution for each customer, these images are shared with Deliveroo’s restaurant partners to help them continue to improve customers’ experiences.

What Thomas and his team soon realized, though, was that not all of the images that customers uploaded were appropriate. To protect the customer care team from having to sift through any inappropriate images, Deliveroo uses Amazon Rekognition. This easy-to-use content moderation solution has become integral to Deliveroo’s customer care flow, as hundreds of photos per week (about 1.7% of all images submitted) are rejected.

“With Amazon Rekognition, we’re able to quickly and accurately process all those photos in real time, which helps us serve our customers promptly when real issues have arisen. That also lets us free our agents’ time so they can focus on the customer problems that matter,” Thomas explained. “Amazon Rekognition allows our agents to safely respond to important customer issues in a timely manner and ensures that legitimate customer claims are handled automatically.”

The choice to use Amazon Rekognition was a natural one for Deliveroo, as the company has been using AWS for a long time. The team originally selected AWS because of their trust in the service. Now, they use Amazon Simple Storage Service (Amazon S3) to store the photos that go into the customer service queue, which streamlines their flow into analysis with Amazon Rekognition. This flow is pictured in the diagram. In addition, the Deliveroo customer care team is using Amazon DynamoDB and AWS Lambda to achieve resolutions faster, as well as Amazon Aurora to manage customer issues.

Going forward, Deliveroo’s customer care team plans to use additional AWS machine learning services, such as Amazon Comprehend, to personalize the post-order care experience for each Deliveroo customer. “We’re hungry for what’s next,” Thomas said laughingly.


About the Author

Marisa Messina is on the AWS ML marketing team, where her job includes identifying the most innovative AWS-using customers and showcasing their inspiring stories. Prior to AWS, she worked on consumer-facing hardware and then university-facing cloud offerings at Microsoft. Outside of work, she enjoys exploring the Pacific Northwest hiking trails, cooking without recipes, and dancing in the rain.






from AWS Machine Learning Blog