Category: Open Source

How AWS is helping to open source the future of robotics

How AWS is helping to open source the future of robotics

Our robot overlords may not take over anytime soon, but when they do, they’ll likely be running ROS. ROS, or Robot Operating System, was launched over a decade ago to unite developers in building “a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms,” as the ROS project page describes. To succeed, ROS has required deep collaboration among a community of roboticist developers, including more recent but significant involvement from AWS engineers like Thomas Moulard and Miaofei Mei.

At AWS, our interest and involvement in the ROS community have evolved and deepened over time, in concert with our customers’ need for a robust, robot-oriented operating system.

Dōmo arigatō, Misutā Robotto

Over the past 10 years, ROS has become the industry’s most popular robot software development framework. According to ABI Research, by 2024 roughly 55% of the world’s robots will include a ROS package. However, getting there has not been and will not be easy.

The ideas behind ROS germinated at Stanford University in the mid 2000s, the brainchild of student researchers Keenan Wyrobek and Eric Berger. Their goal? Help robotics engineers stop reinventing the robot, as it were, given that there was “too much time dedicated to re-implementing the software infrastructure required to build complex robotics algorithms…[and] too little time dedicated to actually building intelligent robotics programs.“ By late 2007/early 2008, they joined efforts with Scott Hassan’s Willow Garage, a robotics incubator, which began funding their robotics work under the auspices of its Personal Robotics Program.

The Robot Operating System (ROS) was born.

Given its academic origins, it’s perhaps not surprising that the original ROS (ROS 1) was mostly inspired by an academic and hobbyist community that grew to tens of thousands of developers. Though there were large industrial automation systems running ROS, ROS 1 didn’t yet yield the industrial-grade foundation that many customers required. Despite thousands of add-on modules that extended ROS 1 in creative ways, it lacked basic security measures and real-time communication, and only ran on Linux. Lacking alternatives, enterprises kept turning to ROS 1 to build large robotics businesses, though evolving ROS 1 to suit their purposes required considerable effort.

By the late 2010s, these industrial-grade demands were starting to strain ROS 1 as the world took ROS well beyond its initial intended scope, as ROS developer Brian Gerkey wrote in Why ROS 2? To push ROS forward into more commercial-grade applications, Open Robotics, the foundation that shepherds ROS development, kicked off ROS 2, a significant upgrade on ROS 1 with multi-platform support, real-time communications, multi-robot communication, small-embedded device capabilities, and more.

Open Robotics opted not to build these capabilities into ROS 1 because, Gerkey notes, “[G]iven the intrusive nature of the changes that would be required to achieve the benefits that we are seeking, there is too much risk associated with changing the current ROS system that is relied upon by so many people.” The move to ROS 2 therefore broke the API upon which the ROS developer community depended, so the ROS community had to start afresh to build out an ecosystem of modules to extend ROS 2, while also improving its stability, bug by bug.

That was when AWS engineers, along with a number of other collaborators, dug in.

While a variety of teams at Amazon had been using ROS for some time, customer interest in ROS motivated a new phase of our ROS engagement. Given the importance of ROS 2 to our customers’ success, the AWS Robotics team went to work on improving ROS. According to AWS engineer Miaofei Mei, an active contributor to ROS 2, this meant that we began in earnest to contribute features, bug fixes, and usability improvements to ROS 2.

Today, the AWS Robotics team actively contributes to ROS 2 and participates on the ROS 2 Technical Steering Committee. With over 250 merged pull requests to the rosdistro and ros2 organization repos, the AWS Robotics team’s contributions run wide and deep, and are not limited to the most recent ROS 2 release (Dashing). In the Dashing release, we have been fortunate to be able to contribute two major features: new concepts for Quality of Service (QoS) callbacks in rclcpp and rclpy, as well as new QoS features for Deadline, Lifespan, and Liveliness policies. In addition, we have made logging improvements (e.g., implementation of rosout and the ability to integrate a third-party logging solution like log4cxx or spdlog), added runtime analysis tools (ROS2 Sanitizer Report and Analysis), and contributed Secure-ROS2 (SROS 2) improvements (like policy generation for securing nodes), among others.

Helping customers, one robot at a time

While AWS actively contributes to ROS 2, we also benefit from others’ contributions. Indeed, the initial impetus to embrace ROS had everything to do with community, recalls AWS RoboMaker product manager Ray Zhu: “ROS has been around for over 10 years, with tens of thousands of developers building packages on it. For us to catch up and build something similar from the ground up would have been years of development work.” Furthermore, AWS customers kept asking AWS to help them with their robotics applications, and would prefer to build on an open industry standard.

To answer this customer request, we launched AWS RoboMaker at re:Invent 2018, with the goal of making it easy to “develop…code inside of a cloud-based development environment, test it in a Gazebo simulation, and then deploy…finished code to a fleet of one or more robots,” as AWS chief evangelist Jeff Barr wrote at RoboMaker’s launch. AWS RoboMaker extends ROS 2 with cloud services, making it easier for developers to extend their robot applications with AWS services such as Amazon Lex or Amazon Kinesis.

For example, AWS RoboMaker customer Robot Care Systems (RCS), which “use[s] robotics to help seniors, people with Parkinson’s disease, and people with disabilities move about more independently,” turned to AWS RoboMaker to extend its Lea robots. The company had ideas it wanted to implement (e.g., the ability to collect and share movement and behavioral data with patients’ physicians), but didn’t know how to connect to the cloud. According to Gabriel Lopes, a control and robotics scientist at RCS, “It was a revelation seeing how easily cloud connectivity could be accomplished with RoboMaker, just by configuring some scripts.”

Additionally, to help accelerate the development of analysis and validation tools in ROS 2 on behalf of AWS customers, the AWS RoboMaker team contracted established ROS community developers like PickNik to assist in porting RQT with enhancements to ROS 2, as detailed by PickNick co-founder Michael Lautman.

Contributing to our robot future

Such open source collaboration is by no means new to AWS, but it’s becoming more pronounced as we seek to improve how we serve customers. Working through Open Robotics means we cannot always move as fast as we (or our customers) would like. By working within the ROS 2 community we ensure that our customers remain firmly rooted in the open source software commons, which enables them to benefit from the combined innovations of many contributors. In the words of AWS engineer (and ROS contributor) Thomas Moulard, “It would be a hard sell for our customers to tell them to trust us and go proprietary with us. Instead, we’re telling them to embrace the biggest robotics middleware open source community. It helps them to trust us now and have confidence in the future.

With AWS initiatives in the ROS community like the newly-announced ROS 2 Tooling Working Group (WG) led by Moulard and other AWS Robotics team members, AWS hopes to attract more partners, customers, and even competitors to join us on our open source journey. While there remain pockets of Amazon teams using ROS purely as customers, the bulk of our usage now drives a rising number of significant contributions upstream to ROS 2.

Why? Well, as mentioned, our contributing back offers significant benefits for customers. But it’s more than that: it’s also great for AWS employees. As Mei said, “It feels good to be part of a community, and not just a company.”

from AWS Open Source Blog

Deploying Spark Jobs on Amazon EKS

Deploying Spark Jobs on Amazon EKS

Kubernetes has gained a great deal of traction for deploying applications in containers in production, because it provides a powerful abstraction for managing container lifecycles, optimizing infrastructure resources, improving agility in the delivery process, and facilitating dependencies management.

Now that a custom Spark scheduler for Kubernetes is available, many AWS customers are asking how to use Amazon Elastic Kubernetes Service (Amazon EKS) for their analytical workloads, especially for their Spark ETL jobs. This post explains the current status of the Kubernetes scheduler for Spark, covers the different use cases for deploying Spark jobs on Amazon EKS, and guides you through the steps to deploy a Spark ETL example job on Amazon EKS.

Available functionalities in Spark 2.4

Before the native integration of Spark in Kubernetes, developers used Spark standalone deployment. In this configuration, the Spark cluster is long-lived and uses a Kubernetes Replication Controller. Because it’s static, the job parallelism is fixed and concurrency is limited, resulting in a waste of resources.

With Spark 2.3, Kubernetes has become a native Spark resource scheduler. Spark applications interact with the Kubernetes API to configure and provision Kubernetes objects, including Pods and Services for the different Spark components. This simplifies Spark clusters management by relying on Kubernetes’ native features for resiliency, scalability and security.

The Spark Kubernetes scheduler is still experimental. It provides some promising capabilities, while still lacking some others. Spark version 2.4 currently supports:

  • Spark applications in client and cluster mode.
  • Launching drivers and executors in Kubernetes Pods with customizable configurations.
  • Interacting with the Kubernetes API server via TLS authentication.
  • Mounting Kubernetes secrets in drivers and executors for sensitive information.
  • Mounting Kubernetes volumes (hostPath, emptyDir, and persistentVolumeClaim).
  • Exposing the Spark UI via a Kubernetes service.
  • Integrating with Kubernetes RBAC, enabling the Spark job to run as a serviceAccount.

It does not support:

  • Driver resiliency. The Spark driver is running in a Kubernetes Pod. In case of a node failure, Kubernetes doesn’t reschedule these Pods to any other node. The Kubernetes restartPolicy only refers to restarting the containers on the same Kubelet (same node). This issue can be mitigated with a Kubernetes Custom Controller monitoring the status of the driver Pod and applying a restart policy at the cluster level.
  • Dynamic Resource Allocation. Spark applications require an External Shuffle Service for auto-scaling Spark applications to persist shuffle data outside of the Spark executors (see SPARK-24432).

For more information, refer to the official Spark documentation.

Additionally, for streaming application resiliency, Spark uses a checkpoint directory to store metadata and data, and be able to recover its state. The checkpoint directory needs to be accessible from the Spark drivers and executors. Common approaches is to use HDFS but it is not available on Kubernetes. Spark deployments on Kubernetes can use PersistentVolumes, which survive Pod termination, with readWriteMany access mode to allow concurrent access. Amazon EKS supports Amazon EFS and Amazon FSX Lustre as PersistentVolume classes.

Use cases for Spark jobs on Amazon EKS

Using Amazon EKS for running Spark jobs provides benefits for the following type of use cases:

  • Workloads with high availability requirements. The Amazon EKS control plane is deployed in multiple availability zones, as can be the Amazon EKS worker nodes.
  • Multi-tenant environments providing isolation between workloads and optimizing resource consumption. Amazon EKS supports Kubernetes Namespaces, Network Policies, Quotas, Pods Priority, and Preemption.
  • Development environment using Docker and existing workloads in Kubernetes. Spark on Kubernetes provides a unified approach between big data and already containerized workloads.
  • Focus on application development without worrying about sizing and configuring clusters. Amazon EKS is fully managed and simplifies the maintenance of clusters including Kubernetes version upgrades and security patches.
  • Spiky workloads with fast autoscaling response time. Amazon EKS supports Kubernetes Cluster Autoscaler and can provides additional compute capacity in minutes.

Deploying Spark applications on EKS

In this section, I will run a demo to show you the steps to deploy a Spark application on Amazon EKS. The demo application is an ETL job written in Scala that processes New York taxi rides to calculate the most valuable zones for drivers by hour. It reads data from the public NYC Amazon S3 bucket, writes output in another Amazon S3 bucket, and creates a table in the AWS Glue Data Catalog to analyze the results via Amazon Athena. The demo application has the following pre-requisites:

  • A copy of the source code, available from:

git clone https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample.git

  • Docker >17.05 on the computer which will be used to build the Docker image.
  • An existing Amazon S3 bucket name.
  • An AWS IAM policy with permissions to write to that Amazon S3 bucket.
  • An Amazon EKS cluster already configured with permissions to create a Kubernetes ServiceAccount with edit ClusterRole and a worker node instance role with the previous AWS IAM policy attached.

If you don’t have an Amazon EKS cluster, you can create one using the EKSCTL tool following the example below (after updating the AWS IAM policy ARN):

eksctl create cluster -f example/eksctl.yaml

In the first step, I am packaging my Spark application in a Docker image. My Dockerfile is a multi-stage image: the first stage is used to compile and build my binary with SBT tool, the second as a Spark base layer, and the last for my final deployable image. Neither SBT nor Apache Spark maintain a Docker base image, so I need to include the necessary files in my layers:

  • Build the Docker image for the application and push it to my repository:

docker build -t <REPO_NAME>/<IMAGE_NAME>:v1.0 .

docker push <REPO_NAME>/<IMAGE_NAME>:v1.0

Now that I have a package to deploy as a Docker image, I need to prepare the infrastructure for deployment:

  • Create the Kubernetes service account and the cluster role binding to give Kubernetes edit permissions to the Spark job:

kubectl apply -f example/kubernetes/spark-rbac.yaml

Finally, I need to deploy my Spark application to Amazon EKS. There are several ways to deploy Spark jobs to Kubernetes:

  • Use the spark-submit command from the server responsible for the deployment. Spark currently only supports Kubernetes authentication through SSL certificates. This method is not compatible with Amazon EKS because it only supports IAM and bearer tokens authentication.
  • Use a Kubernetes job which will run the spark-submit command from within the Kubernetes cluster and ensure that the job is in a success state. In this approach, spark-submit is run from a Kubernetes Pod and the authentication relies on Kubernetes RBAC which is fully compatible with Amazon EKS.
  • Use a Kubernetes custom controller (also called a Kubernetes Operator) to manage the Spark job lifecycle based on a declarative approach with Customer Resources Definitions (CRDs). This method is compatible with Amazon EKS and adds some functionalities that aren’t currently supported in the Spark Kubernetes scheduler, like driver resiliency and advanced resource scheduling.

I will use the Kubernetes job method to avoid the additional dependency on a custom controller. For this, I need a Spark base image which contains Spark binaries, including the spark-submit command:

  • Build the Docker image for Spark and push it to my repository:

docker build --target=spark -t <REPO_NAME>/spark:v2.4.4 .

docker push <REPO_NAME>/spark:v2.4.4

  • Use kubectl to create the job based on the description file, modifying the target Amazon S3 bucket and the Docker repositories to pull the images from:

kubectl apply -f example/kubernetes/spark-job.yaml

  • Use Kubernetes port forwarding to monitor the job via the Spark UI hosted on the Spark driver. First get the driver pod name, then forward its port on localhost and access the UI with your browser pointing to http://localhost:4040 :

kubectl get pods | awk '/spark-on-eks/{print $1}'

kubectl port-forward <SPARK_DRIVER_POD_NAME> 4040:4040

After the job has finished I can check the results via the Amazon Athena console by creating the tables, crawling, and querying them:

  • Add a crawler in the AWS Glue console to crawl the data in the Amazon S3 bucket and create two tables, one for raw rides and one for ride statistics per location:
    • Enter a crawler name, and click Next.
    • Select Data stores and click Next.
    • Select the previously used Amazon S3 bucket and click Next.
    • Enter a name for the AWS Glue IAM role and click Next.
    • Select Run on demand and click Next.
    • Choose the database where you want to add the tables, select Create a single schema for each Amazon S3 path, click Next and then Finish.
  • Run the crawler and wait for completion.
  • After selecting the right database in the Amazon Athena console, execute the following queries to preview the raw rides and to see which zone is the most attractive for a taxi driver to pick up:

SELECT * FROM RAW_RIDES LIMIT 100;

SELECT locationid, yellow_avg_minute_rate, yellow_count, pickup_hour FROM "test"."value_rides" WHERE pickup_hour=TIMESTAMP '2018-01-01 5:00:00.000' ORDER BY yellow_avg_minute_rate DESC; 

Summary

In this blog post, you learned how Apache Spark can use Amazon EKS for running Spark applications on top of Kubernetes. You learned the current state of Amazon EKS and Kubernetes integration with Spark, the limits of this approach, and the steps to build and run your ETL, including the Docker images build, Amazon EKS RBAC configuration, and AWS services configuration.

 

from AWS Open Source Blog

AWS’ Sponsorship of the Rust Project

AWS’ Sponsorship of the Rust Project

Rust language logo.

We’re really excited to announce that AWS is sponsoring the Rust programming language! Rust is designed for writing and maintaining fast, reliable, and efficient code. It has seen considerable uptake since its first stable release four years ago, with companies like Google, Microsoft, and Mozilla all using Rust. Rust has also seen lots of growth in AWS, with services such as Lambda, EC2, and S3 all choosing to use Rust in performance-sensitive components. We’ve even open sourced the Firecracker microVM project!

Why Rust?

In the words of the Rust project’s maintainers:

  • Performance. Rust is blazingly fast and memory-efficient: with no runtime or garbage collector, it can power performance-critical services, run on embedded devices, and easily integrate with other languages.
  • Reliability. Rust’s rich type system and ownership model guarantee memory-safety and thread-safety — and enable you to eliminate many classes of bugs at compile-time.
  • Productivity. Rust has great documentation, a friendly compiler with useful error messages, and top-notch tooling — an integrated package manager and build tool, smart multi-editor support with auto-completion and type inspections, an auto-formatter, and more.

With its inclusive community and top-notch libraries like:

  • Serde, for serializing and deserializing data.
  • Rayon, for writing parallel & data race-free code.
  • Tokio/async-std, for writing non-blocking, low-latency network services.
  • tracing, for instrumenting Rust programs to collect structured, event-based diagnostic information.

…it’ll come as no surprise that Rust was voted Stack Overflow’s “Most Loved Language” four years running.

That’s why AWS is sponsoring the Rust project. The Rust project uses AWS services to:

  • Store release artifacts such as compilers, libraries, tools, and source code on S3.
  • Run ecosystem-wide regression tests with Crater on EC2.
  • Operate docs.rs, a website that hosts documentation for all packages published to the central crates.io package registry.

Getting started with Rust

To get started with the Rust programming language, check out Rust’s “Getting Started” page. To get started with Rust on AWS, consider using Rusoto, a community-driven AWS SDK. To use Rust on AWS Lambda, consider using the official AWS Lambda Runtime for Rust.

While we’re using Rust, we’re especially excited to see how it develops and what the broader community builds with Rust. We can’t wait to be an even bigger part of the Rust community.

from AWS Open Source Blog

How AWS Built a Production Service Using Serverless Technologies

How AWS Built a Production Service Using Serverless Technologies

Our customers are constantly seeking ways to increase agility and innovation in their organizations, and often reach out to us wanting to learn more about how we iterate quickly on behalf of our customers. As part of our unique culture (which Jeff Barr discusses at length in this keynote talk), we constantly explore ways to move faster by using new technologies.

Serverless computing is one technology that helps Amazon innovate faster than before. It frees up time for internal teams by eliminating the need to provision, scale, and patch infrastructure. We use that time to improve our products and services.

Today, we are excited to announce that we have published a new open source project designed to help customers learn from Amazon’s approach to building serverless applications. The project captures key architectural components, code structure, deployment techniques, testing approaches, and operational practices of the AWS Serverless Application Repository, a production-grade AWS service written mainly in Java and built using serverless technologies. We wanted to give customers an opportunity to learn more about serverless development through an open and well-documented code base that has years of Amazon best practices built in. The project was developed by the same team that launched the AWS Serverless Application Repository two years ago, and is available under the Apache 2.0 license on GitHub.

Open source has always been a critical part of AWS’ strategy, as the Firecracker example illustrates. Open source is a more efficient way to collaborate with the community, and we view our investments in open source as a way to enable customer innovation. While we have dozens of example serverless applications available on GitHub and many reference architectures, this is our first attempt to open source a set of serverless application components inspired by the implementation of a production AWS service. Because the focus is on learning, we have made only a few service operations available on GitHub. For example, to show how we developed a request and response architecture, we published code for the create-application operation of the service and excluded similar operations like put-policy or get-template — we used the service to draw motivation from real-world scenarios in developing this project. You can study the code, make changes locally, deploy an inspired version of the service in your AWS account, and repeat this cycle to gain further insight into how we develop, test, and operate our production services using serverless technologies.

A quick note about the AWS Serverless Application Repository, and why we used it as a reference to develop this project:

The AWS Serverless Application Repository enables teams, organizations, and individual developers to store and share reusable serverless applications, and easily assemble and deploy serverless architectures in powerful new ways. The service implements common request/response patterns, makes use of event-driven systems for asynchronous processing, and uses a component architecture to reduce coupling and improve scaling dimensions. Learning how to implement these patterns, using best practices, is a common ask from our customers.

The diagram below represents the high-level architecture of the open source project, and closely resembles the production architecture of the service. Let’s learn more about the things we have made available for you to study from on GitHub.

Main Architecture Diagram.

The open source project has four core components: a static website (front-end), a back-end, an operations component, and an asynchronous analytics component. This modular architecture helps minimize customer impact in the event of failures, and is key in iterating quickly and independently on components. Each component has its own designated folder to help organize its code, dependencies, and infrastructure-as-code template, closely resembling the layout of the production service. Within each folder, a suite of unit and integration tests is provided to help ensure that changes are thoroughly tested before being deployed. CI/CD pipeline templates are available for individual components to give you the flexibility of setting up a pipeline for the specific components you wish to deploy.

Note: While the project is designed for learning, each component is built with production quality in mind. The components extracted from the service can also be deployed independently as apps via the AWS Serverless Application Repository.

This post will provide a detailed explanation of the back-end component. A version of this is also available in the project wiki.

Back-end (request/response)

It is very common for applications to handle user requests, process data, and respond with success or error. Users must also be authenticated, which requires ways to communicate with APIs (usually via a web interface). For example, the create-application operation of the service is exposed as an Amazon API Gateway API. Customers interact with the service using the AWS SDK and the AWS console, where auth(n) and auth(z) are managed via AWS Identity and Access Management (IAM). Requests made to the service get routed to AWS Lambda where the business logic is executed, and state is managed via Amazon DynamoDB. The open source project captures the request/response architecture of the production service almost identically, except that we show how to use Amazon Cognito for authentication and authorization. The following diagram captures the architecture for the back-end component released with this open source project.

Back-end component architecture diagram.

The back-end component implements the following operations:

  • CreateApplication: creates an application
  • UpdateApplication: updates the application metadata
  • GetApplication: gets the details of an application
  • ListApplications: lists applications that you have created
  • DeleteApplication: deletes an application

We used the Open API’s (Swagger) specification to define our APIs, and used the Swagger codegen tool to generate server side models (for input/output), and JAX-RS annotations for the APIs listed above.

Note: JAX-RS (Java API for RESTful Web Services) is a Java programming language API spec that provides support in creating web services according to the Representational State Transfer architectural pattern. Jersey, the reference implementation of JAX-RS, implements support for the annotations defined in JSR 311, making it easy for developers to build RESTful web services using the Java programming language.

Let’s walk through a few sections of code that show how requests get routed and then processed by the appropriate Java methods of the application. It will be helpful to see how we used existing Java frameworks to help developers stay focused on writing just the business logic of the application.

The code snippet below shows JAX-RS annotations defined for the create-application API in the ApplicationsApi Java Interface generated from the API spec.

Java Interface generated from the API spec.

The ApplicationService class contains all business logic for the APIs. This class implements the methods defined by the ApplicationsApi Java interface. The code snippet below shows the implementation of the create-application API: a simple Java method that accepts the CreateApplicationInput POJO as input to process the request for creating an application.

Implementation of create application API.

Finally, the code snippet below shows how Amazon API Gateway requests get routed (from the Lambda handler) to the appropriate Java methods in the ApplicationService class.

As mentioned earlier, the project uses the Jersey framework (an implementation of JAX-RS). The ApplicationService class is registered with ResourceConfig (a Jersey Application) so that Jersey can forward REST calls to the appropriate methods in ApplicationService class. The API requests get sent to Jersey via the JerseyLambdaContainerHandler middleware component that natively supports API Gateway’s proxy integration models for requests and responses. This middleware component is part of the open source AWS Serverless Java Container framework, which provides Java wrappers to run Jersey, Spring, Spark, and other Java-based framework apps inside AWS Lambda.

API Lambda handler. This is the entry point for the API Lambda.

Testing and deployment

Integration tests are defined using Cucumber, a popular Java testing framework that allows translating user requirements into test scenarios in plain English that all stakeholders can understand, review, and agree upon before implementation. These tests are inside the src/test folder and can be run as part of an automated CI/CD pipeline, or manually. Running integration tests deploys a test stack to your AWS account, runs the tests, and deletes the stack after the tests are complete.

For deployment to AWS, we use the AWS Serverless Application Model (SAM), which provides shorthand syntax to express functions, APIs, databases, and event source mappings. These templates can be deployed manually or as part of an automated pipeline. When organizing our templates, we adopted two key best practices learned over the years at AWS: nested stacks and parameter referencing:

1) Nested stacks make the templates reusable across different stacks. As your infrastructure grows, common patterns can be converted into templates, and these templates can be defined as resources in other templates. In this project, we have added a root-level SAM template called template.yaml for each component. Within that template, nested templates, such as database.template.yaml and api.template.yaml, are defined as resources. These nested templates, in turn, define the specific resources required by the component, such as an API Gateway API and a DynamoDB table.

2) For parameter referencing we used AWS Systems Manager Parameter store, which allows configurable parameters to be stored in a parameter store and referenced inside a SAM template. A parameter store also allows referencing these directly inside Lambda functions instead of passing them as environment variables. It allows managing the configuration values independent of your service deployment.

Walkthrough

In the section above, we described some of the development techniques, testing practices, and deployment approaches of the project. Now, let us walk through building and deploying the application in your AWS account. We will build the static website (front-end), and the back-end components so that we can see an end-to-end demo of the working application. This should take less than 10 minutes.

Make sure you have the necessary prerequisites installed locally on your computer. Clone the GitHub repository, change into the directory of the project, and run the following commands to build the code:

cd static-website
npm install
npm run ci
cd ..
mvn clean compile

Note: The application supports Node LTS version 10.16.3, and OpenJDK version 8. You can use Amazon Corretto for a production-ready distribution of the OpenJDK.

Once the project is built successfully:

  • Create an s3 bucket to store the packaged artifacts of the application via aws s3 mb s3://<s3 bucket name> ;
  • Package the application via mvn clean package -DpackageBucket=<s3 bucket name> ;
  • Deploy the application via aws cloudformation deploy --template-file target/sam/app/packaged-template.yaml --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND --stack-name <name>

You should see an output similar to the following image:

Confirmation image for cloudformation stack creation.

Once the stack has been successfully created, navigate to the AWS CloudFormation console, locate the stack you just created, and go to the Outputs tab to find the website URL for the application. When you click on the link, you should see a page like this. The application is up and running!

Website screenshot.

You can click on Try the Demo to access the back-end APIs. As a first time user, you will be asked to sign up first. You can then create (publish), get, update, list, and delete applications.

Next steps

In this post we covered how we leveraged serverless to build a production-grade service, and how AWS developers are using serverless technologies to innovate ever faster on behalf of our customers. The post briefly covered the details of the project’s back-end (request/response) component, our testing and deployment practices, and a walkthrough of how you can deploy an instance of the front-end and back-end to your AWS account. As next steps, you can fork the repository and modify it. You can also set up a CI/CD pipeline to push your changes automatically. We have provided CI/CD templates for each of the components under sam/cicd/template.yaml. These templates can be easily deployed using:

aws cloudformation deploy --template-file target/sam/cicd/template.yaml --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND --stack-name <your-stack-name>

You can also learn more about the analytics component and the pipelines, operational alarms and dashboards. We hope you find this useful – if you do, please let us know by starring the GitHub project.

from AWS Open Source Blog

AWS Promotional Credits for Open Source Projects

AWS Promotional Credits for Open Source Projects

AWS Open Source logo.

AWS has always aimed to help make technology that was historically cost-prohibitive and difficult for many organizations to adopt much more accessible to a broader audience. This applies to open source technologies as well. We help customers to run a wide variety of open source operating systems on EC2. We offer managed services for open source databases such as MySQL, PostgreSQL, and Redis. Or you can use Amazon Elastic Kubernetes Service (Amazon EKS) to run your Docker containers using Kubernetes. We do the undifferentiated heavy lifting so that customers don’t have to download, install, manage, patch, and scale these open source packages. This allows you to focus on building your applications so you can innovate faster. We also continue to widen our open source collaboration by sponsoring foundations and events, increasing code contributions, open sourcing our own technology, and helping communities to sustain the overall health of open source.

Today, we are excited to further extend our support by offering AWS promotional credits to open source projects. Typically, these credits are used to perform upstream and performance testing, CI/CD, or storage of artifacts on AWS. We hope this program will free up resources for open source projects to further expand and innovate in their communities, as several projects are already doing.

The Rust language has seen lots of growth in AWS, with services such as Lambda, EC2, and S3 all choosing to use Rust in performance-sensitive components. We open-sourced Firecracker, which is also written using Rust. And it’s the most-loved language in the latest Stack Overflow survey. Alex Crichton, Rust Core Team Member says: “We’re thrilled that AWS, which the Rust project has used for years, is helping to sponsor Rust’s infrastructure. This sponsorship enables Rust to sustainably host infrastructure on AWS to ship compiler artifacts, provide crates.io crate downloads, and house automation required to glue all our processes together. These services span a myriad of AWS offerings from CloudFront to EC2 to S3. Diversifying the sponsorship of the Rust project is also critical to its long-term success, and we’re excited that AWS is directly aiding this goal.”

AdoptOpenJDK is the community-led effort to produce pre-built binaries from OpenJDK. Martijn Verburg, Director of AdoptOpenJDK, says: “The London Java Community is incredibly pleased to have Amazon Web Services sponsor AdoptOpenJDK. AWS is one of the leading cloud providers for Java workloads and its broad IaaS offering allows us to maintain a scalable and robust build farm for years to come. We thank AWS for being a great supporter of AdoptOpenJDK and the Java developer community.“

The Central Repository, more popularly known as Maven Central, is a critical piece of infrastructure for Java application development. “Sonatype has been leveraging the scalability of AWS services to provide key infrastructure to deliver The Central Repository (Maven Central), the world’s largest repository of open source Java components to the community. The AWS promotional credits will enable a combined higher level of investment in this critical community resource.” — Brian Fox, Co-founder & CTO, Sonatype.

Kubernetes, Prometheus, and Envoy are some of the projects in the Cloud Native Computing Foundation. “We’re very appreciative of AWS making this substantial contribution of cloud promotional credits to CNCF-hosted projects.” said Dan Kohn, executive director of the Cloud Native Computing Foundation (CNCF). “CNCF projects rely on cloud promotional credits for their continuous integration/continuous delivery (CI/CD), which is essential for maintaining their high development velocity.“

Julia is a programming language designed for high performance computing. It is used for data visualization, data science, machine learning, and scientific computing. Viral B. Shah, co-creator of the Julia language said: “The Julia project is thankful to AWS. A lot of the Julia community infrastructure runs on AWS, and the savings from receiving AWS Promotional Credits will allow us to spend money on other community activities, such as the Julia Season of Contributions (JSOC), improving diversity in the Julia community, and funding travel scholarships to JuliaCon.“

How do I apply?

To be eligible, you will need an active AWS Account, including a valid payment method on file, and no outstanding invoices (note that promotional credits cannot be applied retroactively). If you already have an account, you can find your account ID here. Generally, your project must also be licensed under an OSI-approved license, but please still apply if the commonly-used licenses in your project’s space (e.g. open data) are not OSI-approved!

Use the application form to provide complete details including your account number, the amount of credits being requested, and a brief description of how the credits will be used. The credits will expire after one year. Applications will be reviewed within 10-15 business days. Assuming that they meet the basic eligibility requirements above, we will be examining applications on the basis of their relevance to AWS and its customers. The Amazon Leadership Principles will be used as the guiding light to select the projects. We’ll also generally favor projects with maintainers from multiple entities or that are owned by foundations or non-profits. You will be notified via email if your application is approved, and then you will receive a welcome email once the promotional credit has been applied to your account.

Learn more about how AWS is investing in communities, contributing code, and hiring developers to work in open source at opensource.amazon.com.

from AWS Open Source Blog

Distributed TensorFlow Training Using Kubeflow on Amazon EKS

Distributed TensorFlow Training Using Kubeflow on Amazon EKS

Training heavy-weight deep neural networks (DNNs) on large datasets ranging from tens to hundreds of GBs often takes an unacceptably long time. Business imperatives force us to search for solutions that can reduce the training time from days to hours. Distributed data-parallel training of DNNs using multiple GPUs on multiple machines is often the right answer to this problem. The main focus of this post is how to do such distributed training using open source frameworks and platforms on Amazon Web Services (AWS).

TensorFlow is an open source machine learning library. Kubernetes is an open source platform for managing containerized applications. Kubeflow is an open source toolkit that simplifies deploying machine learning workflows on Kubernetes. Amazon Elastic Kubernetes Service (Amazon EKS) makes it is easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. Using Kubeflow on Amazon EKS, we can do highly-scalable distributed TensorFlow training leveraging these open source technologies.

We will first provide an overview of the key concepts, then walk through the steps required to do distributed TensorFlow training using Kubeflow on EKS. An earlier blog post discussing Kubeflow on EKS offers a broader perspective on this topic.

Overview of concepts

While many of the distributed training concepts presented in this post are generally applicable across many types of TensorFlow models, to be concrete, we will focus on distributed TensorFlow training for the Mask R-CNN model on the Common Object in Context (COCO) 2017 dataset.

Model

The Mask R-CNN model is used for object instance segmentation, whereby the model generates pixel-level masks (Sigmoid binary classification) and bounding-boxes (Smooth L1 regression) annotated with object-category (SoftMax classification) to delineate each object instance in an image. Some common use cases for Mask R-CNN include perception in autonomous vehicles, surface defect detection, and analysis of geospatial imagery.

There are three key reasons for selecting the Mask R-CNN model for this post:

  1. Distributed training of Mask R-CNN on large datasets compresses training time.
  2. There are many open source TensorFlow implementations available for the Mask R-CNN model. In this post, we will use the Tensorpack Mask/Faster-RCNN implementation as our primary example, but a highly optimized AWS Samples Mask-RCNN is also recommended.
  3. The Mask R-CNN model is submitted as part of MLPerf results as a heavyweight object detection model

A schematic outline of the Mask R-CNN deep neural network (DNN) architecture is shown below:

 

Schematic of Mask R-CNN DNN architecture (see Mask R-CNN

Figure 1. Schematic of Mask R-CNN DNN architecture (see Mask R-CNN paper: https://arxiv.org/pdf/1703.06870.pdf)

Synchronized All-reduce of gradients in distributed training

The central challenge in distributed DNN training is that the gradients computed during back propagation across multiple GPUs need to be all-reduced (averaged) in a synchronized step before the gradients are applied to update the model weights at multiple GPUs across multiple nodes.

The synchronized all-reduce algorithm needs to be highly efficient, otherwise any training speedup gained from distributed data-parallel training would be lost to the inefficiency of synchronized all-reduce step.

There are three key challenges to making the synchronized all-reduce algorithm highly efficient:

  1. The algorithm needs to scale with increasing numbers of nodes and GPUs in the distributed training cluster.
  2. The algorithm needs to exploit the topology of high-speed GPU-to-GPU inter-connects within a single node.
  3. The algorithm needs to efficiently interleave computations on a GPU with communications with other GPUs by efficiently batching the communications with other GPUs.

Uber’s open-source library Horovod was developed to address these challenges:

  1. Horovod offers a choice of highly efficient synchronized all-reduce algorithms that scale with increasing numbers of GPUs and nodes.
  2. The Horovod library leverages the Nvidia Collective Communications Library (NCCL) communication primitives that exploit awareness of Nvidia GPU topology.
  3. Horovod includes Tensor Fusion, which efficiently interleaves communication with computation by batching data communication for all-reduce.

Horovod is supported with many machine-learning frameworks, including TensorFlow. TensorFlow distribution strategies also leverage NCCL and provide an alternative to using Horovod to do distributed TensorFlow training. In this post, we will use Horovod.

Amazon EC2 p3.16xlarge and p3dn.24xlarge instances with eight Nvidia Tesla V100 GPUs, 128 – 256 GB GPU memory, 25 – 100 Gbs networking inter-connect and high-speed Nvidia NVLink GPU-to-GPU inter-connect are ideally suited for distributed TensorFlow training.

Kubeflow Messaging Passing Interface (MPI) Training

The next challenge in distributed TensorFlow training is appropriate placement of training algorithm worker processes across multiple nodes, and association of each worker process with a unique global rank. Messaging Passing Interface (MPI) is a widely used collective communication protocol for parallel computing and is very useful in managing a group of training algorithm worker processes across multiple nodes.

MPI is used to distribute training algorithm processes across multiple nodes and associate each algorithm process with a unique global and local rank. Horovod is used to logically pin an algorithm process on a given node to a specific GPU. The logical pinning of each algorithm process to a specific GPU is required for synchronized all-reduce of gradients.

The specific aspect of the Kubeflow machine learning toolkit that is relevant to this post is Kubeflow’s support for Message Passing Interface (MPI) training through Kubeflow’s MPI Job Custom Resource Definition (CRD) and MPI Operator Deployment. Kubeflow’s MPI Job and MPI Operator enable distributed TensorFlow training on Amazon EKS. TensorFlow training jobs are defined as Kubeflow MPI Jobs, and Kubeflow MPI Operator Deployment observes the MPI Job definition to launch Pods for distributed TensorFlow training across a multi-node, multi-GPU enabled Amazon EKS cluster. Because of our limited focus on using Kubeflow for MPI training, we do not need a full deployment of Kubeflow for this post.

Kubernetes resource management

To do distributed TensorFlow training using Kubeflow on Amazon EKS, we need to manage Kubernetes resources that define MPI Job CRD, MPI Operator Deployment, and Kubeflow MPI Job training jobs. In this post, we will use Helm charts for managing Kubernetes resources defining distributed TensorFlow training jobs for Mask R-CNN models.

Step-by-step walk through

Below we walk through the steps required to do distributed TensorFlow DNN training using Kubeflow in EKS. We will start by creating an EKS cluster, then package code and frameworks into a Docker image, stage the COCO 2017 dataset on an Amazon Elastic File System (Amazon EFS) shared file system and, finally, launch the training job using Kubeflow in EKS.

Prerequisites

  1. Create and activate an AWS Account or use an existing AWS account.
  2. Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketplace.
  3. Manage your service limits so you can launch at least four EKS-optimized GPU-enabled Amazon EC2 P3 instances.
  4. Create an AWS Service role for an EC2 instance and add AWS managed policy for power user access to this IAM Role, or create a least-privileged role consistent with the IAM permissions required to execute the steps in this post.
  5. We need a build environment with AWS CLI and Docker installed. Launch an m5.xlarge Amazon EC2 instance from an AWS Deep Learning AMI (Ubuntu), using an EC2 instance profile containing the IAM Role created in Step 4. The root EBS volume of the EC2 instance must be at least 200 GB. All steps described below must be executed on this EC2 instance.
  6. Clone this GitHub repository in your build environment and execute the steps below. All paths are relative to the Git repository root. See the Git repository README for detailed instructions.
  7. Use any AWS Region that supports Amazon EKS, Amazon EFS, and EC2 P3 instances. Here we assume use of us-west-2 AWS region.
  8. Create an S3 bucket in your AWS region.

Create GPU-enabled Amazon EKS cluster and node group

The first step to enable distributed TensorFlow training using Kubeflow on EKS is, of course, to create an Amazon EKS cluster. There are multiple cloud infrastructure automation options that can be used to do this, including: eksctl, Terraform, etc. Here we will use Terraform. A high-level understanding of Terraform may be helpful, but is not required. To get started, Install Terraform in your build environment. While the latest version of Terraform may work, this post was tested with Terraform v0.12.6.

Install and configure Kubectl

Install kubectland aws-iam-authenticatoron a Linux machine, from the eks-cluster directory:

./install-kubectl-linux.sh

The script verifies that aws-iam-authenticator is working by displaying the help contents of aws-iam-authenticator.

Create EKS cluster and worker node group

In the eks-cluster/terraform/aws-eks-cluster-and-nodegroup directory in the accompanying Git repository, create an EKS cluster:

terraform init

For the azs variable below, as noted earlier, we are assuming use of AWS region us-west-2. If you select a different AWS region, modify the azs variable accordingly. Some AWS availability zones may not have the required EC2 P3 instances available, in which case the commands below will fail; retry with different availability zones.

You may specify the Kubernetes version using the k8s_version variable, as shown below, While the latest Kubernetes version is expected to work just as well, this post was developed with version 1.13. The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one and substitute the key pair name for <key-pair> in:


terraform apply -var="profile=default" -var="region=us-west-2" \
	-var="cluster_name=my-eks-cluster" \
	-var='azs=["us-west-2a","us-west-2b","us-west-2c"]' \
	-var="k8s_version=1.13" -var="key_pair=<key-pair>"

Save the summary output of terraform apply command . Below is example summary output that has been obfuscated:

EKS Cluster Summary:
vpc: vpc-xxxxxxxxxxxx
subnets: subnet-xxxxxxxxxxxx,subnet-xxxxxxxxxxxx,subnet-xxxxxxxxxxxx
cluster security group: sg-xxxxxxxxxxxxxx
endpoint: https://xxxxxxx.gr7.us-west-2.eks.amazonaws.com

EKS Cluster NodeGroup Summary:
node security group: sg-xxxxxx
node instance role arn: arn:aws:iam::xxxxxxx:role/quick-start-test-ng1-role

EFS Summary:
file system id: fs-xxxxxxxx
dns: fs-xxxxxxxx.efs.us-west-2.amazonaws.com

Create Persistent Volume and Persistent Volume Claim for EFS

As part of creating the Amazon EKS cluster, an instance of the Amazon EFS is also created. We will use this EFS shared file-system to stage training and validation data. To access data from training jobs running in Pods, we need to define a Persistent Volume and a Persistent Volume Claim for EFS.

To create a new Kubernetes namespace named kubeflow:

kubectl create namespace kubeflow

You will need the summary output of the terraform apply command you saved in a previous step. In the eks-cluster directory, in the pv-kubeflow-efs-gp-bursting.yaml file, replace <EFS file-system id> with the EFS file system id summary output you saved and replace <AWS region> with AWS region you are using (e.g. us-west-2) and execute:

kubectl apply -n kubeflow -f pv-kubeflow-efs-gp-bursting.yaml

Check to see that the Persistent Volume was successfully created:

kubectl get pv -n kubeflow

You should see output showing that the Persistent Volume is available.

Execute:

kubectl apply -n kubeflow -f pvc-kubeflow-efs-gp-bursting.yaml

to create an EKS Persistent Volume Claim. Verify that Persistent Volume Claim was successfully bound to Persistent Volume:

kubectl get pv -n kubeflow

Build Docker image

Next, we need to build a Docker image containing TensorFlow, the Horovod library, the Nvidia CUDA toolkit , the Nvidia cuDDN library, the NCCL library, the Open MPI toolkit, and the Tensorpack implementation of the Mask R-CNN training algorithm code. The Dockerfile used for building the container image uses the AWS deep learning container image as the base image. In the container/build_tools folder, customize the build_and_push.sh shell script for AWS region. By default, this script pushes the image to the AWS region configured in your default AWS CLI profile. You can change that in the script and set the region to us-west-2. Execute:

./build_and_push.sh

to build and push the Docker image to Amazon Elastic Container Registry (ECR) in your AWS region.

Optimized Mask R-CNN

To use the optimized Mask R-CNN model, use the container-optimized/build_tools folder and customize and execute:

./build_and_push.sh

Stage COCO 2017 dataset

Next, we stage the COCO 2017 dataset needed for training the Mask R-CNN model. In the eks-cluster folder, customize the prepare-s3-bucket.sh shell script to specify your Amazon S3 bucket name in S3_BUCKET variable and execute:

./prepare-s3-bucket.sh

This will download the COCO 2017 dataset and upload it to your Amazon S3 bucket. In the eks-cluster folder, customize the image and S3_BUCKET variables in stage-data.yaml . Use the ECR URL for the Docker image you created in the previous step as the value for image. Execute:

kubectl apply -f stage-data.yaml -n kubeflow

to stage data on the selected Persistent Volume claim for EFS. Wait until the stage-data Pod started by previous apply command is marked Completed. This can be checked by executing:

kubectl get pods -n kubeflow

To verify data has been staged correctly:


kubectl apply -f attach-pvc.yaml -n kubeflow 
kubectl exec attach-pvc -it -n kubeflow -- /bin/bash

You will be attached to a Pod with the mounted EFS Persistent Volume Claim. Verify that the COCO 2017 dataset is staged correctly under /efs/data on the attached Pod. Type exit when you are done verifying the dataset.

Create the Mask R-CNN training job

Before we proceed further, let us recap what we have covered so far. We have created the EKS cluster, EKS node group, Persistent Volume, and Persistent Volume Claim for EFS, and staged the COCO 2017 dataset on the Persistent Volume.

Next, we define a Kubeflow MPI Job that is used to launch the Mask R-CNN training job,. We define the Kubeflow MPI Job using a Helm chart. Helm is an application package manager for Kubernetes. Next, we install and initialize Helm.

Install and initialize HELM

After installing Helm, initialize it as described below:

In eks-cluster folder, execute:

kubectl create -f tiller-rbac-config.yaml

You should see following two messages:

serviceaccount "tiller" created  
clusterrolebinding "tiller" created

Execute:

helm init --service-account tiller --history-max 200

Define MPI Job CRD

First, we install the Helm chart that defines Kubeflow MPI Job CRD by executing following command in charts folder:

helm install --debug --name mpijob ./mpijob/

Start training job

In charts/maskrcnn/values.yaml file, customize the image value:

image: # ECR image id

Use the ECR URL for the Docker image you created in a preceding step as value for the image. Start the training job by executing the following command in the charts folder:

helm install --debug --name maskrcnn ./maskrcnn/

You can monitor the status of training Pods by executing:

kubectl get pods -n kubeflow

You should see Worker Pods and a Launcher Pod. The Launcher Pod is created after all Worker Pods enter Running status. Once training is completed, Worker Pods will be destroyed automatically and the Launcher Pod will be marked Completed. You can inspect the Launcher Pod logs using kubectl to get a live output of training logs.

Optimized Mask-RCNN model training

To train the optimized Mask R-CNN model, in charts/maskrcnn-optimized/values.yaml file, customize the image value to the relevant ECR URL and execute:

helm install –-debug --name maskrcnn-optimized ./maskrcnn-optimized/

Visualizing Tensorboard summaries

The Tensorboard summaries for the training job can be visualized through the Tensorboard service deployed in EKS:


kubectl get services -n kubeflow \
    -o=jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}{"\n"}'

Use the public DNS address for Tensorboard service and access it in a browser (http://<Tensorboard service dns name>/) to visualize the summaries. Visualization of various algorithm specific metrics through Tensorboard while the Kubeflow job is running allows us to verify that training metrics are converging in the right direction. If training metrics indicate a problem, we can abort the training early. Below, we illustrate Tensorboard graphs from an experiment running the Mask R-CNN training job using Kubeflow on EKS. These graphs show the Mask R-CNN specific algorithm metrics over 24 epochs of training.

 

Mask R-CNN Bounding Box mAP

Figure 2. Mask R-CNN Bounding Box mAP

 

Mask R-CNN Segmentation mAP

Figure 3. Mask R-CNN Segmentation mAP

 

Mask R-CNN segmentation mAP

Figure 4. Mask R-CNN segmentation mAP

 

Mask R-CNN Loss

Figure 5. Mask R-CNN Loss

 

Mask R-CNN Loss

Figure 6. Mask R-CNN Loss

Cleanup

Once the training job is completed, the Worker Pods are destroyed automatically. To purge the training job from Helm:

helm del --purge maskrcnn

To destroy the EKS cluster and worker node group, execute following command in the eks-cluster/terraform/aws-eks-cluster-and-nodegroup directory, using the same argument values you used in the terraform apply command above:


terraform destroy -var="profile=default" -var="region=us-west-2" \
	-var="cluster_name=my-eks-cluster" \
	-var='azs=["us-west-2a","us-west-2b","us-west-2c"]' \
	-var="k8s_version=1.13" -var="key_pair=<key-pair>"

Conclusion

Doing distributed data-parallel training using multiple GPUs on multiple machines is the often the best solution for reducing training time for heavy weight DNNs when training on large datasets.

Kubeflow on Amazon EKS provides a highly available, scalable, and secure machine learning environment based on open source technologies that can be used for all types of distributed TensorFlow training. In this post, we walked through a step-by-step tutorial on how to do distributed TensorFlow training using Kubeflow on Amazon EKS.

from AWS Open Source Blog

What Amazon Gets by Giving Back to Apache Lucene

What Amazon Gets by Giving Back to Apache Lucene

Lucene in action Amazon search screenshot.

 

At pretty much any scale, search is hard. It becomes dramatically harder, however, when searching at Amazon scale: think billions of products, complicated by millions of sellers constantly changing those products on a daily basis, with hundreds of millions of customers searching through that inventory at all hours. Although Amazon has powered its product search for years with a homegrown C++ search engine, today when you search for a new book or dishwasher on Amazon (or when you ask Alexa to search for you), you’re tapping the power of Apache Lucene (“Lucene”), an open source full-text search engine.

To get a deeper appreciation for Amazon’s embrace of Lucene, I caught up with Mike McCandless, a 12-year veteran of the Lucene community. McCandless, who joined Amazon in 2017, says that “the incredible challenge” of configuring Apache Lucene to run at Amazon scale was “too hard to resist”…

…so long as he could continue to contribute changes upstream, back to the open source Lucene project.

Why Apache Lucene?

In a Berlin Buzzwords 2019 talk, McCandless (and Amazon search colleague Mike Sokolov) walked through the reasons that Amazon, after years of success with a homegrown search engine, elected to embrace Lucene. In a follow-up discussion with me, McCandless stressed that the decision wasn’t trivial given our “very large, high-velocity catalog with exceptionally strong latency requirements and extremely peaky query rates.” Against such stringent demands, the product search team was unsure whether Lucene could keep up.

And yet it was worth an evaluation. Why?

First off, McCandless said, Lucene has attracted a massive community of passionate people who are constantly iterating on the technology. Second, while we might have worried about whether Lucene could meet our functionality and performance requirements, it’s not as if we’d be alone in using it at serious scale. Lucene “isn’t a toy,” McCandless declared, “It’s used in practice all over the place by companies like Twitter, Uber, LinkedIn, and Tinder.” Many other teams at Amazon have used Lucene for years across a variety of applications, though not previously for product search.

Ultimately, it’s that community of sophisticated users that makes Lucene hum, and which made it such an attractive option for Amazon’s product search team. Compared to Amazon’s internal product search service, McCandless argued, “Lucene has more features, is moving faster, has lots of developers working on it, offers a much bigger talent pool of experienced search developers, and more.”

All of which, while true, doesn’t necessarily explain why Amazon contributes to upstream Lucene.

Getting more by giving more

In pushing Lucene to its limits, Amazon developers uncovered “rough edges,” bugs, and other issues, according to McCandless. While the Apache License (Version 2.0) allows developers to modify the code without contributing changes back to the upstream community, Amazon chooses to actively contribute back to Lucene and other projects. Indeed, over time Amazon developers have steadily increased our participation in open source projects as a way to better serve customers, even in strategic areas like search that could yield competitive differentiation.

There are a few reasons for doing so.

First, as McCandless says, “The community is a fabulous resource: they suggest changes, and make the source code better.” By working with the Lucene community, we are better able to help our customers find the products they want, faster.

Second, we want to collaborate with that community to help bolster the main branch of innovation. Yes, at times a temporary branch is necessary, according to McCandless: “Sometimes we need to deal quickly with a short-term need. But then we take that change and propose it to the community. Once the change is merged upstream, we re-base our branch on top of the upstream version and switch back to a standard Lucene release or snapshot.” Keeping code branches as short-lived as possible has emerged as a software engineering best practice, in part thanks to the collaborative development processes pioneered by open source projects.

In this give-and-take of open source, Amazon developers have introduced several significant improvements to Lucene, including:

  • Concurrent updates and deletes. For those using Lucene for simple log analytics (i.e., appending new documents to a Lucene index, and never updating previously indexed documents), Lucene works great. For others, like Amazon, with update-heavy workloads, Lucene had a small but important single-threaded section in the code to resolve deleted IDs to their documents, which proved a major bottleneck for such use-cases. “Substantial, low-level changes to Lucene’s indexing code” were necessary, McCandless acknowledges, changes that McCandless and team contributed. “With this change,” he writes, “IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter.” The result? “A gigantic speed-up on concurrent hardware” (e.g., 53% indexing throughput speedup when updating whole documents, and a 7.4X – 8.6X speedup when updating doc values).
  • Indexing custom term frequencies. Amazon needed to add the ability to fold behavioral signals into a ranking (i.e., what do customers do after they have searched for something?), a feature long requested in the Lucene community. McCandless’ proposed patch? Creation of “a new token attribute, TermDocFrequencyAttribute, and tweak[ing] the indexing chain to use that attribute’s value as the term frequency if it’s present, and if the index options are DOCS_AND_FREQS for that field.” Sounds simple, right? Not really. “Getting behavioral signals into Lucene was hard work,” McCandless notes. The work, however, was doubly worth it since “Once we added it, others went in and built on top of it.” That community collaboration is clearly visible in the discussion of how best to implement McCandless’ proposed patch.
  • Lazy loading of Finite State Transducers. Our third significant contribution came from Ankit Jain on the AWS Search Services team. Lucene’s finite state transducers (FSTs) are always loaded into heap memory during index open, which, as Jain describes, “caus[es] frequent JVM [out-of-memory] issues if the terms dictionary size is very large.” His solution was to move the FST off-heap and lazily load it using memory-mapped IO, thereby ensuring only required portions of the terms index would be loaded into memory. This AWS contribution resulted in substantial improvements in heap memory usage, without suffering much of a performance ding for hot indices.

With these improvements (not to mention concurrent query execution improvements) and a bevy of existing features the Lucene community has built (e.g., BM25F scoring, disjunctions, language-specific tokenizers), the Amazon search team is on track to run Lucene for all Amazon product searches in 2020. This is phenomenal progress just two-and-a-half years into our Lucene product search experiment, progress made possible by the exceptional community powering Lucene development. It’s enabling Amazon to improve the customer shopping experience with unique-to-Lucene features like dimensional points and query-time joins.

But there’s more to this story than how Amazon has improved Lucene for our customers’ benefit, important as that is.

Take, for example, the custom term frequencies contribution that Amazon pushed to Lucene. For Amazon, this enabled us to migrate our machine-learned ranking models to Lucene. Adding behavioral signals into Lucene rankings is “powerful,” says McCandless, and that new capability enables each of the companies “powered by Lucene,” and many others not listed on the Apache project page, to tap into this ability to better serve their customers. The Amazon Search team could have maintained the custom term frequencies feature in an internal code branch but, in addition to the ongoing costs of maintaining that branch over time, the team saw even more value in collaborating with the Lucene community to make the software better for everyone.

The same is true of the efficiency improvements for concurrent updates/deletes noted above. These help Amazon, of course, but the improvements also benefit any Lucene user who is doing heavy updates. Our goal is to directly improve our customer experience, while also making these powerful new enhancements available to the world.

For Amazon, whether in Lucene, ROS (Robot Operating System), Xen, or any number of other open source projects, we know that often the best way to deliver great customer experiences long-term requires investments in the open source software that is part of our systems, even when the software is invisible to those customers. Our contributions to Lucene illustrate the ongoing evolution in how we serve our customers by actively participating in open source software.

ps We’re hiring

Are you a Lucene expert who enjoys big challenges? Come work with us.

from AWS Open Source Blog

Join us for the Open Distro Hack Day and More at All Things Open in Raleigh, NC

Join us for the Open Distro Hack Day and More at All Things Open in Raleigh, NC

2019 All Things Open main page.

We’re excited to return to one of our favorite open source community conferences, All Things Open, taking place October 13-15 in Raleigh, NC. AWS is a Presenting Sponsor, and we have some great events lined up – please join us and share in the fun!

The conference kicks off on Sunday, October 13th, at 1pm, with a special event on Inclusion in Open Source & Technology which we are particularly delighted to sponsor.

Open Distro Hack Day

Open Distro is a downstream distribution of open source Elasticsearch with additional features like built-in security, alerting, performance diagnostics, machine learning, and more. The Open Distro for Elasticsearch team is hosting an all-day hackathon on October 14. The Open Distro Hack Day will be held in State Ballroom A at the Marriott, across the street from the Convention Center within easy walking distance from registration.

We’ll work with fellow open source developers, enthusiasts, students, and professionals on real-world problems using the Open Distro for Elasticsearch codebase. At this event, you can build on existing Open Distro for Elasticsearch features. Your projects could include security integration for third-party authentication systems or alerting web hooks for Gitter. You could work on front-end components such as a React client for Performance Analyzer or new dashboard visualizations in Kibana. You could also work on UX design or contribute to the technical documentation.

We will have Open Distro experts and mentors available to help you every step of the way, along with lunch and snacks. We will be selecting top projects for prizes. We’ll also have swag for the winners. Space is limited, so sign up now. We look forward to seeing you there!

Open Source talks from AWS teams

MONDAY, October 14

TUESDAY, October 15

Stop by our booth and say hi!

We will be at the AWS booth detailing the latest features of Open Distro and other open source projects. Come say hello! Pick up some stickers and swag. Meet some AWS engineering managers—we’re hiring! See you soon.

from AWS Open Source Blog

Running Secure Workloads on EKS Using Polaris

Running Secure Workloads on EKS Using Polaris

Getting configurations right, especially at scale, can be a challenging task in cloud-native land. Automation helps you to make that task more manageable. In this guest post from EJ Etherington, CTO for Fairwinds, we look at an open source tool that allows you to check your EKS cluster setup, providing you with a graphical overview of the overall cluster state and detailed status, security, and health information.

Michael Hausenblas

Kubernetes with EKS

Amazon Elastic Kubernetes Service (Amazon EKS) makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. As we all know, the easier it is to deploy workloads to Kubernetes, the easier it is to rush to production without understanding – let alone implementing – best practices for container security, pod resource configuration, etc. At Fairwinds, we wanted to provide an easy way to check all the most important aspects of our Kubernetes workload configurations, continually audit our workloads, block anyone from uploading a configuration that doesn’t adhere to approved guidelines, and use CI/CD to integrate into our deployment workflow. So we have built and open sourced a new project: Polaris.

Polaris from Fairwinds

Creating clusters is easy, but running stable, secure workloads is hard – this is the kind of thing Polaris can help with.

We’ve seen time after time that seemingly small missteps in deployment configuration can lead to much bigger issues – the kind that wake people up at night. Something as simple as forgetting to configure resource requests can break auto scaling or even cause workloads to run out of resources. Small configuration issues can balloon into production outages. Polaris aims to make these problems more easily identifiable and preventable.

As a company, Fairwinds looks to help companies succeed in running Kubernetes at scale, in production. Polaris is one way we make that easier for both our customers and now, as an open source project, for the community. Whether you’re a developer looking to improve the standards of your deployments, or you’re the head of operations looking to give insight to your technical leaders, Polaris provides the information you need.

It’s not just resources

How do you ensure that all your third-party Kubernetes packages are configured as securely and resiliently as possible? Polaris checks more than just resources: it also audits container health checks, image tags, networking, and security settings (to name a few).

Polaris can help you avoid configuration issues that affect the stability, reliability, scalability, and security of your applications. It provides a simple way to identify shortcomings in your deployment configuration and prevent future issues. With Polaris, you can sleep soundly knowing that your applications are deployed with a set of well-tested standards.

Polaris has four key modes:

  1. A dashboard that provides an overview of how well current deployments are configured within a cluster.
  2. A CLI utility that provides YAML output similar to the dashboard.
  3. An experimental validating webhook that can prevent any future deployments that do not live up to a configured standard.
  4. A YAML file check, handy for use when you aren’t ready to commit to the webhook, but don’t want any misconfigurations to leak through you CI/CD System.

The Polaris dashboard

Run Polaris in the cluster and view the dashboard to ensure you can see all pods.

First, point your kubeconfig to your EKS cluster:

aws --region region eks update-kubeconfig --name cluster_name --role-arn arn:aws:iam::aws_account_id:role/role_name

Then install the dashboard:

kubectl apply -f https://raw.githubusercontent.com/fairwindsops/polaris/master/deploy/dashboard.yaml

Alternatively, with Helm:

helm repo add reactiveops-stable https://charts.fairwinds.com/stable

helm install reactiveops-stable/polaris --namespace polaris

then use port-forward to access the dashboard:

kubectl -n polaris port-forward svc/polaris-dashboard 8080:80

and visit http://localhost:8080/ to view the dashboard.

Polaris dashboard.

Now you have a dynamically-updating ‘grade’ for how well your cluster workloads are configured with regards to the checks listed above. You can work from this auto-updating list to improve your workload configurations and help ensure that they will be stable, scalable, and resilient. All you need to do is fix the errors shown in the dashboard, and your score will automatically refresh when you reload the page.

The dashboard includes a high-level summary of checks by category, annotated with helpful information:

Polaris results by category.

You will also see deployments broken out by namespace with specific misconfigurations listed:

Polaris deployments broken out by namespace.

A Note about Polaris’ out-of-the-box defaults

Polaris’ default settings for configuration analysis are very conservative, so don’t be surprised if your score is lower than you expected – a key goal for Polaris was to aim for great configuration by default. If the defaults we’ve included are too strict for your use case, you can easily adjust them as part of the deployment configuration to better suit your workloads.

In releasing Polaris, we’ve included thorough documentation for the checks we’ve chosen to include. Each check includes a link to corresponding documentation that explains why we think it is important, with links to further resources around the topic.

The Polaris CLI

What if you want to check your cluster workloads, but you don’t want to deploy another app into your cluster? The Polaris CLI is just what you need. With the CLI you can view the dashboard locally or get YAML output. Learn more about this in the Polaris Installation and Usage docs.

The Polaris webhook

While the dashboard and CLI provide an overview of the state of your current deployment configuration, the webhook mode provides a way to enforce a higher standard for all future deployments to your cluster.

Once you’ve had a chance to address any issues identified by the dashboard, you can deploy the Polaris webhook to ensure that configuration never slips below that standard again. When deployed in your cluster, the webhook will prevent any deployments that have any “error” level configuration violations.

Although we’re very excited about the potential for this webhook, we’re still working on more thorough testing before we’re ready to consider it production-ready. This is still an experimental feature, and part of a brand new open source project. Because it does have the potential to prevent updates to your deployments, use it with caution.

Use Polaris in your CI/CD pipelines into your EKS clusters

Install Polaris in your CI/CD image and run:

polaris -audit-path path/to/deployment/yaml

to see an overview of the health of what you’re about to deploy.

Conclusion

Amazon EKS is a great way to start using Kubernetes, but to help ensure that your workloads are configured properly, you can run Polaris in the mode that works best for your use case. For more information, check out the Polaris project on GitHub. Fairwinds is always looking for help contributing to our open source projects as we look to improve Kubernetes adoption and build great tools for the ecosystem — reach out to Fairwinds if you have questions or want to get involved!

EJ Etherington.

EJ Etherington

With a MS in Molecular Biology, EJ has been involved in Development and DevOps for over 10 years. He’s currently working as CTO for Fairwinds. @ejether

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

from AWS Open Source Blog

AWS RoboMaker’s CloudWatch ROS Nodes with Offline Support

AWS RoboMaker’s CloudWatch ROS Nodes with Offline Support

Developers and roboticists alike use a variety of tools to monitor and diagnose remote systems. One such tool is Amazon CloudWatch, a monitoring and management service that enables users to collect performance and operational data in the form of logs and metrics from a single platform. AWS RoboMaker‘s CloudWatch extensions are open source Robot Operating System (ROS) packages that enable log and metric data uploads to the cloud from remote robots. Many of these robotic systems operate on the edge where Internet access is not reliable or consistent, resulting in interruptions to network connectivity to the cloud. The new offline data caching capability in the RoboMaker CloudWatch ROS extensions provides resilience to network connectivity issues and improves data durability.

This blog post introduces the AWS RoboMaker CloudWatch Logs and Metrics ROS nodes, launched last year at re:Invent 2018, and presents the newly-added offline data caching capability. We will go over the ROS node implementations, the new offline caching behavior, and how to run your nodes on a ROS system.

How it works

The RoboMaker cloudwatch_logger ROS node enables ROS-generated logs to be sent to Amazon CloudWatch Logs. Out of the box, this node provides the ability to subscribe to the /rosout_agg topic, where all logs are published and uploaded to the Amazon CloudWatch Logs service. Logs can be sent to Amazon CloudWatch Logs selectively based on log severity. The cloudwatch_logger node can also subscribe to other topics if logs are not sent to /rosout_agg, and it can unsubscribe from the /rosout_agg topic if that is not needed for obtaining logs.

The RoboMaker cloudwatch_metrics_collector ROS node publishes your robot’s metrics to the cloud and enables you to easily track the health of a fleet of devices, e.g. robots, with the use of automated monitoring and actions for when a robot’s metrics show abnormalities. You can easily track historic trends and profile behavior such as resource usage. Out of the box, it provides a ROS interface to receive ROS monitoring messages and publish them to Amazon CloudWatch Metrics. All you need to do to get started is set up AWS Credentials and Permissions for your robot. The CloudWatch Metrics node can be used with any ROS node that publishes the ROS monitoring message; this message can be used to define any custom data structure you wish to publish with your own node implementation. For example, the AWS ROS1 Health Metrics Collector periodically measures system CPU and memory information and publishes the data as a metric using the ROS monitoring message.

Data flow: robot or edge device to Amazon CloudWatch.

Offline caching

Reliable network connectivity is necessary for the normal operation of services hosted in the cloud. However, for a variety of systems operating on the edge, network connectivity cannot be guaranteed. In order to avoid data loss in such an event, we have added offline data caching for the AWS RoboMaker CloudWatch Logs and Metrics extensions. If a network outage or other issue occurs during uploading, data in flight is saved to disk to be uploaded later (see Figure 1 above for an overview of the data flow).

When network connectivity is restored, upload of the latest / most recent data on disk is attempted first. When all cached data in a file has been uploaded, that file is deleted. An important part of this feature is that the amount of disk space to be used by offline files, their location on disk, and the individual file size are all configurable parameters that can be specified in a ROS YAML configuration file. If the configurable disk space limit has been reached, the oldest cached data is deleted first. For details, please see the project github READMEs: logs and metrics and Figure 2 below.

Offline caching data flow to CloudWatch.

Installation and example

Prerequisites

The AWS RoboMaker CloudWatch ROS nodes are currently supported on ROS Kinetic (Ubuntu 16.04) and Melodic (Ubuntu 18.04), with planned support for ROS2 (CloudWatch-Metrics-ROS-2 & CloudWatch-Logs-ROS-2). To run these nodes you’ll need a working ROS installation, which should be sourced in your current shell. A working ROS installation Docker image can also be used.

AWS credentials

Important: Before you begin, you must have the AWS CLI installed and configured.

cloudwatch_logger node will require the following AWS account IAM role permissions:

[logs:PutLogEvents, logs:DescribeLogStreams, logs:CreateLogStream, logs:CreateLogGroup]

To add the policy to IAM role permissions, copy and create a file with the following JSON policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:PutLogEvents",
        "logs:DescribeLogStreams",
        "logs:CreateLogStream",
        "logs:CreateLogGroup"
    ],
      "Resource": [
        "arn:aws:logs:*:*:*"
    ]
  }
 ]
}

And use this AWS CLI command to enable it:

aws iam create-policy --policy-name cloudwatch-logs-policy —policy-document file://cloudwatch-iam-policy.json

If successful, the create-policy command will return a JSON confirmation similar to:

{
    "Policy": {
        "PolicyName": "cloudwatch-logs-policy",
        "PolicyId": "ANPAZOAGRWAWQOE6MYHEE",
        "Arn": "arn:aws:iam::648553803821:policy/cloudwatch-logs-policy",
        "Path": "/",
        "DefaultVersionId": "v1",
        "AttachmentCount": 0,
        "IsAttachable": true,
        "CreateDate": "CURRENT-TIME",
        "UpdateDate": "CURRENT-TIME"
    }
}

The cloudwatch_metrics_collector node will require the cloudwatch:PutMetricData AWS account IAM role permission.

Install CLoudwatch ROS nodes via apt

sudo apt-get update
sudo apt-get install -y ros-$ROS_DISTRO-cloudwatch-logger
sudo apt-get install -y ros-$ROS_DISTRO-cloudwatch-metrics-collector

Note: You can also build from source.

Running the CloudWatch ROS nodes

Logging node instructions

With launch file using parameters in .yaml format (example provided in package source and repo):

roslaunch cloudwatch_logger sample_application.launch --screen

Without a launch file, using default values:

rosrun cloudwatch_logger cloudwatch_logger

Send a test log message:

rostopic pub -1 /rosout rosgraph_msgs/Log '{header: auto, level: 2, name: test_log, msg: test_cloudwatch_logger, function: test_logs, line: 1}'

Verify that the test log message was successfully sent to CloudWatch Logs:

  • Go to your AWS account.
  • Find CloudWatch and click into it.
  • In the upper right corner, change region to Oregon if you launched the node using the launch file (region: "us-west-2"), or change to N. Virginia if you launched the node without using the launch file.
  • Select Logs from the left menu.
  • With launch file: The name of the log group should be robot_application_name and the log stream should be device name (example with launch file below).
  • Without launch file: The name of the log group should be ros_log_group and the log stream should be ros_log_stream.

After sending a few logs via the command line, CloudWatch’s logs console will look similar to this:

CloudWatch logs console example.

Note: To understand configuration parameters, please see: parameter’s description and sample configuration file.

Sending logs from your own node

As long as your node publishes to /rosout_agg, logs will be automatically picked up.

Here’s an example with a turtlesim node:

In separate terminals run the turtlesim nodes below, after launching the cloudwatch_logger node as directed in previous example.

rosrun turtlesim turtlesim_node

rosrun turtlesim turtle_teleop_key

TurtleSim CloudWatch test example.

Note: if you have a topic that publishes messages of type rosgraph_msgs::Log you can also upload them to CloudWatch by adding your topic to topics line on the sample_configuration.yaml.

Metrics node instructions

With a launch file using parameters in .yaml format (example provided in package source and repo):

roslaunch cloudwatch_metrics_collector sample_application.launch --screen

Send a test metric:

rostopic pub /metrics ros_monitoring_msgs/MetricList '[{header: auto, metric_name: "ExampleMetric", unit: "sec", value: 42, time_stamp: now, dimensions: []}, {header: auto, metric_name: "ExampleMetric2", unit: "count", value: 22, time_stamp: now, dimensions: [{name: "ExampleDimension", value: "1"}]}]'

Note: See the GitHub repo to understand the configuration file and parameters.

You can also use the bash script below to send a stream of metrics messages:

./metrics-script.sh metric_script_pub_demo 1 24

After publishing 24 sample metrics with the bash script below, the AWS Metrics Console will look something like this:

CloudWatch metrics console example.

An example shell script to publish multiple metrics messages:

#!/bin/bash
# Simple script to publish metrics

if [ $# -eq 0 ] || [ "$1" == "h" ] || [ "$1" == "--help" ]; then 
    echo "This script will publish a ros_monitoring_msgs/MetricList to /metrics." 
    echo "Please provide the metric name, metric value lower bound, and metric value upper bound."
    echo "Example: ./metric_publisher my_demo_metric 1 42"
    exit 0
fi

if [ "$1" == "" ]; then 
    echo "Please provide the metric name"
    exit 1
fi

if [ "$2" == "" ]; then 
    echo "Please provide the metric value lower bound"
    exit 1
fi

if [ "$3" == "" ]; then 
    echo "Please provide the metric value upper bound"
    exit 1
fi

echo "Publishing metrics with metric_name=$1, value start=$2, value end=$3"

for i in $(seq "$2" "$3")
do
  echo "Publishing metric $1 $i"
  rostopic pub -1 /metrics ros_monitoring_msgs/MetricList "[{header: auto, metric_name: '$1', unit: 'sec', value: $i, time_stamp: now, dimensions: [{name: 'Example-Script-Pub-Dimension', value: '$i'}]}]"
done

Summary

In this post we reviewed how you can publish log and metrics data from a ROS node to AWS CloudWatch and how the new offline caching feature works, with hands-on examples on how to use the nodes. This is a rich set of features that both developers and robot fleet managers can use to log, review, and experiment with robot-generated metrics and logs. We hope you have found this post useful; for any improvements or feature requests, please feel free to file a ticket to our logs or metrics repositories, or if you would like to contribute, open a pull request and we will review!

from AWS Open Source Blog