Tag: AWS Compute Blog

Improving and securing your game-binaries distribution at scale

Improving and securing your game-binaries distribution at scale

This post is contributed by Yahav Biran | Sr. Solutions Architect, AWS and Scott Selinger | Associate Solutions Architect, AWS 

One of the challenges that game publishers face when employing CI/CD processes is the distribution of updated game binaries in a scalable, secure, and cost-effective way. Continuous integration and continuous deployment (CI/CD) processes enable game publishers to improve games throughout their lifecycle.

Often, CI/CD jobs contain minor changes that cause the CI/CD processes to push a full set of game binaries over the internet. This is a suboptimal approach. It negatively affects the cost of development network resources, customer network resources (output and input bandwidth), and the time it takes for a game update to propagate.

This post proposes a method of optimizing the game integration and deployments. Specifically, this method improves the distribution of updated game binaries to various targets, such as game-server farms. The proposed mechanism also adds to the security model designed to include progressive layers, starting from the Amazon EC2 instance that runs the game server. It also improves security of the game binaries, the game assets, and the monitoring of the game server deployments across several AWS Regions.

Why CI/CD in gaming is hard today

Game server binaries are usually a native application that includes binaries like graphic, sound, network, and physics assets, as well as scripts and media files. Game servers are usually developed with game engines like Unreal, Amazon Lumberyard, and Unity. Game binaries typically take up tens of gigabytes. However, because game developer teams modify only a few tens of kilobytes every day, frequent distribution of a full set of binaries is wasteful.

For a standard global game deployment, distributing game binaries requires compressing the entire binaries set and transferring the compressed version to destinations, then decompressing it upon arrival. You can optimize the process by decoupling the various layers, pushing and deploying them individually.

In both cases, the continuous deployment process might be slow due to the compression and transfer durations. Also, distributing the image binaries incurs unnecessary data transfer costs, since data is duplicated. Other game-binary distribution methods may require the game publisher’s DevOps teams to install and maintain custom caching mechanisms.

This post demonstrates an optimal method for distributing game server updates. The solution uses containerized images stored in Amazon ECR and deployed using Amazon ECS or Amazon EKS to shorten the distribution duration and reduce network usage.

How can containers help?

Dockerized game binaries enable standard caching with no implementation from the game publisher. Dockerized game binaries allow game publishers to stage their continuous build process in two ways:

  • To rebuild only the layer that was updated in a particular build process and uses the other cached layers.
  • To reassemble both packages into a deployable game server.

The use of ECR with either ECS or EKS takes care of the last mile deployment to the Docker container host.

Larger application binaries mean longer application loading times. To reduce the overall application initialization time, I decouple the deployment of the binaries and media files to allow the application to update faster. For example, updates in the application media files do not require the replication of the engine binaries or media files. This is achievable if the application binaries can be deployed in a separate directory structure. For example:

/opt/local/engine

/opt/local/engine-media

/opt/local/app

/opt/local/app-media

Containerized game servers deployment on EKS

The application server can be deployed as a single Kubernetes pod with multiple containers. The engine media (/opt/local/engine-media), the application (/opt/local/app), and the application media (/opt/local/app-media) spawn as Kubernetes initContainers and the engine binary (/opt/local/engine) runs as the main container.

apiVersion: v1
kind: Pod
metadata:
  name: my-game-app-pod
  labels:
    app: my-game-app
volumes:
      - name: engine-media-volume
          emptyDir: {}
      - name: app-volume
          emptyDir: {}
      - name: app-media-volume
          emptyDir: {}
      initContainers:
        - name: app
          image: the-app- image
          imagePullPolicy: Always
          command:
            - "sh"
            - "-c"
            - "cp /* /opt/local/engine-media"
          volumeMounts:
            - name: engine-media-volume
              mountPath: /opt/local/engine-media
        - name: engine-media
          image: the-engine-media-image
          imagePullPolicy: Always
          command:
            - "sh"
            - "-c"
            - "cp /* /opt/local/app"
          volumeMounts:
            - name: app-volume
              mountPath: /opt/local/app
        - name: app-media
          image: the-app-media-image
          imagePullPolicy: Always
          command:
            - "sh"
            - "-c"
            - "cp /* /opt/local/app-media"
          volumeMounts:
            - name: app-media-volume
              mountPath: /opt/local/app-media
spec:
  containers:
  - name: the-engine
    image: the-engine-image
    imagePullPolicy: Always
    volumeMounts:
       - name: engine-media-volume
         mountPath: /opt/local/engine-media
       - name: app-volume
         mountPath: /opt/local/app
       - name: app-media-volume
         mountPath: /opt/local/app-media
    command: ['sh', '-c', '/opt/local/engine/start.sh']

Applying multi-stage game binaries builds

In this post, I use Docker multi-stage builds for containerizing the game asset builds. I use AWS CodeBuild to manage the build and to deploy the updates of game engines like Amazon Lumberyard as ready-to-play dedicated game servers.

Using this method, frequent changes in the game binaries require less than 1% of the data transfer typically required by full image replication to the nodes that run the game-server instances. This results in significant improvements in build and integration time.

I provide a deployment example for Amazon Lumberyard Multiplayer Sample that is deployed to an EKS cluster, but this can also be done using different container orchestration technology and different game engines. I also show that the image being deployed as a game-server instance is always the latest image, which allows centralized control of the code to be scheduled upon distribution.

This example shows an update of only 50 MB of game assets, whereas the full game-server binary is 3.1 GB. With only 1.5% of the content being updated, that speeds up the build process by 90% compared to non-containerized game binaries.

For security with EKS, apply the imagePullPolicy: Always option as part of the Kubernetes best practice container images deployment option. This option ensures that the latest image is pulled every time that the pod is started, thus deploying images from a single source in ECR, in this case.

Example setup

  • Read through the following sample, a multiplayer game sample, and see how to build and structure multiplayer games to employ the various features of the GridMate networking library.
  • Create an AWS CodeCommit or GitHub repository (multiplayersample-lmbr) that includes the game engine binaries, the game assets (.pak, .cfg and more), AWS CodeBuild specs, and EKS deployment specs.
  • Create a CodeBuild project that points to the CodeCommit repo. The build image uses aws/codebuild/docker:18.09.0: the built-in image maintained by CodeBuild configured with 3 GB of memory and two vCPUs. The compute allocated for build capacity can be modified for cost and build time tradeoff.
  • Create an EKS cluster designated as a staging or an integration environment for the game title. In this case, it’s multiplayersample.

The binaries build Git repository

The Git repository is composed of five core components ordered by their size:

  • The game engine binaries (for example, BinLinux64.Dedicated.tar.gz). This is the compressed version of the game engine artifacts that are not updated regularly, hence they are deployed as a compressed file. The maintenance of this file is usually done by a different team than the developers working on the game title.
  • The game binaries (for example, MultiplayerSample_pc_Paks_Dedicated). This directory is maintained by the game development team and managed as a standard multi-branch repository. The artifacts under this directory get updated on a daily or weekly basis, depending on the game development plan.
  • The build-related specifications (for example, buildspec.yml  and Dockerfile). These files specify the build process. For simplicity, I only included the Docker build process to convey the speed of continuous integration. The process can be easily extended to include the game compilation and linked process as well.
  • The Docker artifacts for containerizing the game engine and the game binaries (for example, start.sh and start.py). These scripts usually are maintained by the game DevOps teams and updated outside of the regular game development plan. More details about these scripts can be found in a sample that describes how to deploy a game-server in Amazon EKS.
  • The deployment specifications (for example, eks-spec) specify the Kubernetes game-server deployment specs. This is for reference only, since the CD process usually runs in a separate set of resources like staging EKS clusters, which are owned and maintained by a different team.

The game build process

The build process starts with any Git push event on the Git repository. The build process includes three core phases denoted by pre_build, buildand post_build in multiplayersample-lmbr/buildspec.yml

  1. The pre_build phase unzips the game-engine binaries and logs in to the container registry (Amazon ECR) to prepare.
  2. The buildphase executes the docker build command that includes the multi-stage build.
    • The Dockerfile spec file describes the multi-stage image build process. It starts by adding the game-engine binaries to the Linux OS, ubuntu:18.04 in this example.
    • FROM ubuntu:18.04
    • ADD BinLinux64.Dedicated.tar /
    • It continues by adding the necessary packages to the game server (for example, ec2-metadata, boto3, libc, and Python) and the necessary scripts for controlling the game server runtime in EKS. These packages are only required for the CI/CD process. Therefore, they are only added in the CI/CD process. This enables a clean decoupling between the necessary packages for development, integration, and deployment, and simplifies the process for both teams.
    • RUN apt-get install -y python python-pip
    • RUN apt-get install -y net-tools vim
    • RUN apt-get install -y libc++-dev
    • RUN pip install mcstatus ec2-metadata boto3
    • ADD start.sh /start.sh
    • ADD start.py /start.py
    • The second part is to copy the game engine from the previous stage --from=0 to the next build stage. In this case, you copy the game engine binaries with the two COPY Docker directives.
    • COPY --from=0 /BinLinux64.Dedicated/* /BinLinux64.Dedicated/
    • COPY --from=0 /BinLinux64.Dedicated/qtlibs /BinLinux64.Dedicated/qtlibs/
    • Finally, the game binaries are added as a separate layer on top of the game-engine layers, which concludes the build. It’s expected that constant daily changes are made to this layer, which is why it is packaged separately. If your game includes other abstractions, you can break this step into several discrete Docker image layers.
    • ADD MultiplayerSample_pc_Paks_Dedicated /BinLinux64.Dedicated/
  3. The post_build phase pushes the game Docker image to the centralized container registry for further deployment to the various regional EKS clusters. In this phase, tag and push the new image to the designated container registry in ECR.

- docker tag $IMAGE_REPO_NAME:$IMAGE_TAG

$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

docker push

$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

The game deployment process in EKS

At this point, you’ve pushed the updated image to the designated container registry in ECR (/$IMAGE_REPO_NAME:$IMAGE_TAG). This image is scheduled as a game server in an EKS cluster as game-server Kubernetes deployment, as described in the sample.

In this example, I use  imagePullPolicy: Always.


containers:
…
        image: /$IMAGE_REPO_NAME:$IMAGE_TAG/multiplayersample-build
        imagePullPolicy: Always
        name: multiplayersample
…

By using imagePullPolicy, you ensure that no one can circumvent Amazon ECR security. You can securely make ECR the single source of truth with regards to scheduled binaries. However, ECR to the worker nodes via kubelet, the node agent. Given the size of a whole image combined with the frequency with which it is pulled, that would amount to a significant additional cost to your project.

However, Docker layers allow you to update only the layers that were modified, preventing a whole image update. Also, they enable secure image distribution. In this example, only the layer MultiplayerSample_pc_Paks_Dedicated is updated.

Proposed CI/CD process

The following diagram shows an example end-to-end architecture of a full-scale game-server deployment using EKS as the orchestration system, ECR as the container registry, and CodeBuild as the build engine.

Game developers merge changes to the Git repository that include both the preconfigured game-engine binaries and the game artifacts. Upon merge events, CodeBuild builds a multistage game-server image that is pushed to a centralized container registry hosted by ECR. At this point, DevOps teams in different Regions continuously schedule the image as a game server, pulling only the updated layer in the game server image. This keeps the entire game-server fleet running the same game binaries set, making for a secure deployment.

 

Try it out

I published two examples to guide you through the process of building an Amazon EKS cluster and deploying a containerized game server with large binaries.

Conclusion

Adopting CI/CD in game development improves the software development lifecycle by continuously deploying quality-based updated game binaries. CI/CD in game development is usually hindered by the cost of distributing large binaries, in particular, by cross-regional deployments.

Non-containerized paradigms require deployment of the full set of binaries, which is an expensive and time-consuming task. Containerized game-server binaries with AWS build tools and Amazon EKS-based regional clusters of game servers enable secure and cost-effective distribution of large binary sets to enable increased agility in today’s game development.

In this post, I demonstrated a reduction of more than 90% of the network traffic required by implementing an effective CI/CD system in a large-scale deployment of multiplayer game servers.

from AWS Compute Blog

Integrating AWS X-Ray with AWS App Mesh

Integrating AWS X-Ray with AWS App Mesh

This post is contributed by Lulu Zhao | Software Development Engineer II, AWS

 

AWS X-Ray helps developers and DevOps engineers quickly understand how an application and its underlying services are performing. When it’s integrated with AWS App Mesh, the combination makes for a powerful analytical tool.

X-Ray helps to identify and troubleshoot the root causes of errors and performance issues. It’s capable of analyzing and debugging distributed applications, including those based on a microservices architecture. It offers insights into the impact and reach of errors and performance problems.

In this post, I demonstrate how to integrate it with App Mesh.

Overview

App Mesh is a service mesh based on the Envoy proxy that makes it easy to monitor and control microservices. App Mesh standardizes how your microservices communicate, giving you end-to-end visibility and helping to ensure high application availability.

With App Mesh, it’s easy to maintain consistent visibility and network traffic control for services built across multiple types of compute infrastructure. App Mesh configures each service to export monitoring data and implements consistent communications control logic across your application.

A service mesh is like a communication layer for microservices. All communication between services happens through the mesh. Customers use App Mesh to configure a service mesh that contains virtual services, virtual nodes, virtual routes, and corresponding routes.

However, it’s challenging to visualize the way that request traffic flows through the service mesh while attempting to identify latency and other types of performance issues. This is particularly true as the number of microservices increases.

It’s in exactly this area where X-Ray excels. To show a detailed workflow inside a service mesh, I implemented a tracing extension called X-Ray tracer inside Envoy. With it, I ensure that I’m tracing all inbound and outbound calls that are routed through Envoy.

Traffic routing with color app

The following example shows how X-Ray works with App Mesh. I used the Color App, a simple demo application, to showcase traffic routing.

This app has two Go applications that are included in the AWS X-Ray Go SDK: color-gateway and color-teller. The color-gateway application is exposed to external clients and responds to http://service-name:port/color, which retrieves color from color-teller. I deployed color-app using Amazon ECS. This image illustrates how color-gateway routes traffic into a virtual router and then into separate nodes using color-teller.

 

The following image shows client interactions with App Mesh in an X-Ray service map after requests have been made to the color-gateway and to color-teller.

Integration

There are two types of service nodes:

  • AWS::AppMesh::Proxy is generated by the X-Ray tracing extension inside Envoy.
  • AWS::ECS::Container is generated by the AWS X-Ray Go SDK.

The service graph arrows show the request workflow, which you may find helpful as you try to understand the relationships between services.

To send Envoy-generated segments into X-Ray, install the X-Ray daemon. The following code example shows the ECS task definition used to install the daemon into the container.

{
    "name": "xray-daemon",

    "image": "amazon/aws-xray-daemon",

    "user": "1337",

    "essential": true,

    "cpu": 32,

    "memoryReservation": 256,

    "portMappings": [

        {

            "hostPort": 2000,

            "containerPort": 2000,

            "protocol": "udp"

         }

After the Color app successfully launched, I made a request to color-gateway to fetch a color.

  • First, the Envoy proxy appmesh/colorgateway-vn in front of default-gateway received the request and routed it to the server default-gateway.
  • Then, default-gateway made a request to server default-colorteller-white to retrieve the color.
  • Instead of directly calling the color-teller server, the request went to the default-gateway Envoy proxy and the proxy routed the call to color-teller.

That’s the advantage of using the Envoy proxy. Envoy is a self-contained process that is designed to run in parallel with all application servers. All of the Envoy proxies form a transparent communication mesh through which each application sends and receives messages to and from localhost while remaining unaware of the broader network topology.

For App Mesh integration, the X-Ray tracer records the mesh name and virtual node name values and injects them into the segment JSON document. Here is an example:

“aws”: {
	“app_mesh”: {
		“mesh_name”: “appmesh”,
		“virtual_node_name”: “colorgateway-vn”
	}
},

To enable X-Ray tracing through App Mesh inside Envoy, you must set two environment variable configurations:

  • ENABLE_ENVOY_XRAY_TRACING
  • XRAY_DAEMON_PORT

The first one enables X-Ray tracing using 127.0.0.1:2000 as the default daemon endpoint to which generated segments are sent. If the daemon you installed listens on a different port, you can specify a port value to override the default X-Ray daemon port by using the second configuration.

Conclusion

Currently, AWS X-Ray supports SDKs written in multiple languages (including Java, Python, Go, .NET, and .NET Core, Node.js, and Ruby) to help you implement your services. For more information, see Getting Started with AWS X-Ray.

from AWS Compute Blog

Getting started with serverless

Getting started with serverless

This post is contributed by Maureen Lonergan, Director, AWS Training and Certification

We consistently hear from customers that they’re interested in building serverless applications to take advantage of the increased agility and decreased total cost of ownership (TCO) that serverless delivers. But we also know that serverless may be intimidating for those who are more accustomed to using instances or containers for compute.

Since we launched AWS Lambda in 2014, our serverless portfolio has expanded beyond event-driven computing. We now have serverless databases, integration, and orchestration tools. This enables you to build end-to-end serverless applications—but it also means that you must learn how to build using a new serverless operational model.

For this reason, AWS Training and Certification is pleased to offer a new course through Coursera entitled AWS Fundamentals: Building Serverless Applications.

This scenario-based course, developed by the experts at AWS, will:

  • Introduce the AWS serverless framework and architecture in the context of a real business problem.
  • Provide the foundational knowledge to become more proficient in choosing and creating serverless solutions using AWS.
  • Provide demonstrations of the AWS services needed for deploying serverless solutions.
  • Help you develop skills in building and deploying serverless solutions using real-world examples of a serverless website and chatbot.

The syllabus allocates more than nine hours of video content and reading material over four weekly lessons. Each lesson has an estimated 2–3 hours per week of study time (though you can set your own pace and deadlines), with suggested exercises in the AWS Management Console. There is an end-of-course assessment that covers all the learning objectives and content.

The course is on-demand and 100% digital; you can even audit it for free. A completion certificate and access to the graded assessments are available for $49.

What can you expect?

In this course you will learn to use the AWS Serverless portfolio to create a chatbot that answers the question, “Can I let my cat outside?” You will build an application using every one of the concepts and services discussed in the class, including:

At the end of the class, you can audibly interact with the application to ask that essential question, “Can my cat go out in Denver?” (See the conversation in the following screenshot.)

Serverless Coursera training app

Across the four weeks of the course, you learn:

  1. What serverless computing is and how to create a chatbot with Amazon Lex using an S3 bucket to host a web application.
  2. How to build a highly scalable API with API Gateway and use Amazon CloudFront as a content delivery network (CDN) for your site and API.
  3. How to use Lambda to build serverless functions that write data to DynamoDB.
  4. How to apply lessons from the previous weeks to extend and add functionality to the chatbot.

Serverless Coursera training

AWS Fundamentals: Building Serverless Applications is now available. This course complements other standalone digital courses by AWS Training and Certification. They include the highly recommended Introduction to Serverless Development, as well as the following:

from AWS Compute Blog

Updated timeframe for the upcoming AWS Lambda and AWS [email protected] execution environment update

Updated timeframe for the upcoming AWS Lambda and AWS [email protected] execution environment update

On May 14th we announced an upcoming update to the AWS Lambda and AWS [email protected] execution environments. In that announcement we shared that we are updating the execution environment to a more recent version of Amazon Linux. This newer execution environment brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with. The previous post explained approaches to proactively testing against the new update, and methods to update your code to be compatible in the rare case you were impacted.

So far, we’ve heard from many customers that their functions have not been impacted when using the new execution environment by configuring the Opt-in mechanism. For those that have been impacted, they have been able to follow the guidance on rebuilding any dependencies against the new execution environment and retesting their functions with success.

We also received feedback that customers wanted to see a longer time frame for validation as well as have more control over it, and so based on this feedback we’ve decided to modify the timeframe in two ways.

The first phase, Begin Testing, will be extended by three weeks, retroactive starting May 21 and now ending June 10. This will give you more time to test your functions with the Opt-in layer before any further changes to the platform kick in.

We are then taking the second phase, originally called Update/Create and breaking into two independent periods of time. The first, now referred to as the New Function Create phase, will be two weeks long and during this time all newly created functions will use the new execution environment unless a delayed-update layer is configured. The second new phase, Existing Function Update, will be three weeks long and during this time both newly created functions as well as existing functions that you update, will use the new execution environment unless a delayed-update layer is configured.

The end result is that you now have 5 more weeks in total to test and potentially update your functions for this change before the General Update begins. As a reminder, starting at that time, all functions without a delayed-update layer configured will begin migrating to the new execution environment.

New update timeline

The following is the new timeline for the update, which is now broken up over five phases:

May 14, 2019 – Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described in the original announcement post.
June 11, 2019 – New Function Create: All newly created functions will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
June 25, 2019 – Existing Function Update: All newly created functions or existing functions that you update will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
July 16, 2019 – General Update: Existing functions begin using the new execution environment on invoke, unless they have a delayed-update layer configured.
July 23, 2019 – Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
July 29, 2019 – Migration End: All functions have been migrated over to the new execution environment.

Note, that we have updated the original announcement post with this new timeline as well.

FAQ

We also wanted to take the chance to provide additional information on follow up questions customers have had about the update.

Q. How does this relate to the recent Node.js v10 runtime launch?
A. The Node.js v10 launch is unrelated and is not impacted by this change. The Node.js v10 runtime is based on Amazon Linux 2 as its execution environment. Please see the AWS Lambda Runtimes section in the documentation for more information.

Q. Does this update change the execution environment for other runtimes to run on Amazon Linux 2?
A. No, this update brings the execution environment to the latest Amazon Linux 1 distribution release. In the future, new runtimes will launch on Amazon Linux 2, but all previous existing runtimes will continue to run on Amazon Linux 1.

Q. Was this update related to the recent Intel Quarterly Security Release (QSR) 2019.1?
A. No, this motion to begin updating the execution environment for Lambda and [email protected] is unrelated to the Intel QSR. There is no action for Lambda or [email protected] customers to take in relation to the QSR.

Next Steps

Your feedback greatly matters to us and we will continue to listen and learn from you. Please continue to contact us through AWS Support, the AWS forums, or AWS account teams.

from AWS Compute Blog

Updates to Amazon EKS Version Lifecycle

Updates to Amazon EKS Version Lifecycle

Contributed by Nathan Taber and Michael Hausenblas

At re:Invent 2017 we introduced the Amazon Elastic Container Service for Kubernetes, or Amazon EKS for short. We consider these tenets as valid today as they were at launch:

  • EKS is a platform to run production-grade workloads. This means that security and reliability are our first priority. After that we focus on doing the heavy lifting for you in the control plane, including life cycle-related things like version upgrades.
  • EKS provides a native and upstream Kubernetes experience. This means, with EKS you get vanilla, un-forked Kubernetes. Of course, in keeping with our first tenant, we ensure the Kubernetes versions we run have security-related patches, even for older, supported versions as quickly as possible. However, in terms of portability there’s no special sauce and no lock in.
  • If you want to use additional AWS services, the integrations are as seamless as possible.
  • The EKS team in AWS actively contributes to the upstream Kubernetes project, both on the technical level as well as community, from communicating good practices to participation in SIGs and working groups.

The first two tenets are highlighted and that is for a good reason: on the one hand we aim to go in lock-step with the upstream release cadence as much as possible, including outcomes of the SIG PM as well as the LTS Working Group. Given that running a service for production applications is our main focus, we want to make sure that you can rely on the Kubernetes we run for you. This includes, but is not limited to, security considerations around community support for ongoing bug fixes and patches for critical vulnerabilities and exposures (CVEs).

In this post, we want to give you a heads-up on upcoming changes with out Amazon EKS is managing the lifecycle for Kubernetes versions, walk you through the process in general and then have a look at a concrete example, Kubernetes version 1.10. This version happens to be the first version that will be deprecated on Amazon EKS.

But why now?

Glad you asked. It’s really all about security. Past a certain point (usually 1 year), the Kubernetes community stops releasing bug and CVE patches. Additionally, the Kubernetes project does not encourage CVE submission for deprecated versions. This means that vulnerabilities specific to an older version of Kubernetes may not even be reported, leaving users exposed with no notice in the case of a vulnerability. We consider this to be an unacceptable security posture for our customers.

Earlier this year we announced support for Kubernetes 1.12 in EKS. That, together with our commitment to support three Kubernetes versions at any given point in time and the fact that 1.13 will land very soon in EKS means that we have to deprecate 1.10, after which the three supported versions, unsurprisingly, will be 1.11, 1.12, and (you guessed it) 1.13. OK, with that out of the way, let’s have a look at the options you have to move to the latest Kubernetes versions with Amazon EKS and then dive into the update and deprecation process in greater detail:

  • Ideally, you test a new version and move to one of the three supported ones, in time (details below).
  • If you are still on a version we deprecate, you will be upgraded automatically, after some time (details, again, below).
  • If you’re using a deprecated version beyond a certain point and we can’t upgrade the cluster, we may deactivate it.

A quick Kubernetes release cycle refresher

In a nutshell, the Kubernetes versioning and release regime is roughly following a four-releases-per-year pattern, with cadence varying between 70 and 130 days. It also lays out an expectation in terms of upgrades:

We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade, especially for production-critical components.

The formal API versioning allows for a strict deprecation policy which states, amongst other things, that stable (GA) API support is “12 months or 3 releases (whichever is longer)”.

Now that we’re on the same page how upstream Kubernetes releases are managed, let’s have a look at how we at AWS implement the process in EKS.

The EKS Process

In line with the Kubernetes community support for Kubernetes versions, Amazon EKS is committed to running at least three production-ready versions of Kubernetes at any given time, with a fourth version in deprecation. A new Kubernetes version is released as generally available by the Kubernetes project every 70 and 130 days (we take the average of 90 days for simplicity). New GA versions will be supported by EKS some time after GA release (typically at the first patch version release – 1.XX.1, but sometimes later). This means that the total time a version is in production with EKS should be roughly 270 days.

We will announce the deprecation of a given Kubernetes version (n) at least 60 days before the deprecation date and over time, will align the deprecation of a Kubernetes version on EKS to be on or after the date the Kubernetes project stops supporting the version upstream.

For example, we will announce deprecation of version 1.10 while 1.12 is available for EKS and complete the deprecation process after version 1.13 is available for EKS. We will announce the deprecation of 1.11 after 1.13 is available and complete the deprecation after 1.14 is available for EKS.

The following table shows how this will work:

 EKS Version

   Today

   Soon

 About +90 days

 About +180 days

 About +270 days

 Latest Available 

1.12

1.13

1.14

1.15

1.16

 Default 

1.11

1.12

1.13

1.14

1.15

 Oldest 

1.10

1.11

1.12

1.13

1.14

 In Deprecation 

1.10

1.11

1.12

1.13

When we announce the deprecation, we will give customers a specific date when new cluster creation will be disabled for the version targeted for deprecation. On this date, EKS clusters running the version targeted for deprecation will begin to be updated to the next EKS-supported version of Kubernetes. This means that if the deprecated version is 1.10, clusters will be automatically updated to version 1.11. If a cluster is automatically updated by EKS, customers will need to update the version of their worker nodes after the update is complete. Kubernetes has compatibility between masters and workers for at least 2 versions, so 1.10 workers will continue to operate when orchestrated by a 1.11 control plane.

Upcoming deprecation of Kubernetes 1.10 in EKS

Amazon EKS will deprecate Kubernetes version 1.10 on July 22, 2019. On this day, you will no longer be able to create new 1.10 clusters and all EKS clusters running Kubernetes version 1.10 will be updated to the latest available platform version of Kubernetes version 1.11.

We recommend that all Amazon EKS customers update their 1.10 clusters to Kubernetes version 1.11 or 1.12 as soon as possible.

 

Wrapping up

What can you do today to prepare? Well, first off, internalize the timeline and try to align internal processes with it. Our documentation has more information about the EKS Kubernetes version deprecation process and EKS updates. If you have any questions, send us a note on our version deprecation issue in the public containers roadmap on GitHub.

from AWS Compute Blog

ICYMI: Serverless Q1 2019

ICYMI: Serverless Q1 2019

Welcome to the fifth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

If you didn’t see them, check our previous posts for what happened in 2018:

So, what might you have missed this past quarter? Here’s the recap.

Amazon API Gateway

Amazon API Gateway improved the experience for publishing APIs on the API Gateway Developer Portal. In addition, we also added features like a search capability, feedback mechanism, and SDK-generation capabilities.

Last year, API Gateway announced support for WebSockets. As of early February 2019, it is now possible to build WebSocket-enabled APIs via AWS CloudFormation and AWS Serverless Application Model (AWS SAM). The following diagram shows an example application.WebSockets

API Gateway is also now supported in AWS Config. This feature enhancement allows API administrators to track changes to their API configuration automatically. With the power of AWS Config, you can automate alerts—and even remediation—with triggered Lambda functions.

In early January, API Gateway also announced a service level agreement (SLA) of 99.95% availability.

AWS Step Functions

Step Functions Local

AWS Step Functions added the ability to tag Step Function resources and provide access control with tag-based permissions. With this feature, developers can use tags to define access via AWS Identity and Access Management (IAM) policies.

In addition to tag-based permissions, Step Functions was one of 10 additional services to have support from the Resource Group Tagging API, which allows a single central point of administration for tags on resources.

In early February, Step Functions released the ability to develop and test applications locally using a local Docker container. This new feature allows you to innovate faster by iterating faster locally.

In late January, Step Functions joined the family of services offering SLAs with an SLA of 99.9% availability. They also increased their service footprint to include the AWS China (Ningxia) and AWS China (Beijing) Regions.

AWS SAM Command Line Interface

AWS SAM Command Line Interface (AWS SAM CLI) released the AWS Toolkit for Visual Studio Code and the AWS Toolkit for IntelliJ. These toolkits are open source plugins that make it easier to develop applications on AWS. The toolkits provide an integrated experience for developing serverless applications in Node.js (Visual Studio Code) as well as Java and Python (IntelliJ), with more languages and features to come.

The toolkits help you get started fast with built-in project templates that leverage AWS SAM to define and configure resources. They also include an integrated experience for step-through debugging of serverless applications and make it easy to deploy your applications from the integrated development environment (IDE).

AWS Serverless Application Repository

AWS Serverless Application Repository applications can now be published to the application repository using AWS CodePipeline. This allows you to update applications in the AWS Serverless Application Repository with a continuous integration and continuous delivery (CICD) process. The CICD process is powered by a pre-built application that publishes other applications to the AWS Serverless Application Repository.

AWS Event Fork Pipelines

Event Fork Pipelines

AWS Event Fork Pipelines is now available in AWS Serverless Application Repository. AWS Event Fork Pipelines is a suite of nested open-source applications based on AWS SAM. You can deploy Event Fork Pipelines directly from AWS Serverless Application Repository into your AWS account. These applications help you build event-driven serverless applications by providing pipelines for common event-handling requirements.

AWS Cloud9

Cloud9

AWS Cloud9 announced that, in addition to Amazon Linux, you can now select Ubuntu as the operating system for their AWS Cloud9 environment. Before this announcement, you would have to stand up an Ubuntu server and connect AWS Cloud9 to the instance by using SSH. With native support for Ubuntu, you can take advantage of AWS Cloud9 features, such as instance lifecycle management for cost efficiency and preconfigured tooling environments.

AWS Cloud9 also added support for AWS CloudTrail, which allows you to monitor and react to changes made to your AWS Cloud9 environment.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics now supports CloudTrail logging. CloudTrail captures changes made to Kinesis Data Analytics and delivers the logs to an Amazon S3 bucket. This makes it easy for administrators to understand changes made to the application and who made them.

Amazon DynamoDB

Amazon DynamoDB removed the associated costs of DynamoDB Streams used in replicating data globally. Because of their use of streams to replicate data between Regions, this translates to cost savings in global tables. However, DynamoDB streaming costs remain the same for your applications reading from a replica table’s stream.

DynamoDB added the ability to switch encryption keys used to encrypt data. DynamoDB, by default, encrypts all data at rest. You can use the default encryption, the AWS-owned customer master key (CMK), or the AWS managed CMK to encrypt data. It is now possible to change between the AWS-owned CMK and the AWS managed CMK without having to modify code or applications.

Amazon DynamoDB Local, a local installable version of DynamoDB, has added support for transactional APIs, on-demand capacity, and as many as 20 global secondary indexes per table.

AWS Amplify

Amplify Deploy

AWS Amplify added support for OAuth 2.0 Authorization Code Grant flows in the native (iOS and Android) and React Native libraries. Previously, you would have to use third-party libraries and handwritten logic to achieve these use cases.

Additionally, Amplify also launched the ability to perform instant cache invalidation and delta deployments on every code commit. To achieve this, Amplify creates unique references to all the build artifacts on each deploy. Amplify has also added the ability to detect and upload only modified artifacts at the time of release to help reduce deployment time.

Amplify also added features for multiple environments, custom resolvers, larger data models, and IAM roles, including multi-factor authentication (MFA).

AWS AppSync

AWS AppSync increased its availability footprint to the EU (London) Region.

Amazon Cognito

Amazon Cognito increased its service footprint to include the Canada (central) Region. It also published an SLA of 99.9% availability.

Amazon Aurora

Amazon Aurora Serverless increases performance visibility by publishing logs to Amazon CloudWatch.

AWS CodePipeline

CodePipeline

AWS CodePipeline announces support for deploying static files to Amazon S3. While this may not usually fall under the serverless blogs and announcements, if you’re a developer who builds single-page applications or host static websites, this makes your life easier. Your static site can now be part of your CICD process without custom coding.

Serverless Posts

January:

February:

March

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the three tech talks that we delivered in Q1:

Whitepapers

Security Overview of AWS Lambda: This whitepaper presents a deep dive into the Lambda service through a security lens. It provides a well-rounded picture of the service, which can be useful for new adopters, as well as deepening understanding of Lambda for current users. Read the full whitepaper.

Twitch

AWS Launchpad Santa Clara

There is always something going on at our Twitch channel! Be sure and follow us so you don’t miss anything! For information about upcoming broadcasts and recent livestreams, keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page.

In other news

Building Happy Little APIs

Twitch Series: Building Happy Little APIs

In April, we started a 13-week deep dive into building APIs on AWS as part of our Twitch Build On series. The Building Happy Little APIs series covers the common and not-so-common use cases for APIs on AWS and the features available to customers as they look to build secure, scalable, efficient, and flexible APIs.

Twitch series: Build on Serverless: Season 2

Build On Serverless

Join Heitor Lessa across 14 weeks, nearly every Wednesday from April 24 – August 7 at 8AM PST/11AM EST/3PM UTC. Heitor is live-building a full-stack, serverless airline-booking application using a bunch of services: Lambda, Amplify, API Gateway, Amazon Cognito, AWS SAM, CloudWatch, AWS AppSync, and others. See the episode guide and sign up for stream reminders.

2019 AWS Summits

AWS Summit

The 2019 schedule is in full swing for 2019 AWS Global Summits held in major cities around the world. These free events bring the cloud computing community together to connect, collaborate, and learn about AWS. They attract technologists from all industries and skill levels who want to discover how AWS can help them innovate quickly and deliver flexible, reliable solutions at scale. Get notified when to register and learn more at the AWS Global Summit Program website.

Still looking for more?

The Serverless landing page has lots of information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

from AWS Compute Blog

Upcoming updates to the AWS Lambda and AWS [email protected] execution environment

Upcoming updates to the AWS Lambda and AWS [email protected] execution environment

AWS Lambda was first announced at AWS re:Invent 2014. Amazon CTO Werner Vogels highlighted the aspect of needing to run no servers, no instances, nothing, you just write your code. In 2016, we announced the launch of [email protected], which lets you run Lambda functions to customize content that CloudFront delivers, executing the functions in AWS locations closer to the viewer.

At AWS, we talk often about “shared responsibility” models. Essentially, those are the places where there is a handoff between what we as a technology provider offer you and what you as the customer are responsible for. In the case of Lambda and [email protected], one of the key things that we manage is the “execution environment.” The execution environment is what your code runs inside of. It is composed of the underlying operating system, system packages, the runtime for your language (if a managed one), and common capabilities like environment variables. From the customer standpoint, your primary responsibility is for your application code and configuration.

In this post, we outline an upcoming change to the execution environment for Lambda and [email protected] functions for all runtimes with the exception of Node.js v10. As with any update, some functionality could be affected. We encourage you to read through this post to understand the changes and any actions that you might need to take.

Update overview

AWS Lambda and AWS [email protected] run on top of the Amazon Linux operating system distribution and maintain updates to both the core OS and managed language runtimes. We are updating the Lambda execution environment AMI to version 2018.03 of Amazon Linux. This newer AMI brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with.

This does not apply to the recently announced Node.js v10 runtime which today runs on Amazon Linux 2.

The majority of functions will benefit seamlessly from the enhancements in this update without any action from you. However, in rare cases, package updates may introduce compatibility issues. Potential impacted functions may be those that contain libraries or application code compiled against specific underlying OS packages or other system libraries. If you are primarily using an AWS SDK or the AWS X-Ray SDK with no other dependencies, you will see no impact.

You have the following options in terms of next steps:

  • Take no action before the automatic update of the execution environment starting May 21 for all newly created/updated functions and June 11 for all existing functions.
  • Proactively test your functions against the new environment starting today.
  • Configure your functions to delay the execution environment update until June 18 to allow for a longer testing window.

In addition to the overall timeline for this change, this post also provides instructions on the following:

  • How to test your functions for this new execution environment locally and on Lambda/[email protected]
  • How to proactively update your functions.
  • How to extend the testing window by one week.

Update timeline

The following is the timeline for the update, which is broken up over four phases over the next several weeks:

May 14, 2019—Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described later in this post.
May 21, 2019—Update/Create: All new function creates or function updates result in your functions running on the new execution environment.
June 11, 2019—General Update: Existing functions begin using the new execution environment on invoke unless they have a delayed-update layer configured.
June 18, 2019—Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
June 24, 2019—Migration End: All functions have been migrated over to the new execution environment.

Recommended Approach

decision tree

You only have to act if your application uses dependencies that are compiled to work on the previous execution environment. Otherwise, you can continue to deploy new and updated Lambda functions without needing to perform any other testing steps. For those who aren’t sure if their functions use such dependencies, we encourage you to do a new deployment of your functions and to test their functionality.

There are two options for when you can start testing your functions on the new execution environment:

  • You can begin testing today using the opt-in mechanism described later.
  • Starting May 21, a new deploy or update of your functions uses the new execution environment.

If you confirm that your functions would be affected by the new execution environment, you can begin re-compiling or building your dependencies using the new reference AMI for the execution environment today and then repeat the testing. The final step is to redeploy your applications any time after May 21 to use the new execution environment.

Building your dependencies and application for the new execution environment

Because we are basing the environment off of an existing Amazon Linux AMI, you can start with building and testing your code against that AMI on EC2. With an updated EC2 instance running this AMI, you can compile and build your packages using your normal processes. For the list of AMI IDs in all public Regions, check the release notes. To start an EC2 instance running this AMI, follow the steps in the Launching an Instance Using the Launch Instance Wizard topic in the Amazon EC2 User Guide.

Opt-in/Delayed-Update with Lambda layers

Some of you may want to begin testing as soon as you’ve read this announcement. Some know that they should postpone until later in the timeline.

To give you some control over testing, we’re releasing two special Lambda layers. Lambda layers can be used to provide shared resources, code, or data between Lambda functions and can simplify the deployment and update process. These layers don’t actually contain any data or code. Instead, they act as a special flag to Lambda to run your function executions either specifically on the new or old execution environment.

The Opt-In layer allows you to start testing today. You can use the Delayed-Update layer when you know that you must make updates to your function or its configuration after May 21, but aren’t ready to deploy to the new execution environment. The Delayed-Update layer extends the initial period available to you to deploy your functions by one week until the end of June 17, without changing the execution environment.

Neither layer brings any performance or runtime changes beyond this. After June 24, the layers will have no functionality. In a future deployment, you should remove them from any function configurations.

The ARNs for the two scenarios:

  • To OPT-IN to the update to the new execution environment, add the following layer:

arn:aws:lambda:::awslayer:AmazonLinux1803

  • To DELAY THE UPDATE to the new execution environment until June 18, add the following layer:

arn:aws:lambda:::awslayer:AmazonLinux1703

The action for adding a layer to your existing functions requires an update to the Lambda function’s configuration. You can do this with the AWS CLI, AWS CloudFormation or AWS SAM, popular third-party frameworks, the AWS Management Console, or an AWS SDK.

Validating your functions

There are several ways for you to test your function code and assure that it will work after the execution environment has been updated.

Local testing

We’re providing an update to the AWS SAM CLI to enable you to test your functions locally against this new execution environment. The AWS SAM CLI uses a Docker image that mirrors the live Lambda environment locally wherever you do development. To test against this new update, make sure that you have the most recent update to AWS SAM CLI version 0.16.0. You also should have an AWS SAM template configured for your function.

  1. Install or update the AWS SAM CLI:
    $ pip install --upgrade aws-sam-cli

    -Or-

    $ pip install aws-sam-cli
  2. Confirm that you have a valid AWS SAM template:
    $ sam validate -t <template file name>

    If you don’t have a valid AWS SAM template, you can begin with a basic template to test your functions. The following example represents the basic needs for running your function against a variety. The Runtime value must be listed in the AWS Lambda Runtimes topic.

    AWSTemplateFormatVersion: 2010-09-09
    Transform: 'AWS::Serverless-2016-10-31'
    
    Resources:
      myFunction:
        Type: 'AWS::Serverless::Function'
        Properties:
          CodeUri: ./ 
          Handler: YOUR_HANDLER
          Runtime: YOUR_RUNTIME
  3. With a valid template, you can begin testing your function with mock event payloads. To generate a mock event payload, you can use the AWS SAM CLI local generate-event command. Here is an example of that command being run to generate an Amazon S3 notification type of event:
    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg
    {
      "Records": [
        {
          "eventVersion": "2.0", 
          "eventTime": "1970-01-01T00:00:00.000Z", 
          "requestParameters": {
            "sourceIPAddress": "127.0.0.1"
          }, 
          "s3": {
            "configurationId": "testConfigRule", 
            "object": {
              "eTag": "0123456789abcdef0123456789abcdef", 
              "sequencer": "0A1B2C3D4E5F678901", 
              "key": "somephoto.jpeg", 
              "size": 1024
            }, 
            "bucket": {
              "arn": "arn:aws:s3:::munns-test", 
              "name": "munns-test", 
              "ownerIdentity": {
                "principalId": "EXAMPLE"
              }
            }, 
            "s3SchemaVersion": "1.0"
          }, 
          "responseElements": {
            "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", 
            "x-amz-request-id": "EXAMPLE123456789"
          }, 
          "awsRegion": "us-east-1", 
          "eventName": "ObjectCreated:Put", 
          "userIdentity": {
            "principalId": "EXAMPLE"
          }, 
          "eventSource": "aws:s3"
        }
      ]
    }

    You can then use the AWS SAM CLI local invoke command and pipe in the output from the previous command. Or, you can save the output from the previous command to a file and then pass in a reference to the file’s name and location with the -e flag. Here is an example of the pipe event method:

    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg | sam local invoke myFunction
    2019-02-19 18:45:53 Reading invoke payload from stdin (you can also pass it from file with --event)
    2019-02-19 18:45:53 Found credentials in shared credentials file: ~/.aws/credentials
    2019-02-19 18:45:53 Invoking index.handler (python2.7)
    
    Fetching lambci/lambda:python2.7 Docker container image......
    2019-02-19 18:45:53 Mounting /home/ec2-user/environment/forblog as /var/task:ro inside runtime container
    START RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Version: $LATEST
    {"Records": [{"eventVersion": "2.0", "eventTime": "1970-01-01T00:00:00.000Z", "requestParameters": {"sourceIPAddress": "127.0.0.1"}, "s3": {"configurationId": "testConfigRule", "object": {"eTag": "0123456789abcdef0123456789abcdef", "key": "somephoto.jpeg", "sequencer": "0A1B2C3D4E5F678901", "size": 1024}, "bucket": {"ownerIdentity": {"principalId": "EXAMPLE"}, "name": "munns-test", "arn": "arn:aws:s3:::munns-test"}, "s3SchemaVersion": "1.0"}, "responseElements": {"x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", "x-amz-request-id": "EXAMPLE123456789"}, "awsRegion": "us-east-1", "eventName": "ObjectCreated:Put", "userIdentity": {"principalId": "EXAMPLE"}, "eventSource": "aws:s3"}]}
    END RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34
    REPORT RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Duration: 1 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 14 MB
    
    "Success! Parsed Events"

    You can see the full output of your function in the logs that follow the invoke command. In this example, the Python function prints out the event payload and then exits.

With the AWS SAM CLI, you can pass in valid test payloads that interface with data in other AWS services. You can also have your Lambda function talk to other AWS resources that exist in your account, for example Amazon DynamoDB tables, Amazon S3 buckets, and so on. You could also test an API interface using the local start-api command, provided that you have configured your AWS SAM template with events of the API type. Follow the full instructions for setting up and configuring the AWS SAM CLI in Installing the AWS SAM CLI. Find the full syntax guide for AWS SAM templates in the AWS Serverless Application Model documentation.

Testing in the Lambda console

After you have deployed your functions after the start of the Update/Create phase or with the Opt-In layer added, test your functions in the Lambda console.

  1. In the Lambda console, select the function to test.
  2. Select a test event and choose Test.
  3. If no test event exists, choose Configure test events.
    1. Choose Event template and select the relevant invocation service from which to test.
    2. Name the test event.
    3. Modify the event payload for your specific function.
    4. Choose Create and then return to step 2.

The results from the test are displayed.

Conclusion

With Lambda and [email protected], AWS has allowed developers to focus on just application code without the need to think about the work involved managing the actual servers that run the code. We believe that the mechanisms provided and processes described in this post allow you to easily test and update your functions for this new execution environment.

Some of you may have questions about this process, and we are ready to help you. Please contact us through AWS Support, the AWS forums, or AWS account teams.

from AWS Compute Blog

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

{
    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
            "<subnet-id-1>",
            "<subnet-id-2>",
            "<subnet-id-3>"
        ],
        "tags": {
            "Name": "batch-gpu-worker"
        },
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
            "p3",
            "p2"
        ],
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
            "<security-group-1>",
            "<security-group-2>"
        ],
        "ec2KeyPair": "<keypair-name>"
    }
}

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

{
    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    }, 
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [
            "hiccups", 
            "Ref::InputHICS3Path", 
            "Ref::OutputS3Prefix"
        ], 

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
            {
                "type" : "GPU",
                "value" : "1"
            }
        ],
        /* END NEW STUFF (delete comment before use) */
        
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []
    }
}

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts
COPY juicer-tools.aws.sh .

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/juicer-tools.aws.sh"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

{
    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            },
            "Next": "HiccupsTask"
        },
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            },
            "End": true
        }
    }
}

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.

Conclusion

With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!

from AWS Compute Blog

Running the most reliable choice for Windows workloads: Windows on AWS

Running the most reliable choice for Windows workloads: Windows on AWS

Some of you may not know, but AWS began supporting Microsoft Windows workloads on AWS in 2008—over 11 years ago. Year over year, we have released exciting new services and enhancements based on feedback from customers like you. AWS License Manager and Amazon CloudWatch Application Insights for .NET and SQL Server are just some of the recent examples. The rate and pace of innovation is eye-popping.

In addition to innovation, one of the key areas that companies value is the reliability of the cloud platform. I recently chatted with David Sheehan, DevOps engineer at eMarketer. He told me, “Our move from Azure to AWS improved the performance and reliability of our microservices in addition to significant cost savings.” If a healthcare clinic can’t connect to the internet, then it’s possible that they can’t deliver care to their patients. If a bank can’t process transactions because of an outage, they could lose business.

In 2018, the next-largest cloud provider had almost 7x more downtime hours than AWS per data pulled directly from the public service health dashboards of the major cloud providers. It is the reason companies like Edwards Science chose AWS. They are a global leader in patient-focused medical innovations for structural heart disease, as well as critical care and surgical monitoring. Rajeev Bhardwaj, the senior director for Enterprise Technology, recently told me, “We chose AWS for our data center workloads, including Windows, based on our assessment of the security, reliability, and performance of the platform.”

There are several reasons as to why AWS delivers a more reliable platform for Microsoft workloads but I would like to focus on two here: designing for reliability and scaling within a Region.

Reason #1—It’s designed for reliability

AWS has significantly better reliability than the next largest cloud provider, due to our fundamentally better global infrastructure design based on Regions and Availability Zones. The AWS Cloud spans 64 zones within 21 geographic Regions around the world. We’ve announced plans for 12 more zones and four more Regions in Bahrain, Cape Town, Jakarta, and Milan.

Look at networking capabilities across five key areas: security, global coverage, performance, manageability, and availability. AWS has made deep investments in each of these areas over the past 12 years. We want to ensure that AWS has the networking capabilities required to run the world’s most demanding workloads.

There is no compression algorithm for experience. From running the most extensive, reliable, and secure global cloud infrastructure technology platform, we’ve learned that you care about the availability and performance of your applications. You want to deploy applications across multiple zones in the same Region for fault tolerance and latency.

I want to take a moment to emphasize that our approach to building our network is fundamentally different from our competitors, and that difference matters. Each of our Regions is fully isolated from all other Regions. Unlike virtually every other cloud provider, each AWS Region has multiple zones and data centers. These zones are a fully isolated partition of our infrastructure that contains at a minimum two and up to eight separate data centers.

The zones are connected to each other with fast, private fiber-optic networking, enabling you to easily architect applications that automatically fail over between zones without interruption. With their own power infrastructure, the zones are physically separated by a meaningful distance, many kilometers, from any other zone. You can partition applications across multiple zones in the same Region to better isolate any issues and achieve high availability.

The AWS control plane (including APIs) and AWS Management Console are distributed across AWS Regions. They use a Multi-AZ architecture within each Region to deliver resilience and ensure continuous availability. This ensures that you avoid having a critical service dependency on a single data center.

While other cloud vendors claim to have Availability Zones, they do not have the same stringent requirements for isolation between zones, leading to impact across multiple zones. Furthermore, AWS has more zones and more Regions with support for multiple zones than any other cloud provider. This design is why the next largest cloud provider had almost 7x more downtime hours in 2018 than AWS.

Reason #2—Scale within a Region

We also designed our services into smaller cells that scale out within a Region, as opposed to a single-Region instance that scales up. This approach reduces the blast radius when there is a cell-level failure. It is why AWS—unlike other providers—has never experienced a network event spanning multiple Regions.

AWS also provides the most detailed information on service availability via the Service Health Dashboard, including Regions affected, services impacted, and downtime duration. AWS keeps a running log of all service interruptions for the past year. Finally, you can subscribe to an RSS feed to be notified of interruptions to each individual service.

Reliability matters

Running Windows workloads on AWS means that you not only get the most innovative cloud, but you also have the most reliable cloud as well.

For example, Mary Kay is one of the world’s leading direct sellers of skin care products and cosmetics. They have tens of thousands of employees and beauty consultants working outside the office, so the IT system is fundamental for the success of their company.

Mary Kay used Availability Zones and Microsoft Active Directory to architect their applications on AWS. AWS Microsoft Managed AD provides Mary Kay the features that enabled them to deploy SQL Server Always On availability groups on Amazon EC2 Windows. This configuration gave Mary Kay the control to scale their deployment out to meet their performance requirements. They were able to deploy the service in multiple Regions to support users worldwide. Their on-premises users get the same experience when using Active Directory–aware services, either on-premises or in the AWS Cloud.

Now, with our cross-account and cross-VPC support, Mary Kay is looking at reducing their managed Active Directory infrastructure footprint, saving money and reducing complexity. But this identity management system must be reliable and scalable as well as innovative.

Fugro is a Dutch multinational public company headquartered in the Netherlands. They provide geotechnical, survey, subsea, and geoscience services for clients, typically oil and gas, telecommunications cable, and infrastructure companies. Fugro leverages the cloud to support the delivery of geo-intelligence and asset management services for clients globally in industries including onshore and offshore energy, renewables, power, and construction.

As I was chatting with Scott Carpenter, the global cloud architect for Fugro, he said, “Fugro is also now in the process of migrating a complex ESRI ArcGIS environment from an existing cloud provider to AWS. It is going to centralize and accelerate access from existing AWS hosted datasets, while still providing flexibility and interoperability to external and third-party data sources. The ArcGIS migration is driven by a focus on providing the highest level of operational excellence.”

With AWS, you don’t have to be concerned about reliability. AWS has the reliability and scale that drives innovation for Windows applications running in the cloud. And the same reliability that makes it best for your Windows applications is the same reliability that makes AWS the best cloud for all your applications.

Let AWS help you assess how your company can get the most out of cloud. Join all the AWS customers that trust us to run their most important applications in the best cloud. To have us create an assessment for your Windows applications or all your applications, email us at [email protected].

from AWS Compute Blog

Enabling DNS resolution for Amazon EKS cluster endpoints

Enabling DNS resolution for Amazon EKS cluster endpoints

This post is contributed by Jeremy Cowan – Sr. Container Specialist Solution Architect, AWS

By default, when you create an Amazon EKS cluster, the Kubernetes cluster endpoint is public. While it is accessible from the internet, access to the Kubernetes cluster endpoint is restricted by AWS Identity and Access Management (IAM) and Kubernetes role-based access control (RBAC) policies.

At some point, you may need to configure the Kubernetes cluster endpoint to be private.  Changing your Kubernetes cluster endpoint access from public to private completely disables public access such that it can no longer be accessed from the internet.

In fact, a cluster that has been configured to only allow private access can only be accessed from the following:

  • The VPC where the worker nodes reside
  • Networks that have been peered with that VPC
  • A network that has been connected to AWS through AWS Direct Connect (DX) or a virtual private network (VPN)

However, the name of the Kubernetes cluster endpoint is only resolvable from the worker node VPC, for the following reasons:

  • The Amazon Route 53 private hosted zone that is created for the endpoint is only associated with the worker node VPC.
  • The private hosted zone is created in a separate AWS managed account and cannot be altered.

For more information, see Working with Private Hosted Zones.

This post explains how to use Route 53 inbound and outbound endpoints to resolve the name of the cluster endpoints when a request originates outside the worker node VPC.

Route 53 inbound and outbound endpoints

Route 53 inbound and outbound endpoints allow you to simplify the configuration of hybrid DNS.  DNS queries for AWS resources are resolved by Route 53 resolvers and DNS queries for on-premises resources are forwarded to an on-premises DNS resolver. However, you can also use these Route 53 endpoints to resolve the names of endpoints that are only resolvable from within a specific VPC, like the EKS cluster endpoint.

The following diagrams show how the solution works:

  • A Route 53 inbound endpoint is created in each worker node VPC and associated with a security group that allows inbound DNS requests from external subnets/CIDR ranges.
  • If the requests for the Kubernetes cluster endpoint originate from a peered VPC, those requests must be routed through a Route 53 outbound endpoint.
  • The outbound endpoint, like the inbound endpoint, is associated with a security group that allows inbound requests that originate from the peered VPC or from other VPCs in the Region.
  • A forwarding rule is created for each Kubernetes cluster endpoint.  This rule routes the request through the outbound endpoint to the IP addresses of the inbound endpoints in the worker node VPC, where it is resolved by Route 53.
  • The results of the DNS query for the Kubernetes cluster endpoint are then returned to the requestor.

If the request originates from an on-premises environment, you forego creating the outbound endpoints. Instead, you create a forwarding rule to forward requests for the Kubernetes cluster endpoint to the IP address of the Route 53 inbound endpoints in the worker node VPC.

Solution overview

For this solution, follow these steps:

  • Create an inbound endpoint in the worker node VPC.
  • Create an outbound endpoint in a peered VPC.
  • Create a forwarding rule for the outbound endpoint that sends requests to the Route 53 resolver for the worker node VPC.
  • Create a security group rule to allow inbound traffic from a peered network.
  • (Optional) Create a forwarding rule in your on-premises DNS for the Kubernetes cluster endpoint.

Prerequisites

EKS requires that you enable DNS hostnames and DNS resolution in each worker node VPC when you change the cluster endpoint access from public to private.  It is also a prerequisite for this solution and for all solutions that uses Route 53 private hosted zones.

In addition, you need a route that connects your on-premises network or VPC with the worker node VPC.  In a multi-VPC environment, this can be accomplished by creating a peering connection between two or more VPCs and updating the route table in those VPCs. If you’re connecting from an on-premises environment across a DX or an IPsec VPN, you need a route to the worker node VPC.

Configuring the inbound endpoint

When you provision an EKS cluster, EKS automatically provisions two or more cross-account elastic network interfaces onto two different subnets in your worker node VPC.  These network interfaces are primarily used when the control plane must initiate a connection with your worker nodes, for example, when you use kubectl exec or kubectl proxy. However, they can also be used by the workers to communicate with the Kubernetes API server.

When you change the EKS endpoint access to private, EKS associates a Route 53 private hosted zone with your worker node VPC.  Within this private hosted zone, EKS creates resource records for the cluster endpoint. These records correspond to the IP addresses of the two cross-account elastic network interfaces that were created in your VPC when you provisioned your cluster.

When the IP addresses of these cross-account elastic network interfaces change, for example, when EKS replaces unhealthy control plane nodes, the resource records for the cluster endpoint are automatically updated. This allows your worker nodes to continue communicating with the cluster endpoint when you switch to private access.  If you update the cluster to enable public access and disable private access, your worker nodes revert to using the public Kubernetes cluster endpoint.

By creating a Route 53 inbound endpoint in the worker node VPC, DNS queries are sent to the VPC DNS resolver of worker node VPC.  This endpoint is now capable of resolving the cluster endpoint.

Create an inbound endpoint in the worker node VPC

  1. In the Route 53 console, choose Inbound endpoints, Create Inbound endpoint.
  2. For Endpoint Name, enter a value such as <cluster_name>InboundEndpoint.
  3. For VPC in the Region, choose the VPC ID of the worker node VPC.
  4. For Security group for this endpoint, choose a security group that allows clients or applications from other networks to access this endpoint. For an example, see the Route 53 resolver diagram shown earlier in the post.
  5. Under IP addresses section, choose an Availability Zone that corresponds to a subnet in your VPC.
  6. For IP address, choose Use an IP address that is selected automatically.
  7. Repeat steps 7 and 8 for the second IP address.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
export INBOUND_RESOLVER_ID=$(aws route53resolver create-resolver-endpoint --name 
<name> --direction INBOUND --creator-request-id $DATE --security-group-ids <sgs> \
--ip-addresses SubnetId=<subnetId>,Ip=<IP address> SubnetId=<subnetId>,Ip=<IP address> \
| jq -r .ResolverEndpoint.Id)
aws route53resolver list-resolver-endpoint-ip-addresses --resolver-endpoint-id \
$INBOUND_RESOLVER_ID | jq .IpAddresses[].Ip

This outputs the IP addresses assigned to the inbound endpoint.

When you are done creating the inbound endpoint, select the endpoint from the console and choose View details.  This shows you a summary of the configuration for the endpoint.  Record the two IP addresses that were assigned to the inbound endpoint, as you need them later when configuring the forwarding rule.

Connecting from a peered VPC

An outbound endpoint is used to send DNS requests that cannot be resolved “locally” to an external resolver based on a set of rules.

If you are connecting to the EKS cluster from a peered VPC, create an outbound endpoint and forwarding rule in that VPC or expose an outbound endpoint from another VPC. For more information, see Forwarding Outbound DNS Queries to Your Network.

Create an outbound endpoint

  1. In the Route 53 console, choose Outbound endpoints, Create outbound endpoint.
  2. For Endpoint name, enter a value such as <cluster_name>OutboundEnpoint.
  3. For VPC in the Region, select the VPC ID of the VPC where you want to create the outbound endpoint, for example the peered VPC.
  4. For Security group for this endpoint, choose a security group that allows clients and applications from this or other network VPCs to access this endpoint. For an example, see the Route 53 resolver diagram shown earlier in the post.
  5. Under the IP addresses section, choose an Availability Zone that corresponds to a subnet in the peered VPC.
  6. For IP address, choose Use an IP address that is selected automatically.
  7. Repeat steps 7 and 8 for the second IP address.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
export OUTBOUND_RESOLVER_ID=$(aws route53resolver create-resolver-endpoint --name 
<name> --direction OUTBOUND --creator-request-id $DATE --security-group-ids <sgs> \
--ip-addresses SubnetId=<subnetId>,Ip=<IP address> SubnetId=<subnetId>,Ip=<Ip address> \
| jq -r .ResolverEndpoint.Id)
aws route53resolver list-resolver-endpoint-ip-addresses --resolver-endpoint-id \
$OUTBOUND_RESOLVER_ID | jq .IpAddresses[].Ip

This outputs the IP addresses that get assigned to the outbound endpoint.

Create a forwarding rule for the cluster endpoint

A forwarding rule is used to send DNS requests that cannot be resolved by the local resolver to another DNS resolver.  For this solution to work, create a forwarding rule for each cluster endpoint to resolve through the outbound endpoint. For more information, see Values That You Specify When You Create or Edit Rules.

  1. In the Route 53 console, choose Rules, Create rule.
  2. Give your rule a name, such as <cluster_name>Rule.
  3. For Rule type, choose Forward.
  4. For Domain name, type the name of the cluster endpoint for your EKS cluster.
  5. For VPCs that use this rule, select all of the VPCs to which this rule should apply.  If you have multiple VPCs that must access the cluster endpoint, include them in the list of VPCs.
  6. For Outbound endpoint, select the outbound endpoint to use to send DNS requests to the inbound endpoint of the worker node VPC.
  7. Under the Target IP addresses section, enter the IP addresses of the inbound endpoint that corresponds to the EKS endpoint that you entered in the Domain name field.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
aws route53resolver create-resolver-rule --name <name> --rule-type FORWARD \
--creator-request-id $DATE --domain-name <cluster_endpoint> --target-ips \
Ip=<IP of inbound endpoint>,Port=53 --resolver-endpoint-id <Id of outbound endpoint>

Accessing the cluster endpoint

After creating the inbound and outbound endpoints and the DNS forwarding rule, you should be able to resolve the name of the cluster endpoints from the peered VPC.

$ dig 9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com 

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.58.amzn1 <<>> 9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7168
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. IN A
;; ANSWER SECTION:
9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. 60 IN A 192.168.109.77
9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. 60 IN A 192.168.74.42
;; Query time: 12 msec
;; SERVER: 172.16.0.2#53(172.16.0.2)
;; WHEN: Mon Apr 8 22:39:05 2019
;; MSG SIZE rcvd: 114

Before you can access the cluster endpoint, you must add the IP address range of the peered VPCs to the EKS control plane security group. For more information, see Tutorial: Creating a VPC with Public and Private Subnets for Your Amazon EKS Cluster.

Add a rule to the EKS cluster control plane security group

  1. In the EC2 console, choose Security Groups.
  2. Find the security group associated with the EKS cluster control plane.  If you used eksctl to provision your cluster, the security group is named as follows: eksctl-<cluster_name>-cluster/ControlPlaneSecurityGroup.
  3. Add a rule that allows port 443 inbound from the CIDR range of the peered VPC.
  4. Choose Save.

Run kubectl

With the proper security group rule in place, you should now be able to issue kubectl commands from a machine in the peered VPC against the cluster endpoint.

$ kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ip-192-168-18-187.ec2.internal   Ready     <none>    22d       v1.11.5
ip-192-168-61-233.ec2.internal   Ready     <none>    22d       v1.11.5

Connecting from an on-premises environment

To manage your EKS cluster from your on-premises environment, configure a forwarding rule in your on-premises DNS to forward DNS queries to the inbound endpoint of the worker node VPCs. I’ve provided brief descriptions for how to do this for BIND, dnsmasq, and Windows DNS below.

Create a forwarding zone in BIND for the cluster endpoint

Add the following to the BIND configuration file:

zone "<cluster endpoint FQDN>" {
    type forward;
    forwarders { <inbound endpoint IP #1>; <inbound endpoint IP #2>; };
};

Create a forwarding zone in dnsmasq for the cluster endpoint

If you’re using dnsmasq, add the --server=/<cluster endpoint FQDN>/<inbound endpoint IP> flag to the startup options.

Create a forwarding zone in Windows DNS for the cluster endpoint

If you’re using Windows DNS, create a conditional forwarder.  Use the cluster endpoint FQDN for the DNS domain and the IPs of the inbound endpoints for the IP addresses of the servers to which to forward the requests.

Add a security group rule to the cluster control plane

Follow the steps in Adding A Rule To The EKS Cluster Control Plane Security Group. This time, use the CIDR of your on-premises network instead of the peered VPC.

Conclusion

When you configure the EKS cluster endpoint to be private only, its name can only be resolved from the worker node VPC. To manage the cluster from another VPC or your on-premises network, you can use the solution outlined in this post to create an inbound resolver for the worker node VPC.

This inbound endpoint is a feature that allows your DNS resolvers to easily resolve domain names for AWS resources. That includes the private hosted zone that gets associated with your VPC when you make the EKS cluster endpoint private. For more information, see Resolving DNS Queries Between VPCs and Your Network.  As always, I welcome your feedback about this solution.

from AWS Compute Blog