Tag: AWS Open Source Blog

Why Does AWS Contribute to Open Source? The Firecracker Example

Why Does AWS Contribute to Open Source? The Firecracker Example

Open source has long lived by the credo that “Every good work of software starts by scratching a developer’s personal itch.” At AWS, however, we’re not content with simply writing good software: we write software to meet customer needs. Over 90% of what we build is driven by customer demand, and the rest comes from intuiting customer needs and innovating on their behalf. This holds true in open source, and means that we’re very deliberate in how and where we invest in open source, as the Firecracker example helps illustrate.

Why Firecracker?

AWS has driven the cloud market in response to customer desires to stop managing servers and instead buy infrastructure as a service. More recently, to further this customer value, we innovated the concept of serverless, freeing customers from having to provision their infrastructure while providing greater security. As Amazon CTO Werner Vogels recently wrote,

[W]e anticipate that there will soon be a whole generation of developers who have never touched a server and only write business logic. The reason is simple. Whether you’re building net new applications or migrating legacy, using serverless primitives for compute, data, and integration enables you to benefit from the most agility that the cloud has to offer.

Firecracker is one way to help enable this customer innovation. Prior to Firecracker, customers told us that existing container security boundaries didn’t offer sufficient isolation between their applications when all containers must use a shared operating system kernel. We set out to work on Firecracker to offer best-in-class security as well as improve the performance and resource efficiency of customers’ container applications.

Open sourcing Firecracker made strong customer sense for at least two reasons.

First, when innovating on behalf of our customers in the context of open-source technology like Linux, it’s much easier to do this if we, too, are open. The Linux kernel runs inside Firecracker. To ensure this continues to work well for customers, we needed Linux kernel developers to have full visibility into how Firecracker works, including the special virtual devices it provides, and its somewhat unique environment (e.g., no BIOS, no ACPI tables, special “keyboard” with only one button). Making Firecracker open source reduces friction when submitting patches, among other benefits. It’s good for community and good for customers.

Second, in our quest for increasing the velocity of innovation for our customers, we believed open source gave us a more efficient way to collaborate with our community and realize compounded product gains from customer contributions over time. No, I don’t believe open source always accomplishes this. But in the case of Firecracker, and the need to fill a big void in the container ecosystem, open source was deemed our best way to meet customer needs.

More to come

Open source has always been a critical part of AWS’ strategy. Indeed, as longtime open source advocate (and senior AWS engineer) Wilson stresses, AWS has long been “a natural place to build and run open source software.” This started with Amazon.com, AWS’ customer zero. As Wilson puts it, “You could say that the shape of the first services that AWS launched back in 2006 was deeply formed from Amazon’s own experience building with open source software. Which is why open source worked so well in the cloud: the cloud was built for running it!”

Say that again? “AWS has been a huge boost for free and open source software, making it possible for many companies and communities to exist that invest in open source.” I definitely saw this through a variety of open source companies for which I’ve worked since 2002, companies like MongoDB, Alfresco, and Nodeable that ran some or all of their business on AWS.

And going forward?

In serving AWS customers, open source will play an increasingly large role. As AWS open source chief Adrian Cockcroft suggests, “Customers use more open source and care more about contributions now than they did a few years ago, which is the underlying reason why AWS is building more open source based products and contributing more now.” No, we will not engage in open source to score brownie points with analysts or pundits. As Wilson summarizes, “[T]he foundation of the case to open source [Firecracker wa]s because it ultimately benefits our customers.”

That is the constituency we focus on serving. And that is why you can expect more, not less, open source from AWS. Not because it’s popular but because it increasingly helps AWS to serve our customers.

To learn more about Firecracker, please see the launch announcement, as well as the more recent Firecracker open source update. Interested? Please contributing by submitting a pull request on GitHub.

from AWS Open Source Blog

EKS Support for the EBS CSI Driver

EKS Support for the EBS CSI Driver

Today, we are announcing EKS support for the EBS Container Storage Interface driver, an initiative to create unified storage interfaces between container orchestrators such as Kubernetes and storage vendors like AWS.

A History of Storage in Kubernetes

As originally conceived, containers were a great fit for stateless applications. However, there was no provision for persistent storage, without which stateful workloads were not possible. On the other hand, many applications weren’t designed with containerization in mind, so to support migrating all types of applications to the cloud, container orchestrators built out support for storage, making stateful apps possible.

Kubernetes first introduced support for stateful workloads with in-tree volume plugins, meaning that the plugin code was part of the core Kubernetes code and shipped with the Kubernetes binaries. This model proved challenging as vendors wanting to add support for their storage systems to Kubernetes — or even just fix a bug in an existing volume plugin — were forced to align with the Kubernetes release process. This problem led to the development of Container Storage Interface (CSI), a standard for exposing arbitrary block and file storage storage systems to containerized workloads on container orchestration systems like Kubernetes. CSI support was introduced as alpha in Kubernetes v1.9, moved to beta in Kubernetes v1.10, and went GA in Kubernetes v1.13. The CSI specification enables the container orchestrator and the storage provider to evolve independently in a modular way. By using a CSI driver, you benefit from the decoupling between the Kubernetes upstream release cycle and the CSI driver release cycle. Users can upgrade to the latest driver without waiting for new Kubernetes version releases.

Amazon Elastic Block Store (EBS)

Amazon EBS is a cloud block storage service that provides direct access from an EC2 instance to a dedicated storage volume. Support for EBS initially launched as an in-tree volume plugin in Kubernetes. When the CSI specification was published, we started developing a compatible driver for Amazon EBS. We made the EBS CSI Driver available as open source on GitHub as part of kubernetes-sigs. The driver has been available to use for self-managed Kubernetes installations on AWS since Kubernetes version 1.12 (driver version 0.2 using CSI 0.3 spec) and version 1.13 (driver version 0.3 using CSI 1.0 spec). However, prior to version 1.14, certain features such as CSINodeInfo and CSIBlockVolume were Kubernetes alpha features and thus not supported by Amazon EKS. With EKS support for Kubernetes version 1.14, users can now install the EBS CSI driver (v0.4.0 or greater) to their EKS 1.14 clusters.

The EBS CSI Driver in Action

The EBS CSI driver implements basic operations required by the CSI specification to provision, attach, and mount a volume into a Pod, and the reverse operations to remove it. The existing in-tree workflow of provisioning storage for a pod is unchanged, and all you need to do is to define a new storage glass that uses the CSI driver as a provisioner. The following is an example storage class manifest file:

kind: StorageClass
apiVersion: storage.k8s.io/v1 
  name: csi-sc 
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer

The CSI driver supports all the storage class parameters supported by the current in-tree EBS volume driver. The parameters include type, csi.storage.k8s.io/fsType, iopsPerGB, encrypted and kmsKeyId.  Note that zone/zones is not included because it is deprecated in kubernetes v1.12 in favor of allowedTopologies. If you are building a multi-zone application that requires provisioning volumes in different availability zones for a Pod to access, volume scheduling can be turned on by setting volumeBindingMode to WaitForFirstConsumer.

There are four steps to using the EBS CSI driver for your Kubernetes cluster:

  1. Grant proper permission to worker nodes
  2. Install driver: kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
  3. Create a storage class with a sample manifest file like the above one.
  4. Create a persistent volume claim or persistent volume and consume the volume using the same workflow you have been using with the in-tree volume plugin.

For more detailed installation instructions, see the EKS documentation. For more details on the EBS CSI driver, please refer to the README on the aws-ebs-csi-driver GitHub page.


To connect to an EBS volume, an application running in a pod on EKS must be on a node in the same availability zone as the EBS volume. Currently, an EBS volume can only be attached to a single Kubernetes node at a time.

Migrating to the EBS CSI Driver

In Kubernetes 1.14, CSIMigrationAWS launched as a Kubernetes alpha feature and added functionality that enables shims and translation logic to route volume operations from the EBS in-tree plugin to the EBS CSI plugin. Given its alpha status, this feature isn’t yet available for production usage in EKS. We are making upstream contributions to Kubernetes in order to move the feature to beta status in a future release. In the meantime, the existing in-tree EBS plugin is still supported.

Future Work

Upcoming work on the EBS CSI plugin will focus on several topics including stabilizing alpha features (for example volume resizing and volume snapshot), getting the driver ready for migration from the in-tree EBS plugin, adding performance tests, and more. If you are interested in more details, our milestones can be found on GitHub. We’re looking forward to hearing your feedback, and any contributions are welcome!

Fábio Bertinatto

Fábio Bertinatto is a senior software engineer at Red Hat where he has been working on OpenShift and Kubernetes for over a year. He is a contributor to the Kubernetes SIG Storage and is one of the co-authors of the EBS CSI driver.

from AWS Open Source Blog

Building Spinnaker Features for Amazon ECS

Building Spinnaker Features for Amazon ECS

Spinnaker project logo.

For the past year, AWS Container Services has been contributing to Amazon ECS support in Spinnaker, the popular cloud-based continuous delivery platform. Originally open sourced by Netflix in 2015, Spinnaker has become a compelling CI/CD solution for customers looking to standardize their deployment process across multiple platforms and integrate with existing tools like Jenkins or TravisCI.

In early 2018, deploying to Amazon ECS with Spinnaker was possible thanks to contributions from Lookout and other community members who enabled deployment of a container image to Amazon ECS through a pipeline. Later that year, the Amazon ECS team began exploring a variety of open source tools, including Spinnaker, to find where we could help improve open source integrations with ECS. We engaged with the Spinnaker community and began asking our customers what they wanted to see in the ECS integrations with Spinnaker. Customers told us they needed a more fully-featured experience, something closely resembling what they could do when deploying to Amazon ECS directly. For example, we heard from customers who were eager to deploy their Amazon ECS services on AWS Fargate, but couldn’t yet configure the required settings from their Spinnaker pipeline. Customers also wanted more flexibility in how they ran their service, such as without a load balancer, or using placement constraints – options which were either missing or used hard-coded values at the time.

An Amazon ECS team member, AWS Principal Software Engineer Clare Liguori, began tackling some of these gaps in August of 2018. Since then, we’ve submitted 50+ pull requests across five Spinnaker repositories, adding support for service features like AWS Fargate, service discovery with AWS Cloud Map, resource tagging, and task placement constraints.

We’ve also added support for all task definition fields and multi-container applications through the use of task definition artifacts. This feature is especially exciting because it gives customers complete control over the contents of their task definition and makes it easier to automatically deploy new configurations. For example, I can now store my task definition as a JSON file in GitHub and set up my Spinnaker pipeline to trigger on changes to that repository. This means that when I edit my task definition – say, to increase the memory limit for one of my containers – and push that change, my Spinnaker pipeline automatically kicks off a new deployment to apply that change to my ECS service! I can also define up to 10 containers in my task definition vs. being limited to one in the pipeline configuration.

Using a file to store your task definition also means that the time between new Amazon ECS features being launched and your ability to use them in Spinnaker is greatly reduced. Prior to this work, only about 25% of the 100+ available task definition attributes were configurable in Spinnaker, with the potential for new attributes to be added as new features were released. Individually building support for each new and existing field would put substantial time and engineering effort between customers and their ability to use these features. By storing and retrieving task definitions as files, all Spinnaker maintainers have to do is bump the AWS SDK version, and customers can start adding the new features to their task definition and deploying them to their Amazon ECS services.

Thanks to these contributions, customers can now leverage all supported task definition attributes, adopt new features faster, run their Amazon ECS services on either EC2 or AWS Fargate, and take advantage of numerous service options, all within their existing Spinnaker pipeline.

Current work

We’re still working to bring all Amazon ECS service features into Spinnaker, including the recently-launched support for multiple target groups per service, and existing scheduling options like daemon sets and custom placement strategies.

But we’ve also realized that there’s more work to do outside of just adding missing service features. As more customers try out the ECS provider in Spinnaker, we’ve realized that we need to support larger accounts, with more resources, and address intermittent problems such as race conditions and caching inefficiencies, which have a bigger impact as your services grow. So we’ll be investing in performance enhancements to make it easier for Amazon ECS customers to scale up their Spinnaker-managed services along with the demands on their applications.

Finally, collaborating with customers and the larger Spinnaker community will continue to be essential. To make sure our work aligns with the priorities of the community, we have partnered with Armory and Netflix to form the Spinnaker AWS Special Interest Group (SIG). The goal of the AWS SIG is to share updates on in-progress work and provide another channel for the community to ask questions, give feedback, and see new feature demos. The AWS SIG meets monthly on Google Hangouts and anyone is welcome to join! Our team will use these meetings as a forum to discuss the Amazon ECS on Spinnaker roadmap and prioritize contributions based on customer needs.

Join us!

Amazon ECS contributions to Spinnaker are 100% driven by customers. If you’re already using Spinnaker to deploy to Amazon ECS and would like to collaborate with us on new features, please join us on Slack #ecs and tell us what you’re working on! We also welcome feature requests, bug reports, or other feedback via GitHub issues. Big thanks to all the Spinnaker community members who have already helped make the Amazon ECS provider what it is today!

If you’re new to Spinnaker or ECS and are interested in trying it out, take a look at the AWS Quickstart guide and Amazon ECS setup documentation to get started.

from AWS Open Source Blog

Introducing Fine-Grained IAM Roles for Service Accounts

Introducing Fine-Grained IAM Roles for Service Accounts

Here at AWS we focus first and foremost on customer needs. In the context of access control in Amazon EKS, you asked in issue #23 of our public container roadmap for fine-grained IAM roles in EKS. To address this need, the community came up with a number of open source solutions, such as kube2iam, kiam, and Zalando’s IAM controller – which is a great development, allowing everyone to better understand the requirements and also the limitations of different approaches.

Now it’s time for an integrated, end-to-end solution that is flexible and easy to use. Our primary goal was to provide fine-grained roles at the pod level rather than the node level. The solution we came up with is also open source, so you can use it with Amazon EKS when provisioning a cluster with eksctl, where we are taking care of the setup, or you can use it with other Kubernetes DIY approaches, such as the popular kops setup.

Access control: IAM and RBAC

In Kubernetes on AWS, there are two complementary access control regimes at work. AWS Identity and Access Management (IAM) allows you to assign permissions to AWS services: for example, an app can access an S3 bucket. In the context of Kubernetes, the complementary system to define permissions towards Kubernetes resources is Kubernetes Role-based Access Control (RBAC). A complete end-to-end example might look as follows (we covered this in a previous post on Centralized Container Logging with Fluent Bit, where we introduced Fluent Bit output plugins):


NOTE If you want to brush up your knowledge, check out the IAM and RBAC terminology resource page we put together for this purpose.

Through the Kubernetes RBAC settings in the pod-log-reader role, the Fluent Bit plugin has permission to read the logs of the NGINX pods. Because it is running on an EC2 instance with the AWS IAM role eksctl-fluent-bit-demo-nodegroup-ng-2fb6f1a-NodeInstanceRole-P6QXJ5EYS6, which has an inline policy attached, it is also allowed to write the log entries to a Kinesis Data Firehose delivery stream.

As you can tell from above figure, the problem now is that all pods running on the Kubernetes node share the same set of permissions – the above setup violates the least privilege principle and provides attackers with a much larger attack surface than necessary.

Can we do better? Yes, we can. The community developed tooling such as kiam and kube2iam to address this issue. Our approach, IAM Roles for Service Accounts (IRSA), however, is different: we made pods first class citizens in IAM. Rather than intercepting the requests to the EC2 metadata API to perform a call to the STS API to retrieve temporary credentials, we made changes in the AWS identity APIs to recognize Kubernetes pods. By combining an OpenID Connect (OIDC) identity provider and Kubernetes service account annotations, you can now use IAM roles at the pod level.

Drilling further down into our solution: OIDC federation access allows you to assume IAM roles via the Secure Token Service (STS), enabling authentication with an OIDC provider, receiving a JSON Web Token (JWT), which in turn can be used to assume an IAM role. Kubernetes, on the other hand, can issue so-called projected service account tokens, which happen to be valid OIDC JWTs for pods. Our setup equips each pod with a cryptographically-signed token that can be verified by STS against the OIDC provider of your choice to establish the pod’s identity. Additionally, we’ve updated AWS SDKs with a new credential provider that calls sts:AssumeRoleWithWebIdentity, exchanging the Kubernetes-issued OIDC token for AWS role credentials.

The resulting solution is now available in EKS, where we manage the control plane and run the webhook responsible for injecting the necessary environment variables and projected volume. The solution is also available in a DIY Kubernetes setup on AWS; more on that option can be found below.

To benefit from the new IRSA feature the necessary steps, on a high level, are:

  1. Create a cluster with eksctl and OIDC provider setup enabled.
  2. Create an IAM role defining access to the target AWS services, for example S3, and annotate a service account with said IAM role.
  3. Finally, configure your pods by using the service account created in the previous step and assume the IAM role.

Let’s now have a closer look at how exactly these steps look in the context of EKS. Here, we’ve taken care of all the heavy lifting, such as enabling IRSA or injecting the necessary token into the pod.

Setup with Amazon EKS and eksctl

So how do you use IAM Roles for Service Accounts (IRSA) in EKS? We tried to make it as simple as possible, so you can follow along in this generic walk-through. Further down, we provide a concrete end-to-end walk-through using an app that writes to S3.

1. Cluster and OIDC ID provider creation

First, create a new v1.13 EKS cluster using the following command:

$ eksctl create cluster irptest --approve
[ℹ]  using region us-west-2

Now let’s set up the OIDC ID provider (IdP) in AWS:

Note: Make to sure to use an eksctl version >= 0.5.0

$ eksctl utils associate-iam-oidc-provider \
               --name irptest \
[ℹ]  using region us-west-2
[ℹ]  will create IAM Open ID Connect provider for cluster "irptest" in "us-west-2"
[✔]  created IAM Open ID Connect provider for cluster "irptest" in "us-west-2"

If you’re using eksctl, then you’re done and no further steps are required here. You can proceed directly to step 2.

Alternatively, for example when using CloudFormation for EKS cluster provisioning, you’ll have to create the OIDC IdP yourself. Some systems, such as Terraform, provide first-class support for this and otherwise you’ll have to manually create the IdP as follows:

First, get the cluster’s identity issuer URL by executing:

$ ISSUER_URL=$(aws eks describe-cluster \
                       --name irptest \
                       --query cluster.identity.oidc.issuer \
                       --output text)

We capture above URL in an environment variable called ISSUER_URL since we will need it in the next step. Now, create the OIDC provider (shown later on for our (AWS) OIDC ID provider), but you can also use your own as shown here:

$ aws iam create-open-id-connect-provider \
          --url $ISSUER_URL \
          --thumbprint-list $ROOT_CA_FINGERPRINT \
          --client-id-list sts.amazonaws.com

NOTE How you obtain the ROOT_CA_FINGERPRINT is up to the OIDC provider; for AWS, see Obtaining the Root CA Thumbprint for an OpenID Connect Identity Provider in the docs.

2. Kubernetes service account and IAM role setup

Next, we create a Kubernetes service account and set up the IAM role that defines the access to the targeted services, such as S3 or DynamoDB. For this, implicitly, we also need to have an IAM trust policy in place, allowing the specified Kubernetes service account to assume the IAM role. The following command does all these steps conveniently at once:

$ eksctl create iamserviceaccount \
                --name my-serviceaccount \
                --namespace default \
                --cluster irptest \
                --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
[ℹ]  1 task: { 2 sequential sub-tasks: { create addon stack "eksctl-irptest-addon-iamsa-default-my-serviceaccount", create ServiceAccount:default/my-serviceaccount } }
[ℹ]  deploying stack "eksctl-irptest-addon-iamsa-default-my-serviceaccount"
[✔]  create all roles and service account

Under the hood, the above command carries out two things:

    1. It creates an IAM role, something of the form eksctl-irptest-addon-iamsa-default-my-serviceaccount-Role1-U1Y90I1RCZWB and attaches the specified policy to it, in our case arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess.
    2. It creates a Kubernetes service account, my-serviceaccount here, and annotates the service account with said IAM role.

The following CLI command sequence is equivalent to the steps the eksctl create iamserviceaccount takes care of, for you:

# STEP 1: create IAM role and attach the target policy:
$ ISSUER_HOSTPATH=$(echo $ISSUER_URL | cut -f 3- -d'/')
$ ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
$ PROVIDER_ARN="arn:aws:iam::$ACCOUNT_ID:oidc-provider/$ISSUER_HOSTPATH"
$ cat > irp-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Federated": "$PROVIDER_ARN"
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${ISSUER_HOSTPATH}:sub": "system:serviceaccount:default:my-serviceaccount"
$ ROLE_NAME=s3-reader
$ aws iam create-role \
          --role-name $ROLE_NAME 
          --assume-role-policy-document file://irp-trust-policy.json
$ aws iam update-assume-role-policy \
          --role-name $ROLE_NAME \
          --policy-document file://irp-trust-policy.json
$ aws iam attach-role-policy \
          --role-name $ROLE_NAME \
          --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
$ S3_ROLE_ARN=$(aws iam get-role \
                        --role-name $ROLE_NAME \
                        --query Role.Arn --output text)

# STEP 2: create Kubernetes service account and annotate it with the IAM role:
$ kubectl create sa my-serviceaccount
$ kubectl annotate sa my-serviceaccount eks.amazonaws.com/role-arn=$S3_ROLE_ARN

Now that we have the identity side of things covered, and have set up the service account and the IAM role, we can move on to using this setup in the context of a pod.

3. Pod setup

Remember that the service account is the identity of your app towards the Kubernetes API server, and the pod that hosts your app uses said service account.

In the previous step, we created a service account called my-serviceaccount, so let’s use that in a pod spec. The service account should look as follows (edited for readability):

$ kubectl get sa my-serviceaccount -o yaml
apiVersion: v1
kind: ServiceAccount
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eksctl-irptest-addon-iamsa-default-my-serviceaccount-Role1-UCGG6NDYZ3UE
  name: my-serviceaccount
  namespace: default
- name: my-serviceaccount-token-m5msn

Using serviceAccountName: my-serviceaccount in the deployment manifest, we can now, make the pods it is supervising use the service account we defined above. Looking at the deployment you should find something like the following (edited for readability):

apiVersion: apps/v1
kind: Deployment
    run: myapp
  name: myapp
      serviceAccountName: my-serviceaccount
      - image: myapp:1.2
        name: myapp

Now we can finally create the deployment with kubectl apply, and the resulting pods should look something like the following (again, edited for readability):

apiVersion: apps/v1
kind: Pod
  name: myapp
  serviceAccountName: my-serviceaccount
  - name: myapp
    image: myapp:1.2
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::123456789012:role/eksctl-irptest-addon-iamsa-default-my-serviceaccount-Role1-UCGG6NDYZ3UE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        name: aws-iam-token
        readOnly: true
  - name: aws-iam-token
      defaultMode: 420
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

In the above you can see that the mutating admission controller we run in EKS (via a webhook) automatically injected the environment variables AWS_IAM_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE as well as the aws-iam-token volume. All you had to do was to annotate the service account my-serviceaccount. Further, you can see that the temporary credentials from STS are by default valid for 86,400 seconds (i.e., 24h).

If you do not want to the admission controller to modify your pods, you can manually add the environment variables AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN with the values of a projected service account token location and the role to assume. In addition, you will also need to add a volume and volumeMounts parameter to the pod with a projected service account token, see the Kubernetes docs for reference.

The final step necessary is that the pod, via its service account, assumes the IAM role. This works as follows: OIDC federation allows the user to assume IAM roles with the Secure Token Service (STS), effectively receiving a JSON Web Token (JWT) via an OAuth2 flow that can be used to assume an IAM role with an OIDC provider. In Kubernetes we then use projected service account tokens, which are valid OIDC JWTs, giving each pod a cryptographically-signed token which can be verified by STS against the OIDC provider for establishing identity. The AWS SDKs have been updated with a new credential provider that calls sts:AssumeRoleWithWebIdentity, exchanging the Kubernetes-issued OIDC token for AWS role credentials. For this feature to work correctly, you’ll need to use an SDK version greater than or equal to the values listed below:

In case you’re not (yet) using one of the above SDK versions or not (yet) in the position to migrate, you can make app IRSA-aware (in the pod) using the following recipe. As a prerequisite you need the AWS CLI and jq installed, for example like so:

$ JQ=/usr/bin/jq && curl https://stedolan.github.io/jq/download/linux64/jq > $JQ && chmod +x $JQ
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python get-pip.py
$ pip install awscli --upgrade

Now you can do the sts:AssumeRoleWithWebIdentity call manually:

$ aws sts assume-role-with-web-identity \
 --role-arn $AWS_ROLE_ARN \
 --role-session-name mh9test \
 --web-identity-token file://$AWS_WEB_IDENTITY_TOKEN_FILE \
 --duration-seconds 1000 > /tmp/irp-cred.txt
$ export AWS_ACCESS_KEY_ID="$(cat /tmp/irp-cred.txt | jq -r ".Credentials.AccessKeyId")"
$ export AWS_SECRET_ACCESS_KEY="$(cat /tmp/irp-cred.txt | jq -r ".Credentials.SecretAccessKey")"
$ export AWS_SESSION_TOKEN="$(cat /tmp/irp-cred.txt | jq -r ".Credentials.SessionToken")"
$ rm /tmp/irp-cred.txt

NOTE In the above case, the temporary STS credentials are valid for 1000 seconds, specified via --duration-seconds, and you’ll need to refresh them yourself. Also, note that the session name is arbitrary and each session is stateless and independent; that is: the token contains all the relevant data.

With the generic setup out of the way, let’s now have a look at a concrete end-to-end example, showing IRSA in action.

Example usage walkthrough

In this walkthrough we show you, end-to-end, how to use IRSA (IAM Roles for Service Accounts) for an app that takes input from stdin and writes the data to an S3 bucket, keyed by creation time.

To make the S3 Echoer demo app work on EKS, in a nutshell, we have to set up an IRSA-enabled cluster, create the S3 bucket and enable IRSA for the pod the app is running in, and then can launch a pod that writes to the S3 bucket.

Let’s start with cloning the demo app repo into a local directory:

$ git clone https://github.com/mhausenblas/s3-echoer.git && cd s3-echoer

Next, we create the EKS cluster and enable IRSA in it:

$ eksctl create cluster --approve

$ eksctl utils associate-iam-oidc-provider --name s3echotest --approve

Now we define the necessary permissions for the app by creating an IAM role and annotating the service account the pod will be using, with it:

$ eksctl create iamserviceaccount \
                --name s3-echoer \
                --cluster s3echotest \
                --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess 

At this point we have all the pieces in place: now we create the target bucket we want to write to and launch the S3 Echoer app as a one-off Kubernetes job like so:

$ TARGET_BUCKET=irp-test-2019

$ aws s3api create-bucket \
            --bucket $TARGET_BUCKET \
            --create-bucket-configuration LocationConstraint=$(aws configure get region) \
            --region $(aws configure get region)

$ sed -e "s/TARGET_BUCKET/${TARGET_BUCKET}/g" s3-echoer-job.yaml.template > s3-echoer-job.yaml

$ kubectl apply -f s3-echoer-job.yaml

NOTE Make sure that you use a different value for $TARGET_BUCKET than shown here, since S3 bucket names must be globally unique.

Finally, to verify if the write to the bucket was successful, do the following:

$ aws s3api list-objects \
            --bucket $TARGET_BUCKET \
            --query 'Contents[].{Key: Key, Size: Size}'
        "Key": "s3echoer-1565024447",
        "Size": 27

Here’s how the different pieces from AWS IAM and Kubernetes all play together to realize IRSA in EKS (dotted lines are actions, solid ones are properties or relations):

There’s a lot going on in the figure above, let’s take it step by step:

  1. When you launch the S3 Echoer app with kubectl apply -f s3-echoer-job.yaml, the YAML manifest is submitted to the API server with the Amazon EKS Pod Identity webhook configured, which is called in the mutating admission step.
  2. The Kubernetes job uses the service account s3-echoer, set via serviceAccountName.
  3. Because the service account has an eks.amazonaws.com/role-arn annotation, the webhook injects the necessary environment variables (AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE) and sets up the aws-iam-token projected volume in the pod that the job supervises.
  4. When the S3 Echoer app calls out to S3, attempting to write data into a bucket, the IRSA-enabled Go SDK we use here performs an sts:assume-role-with-web-identity call to assume the IAM role that has the arn:aws:iam::aws:policy/AmazonS3FullAccess managed policy attached. It receives temporary credentials that it uses to complete the S3 write operation.

If you want to explore the access control space yourself, learning how IAM roles, service accounts etc. are connected, you can use rbIAM, a tool we’ve written specifically for exploring the IAM/RBAC space in a unified manner. For example, for the S3 Echoer demo, an excerpt of rbIAM in action looks as follows:

That’s it! With the S3 Echoer app we’ve demonstrated how to use IRSA in EKS, and shown how the different entities in IAM and Kubernetes work together to realize IRSA. Don’t forget to clean up using kubectl delete job/s3-echoer.

Open source for the win: use with DIY Kubernetes on AWS

Now that you know how to use IRSA in EKS, you might be wondering if you can use it for a DIY Kubernetes on AWS, for example, if you’re managing your Kubernetes clusters with kops. We’ve open sourced our solution so you can, in addition to the managed solution with EKS, use it in your own setup: check out aws/amazon-eks-pod-identity-webhook, the Amazon EKS Pod Identity webhook, called by the API server in the mutating admission phase.

From the GitHub repo, you can learn how to set it up and configure it in your own environment:

To start benefiting from IRSA in your own Kubernetes setup, follow the instructions in the Amazon EKS Pod Identity Webhook GitHub repo to set up the webhook and let us know via issues there how it goes.

Next steps

Given the demand and due to the fact that we’ve open sourced the necessary components, we’re naturally excited to share this with you and let you take it for a spin on your own clusters. We will continue to improve IRSA, addressing common asks from the community, including (but not limited to) support for cross-account roles, supporting multiple profiles, and using tokens to talk to other systems, that is, non-AWS service, for example if you want to access Jenkins or Vault running in EKS.

Please do let us know if something doesn’t work the way you expect, and also leave any feedback here, comment, or open an issue on the AWS Containers Roadmap on GitHub.

from AWS Open Source Blog

Add Single Sign-On (SSO) to Open Distro for Elasticsearch Kibana Using SAML and Okta

Add Single Sign-On (SSO) to Open Distro for Elasticsearch Kibana Using SAML and Okta

Open Distro for Elasticsearch Security implements the web browser single sign-on (SSO) profile of the SAML 2.0 protocol. This enables you to configure federated access with any SAML 2.0 compliant identity provider (IdP). In a prior post, I discussed setting up SAML-based SSO using Microsoft Active Directory Federation Services (ADFS). In this post, I’ll cover the Okta-specific configuration.


User  Okta Group

Open Distro Security role

esuser1 ESAdmins all_access
esuser2 ESUsers readall
esuser3 N/A N/A


Okta configuration

In your Okta account, click on Application -> Add Application -> Create New App.

add new application

In the next screen, choose Web app as type, SAML 2.0 as the authentication method, and click Create. In the next screen, type in an application name and click Next.

select integration type

In SAML settings, set Single sign on URL and the Audience URI (SP Entity ID). Enter the below kibana url as the Single sign on URL.


Make sure to replace the kibana_base_url and kibana_port with your actual Kibana configuration as noted in the prerequisites. In my setup this is https://new-kibana.ad.example.com:5601/....

Add a string for the Audience URI. You can choose any name here. I used kibana-saml. You will use this name in the Elasticsearch Security plugin SAML config as the SP-entity-id.

saml settings

You will pass the user’s group memberships from Okta to Elasticsearch using Okta’s group attribute statements. Set the Name to “Roles”. The name you choose must match the roles_key defined in Open Distro Security’s configuration. Click Next and Finish.

roles to group mapping

On the Application Settings screen, click Identity Provider metadata link to download the metadata XML file and copy it to the Elasticsearch config directory. Set the idp.metadata_file property in Open Distro Security’s config.yml file to the path of the XML file. The path has to be specified relative to the config directory (you can also specify metadata_url instead of file).

downlooad idp metadata file

This metadata file contains the idp.entity_id.

metadata file showing entity id

To complete the configuration of Open Distro for Elasticsearch Security, refer to my prior post on adding single sign-on with ADFS. Follow the steps in that post to map Open Distro Security roles to Okta groups, update Open Distro Security configuration and Kibana configuration, and restart Kibana. My copy of the Security config file with Okta integration is as below:

        http_enabled: true
        transport_enabled: true
        order: 1
          type: saml
          challenge: true
 idp: metadata_file: okta-metadata.xml entity_id: http://www.okta.com/exksz5jfvfaUjGSuU356 sp: entity_id: kibana-saml kibana_url: https://new-kibana.ad.example.com:5601/
            exchange_key: 'MIIDAzCCAeugAwIB...'
          type: noop

Once you restart Kibana, you are ready to test the integration. You should observe the same behavior as covered in the ADFS post.

okta login screen


kibana esuser2 read screenshot


In this post, I covered SAML authentication for Kibana single sign-on with Okta. You can use a similar process to configure integration with any SAML 2.0 compliant Identity provider. Please refer to the Open Distro for Elasticsearch documentation for additional configuration options for Open Distro for Elasticsearch Security configuration with SAML.

Have an issue or a question? Want to contribute? You can get help and discuss Open Distro for Elasticsearch on our forums. You can file issues here.

from AWS Open Source Blog

Demystifying Elasticsearch Shard Allocation

Demystifying Elasticsearch Shard Allocation

At the core of Open Distro for Elasticsearch’s ability to provide a seamless scaling experience lies its ability distribute its workload across machines. This is achieved via sharding. When you create an index you set a primary and replica shard count for that index. Elasticsearch distributes your data and requests across those shards, and the shards across your data nodes.

The capacity and performance of your cluster depends critically on how Elasticsearch allocates shards on nodes. If all of your traffic goes to one or two nodes because they contain the active indexes in your cluster, those nodes will show high CPU, RAM, disk, and network use. You might have tens or hundreds of nodes in your cluster sitting idle while these few nodes melt down.

In this post, I will dig into Elasticsearch’s shard allocation strategy and discuss the reasons for “hot” nodes in your cluster. With this understanding, you can fix the root cause issues to achieve better performance and a more stable cluster.

Shard skew can result in cluster failure

In an optimal shard distribution, each machine has uniform resource utilization: every shard has the same storage footprint, every request is serviced by every shard, and every request uses CPU, RAM, disk, and network resources equally. As you scale vertically or horizontally, additional nodes contribute equally to performing the work of the cluster, increasing its capacity.

So much for the optimal case. In practice, you run more than one index in a cluster, the distribution of data is uneven, and requests are processed at different rates on different nodes. In a prior post Jon Handler explained how storage usage can become skewed. When shard distribution is skewed, CPU, network, and disk bandwidth usage can also become skewed.

For example, let’s say you have a cluster with three indexes, each with four primary shards, deployed on six nodes as in the figure below. The shards for the square index have all landed on two nodes, while the circle and rounded-rectangle indexes are mixed on four nodes. If the square index is receiving ten times the traffic of the other two indices, those nodes will need ten times the CPU, disk, network, and RAM (likely) as the other four nodes. You either need to overscale based on the requirements for the square index, or see your cluster fall over if you have scaled for the other indexes.


Diagram of six elasticsearch nodes with three indexes showing uneven, skewed CPU, RAM, JVM, and I/O usage.

The correct allocation strategy should make intelligent decisions that respect system requirements. This is a difficult problem, and Elasticsearch does a good job of solving it. Let’s dive into Elasticsearch’s algorithm.

ShardsAllocator figures out where to place shards

The ShardsAllocator is an interface in Elasticsearch whose implementations are responsible for shard placement. When shards are unassigned for any reason, ShardsAllocator decides on which nodes in the cluster to place them.

ShardsAllocator engages to determine shard locations in the following conditions:

  • Index Creation – when you add an index to your cluster (or restore an index from snapshot), ShardsAllocator decides where to place its shards. When you increase replica count for an index, it decides locations for the new replica copies.
  • Node failure – if a node drops out of the cluster, ShardsAllocator figures out where to place the shards that were on that node.
  • Cluster resize – if nodes are added or removed from the cluster, ShardsAllocator decides how to rebalance the cluster.
  • Disk high water mark – when disk usage on a node hits the high water mark (90% full, by default), Elasticsearch engages ShardsAllocator to move shards off that node.
  • Manual shard routing – when you manually route shards, ShardsAllocator also moves other shards to ensure that the cluster stays balanced.
  • Routing related setting updates — when you change cluster or index settings that affect shard routing, such as allocation awareness, exclude or include a node (by ip or node attribute), or filter indexes to include/exclude specific nodes.

Shard placement strategy can be broken into two smaller subproblems: which shard to act on, and which target node to place it at. The default Elasticsearch implementation, BalancedShardsAllocator, divides its responsibilities into three major buckets: allocate unassigned shards, move shards, and rebalance shards. Each of these internally solves the primitive subproblems and decides an action for the shard: whether to allocate it on a specific node, move it from one node to another, or simply leave it as-is.

The overall placement operation, called reroute in Elasticsearch, is invoked when there are cluster state changes that can affect shard placement.

Node selection

Elasticsearch gets the list of eligible nodes by processing a series of Allocation Deciders. Node eligibility can vary depending on the shard and on the current allocations on the node. Not all nodes may be eligible to accept a particular shard. For example, Elasticsearch won’t put a replica shard on the same node as the primary. Or, if a node’s disk is full, Elasticsearch cannot place another shard on it.

Elasticsearch follows a greedy approach for shard placement: it makes locally optimal decisions, hoping to reach global optima. A node’s eligibility for a shard is abstracted out to a weight function, then each shard is allocated to the node that is currently most eligible to accept it. Think of this weight function as a mathematical function which, given some parameters, returns the weight of a shard on a node. The most eligible node for a shard is one with minimum weight.


The first operation that a reroute invocation undertakes is allocateUnassigned. Each time an index is created, its shards (both primary and replicas) are unassigned. When a node leaves the cluster, shards that were on that node are lost. For lost primary shards, their surviving replicas (if any) are promoted to primary (this is done by a different module), and the corresponding replicas are rendered unassigned. All of these are allocated to nodes in this operation.

For allocateUnassigned(), the BalancedShardsAllocator iterates through all unassigned shards, finds the subset of nodes eligible to accept the shard (Allocation Deciders), and out of these, picks the node with minimum weight.

There is a set order in which Elasticsearch picks unassigned shards for allocation. It picks primary shards first, allocating all shards for one index before moving on to the next index’s primaries. To choose indexes, it uses a comparator based on index priority, creation data and index name (see PriorityComparator). This ensures that Elasticsearch assigns all primaries for as many indices as possible, rather than creating several partially-assigned indices. Once Elasticsearch has assigned all primaries, it moves to the first replica for each index. Then, it moves to the second replica for each index, and so on.

Move Shards

Consider a scenario when you are scaling-down your cluster. Responding to seasonal variation in your workload, you have just passed a high-traffic season and are now back to moderate workloads. You want to right-size your cluster by removing some nodes. If you remove nodes that hold data too quickly, you might remove nodes that hold a primary and its replicas, permanently losing that data. A better approach is to exclude a subset of nodes, wait for all shards to move out, and then terminate them.

Or, consider a situation where a node has its disk full and some shards must be moved out to free up space. In such cases, a shard must be moved out of a node. This is handled by the moveShards() operation, triggered right after allocateUnassigned() completes.

For “move shards”, Elasticsearch iterates through each shard in the cluster, and checks whether it can remain on its current node. If not, it selects the node with minimum weight, from the subset of eligible nodes (filtered by deciders), as the target node for this shard. A shard relocation is then triggered from current node to target node.

The move operation only applies to STARTED shards; shards in any other state are skipped. To move shards uniformly from all nodes, moveShards uses a nodeInterleavedShardIterator. This iterator goes breadth first across nodes, picking one shard from each node, followed by the next shard, and so on. Thus all shards on all nodes are evaluated for move, without preferring one over the other.

Rebalance shards

As you hit workload limits, you may decide to add more nodes to scale your cluster. Elasticsearch should automatically detect these nodes and relocate shards for better distribution. The addition or removal of nodes may not always require shard movement – what if the nodes had very few shards (say just one), and extra nodes were added only as a proactive scaling measure?

Elasticsearch generalizes this decision using the weight function abstraction in shard allocator. Given current allocations on a node, the weight function provides the weight of a shard on a node. Nodes with a high weight value are less suited to place the shard than nodes with a lower weight value. Comparing the weight of a shard on different nodes, we can decide if relocating can improve the overall weight distribution.

For rebalance decisions, Elasticsearch computes the weight for each index on every node, and the delta between min and max possible weights for an index. (This can be done at index level, since each shard in an index is treated equal in Elasticsearch.) Indexes are then processed in the order of most unbalanced index first.

Shard movement is a heavy operation. Before actual relocation, Elasticsearch models shard weights pre- and post-rebalance; shards are relocated only if the operation leads to a more balanced distribution of weights.

Finally, rebalancing is an optimization problem. Beyond a threshold, the cost of moving shards begins to outweigh the benefits of balanced weights. In Elasticsearch, this threshold is currently a fixed value, configurable by the dynamic setting cluster.routing.allocation.balance.threshold. When the computed weight delta for an index — the difference between its min and max weights across nodes — is less than this threshold, the index is considered balanced.


In this post, we covered the algorithms that power shard placement and balancing decisions in Elasticsearch. Each reroute invocation goes through the process of allocating unassigned shards, moving shards that must be evacuated from their current nodes, and rebalancing shards wherever possible. Together, they maintain a stable balanced cluster.

In the next post, we will dive deep into the default weight function implementation, which is responsible for selecting one node over another for a given shard’s placement.

from AWS Open Source Blog

Using a Network Load Balancer with the NGINX Ingress Controller on EKS

Using a Network Load Balancer with the NGINX Ingress Controller on EKS

Kubernetes Ingress is an API object that provides a collection of routing rules that govern how external/internal users access Kubernetes services running in a cluster. An ingress controller is responsible for reading the ingress resource information and processing it appropriately. As there are different ingress controllers that can do this job, it’s important to choose the right one for the type of traffic and load coming into your Kubernetes cluster. In this post, we will discuss how to use an NGINX ingress controller on Amazon EKS, and how to front-face it with a Network Load Balancer (NLB).

What is a Network Load Balancer?

An AWS Network Load Balancer functions at the fourth layer of the Open Systems Interconnection (OSI) model. It can handle millions of requests per second. After the load balancer receives a connection request, it selects a target from the target group for the default rule. It attempts to open a TCP connection to the selected target on the port specified in the listener configuration.

Exposing your application on Kubernetes

In Kubernetes, these are several different ways to expose your application; using Ingress to expose your service is one way of doing it. Ingress is not a service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource, as it can expose multiple services under the same IP address.

This post will explain how to use an ingress resource and front it with a NLB (Network Load Balancer), with an example.

Ingress in Kubernetes

This image shows the flow of traffic from the outside world hitting the Ingress resource and is diverted to the required Kubernetes service based on the path rules set up in the Ingress resource..


Kubernetes supports a high-level abstraction called Ingress, which allows simple host- or URL-based HTTP routing. An Ingress is a core concept (in beta) of Kubernetes. It is always implemented by a third party proxy; these implementations are known as ingress controllers. An ingress controller is responsible for reading the ingress resource information and processing that data accordingly. Different ingress controllers have extended the specification in different ways to support additional use cases.

Typically, your Kubernetes services will impose additional requirements on your ingress. Examples of this include:

  • Content-based routing: e.g.,  routing based on HTTP method, request headers, or other properties of the  specific request.
  • Resilience: e.g., rate  limiting, timeouts.
  • Support for multiple  protocols: e.g., WebSockets or gRPC.
  • Authentication.

An ingress controller is a daemon or deployment, deployed as a Kubernetes Pod, that watches the endpoint of the API server for updates to the Ingress resource. Its job is to satisfy requests for Ingresses. NGINX ingress is one such implementation. This blog post implements the ingress controller as a deployment with the default values. To suit your use case and for more availability, you can use it as a daemon or increase the replica count.

Why would I choose the NGINX ingress controller over the Application Load Balancer (ALB) ingress controller?

The ALB ingress controller is great, but there are certain use cases where the NLB with the NGINX ingress controller will be a better fit. I will discuss scenarios where you would need a NLB over the ALB later in this post, but first let’s discuss the ingress controllers.

By default, the NGINX Ingress controller will listen to all the ingress events from all the namespaces and add corresponding directives and rules into the NGINX configuration file. This makes it possible to use a centralized routing file which includes all the ingress rules, hosts, and paths.

With the NGINX Ingress controller you can also have multiple ingress objects for multiple environments or namespaces with the same network load balancer; with the ALB, each ingress object requires a new load balancer.

Furthermore, features like path-based routing can be added to the NLB when used with the NGINX ingress controller.

Why do I need a load balancer in front of an ingress?

Ingress is tightly integrated into Kubernetes, meaning that your existing workflows around kubectl will likely extend nicely to managing ingress. An Ingress controller does not typically eliminate the need for an external load balancer , it simply adds an additional layer of routing and control behind the load balancer.

Pods and nodes are not guaranteed to live for the whole lifetime that the user intends: pods are ephemeral and vulnerable to kill signals from Kubernetes during occasions such as:

  • Scaling.
  • Memory or CPU saturation.
  • Rescheduling for more efficient resource use.
  • Downtime due to outside factors.

The load balancer (Kubernetes service) is a construct that stands as a single, fixed-service endpoint for a given set of pods or worker nodes. To take advantage of the previously-discussed benefits of a Network Load Balancer (NLB), we create a Kubernetes service of  type:loadbalancer with the NLB annotations, and this load balancer sits in front of the ingress controller – which is itself a pod or a set of pods. In AWS, for a set of EC2 compute instances managed by an Autoscaling Group, there should be a load balancer that acts as both a fixed referable address and a load balancing mechanism.

Ingress with load balancer


This image shows the flow of traffic from the outside world hitting the Network load balancer and then hitting the Ingress resource and is diverted to the required Kubernetes service based on the path rules set up in the Ingress resource.


The diagram above shows a Network Load Balancer in front of the Ingress resource. This load balancer will route traffic to a Kubernetes service (or Ingress) on your cluster that will perform service-specific routing. NLB with the Ingress definition provides the benefits of both a NLB and an Ingress resource.

What advantages does the NLB have over the Application Load Balancer (ALB)?

A Network Load Balancer is capable of handling millions of requests per second while maintaining ultra-low latencies, making it ideal for load balancing TCP traffic. NLB is optimized to handle sudden and volatile traffic patterns while using a single static IP address per Availability Zone. The benefits of using a NLB are:

  • Static IP/elastic IP addresses: For each Availability Zone (AZ) you enable on the NLB, you have a network interface. Each load balancer node in the AZ uses this network interface to get a static IP address. You can also use Elastic IP to assign a fixed IP address for each Availability Zone.
  • Scalability: Ability to handle volatile workloads and scale to millions of requests per second.
  • Zonal isolation: The Network Load Balancer can be used for application architectures within a Single Zone. Network Load Balancers attempt to route a series of requests from a particular source to targets in a single AZ while still providing automatic failover should those targets become unavailable.
  • Source/remote address preservation: With a Network Load Balancer, the original source IP address and source ports for the incoming connections remain unmodified. With Classic and Application load balancers, we had to use HTTP header X-Forwarded-For to get the remote IP address.
  • Long-lived TCP connections: Network Load Balancer supports long-running TCP connections that can be open for months or years, making it ideal for WebSocket-type applications, IoT, gaming, and messaging applications.
  • Reduced bandwidth usage: Most applications are bandwidth-bound and should see a cost reduction (for load balancing) of about 25% compared to Application or Classic Load Balancers.
  • SSL termination: SSL termination will need to happen at the backend, since SSL termination on NLB for Kubernetes is not yet available.

For any NLB usage, the backend security groups control the access to the application (NLB does not have security groups of it own). The worker node security group handles the security for inbound/ outbound traffic.

How to use a Network Load Balancer with the NGINX Ingress resource in Kubernetes

Start by creating the mandatory resources for NGINX Ingress in your cluster:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/mandatory.yaml

Create the NLB for the ingress controller:

kubectl apply -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/nlb-service.yaml

Now create two services (apple.yaml and banana.yaml) to demonstrate how the Ingress routes our request.  We’ll run two web applications that each output a slightly different response. Each of the files below has a service definition and a pod definition.

Create the resources:

$ kubectl apply -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/apple.yaml $ kubectl apply -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/banana.yaml

Defining the Ingress resource (with SSL termination) to route traffic to the services created above 

If you’ve purchased and configured a custom domain name for your server, you can use that certificate, otherwise you can still use SSL with a self-signed certificate for development and testing.

In this example, where we are terminating SSL on the backend, we will create a self-signed certificate.

Anytime we reference a TLS secret, we mean a PEM-encoded X.509, RSA (2048) secret. Now generate a self-signed certificate and private key with:

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout
tls.key -out tls.crt -subj "/CN=anthonycornell.com/O=anthonycornell.com"

Then create the secret in the cluster:

kubectl create secret tls tls-secret --key tls.key --cert tls.crt

Now declare an Ingress to route requests to /apple to the first service, and requests to /banana to second service. Check out the Ingress’ rules field that declares how requests are passed along:

apiVersion: extensions/v1beta1
kind: Ingress
  name: example-ingress
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
    nginx.ingress.kubernetes.io/rewrite-target: /
  - hosts:
    - anthonycornell.com
    secretName: tls-secret
  - host: anthonycornell.com
        - path: /apple
            serviceName: apple-service
            servicePort: 5678
        - path: /banana
            serviceName: banana-service
            servicePort: 5678

Create the Ingress in the cluster:

kubectl create -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/example-ingress.yaml

Set up Route 53 to have your domain pointed to the NLB (optional):

anthonycornell.com.           A.    
ALIAS abf3d14967d6511e9903d12aa583c79b-e3b2965682e9fbde.elb.us-east-1.amazonaws.com 

Test your application:

curl  https://anthonycornell.com/banana -k
curl  https://anthonycornell.com/apple -k

Can I reuse a NLB with services running in different namespaces? In the same namespace?  

Install the NGINX ingress controller as explained above. In each of your namespaces, define an Ingress Resource.

Example for test:

apiVersion: extensions/v1beta1
 kind: Ingress
  name: api-ingresse-test
  namespace: test
    kubernetes.io/ingress.class: "nginx"
  - host: test.anthonycornell.com
      - backend:
          serviceName: myApp
          servicePort: 80
        path: /

Suppose we have three namespaces – Test, Demo, and Staging. After creating the Ingress resource in each namespace, the NGINX ingress controller will process those resources as shown below:


This image shows traffic coming from three namespaces – Test, Demo, and Staging hitting the NLB and the NGINX ingress controller will process those requests to the respective namespace.



Delete the Ingress resource:

kubectl delete -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/example-ingress.yaml

Delete the services:

kubectl delete -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/apple.yaml 
kubectl delete -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/banana.yaml

Delete the NLB:

kubectl delete -f https://raw.githubusercontent.com/cornellanthony/nlb-nginxIngress-eks/master/nlb-service.yaml

Delete the NGINX ingress controller:

kubectl delete -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/mandatory.yaml

We hope this post was useful! Please let us know in the comments.

from AWS Open Source Blog

AWS API Gateway for HPC Job Submission

AWS API Gateway for HPC Job Submission

AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. AWS API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.

In this post we combine AWS ParallelCluster and AWS API Gateway to allow an HTTP interaction with the scheduler. You can submit, monitor, and terminate jobs using the API, instead of connecting to the master node via SSH. This makes it possible to integrate ParallelCluster programmatically with other applications running on premises or on AWS.

The API uses AWS Lambda and AWS Systems Manager to execute the user commands without granting direct SSH access to the nodes, thus enhancing the security of whole cluster.

VPC configuration

The VPC used for this configuration can be created using the VPC Wizard. You can also use an existing VPC that respects the AWS ParallelCluster network requirements.


Launch VPC Wizard


In Select a VPC Configuration, choose VPC with Public and Private Subnets and then Select.


Select a VPC Configuration

Before starting the VPC Wizard, allocate an Elastic IP Address. This will be used to configure a NAT gateway for the private subnet. A NAT gateway is required to enable compute nodes in the AWS ParallelCluster private subnet to download the required packages and to access the AWS services public endpoints. See AWS ParallelCluster network requirements.

You can find more details about the VPC creation and configuration options in VPC with Public and Private Subnets (NAT).

The example below uses the following configuration:

IPv4 CIDR block:
VPC name: Cluster VPC
Public subnet’s IPv4 CIDR:
Availability Zone: eu-west-1a
Public subnet name: Public subnet
Private subnet’s IPv4 CIDR:1
Availability Zone: eu-west-1b
Private subnet name: Private subnet
Elastic IP Allocation ID: <id of the allocated Elastic IP>
Enable DNS hostnames: yes

VPC with Public and Private Subnets

AWS ParallelCluster configuration

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS cloud; to get started, see Installing AWS ParallelCluster.

After the AWS ParallelCluster command line has been configured, create the cluster template file below in .parallelcluster/config. The master_subnet_idcontains the id of the created public subnet and the compute_subnet_idcontains the private one. The ec2_iam_roleis the role that will be used for all the instances of the cluster. The steps to create this role will be explained below.

aws_region_name = eu-west-1

[cluster slurm]
scheduler = slurm
compute_instance_type = c5.large
initial_queue_size = 2
max_queue_size = 10
maintain_initial_size = false
base_os = alinux
key_name = AWS_Ireland
vpc_settings = public
ec2_iam_role = parallelcluster-custom-role

[vpc public]
master_subnet_id = subnet-01fc20e143543f8af
compute_subnet_id = subnet-0b1ae2790497d83ec
vpc_id = vpc-0cdee679c5a6163bd

update_check = true
sanity_check = true
cluster_template = slurm

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

IAM custom Roles for SSM endpoints

To allow ParallelCluster nodes to call Lambda and SSM endpoints, it is necessary to configure a custom IAM Role.

See AWS Identity and Access Management Roles in AWS ParallelCluster for details on the default AWS ParallelCluster policy.

From the AWS console:

  • Access the AWS Identity and Access Management (IAM) service and click on Policies.
  • Choose Create policy and paste the following policy into the JSONsection. Be sure to modify <REGION>, <AWS ACCOUNT ID>to match the values for your account, and also update the S3 bucket name from pcluster-scriptsto the the bucket you want to use to store the input/output data from jobs and save the output of SSM execution commands.
    "Version": "2012-10-17",
    "Statement": [
            "Resource": [
            "Action": [
            "Sid": "EC2",
            "Effect": "Allow"
            "Resource": [
            "Action": [
            "Sid": "DynamoDBList",
            "Effect": "Allow"
            "Resource": [
                "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
            "Action": [
            "Sid": "SQSQueue",
            "Effect": "Allow"
            "Resource": [
            "Action": [
            "Sid": "Autoscaling",
            "Effect": "Allow"
            "Resource": [
                "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
            "Action": [
            "Sid": "DynamoDBTable",
            "Effect": "Allow"
            "Resource": [
            "Action": [
            "Sid": "S3GetObj",
            "Effect": "Allow"
            "Resource": [
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
            "Action": [
            "Sid": "CloudFormationDescribe",
            "Effect": "Allow"
            "Resource": [
            "Action": [
            "Sid": "SQSList",
            "Effect": "Allow"
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": [

Choose Review policy and, in the next section, enter parallelcluster-custom-policystring and choose Create policy.

Now you can create the Role. Choose Role in the left menu and then Create role.

Select AWS service as type of trusted entity and EC2 as service that will use this role as shown here:


Create role


Choose Next Permissions to proceed.

In the policy selection, select the parallelcluster-custom-policy that you just created.

Choose Next: Tags and then Next: Review.

In the Real Name box, enter parallelcluster-custom-roleand confirm by choosing Create role.

Slurm commands execution with AWS Lambda

AWS Lambda allows you to run your code without provisioning or managing servers. Lambda is used, in this solution, to execute the Slurm commands in the Master node. The AWS Lambda function can be created from the AWS console as explained in the Create a Lambda Function with the Console documentation.

For Function name, enter slurmAPI.

As Runtime, enter Python 2.7.

Choose Create function to create it.


Create function

The code below should be pasted into the Function code section, which you can see by scrolling further down the page. The Lambda function uses AWS Systems Manager to execute the scheduler commands, preventing any SSH access to the node. Please modify <REGION>appropriately, and update the S3 bucket name from pcluster-datato the name you chose earlier.

import boto3
import time
import json
import random
import string

def lambda_handler(event, context):
    instance_id = event["queryStringParameters"]["instanceid"]
    selected_function = event["queryStringParameters"]["function"]
    if selected_function == 'list_jobs':
    elif selected_function == 'list_nodes':
      command='scontrol show nodes'
    elif selected_function == 'list_partitions':
      command='scontrol show partitions'
    elif selected_function == 'job_details':
      jobid = event["queryStringParameters"]["jobid"]
      command='scontrol show jobs %s'%jobid
    elif selected_function == 'submit_job':
      script_name = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(10)])
      jobscript_location = event["queryStringParameters"]["jobscript_location"]
      command = 'aws s3 cp s3://%s %s.sh; chmod +x %s.sh'%(jobscript_location,script_name,script_name)
      s3_tmp_out = execute_command(command,instance_id)
      submitopts = ''
        submitopts = event["headers"]["submitopts"]
      except Exception as e:
        submitopts = ''
      command = 'sbatch %s %s.sh'%(submitopts,script_name)
    body = execute_command(command,instance_id)
    return {
        'statusCode': 200,
        'body': body
def execute_command(command,instance_id):
    bucket_name = 'pcluster-data'
    ssm_client = boto3.client('ssm', region_name="<REGION>")
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    response = ssm_client.send_command(
                     'sudo su - %s -c "%s"'%(username,command)
    command_id = response['Command']['CommandId']
    output = ssm_client.get_command_invocation(
    while output['Status'] != 'Success':
      output = ssm_client.get_command_invocation(CommandId=command_id,InstanceId=instance_id)
      if (output['Status'] == 'Failed') or (output['Status'] =='Cancelled') or (output['Status'] == 'TimedOut'):
    body = ''
    files = list(bucket.objects.filter(Prefix='ssm/%s/%s/awsrunShellScript/0.awsrunShellScript'%(command_id,instance_id)))
    for obj in files:
      key = obj.key
      body += obj.get()['Body'].read()
    return body

In the Basic settings section, set 10 seconds as Timeout.

Choose Save in the top right to save the function.

In the Execution role section, choose View the join-domain-finction-role role on the IAM console (indicated by the red arrow in the image below).


Execution role


In the newly-opened tab, Choose Attach Policy and then Create Policy.


Permissions policies

This last action will open a new tab in your Browser. From this new tab, choose Create policy and then Json.


Attach Permissions

Create policy


Modify the <REGION>, <AWS ACCOUNT ID>appropriately, and also update the S3 bucket name from pcluster-datato the name you chose earlier.


    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:ec2:<REGION>:<AWS ACCOUNT ID>:instance/*",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:ssm:<REGION>:<AWS ACCOUNT ID>:*"
            "Effect": "Allow",
            "Action": [
            "Resource": [

In the next section, enter as Name the ExecuteSlurmCommandsstring and then choose Create policy.

Close the current tab and move to the previous one.

Refresh the list, select the ExecuteSlurmCommands policy and then Attach policy, as shown here:


Attach Permissions

Execute the AWS Lambda function with AWS API Gateway

The AWS API Gateway allows the creation of REST and WebSocket APIs that act as a “front door” for applications to access data, business logic, or functionality from your backend services like AWS Lambda.

Sign in to the API Gateway console.

If this is your first time using API Gateway, you will see a page that introduces you to the features of the service. Choose Get Started. When the Create Example API popup appears, choose OK.

If this is not your first time using API Gateway, choose Create API.

Create an empty API as follows and choose Create API:


Create API


You can now create the slurmresource choosing the root resource (/) in the Resources tree and selecting Create Resource from the Actions dropdown menu as shown here:


Actions dropdown


The new resource can be configured as follows:

Configure as proxy resource: unchecked
Resource Name: slurm
Resource Path: /slurm
Enable API Gateway CORS: unchecked

To confirm the configuration, choose Create Resource.


New Child Resource

In the Resource list, choose /slurm and then Actions and Create method as shown here:


Create Method

Choose ANYfrom the dropdown menu, and choose the checkmark icon.

In the “/slurm – ANY – Setup” section, use the following values:

Integration type: Lambda Function
Use Lambda Proxy integration: checked
Lambda Region: eu-west-1
Lambda Function: slurmAPI
Use Default Timeout: checked

and then choose Save.


slurm - ANY - Setup

Choose OK when prompted with Add Permission to Lambda Function.

You can now deploy the API by choosing Deploy API from the Actions dropdown menu as shown here:


Deploy API


For Deployment stage choose [new stage], for Stage name enter slurmand then choose Deploy:


Deploy API

Take note of the API’s Invoke URL – it will be required for the API interaction.

Deploy the Cluster

The cluster can now be created using the following command line:

pcluster create -t slurm slurmcluster

-t slurm indicates which section of the cluster template to use.
slurmcluster is the name of the cluster that will be created.

For more details, see the AWS ParallelCluster Documentation. A detailed explanation of the pcluster command line parameters can be found in AWS ParallelCluster CLI Commands.

How to interact with the slurm API

The slurm API created in the previous steps requires some parameters:

  • instanceid– the instance id of the Master node.
  • function– the API function to execute. Accepted values .are list_jobs, list_nodes, list_partitions, job_detailsand submit_job.
  • jobscript_location– the s3 location of the job script (required only when function=submit_job) .
  • submitopts– the submission parameters passed to the scheduler (optional, can be used when function=submit_job).

Here is an example of the interaction with the API:

#Submit a job
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=submit_job&jobscript_location=pcluster-data/job_script.sh" -H 'submitopts: --job-name=TestJob --partition=compute'
Submitted batch job 11

#List of the jobs
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=list_jobs"
                11   compute  TestJob ec2-user  R       0:14      1 ip-10-0-3-209
 #Job details
 $ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=job_details&jobid=11" 
JobId=11 JobName=TestJob
   UserId=ec2-user(500) GroupId=ec2-user(500) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-06-26T14:42:09 EligibleTime=2019-06-26T14:42:09
   StartTime=2019-06-26T14:49:18 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=compute AllocNode:Sid=ip-10-0-1-181:28284
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

The authentication to the API can be managed following the Controlling and Managing Access to a REST API in API Gateway Documentation.


When you have finished your computation, the cluster can be destroyed using the following command:

pcluster delete slurmcluster

The additional created resources can be destroyed following the official AWS documentation:


This post has shown you how to deploy a Slurm cluster using AWS ParallelCluster, and integrate it with the AWS API Gateway.

This solution uses the AWS API Gateway, AWS Lambda, and AWS Systems Manager to simplify interaction with the cluster without granting access to the command line of the Master node, improving the overall security. You can extend the API by adding additional schedulers or interaction workflows and can be integrated with external applications.

from AWS Open Source Blog

Open Distro for Elasticsearch 1.1.0 Released

Open Distro for Elasticsearch 1.1.0 Released

We are happy to announce that Open Distro for Elasticsearch 1.1.0 is now available for download!

Version 1.1.0 includes the upstream open source versions of Elasticsearch 7.1.1, Kibana 7.1.1, and the latest updates for alerting, SQL, security, performance analyzer, and Kibana plugins, as well as the SQL JDBC driver. You can find details on enhancements, bug fixes, and more in the release notes for each plugin in their respective GitHub repositories. See Open Distro’s version history table for previous releases.

Download the latest packages

You can find Docker Hub images Open Distro for Elasticsearch 1.1.0 and Open Distro for Elasticsearch Kibana 1.1.0 on Docker Hub. Make sure your compose file specifies 1.1.0 or uses the latest tag. See our documentation on how to install Open Distro for Elasticsearch with RPMs and install Open Distro for Elasticsearch with Debian packages. You can find our Open Distro for Elasticsearch’s Security plugin artifacts on Maven Central.

We have updated our tools as well! You can download Open Distro for Elasticsearch’s PerfTop client, and Open Distro for Elasticsearch’s SQL JDBC driver.

For more detail, see our release notes for Open Distro for Elasticsearch 1.1.0.

New features in development

We’re also excited to pre-announce new plugins in development. We’ve made available pre-release alpha versions of these plugin artifacts for developers (see below for links) to integrate into their applications. We invite you to join in to submit issues and PRs on features, bugs, and tests you need or build.

k-NN Search

Open Distro for Elasticsearch’s k-nearest neighbor (k-NN) search plugin will enable high-scale, low-latency nearest neighbor search on billions of documents across thousands of dimensions with the same ease as running any regular Elasticsearch query. The k-NN plugin relies on the Non-Metric Space Library (NMSLIB). It will power use cases such as recommendations, fraud detection, and related document search. We are extending the Apache Lucene codec to introduce a new file format to store vector data. k-NN search uses the standard Elasticsearch mapping and query syntax: to designate a field as a k-NN vector you simply map it to the new k-NN field type provided by the k-NN plugin.

Index management

Open Distro for Elasticsearch Index Management will enable you to run periodic operations on your indexes, eliminating the need to build and manage external systems for these tasks. You will define custom policies to optimize and move indexes, applied based on wildcard index patterns. Policies are finite-state automata. Policies define states and transitions (Actions). The first release of Index Management will support force merge, delete, rollover, snapshot, replica_count, close/open, read_only/read_write actions, and more. Index Management will be configurable via REST or the associated Kibana plugin. We’ve made artifacts of the alpha version of Open Distro for Elasticsearch Index Management and Open Distro for Elasticsearch Kibana Index Management available on GitHub.

Job scheduler

Open Distro for Elasticsearch’s Job Scheduler plugin is a library that enables you to build plugins that can run periodic jobs on your cluster. You can use Job Scheduler for a variety of use cases, from taking snapshots once per hour, to deleting indexes more than 90 days old, to providing scheduled reports. Read our announcement page for Open Distro for Elasticsearch Job Scheduler for more details.

SQL Kibana UI

Open Distro for Elasticsearch’s Kibana UI for SQL will make it easier for you to run SQL queries and explore your data. This plugin will support SQL syntax highlighting and output results in the familiar tabular format. The SQL Kibana UI will support nested documents, allowing you to expand columns with these documents and drill down into the nested data. You will also be able to translate your SQL query to Elasticsearch query DSL with a single click and download results of the query as a CSV file.


Please feel free to ask questions on the Open Distro for Elasticsearch community discussion forum.

Report a bug or request a feature

You can file a bug, request a feature, or propose new ideas to enhance Open Distro for Elasticsearch. If you find bugs or want to propose a feature for a particular plug-in, you can go to the specific repo and file an issue on the plug-in repo.

Getting Started

If you’re getting started on building your open source contribution karma, you can select an issue tagged as a “Good First Issue” to start contributing to Open Distro for Elasticsearch. Read the Open Distro technical documentation on the project website to help you get started.

Go develop! And contribute to Open Distro 🙂

from AWS Open Source Blog

Use Elasticsearch’s _rollover API For Efficient Storage Use

Use Elasticsearch’s _rollover API For Efficient Storage Use

Many Open Distro for Elasticsearch users manage data life cycle in their clusters by creating an index based on a standard time period, usually one index per day. This pattern has many advantages: ingest tools like Logstash support index rollover out of the box; defining a retention window is straightforward; and deleting old data is as simple as dropping an index.

If your workload has multiple data streams with different data sizes per stream, you can run into problems: Your resource usage, especially your storage per node, can become unbalanced, or “skewed.” When that happens, some nodes will become overloaded or run out of storage before other nodes, and your cluster can fall over.

You can use the _rollover API to manage the size of your indexes. You call _rollover on a regular schedule, with a threshold that defines when Elasticsearch should create a new index and start writing to it. That way, each index is as close to the same size as possible. When Elasticsearch distributes the shards for your index to nodes in your cluster, you use storage from each node as evenly as possible.

What is skew?

Elasticsearch distributes shards to nodes based primarily on the count of shards on each node (it’s more complicated than that, but that’s a good first approximation). When you have a single index, because the shards are all approximately the same size, you can ensure even distribution of data by making your shard count divisible by your node count. For example, if you have five primaries and one replica, or ten total shards, and you deploy two nodes, you will have five shards on each node (Elasticsearch always places a primary and its first replica on different nodes).


Two Open Distro for Elasticsearch nodes with balanced shard usage


When you have multiple indexes, you get heterogeneity in the storage per node. For example, say your application is generating one GB of log data per day and your VPC Flow Logs are ten GB per day. For both of these data streams, you use one primary shard,and one replica, following the best practice of up to 50GB per shard. Further, assume you have six nodes in your cluster. After seven days, each index has 14 total shards (one primary and one replica per day). Your cluster might look like the following – in the best case, you have even distribution of data:


An Open Distro for Elasticsearch cluster with balanced resource usage

In the worst case, assume you have five nodes. Then your shard count is indivisible by your node count, so larger shards can land together on one node, as in the image below. The nodes with larger shards use ten times more storage than the nodes with smaller shards.

An Open Distro for Elasticsearch cluster with unbalanced resource usage

While this example is somewhat manufactured, it represents a real problem that Elasticsearch users must solve.

Rollover instead!

The _rollover API creates a new index when you hit a threshold that you define in the call. First, you create an _alias for reading and writing the current index. Then you use cron or other scheduling tool to call the _rollover API on a regular basis, e.g. every minute. When your index exceeds the threshold, Elasticsearch creates a new index behind the alias, and you continue writing to that alias.

To create an alias for your index, call the _aliases API:

POST _aliases
  "actions": [
      "add": {
        "index": "weblogs-000001",
        "alias": "weblogs",
        "is_write_index": true

You must set is_write_index to true to tell _rollover which index it needs to update.

When you call the _rollover API:

POST /weblogs/_rollover 
  "conditions": {
    "max_size":  "10gb"

You will receive a response that details which of the conditions, if any, is true and whether Elasticsearch created a new index as a result of the call. If you name your indexes with a trailing number (e.g. -000001), Elasticsearch increments the number for the next index it creates. In either case, you can continue to write to the alias, uninterrupted.

Elasticsearch 7.x accepts three conditions: max_age, max_docs, and max_size. If you call _rollover with the same max_size across all of your indexes, they will all roll over at approximately the same size. [Note: Size is difficult to nail down in a distributed system. Don’t expect that you will hit exactly the same size. Variation is normal. In fact, earlier versions of Elasticsearch don’t accept max_size as a condition. For those versions, you can use max_docs, normalizing for your document size.]

The one significant tradeoff is in lifecycle management. Returning to our prior example, let’s say you roll over on ten GB of index. The data stream with ten GB daily will roll over every day. The data stream with one GB of index daily will roll over every ten days. You need to manage these indexes at different times, based on their size. Data in the lower-volume indexes will persist for longer than data in higher-volume indices.


When running an Elasticsearch cluster with multiple data streams of different sizes, typically for log analytics, you use the _rollover API to maintain a more nearly even distribution of data in your cluster’s nodes. This prevents skew in your storage usage and results in a more stable cluster.

Have an issue or question? Want to contribute? You can get help and discuss Open Distro for Elasticsearch on our forums. You can file issues here.

from AWS Open Source Blog