Tag: Infrastructure

Optimizing AWS Control Tower For Multiple AWS Accounts And Teams

Optimizing AWS Control Tower For Multiple AWS Accounts And Teams

Control tower

You can see everything from up here!

One of the major benefits of optimizing Amazon Web Service is that it comes with an extensive set of tools for managing deployments and user identities. Most organizations can meticulously manage how their cloud environment is set up and how users can access different parts of that environment through AWS IAM.

However, there are times when even the most extensive IAM and other management tools just aren’t enough. For larger corporations or businesses who are scaling their cloud deployment on a higher level, setting up multiple AWS accounts—run by different teams—is often the solution.

The need for multi-account AWS environment isn’t something that Amazon ignores. In fact, the company has introduced AWS Control Tower, whose sole purpose is to make setting up new multi-account AWS environments easy.

You may also enjoy:  How AWS Control Tower Lowers the Barrier to Enterprise Cloud Migration

Quick Environment Setup With AWS Control Tower

As the name suggests, AWS Control Tower is designed to give you a comprehensive bird’s-eye view of multiple cloud environments. Control Tower is designed to make deploying, managing, and monitoring multiple AWS accounts and teams easy. The way it is set up also makes deploying AWS environments simple.

Rather than going through the setup process of new AWS accounts manually, you can now automate the creation of multiple AWS accounts and environments using Control Tower.

First, you need to define the blueprint that will be used by all of the environments; this is very similar to setting up a base operating system for OEM devices.

Blueprints are designed to make sure that the new AWS environments comply with best practices and are set up correctly from the beginning. Any customization can then be made on a per-account basis, giving the organization maximum flexibility with their cloud environments.

Among the things that the AWS Control Tower blueprints provide are identity management, access management, centralized logging, and cross-account security audits. Provisioning of cloud resources and network configurations are also included in the blueprints. You even have the ability to customize the blueprint you use to specific requirements.

Easy Monitoring Of Environments

Since AWS Control Tower is designed as a centralization tool from the beginning, you can also expect easy monitoring and maintenance of multiple AWS accounts and teams from this platform. There are guardrails added to the blueprints of AWS environments, so you know your environments are secure from the beginning. All you need to do is enforce the security policies; even that is easy and centralized.

Service control policies (SCPs) are monitored constantly. When configurations of the environments don’t comply with the required policies, warnings are triggered, and you are informed immediately. Every new account created using AWS Control Tower utilizes the same set of policies, leading to a more standardized cloud environment as a whole.

What’s interesting about the SCPs is the fact that you can dig deep into details—particularly details about accounts that don’t comply with the predefined security policies—and make adjustments as necessary. You always know the kind of information security and policy violations you are dealing with and you know exactly who to address to get the issues fixed.

As an added bonus, AWS Control Tower provides extensive reports, including on governance of workloads, security control policies, and the state of the cloud environments in general. The tool goes beyond setting up a landing zone based on best-practices. It helps you monitor those landing zones meticulously too.

Automation Is The Key

From the previous explanation, it is easy to see how AWS Control Tower is incredibly useful for organizations who need to set up multiple cloud environments. The tool allows for top administrators and business owners to keep an eye of their cloud deployment while maintaining high visibility of individual environment, deployment, and user.

That said, the AWS Control Tower still doesn’t stop there. It adds one crucial element that puts Amazon as the leader in this specific market segment: automation. Account provisioning, resource provisioning, and even the complete set up of landing zones can be fully automated with ‘recipes’ that are defined in blueprints.

Ibexlabs, for example, is already leveraging AWS Control Tower on behalf of current clients and has designed an onboarding process specifically to leverage the tool for new enterprises, too. As well as creating a landing zone with log archive and audit account, the team leverages Control Tower to launch VPCs and subnets for the organization in addition to portfolio setup.

Ibexlabs also scripts the installation of a comprehensive suite of other tools to enhance client usage of AWS including: Jenkins; CircleCI; Datadog; NewRelic; OpenVPN; and VPC peering within the accounts. On top of all this, Ibexlabs leverages CloudFormation with launch configuration and autoscaling as well as other app services according to the clients’ needs.

Automation eliminates countless mundane tasks associated with setting up and securing a new cloud environment. What used to be a tedious process that could take hours—if not days—to complete is now one or two clicks away. Automation makes the whole system more robust and flexible since customizations can now be done on a specific deployment level.

We really have to see the implementation of automation in AWS Control Tower as a part of a bigger trend. Amazon has been automating many of its AWS components in recent years, signaling a serious shift beyond DevOps. As it gets easier for even the most complex organizations to maintain its cloud environments, the days of developers running their own environments may soon be here.

Regardless of the shift, AWS Control Tower is a step in the right direction. Organizations that require multiple AWS accounts can now gain access to the resources they need without jumping through hoops of performing setup of those environments manually.

This post was originally published here.

Further Reading

AWS DevOps: Introduction to DevOps on AWS

Top 3 Areas to Automate in AWS to Avoid Overpaying Cloud Costs

from DZone Cloud Zone

Flux7 and AWS Present High Performance Computing Immersion Day

Flux7 and AWS Present High Performance Computing Immersion Day

Array ( [0] => WP_Term Object ( [term_id] => 90 [name] => Blog [slug] => blog [term_group] => 0 [term_taxonomy_id] => 90 [taxonomy] => category [description] => [parent] => 0 [count] => 413 [filter] => raw ) [1] => WP_Term Object ( [term_id] => 4468 [name] => Uncategorized [slug] => uncategorized [term_group] => 0 [term_taxonomy_id] => 4468 [taxonomy] => category [description] => [parent] => 0 [count] => 413 [filter] => raw ) )

from Flux7 DevOps Blog

Adopting Kubernetes? Here’s What You Should Definitely Not Do

Adopting Kubernetes? Here’s What You Should Definitely Not Do

Image title

Never, ever, ever forget or neglect to do these things

Kubernetes is changing the infrastructure landscape is by enabling platform teams to scale much more effectively. Rather than convincing every team to secure VMs properly, manage network devices, and then following up to make sure they’ve done so, platform teams can now hide all of these details behind a K8s abstraction. This lets both application and platform teams move more quickly: application teams because they don’t need to know all the details, and platform teams because they are free to change them.

You may also enjoy:  Kubernetes #Fails

There are some great tutorials and online courses on the Kubernetes website. If you’re new to Kubernetes, you should definitely check these out. But it’s also helpful to understand what to do.

Not Budgeting for Maintenance

The first big failure mode with Kubernetes is not budgeting for maintenance. Just because K8s hides a lot of details from application developers doesn’t mean that those details aren’t there. Someone still needs to allocate time for upgrades, setting up monitoring, and thinking about provisioning new nodes.

You should budget for (at least) quarterly upgrades of master and node infrastructure. Or however frequently you were upgrading VM images, make sure you are now doing the same for Kubernetes infrastructure as well.

Who should do these upgrades? If you’re going to roll out K8s in a way that is standard across your org (which you should be doing!), this needs to be your infrastructure team.

Moving Too Fast

The second big failure mode is that teams move so quickly they forget that adopting a new paradigm for orchestrating services creates new challenges around observability. Not only is a move to K8s often coincident with a move to microservices (which necessitates new observability tools) but pods and other K8s abstractions are often shorter-lived than traditional VMs, meaning that the way that telemetry is gathered from the application also needs to change.

The solution is to build in observability from day one. This includes instrumenting code in such a way that you can take both a user-centric as well as an infrastructure-centric view of requests — and understand how instrumentation data is transmitted, aggregated, and analyzed. Waiting until you’ve experienced an outage is too late to address these issues, as it will virtually impossible to get the data you need to understand and remediate that outage.

Not Accounting for Infrastructure

With all the hype around Kubernetes —and of course, it’s many, many benefits —it’s easy to assume that it will magically solve all your infrastructure problems. What’s great about K8s is that it goes a long way toward isolating those problems (so that platform teams can solve them more effectively) but they’ll still be there, in need of a solution.

So in addition to managing OS upgrades, vulnerability scans, and patches, your infrastructure team will also need to run, monitor, and upgrade K8s master components (API server, etcd) as well as all of the node components (docker, kubelet). If you choose a managed K8s solution, then a lot of that work will be taken care of for you, but you still need to initiate master and node upgrades. And even if they are easy, node upgrades can still be disruptive: You’ll want to make sure you have enough capacity to move services around during the upgrade. While it’s good news that application developers no longer need to think about these issues, the platform team (or someone else) still does.

Not Embracing the K8s Community

The K8s community is an incredible resource that really can’t be overvalued. Kubernetes is certainly not the first open source orchestration tool, but it’s got a vibrant and quickly growing community. This is really what’s powering the continued development of K8s, as it continues to turn out new features.

The platform boasts thousands of contributors, including collaboration with all major cloud providers and dozens of tech companies (you can check out the list here ). If you have questions or need help, it’s almost guaranteed that you can find the answer on Github or Slack, or find someone who can point you in the right direction.

And last, but certainly not least, is that contributing to and being a part of the community and can be a great way to meet other developers who might one day become members of your team.

Not Thinking Through “Matters of State”

Of course, how you divide your application into smaller services is a critical decision to get right. But for K8s specifically, it’s really important to think about how you are going to handle state: whether it’s using StatefulSets, leveraging your provider’s block storage devices, or moving to a completely managed storage solution, implementing stateful services correctly the first time around is going to save you huge headaches.

It’s all too easy to get burned by a corrupted shard in a database or other storage system, and recovering from these sorts of failures is by definition more complex when running on K8s. Needless to say, make sure you are testing disaster recovery for stateful services as they are deployed on your cluster (and not just trusting that it will just work like it did before you moved to K8s).

Not Accounting for Migration

Another important item to address is what your application looked like you began implementing K8s. Did it already have hundreds of services? If so, your biggest concern should be understanding how to migrate those services in an incremental but seamless way.

Are you just breaking the first few services off of your monolith? Making sure you have the infrastructure to support an influx of services is going to be critical to a successful implementation.

Wrapping Up: The Need for Distributed Tracing

K8s and other orchestration and coordination tools like service meshes are really only half the story. They provide flexibility in how services are deployed as well as the ability to react quickly, but they don’t offer insight into what’s actually happening in your services.

The other half is about building that insight and understanding how performance is impacting your users. That’s where LightStep comes in: , we enable teams to understand how a bad deployment in one service is affecting users 5, 10, or 100 services away.

Further Reading

Kubernetes Anti-Patterns: Let’s Do GitOps, Not CIOps!

Creating an Affordable Kubernetes Cluster

from DZone Cloud Zone

Flux7 Case Study: Technology’s Role in the Agile Enterprise

Flux7 Case Study: Technology’s Role in the Agile Enterprise

Technology in the Agile EnterpriseThe transition to becoming an Agile Enterprise is one that touches every part of the organization — from strategy to structure and process to technology. In our journey to share the story of how we at Flux7 have moved through the process, today we will discuss how we have adopted specific supporting technologies to further our agile goals. (In case you missed them, check out our first two articles on choosing a Flatarchy and our OKR journey.)

While achieving an Agile Enterprise must be rooted in the business and must be accompanied by an agile culture (more on that in our next article in the series), a technology platform that supports agility can be a key lever to successful Agile transformation.

At Flux7, this means both technologies that support communication and learning for teams to be agile and agile technology automation. Flux7 uses a variety of tools, each with its own specialty for helping us communicate, collaborate and stay transparent. We’ll first take a look at each of these tools and the role it plays, and then we’ll share a couple of ways in which some of these tools come together to create agility.

Agile Communication

As a 100% remote organization, communication is vital to corporate success. As a result, we use several tools to communicate, share files, documents, ideas and more.

  • Slack enables us to communicate in near real-time sharing files, updates, links and so much more. Slack is a go-to resource for everything from quick questions to team updates and accolades.
  • OfficeVibe allows employees to communicate feedback to the organization anonymously. At Flux7 we take feedback gathered from the OfficeVibe LeoBot very seriously and aim for top scores as a measure of our success in creating a thriving culture.
  • Gmail is used for less real-time communication needs, and for communicating with external parties (though we also use Slack channels with our customers); Google Calendar communicates availability and; Google Meet is used widely for internal and external meetings.
Agile Collaboration

Working closely together from a distance may sound antithetical, but with the help of several tools, our teams are able to collaborate effectively, boosting efficiency and productivity. Our favored tools for collaboration are:

  • Trello helps us collaborate on OKRs and customer engagements and is where our teams are able to visualize, plan, organize and facilitate short term and long term tasks.
  • Google Drive allows us to collaborate in real-time as our documents are automatically saved so that nothing can ever be lost. In fact, Flux7 has a main door to Google Drive called the Flux7 Library, which is where all of our non-personnel resources and documents are stored. This is just one way we ensure resources are at employees’ fingertips, helping us to stay transparent, agile and innovative.
  • Zapier automates workflows for us. For example, we make extensive use of its Trello PowerUps to automate things like creating new Trello cards from Gmail messages or updating Trello cards with new Jira issues.
  • GitHub Repositories host and track changes to our code-related project files and GitHub’s Wiki tools allow us to host documentation for others to use and contribute. In fact, the Flux7 Wiki page is hosted in a Git Wiki. The Flux7 Wiki is home to a wide variety of resources — from a Flux7 Glossary to book reviews, PeopleOps tools and more.
  • HubSpot is a marketing automation and CRM solution where sales and marketing teams communicate and collaborate on everything from new sales leads to sharing sales collateral.

Agile Metrics and Measurements
At Flux7 our mantra is to experiment more, fail cheap, and measure results accurately. Helping us to measure accurately are:

  • Google analytics gives Flux7 valuable detail about our website visitors, giving us clear insights into what our visitors care most about. HubSpot analytics also gives us website data. As our CRM, when this data is paired with sales pipeline activity data, it gives us an incredibly rich view of the customer journey, helping us hone business strategy.
  • Slack analytics give Flux7 insight into how the team uses Slack. For example, how many messages were sent over the last 30 days, where conversations are happening and more.
Agile Management & More

Continuous learning and growth are central to Flux7’s culture and values of innovation, humbleness, and transparency. As such, we also have the technology to facilitate ongoing learning with:

  • The Flux7 internal e-book library where employees can check out e-books and audiobooks for ongoing education. Flux7 utilizes Overdrive to secure our online Internal Library. Topics range from Marketing and Business to IT and DevOps Instructional Resources. (For more on our Library, please refer to our blog: Flux7 Library Drives Culture of Learning, Sharing)
  • Flux7 also uses BambooHR to store peer feedback; anyone can initiate feedback by asking another peer to provide it. The feedback is stored in BambooHR and only the recipient can see it and turn the feedback into actionable results. BambooHR also contains important files like team assignments, who is on vacation, and recorded All-Hands meetings.
  • We use Okta for single sign-on, LastPass for password management, HubStaff for tracking time on projects, QuickBooks for finance, and more.
Bringing It All Together

IT automation is core to all we do at Flux7 and is instrumental in bringing together many of these tools. To give you an example, forecasting data from HubSpot is automatically sent to Slack with a Zapier integration that allows us to automatically see just-in-time forecasting data. We can share newly closed deals with the broader Flux7 team over Slack this way, too.

We have also integrated Git with Trello such that change notifications are sent as updates to the appropriate Trello card(s), keeping the right team members updated. Trello, in turn, notifies all relevant team members of the updated card information, automatically keeping all team members updated.

At Flux7, we believe in the value of cloud computing and removing levels of management from the process altogether. In fact, we are extremely serverless as a company — with only one server for our website — which allows us to focus less on IT tasks like managing servers and more on delivering value to our customers and employees.

While there are many elements to becoming an Agile Enterprise, technology plays a pivotal role in communication, collaboration, and productivity. As the pace of the market continues to accelerate, agility can only be driven through flexible technologies that help us better anticipate and react to change. Don’t miss the fourth article in our series on the role of culture in building an Agile Enterprise. Subscribe to our blog below and get it direct to your inbox.

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

What’s New in Terraform V0.12

What’s New in Terraform V0.12

Man reading newspaper

Read all about it! Terraform gets new features!

Over recent years, infrastructure as code has made creating and managing a complex cloud environment —plus making necessary ongoing changes to that environment—significantly more streamlined. Rather than having to handle infrastructure management separately, developers can push infrastructure changes and updates alongside updates to application code.

Terraform from HashiCorp has been a leading tool in the charge of making infrastructure as code more accessible. While early versions, admittedly, involved a steep learning curve, as upgrades have been made the tool becomes more and more workable. The latest version of Terraform v0.12, introduced a number of changes that make using the programming language inside the tool much simpler.

You may also enjoy: Intro to Terraform for Infrastructure as Code

Command Improvements

Terraform v0.12 variations have seen major updates indeed since a lot of commands and features have been changed over the last three months. HCL, the underlying language used by Terraform, has been updated too; we will get to this in a bit.

The use of first-class expressions is perhaps the most notable update of them all in the new version. Rather than wrapping expressions in interpolation sequences using double quotes, expressions can now be used natively. This makes more sense as many developers may be more used to the latter. Developers can now use variables like var.conf[1] rather than calling it using the old format of “${var.conf}”.

Expressions are also strengthened by the introduction of for expressions to filter lists and map values. The use of for as a filter is also a great improvement; it makes advanced infrastructure manipulation based on conditions easier to achieve.

There is also a more general operator. You can use resource.*.field syntax in any application rather than for components with their count specifically set. The same general operator is also compatible with any list of value you add to your code.

At Caylent, we love the use of 1:1 JSON mapping. Conversion of HCL configuration to and from JSON happens seamlessly and with no possibility of errors and problems. This may look like a small update at first, but it is a simple change that will make life significantly smoother for developers who manage their own infrastructure using Terraform.

More Updates in Terraform v0.12

Other changes are just as interesting. As mentioned before, HashiCorp updated the HCL programming language extensively in this version. There are limitations to using HCL and we —developers —have been relying on workarounds to make things work. Terraform v0.12, as the previous points demonstrate, really eliminates a number of bottlenecks we’ve been facing all along.

There is also the fact that you can now use conditional operators to configure infrastructure. First of all, I’m glad to inform you that null is now a recognized value. When you assign null to parameters or fields, the default value set by the infrastructure will be used instead.

Conditional operators like ? and : are also handy to use, plus you can now rely on lazy evaluation of results when using Terraform to code your infrastructure. Using the previous example, fields that contain null actually get omitted from the upstream API calls.

Complex lists and maps are also now supported. This is perhaps the biggest change in this version. While Terraform supported only simple values in the past, you can now use complex lists and maps in both inputs and outputs. You can even code lines to control modules and other components.

Another version upgrade worth mentioning is the new access to resources details using remote state outputs. The v0.12 release sees terraform_remote_state data source changing slightly so that all the remote state outputs are now available as a single map value, in contrast to how previously they were top-level attributes.

It’s also worth commenting on the new, improved “Context-Rich Error Messages” which make debugging much easier than in previous versions. Not all messages will include the following elements but most will conform to this structure:

  • A short problem description
  • A specific configuration construct reference
  • Any reference values necessary
  • A longer, more thorough problem and possible solutions

Last but certainly not least, Terraform now recognizes references as first-class values. Yes, no more multiple quotes and having to nest references when coding updates for your infrastructure. Resource identifiers can also be outputted or used as parameters, depending on how your infrastructure is set up.

Real-World Challenges

The new version of Terraform has seen many big updates since the start of v0.12.0 in May 2019, and the engineers behind this framework worked really hard in making sure that no configuration changes are needed. In fact, the average Terraform user will not have to make any changes when updating to Terraform v0.12.

However, breaking changes are to be expected and if you do have to mitigate changes that need to be made due to the upgrade to Terraform v0.12, HashiCorp is working on an automated tool that makes the process possible. You can use the tool to calculate compatibility and anticipate potential issues that may stem from the upgrade.

One last thing to note about this latest update: it’s only the beginning. Terraform v0.12 is a signal from the engineers behind this tool that they are serious about pushing Terraform (and HCL) further. Upcoming releases and iterations are already in the pipeline, with refinements like better versioning and seamless upgrade being among the major changes.

It is inspiring to see engineers work so hard to ensure that there are little to no breaking changes with every version they push. Terraform may well be the IAC language of the future if it continues to bring positive updates and useful tools to its framework at the current pace.

This post was originally published here.

Further Reading

The Top 7 Infrastructure-As-Code Tools For Automation

Infrastructure-As-Code: A DevOps Way To Manage IT Infrastructure

from DZone Cloud Zone

TechTalks With Tom Smith: Tools and Techniques for Migration

TechTalks With Tom Smith: Tools and Techniques for Migration

birds migrating

Time to pack it up and move it out. Or, rather, up.

To understand the current state of migrating legacy apps to microservices, we spoke to IT executives from 18 different companies. We asked, “What are the most effective techniques and tools for migrating legacy apps to microservices?” Here’s what we learned:

You may also enjoy: The Best Cloud Migration Approach: Lift-And-Shift, Replatform, Or Refactor?


  • We did everything with Docker and Kubernetes. Datastore was on PostgreSQL. Depends on the use case. We went through an era of containers wars and Docker/Kubernetes has won. It’s less about the technology you’re going to use and how are you going to get there and get your team aligned. More about the journey than the tooling. Alignment and culture are the hardest part.
  • Spring and Java tend to be the most regarded stacks. Seeing a shift to serverless with Lambda and K8s in the cloud. CloudWatch can watch all your containers for you. Use best of class CI/CD tools.
  • The manual process is required. In addition to decomposing legacy apps into components a challenge in understanding how the legacy app acquires its configuration and how to pass the identical configuration to the microservice. If you use K8s as your target platform, it supports two of three ways of mapping to a legacy app – it requires thinking and design work, it is not easily automated.
  • K8s is the deployment vehicle of choice. On the app side, we are seeing features or sub-services move into cloud-native design one-by-one. People start with the least critical and get confidence in running in a dual hybrid mode. Start with one or two services and quickly get to 10, 20, and more. Scalability, fault tolerance, and geographic distributions. Day two operations are quite different than with a traditional, legacy app.
  • Docker and K8s are the big ones. All are derivatives on the infrastructure side.
  • Containerizing our workloads and standardizing all runtime aspects on K8s primitives is one of the most major factors in this effort. These include environment variable override scheme, secrets handling, service discovery, load balancing, automatic failover procedures, network encryption, access control, processes scheduling, joining nodes to a cluster, monitoring and more. Each of these aspects had been handled in a bespoke way in the past for each service. Abstracting the infrastructure further away from the operating system has also contributed to the portability of the product. We’re employing a message delivery system to propagate events as the main asynchronous interaction between services. The events payload is encoded using a Protocol Buffers schema. This gives us an easy to maintain and evolve a contract model between the services. A nice property of that technique is also the type of safety and ease of use that comes with using the generated model. We primarily use Java as our choice runtime. Adopting Spring Boot has helped to standardize how we externalize and consume configuration and allowed us to hook into an existing ecosystem of available integrations (like Micrometer, gRPC, etc.). We adopted Project Reactor as our reactive programming library of choice. The learning curve was steep, but it has helped us to apply common principled solutions to very complex problems. It greatly contributes to the resilience of the system.


  • Containers (Docker) and container orchestration platforms help a lot in managing the microservices architecture and its dependencies. Tools for generating microservices client and server APIs. Examples would be Swagger for REST APIs, and Google’s Protocol Buffers for internal microservice-to-microservice APIs.


  • K8s, debugging and tracing tools like Prometheus, controller, logging. Getting traffic into K8s and within K8s efficient communications with Envoy, sidecar proxy for networking functions (routing caching). API gateway meter, API management, API monitoring, dev portal. A gateway that’s lightweight, flexible, and portable taking into account east-west traffic with micro-gateways.
  • When going to microservices, you need to think about scale, there are so many APIs, how do you keep up with them and the metrics. You might want to use Prometheus, CloudWatch if going to Lambda when going to microservices you have to bring open telemetry, tracing, logging. How to debug on a monolithic application as a developer I can attach a debugger to a binary, I can’t do that with microservices. With microservices, every outage is like a murder mystery. This is a big issue for serverless and microservices.


  • There are a couple. It comes down to accepting CI/CD you’re going to have trouble. Consider the build and delivery pipeline as part of the microservice itself. Everything is in one place. With microservices, you scatter functionality all over the place. More independent and less tightly bound. The pipeline becomes an artifact. Need to provide test coverage across microservices as permutations grow, Automation begins at the developer’s desk.
  • There are many tools that help the migration to microservices, such as Spring Boot and Microprofile, both of which simplify the development of standalone apps in a microservices architecture. Stream processing technologies are becoming increasingly popular for building microservices, as there is a lot of overlap between a “stream-oriented” architecture and a “microservices-oriented” architecture. In-memory technology platforms are useful when high-performance microservices are a key part of the architecture.
  • Legacy apps should be migrated to independent, loosely coupled services through gradual decomposition, by splitting off capabilities using a strangler pattern. Select functionality based on a domain with clear boundaries that need to be modified or scaled independently. Make sure your teams are organized to build and support their portions of the application, avoid dependencies and bottlenecks that minimize the benefits of microservice adoption, and take advantage of infrastructure-level tools for monitoring, log management, and other supporting capabilities.
  • There are many tools available such as Amazon’s AWS Application Discovery Services, which will help IT understand your application and workloads, while others help IT understand server and database migration. Microsoft Azure has tools that help you define the business case and understand your current application and workloads. There is an entire ecosystem of partners who provide similar tools that may fit your specific needs for your migration that help you map dependencies, optimize workload and determine the best cloud computing model, so it may be prudent to look at others if you have specific needs or requirements. We provide the ability to monitor your applications in real-time both from an app performance as well as business performance perspective, providing data you need to see your application in action, validate the decisions and money spent on the migration, improve your user experience and provide the ability to rapidly change your development and release cycles.
  • Have a service mesh in place before you start making the transition. API ingress management platform (service control platform). An entry point for every connect in the monolith. Implement security and observability. Existing service mesh solutions are super hyper-focused on greenfield and K8s. Creates an enclosure around the monolith during the transitions.
  • There are a couple of approaches that work well when moving to microservices. Break the application down into logical components that each fit well in a single microservice, and then build it again as component pieces, which will minimize code changes while fully taking advantage of running smaller autonomous application components that are easier to deploy and manage separately. Build data access-as -a-service for the application to use for all data request and write calls. This moves data complexity into its own domain, decoupling the data from the application components. It’s essential to embrace containers, container orchestration and use DevOps tools – integrating security into your processes and automating throughout.
  • You need a hybrid integration platform that supports direct legacy to microservice communication so that you can choose the ideal digital environment without compromises based on your original IT debt.
  • Design Patterns such as the Strangler pattern are effective in migrating components of legacy apps to the microservices architecture. Saga, Circuit Breaker, Chassis and Contract are other design patterns that can help. Another technique/practice is to decompose by domain models so that microservices reflect business functionality. A third aspect is to have data segregation within each microservice or for a group of related microservices, supplemented by a more traditional permanent data store for some applications. Non-relational databases, lightweight message queues, API gateways, and serverless platforms are speeding up migrations. Server-side JavaScript and newer languages such as Go are fast becoming the programming platforms of choice for developing self-sufficient services.

Here’s who shared their insights:

Further Reading

TechTalks With Tom Smith: What Devs Need to Know About Kubernetes

A Guide to Cloud Migration

from DZone Cloud Zone

Reimagining Experimentation Analysis at Netflix

Reimagining Experimentation Analysis at Netflix

Toby Mao, Sri Sri Perangur, Colin McFarland

Another day, another custom script to analyze an A/B test. Maybe you’ve done this before and have an old script lying around. If it’s new, it’s probably going to take some time to set up, right? Not at Netflix.

ABlaze: The standard view of analyses in the XP UI

Suppose you’re running a new video encoding test and theorize that the two new encodes should reduce play delay, a metric describing how long it takes for a video to play after you press the start button. You can look at ABlaze (our centralized A/B testing platform) and take a quick look at how it’s performing.

Simulated dataset that shows what the distribution of play delay may look like. Note that the new encodes perform well in the lower quantiles but worse in the higher ones

You notice that the first new encode (Cell 2 — Encode 1) increased the mean of the play delay but decreased the median!

After recreating the dataset, you can plot the raw numbers and perform custom analyses to understand the distribution of the data across test cells.

With our new platform for experimentation analysis, it’s easy for scientists to perfectly recreate analyses on their laptops in a notebook. They can then choose from a library of statistics and visualizations or contribute their own to get a deeper understanding of the metrics.

Extending the same view of ABlaze with other contributed models and visualizations

Why it Matters

Netflix runs on an A/B testing culture: nearly every decision we make about our product and business is guided by member behavior observed in test. At any point a Netflix user is in many different A/B tests orchestrated through ABlaze. This enables us to optimize their experience at speed. Our A/B tests range across UI, algorithms, messaging, marketing, operations, and infrastructure changes. A user might be in a title artwork test, personalization algorithm test, or a video encoding testing, or all three at the same time.

The analysis reports tell us whether or not a new experience made statistically significant changes to relevant metrics, such as member behavior, or technical metrics that describe streaming video quality. However, the default reports only provide a summary view of the data with some powerful but limited filtering options. Our data scientists often want to apply their knowledge of the business and statistics to fully understand the outcome of an experiment.

Instead of relying on engineers to productionize scientific contributions, we’ve made a strategic bet to build an architecture that enables data scientists to easily contribute.

The two main challenges with this approach are establishing an easy contribution framework and handling Netflix’s scale of data. When dealing with ‘big data’, it’s common to perform computation on frameworks like Apache Spark or Map Reduce. In order to reduce the learning curve of contributing analyses, we’ve decided to take an alternative path by performing all of our analyses on one machine. Due to compression and high performance computing, scientists can analyze billions of rows of raw data on their laptops using languages and statistical libraries they are familiar with like Python and R.

Challenges with Pre-existing Infrastructure

Netflix’s well-known experimentation culture was fueled by our previous infrastructure: an optimized framework that scaled to the wide variety of use cases across Netflix. But as our experimentation culture grew, so too did our product areas, users, and ambitions around more sophisticated methodology on measurement.

Our data scientists faced numerous challenges in our previous infrastructure. Complex business logic was embedded directly into the ETL pipelines by data engineers. In order to replicate results, scientists had to delve deep into the data, code, and documentation. Due to Netflix’s scale of over 150 million subscribers, scientists also frequently encountered issues while fetching data and performing custom statistical models in Python or R.

To offer new methods to the community and overcome any existing engineering barriers, scientists would have to run custom scripts outside of the centralized platform. Heavily used or high value scripts were sometimes converted into Shiny apps, allowing easy access to these novel features. However, because these apps lived separately from the platform, they could be difficult to maintain as the underlying data and platform evolved. Also, since these apps were generally written for specific use cases, they were difficult to generalize and graduate back into the platform.

Our scientists come from many backgrounds, such as neuroscience, biostatistics, economics, and physics; each of these backgrounds has a meaningful contribution to how experiments should be analyzed. Instead of spending their time wrangling data and conducting the same ad-hoc analyses multiple times, we would like our data scientists to focus on contributing new and innovative techniques for analyzing tests, such as Interleaving, Quantile Bootstrapping, Quasi Experiments, Quantile Regression, and Heterogeneous Treatment Effects. Additionally, as these new techniques are contributed, we want them to be effortlessly leveraged across the Netflix experimentation community.

Previous XP architecture: all systems are engineering-owned and not easily introspectable

Reimagining our Infrastructure: Democratization Across 3 Tracks

We are reimagining new infrastructure that makes the scientific development experience better. We’ve chosen to break down the contribution framework into 3 steps.

1. Getting Data with the Metrics Repo
2. Computing Statistics with Causal Models
3. Rendering Visualizations with Plotly

Democratization across 3 tracks: Metrics, Stats, Viz

The new architecture employs a modular design that permits data scientists to contribute using SQL, Python, and R, the tools of their trade. Users can contribute metrics and methods directly, without needing to master data engineering tools. We’ve also made sure that both production and local workflows use the same code base, so reproducibility is a given and promotion to production is just a pull request away.

New XP architecture: Systems highlighted in red are introspectable and contributable by data scientists

Getting data with Metrics Repo

Metrics Repo is an in-house Python framework where users define programmatically generated SQL queries and metric definitions. It centralizes metrics definitions which used to be scattered across many teams. Previously, many teams at Netflix had their own pipelines to calculate success metrics which caused a lot of fragmentation and discrepancies in calculations.

A key design decision of Metrics Repo is that it moves the last mile of metric computation away from engineering owned ETL pipelines into dynamically generated SQL. This allows scientists to add metrics and join arbitrary tables. The new architecture is much more flexible compared to the previous Spark based jobs. Views of reports are only calculated on demand and take a couple minutes to execute, so there are no migrations or backfills when making changes or updates to metrics. Adding a new metric is as easy as adding a new field or joining a different table in SQL. By leveraging PyPika, we represent each table as a Python class that can be customized with filters and additional joins. The code is self documenting and serializes to JSON so it can be easily exposed as an API.

Calculating Statistics with Causal Models

Causal Models is an in-house Python library that allows scientists to contribute generic models for causal inference. Previously, the centralized platform only had T-Test and Mann-Whitney while advanced statistical tests were only available via scripts or Shiny apps. Scientists can now add their statistical models by overriding two functions in a model subclass. Many of the models are simple wrappers over Scipy, but it’s flexible enough to do arbitrarily complex calculations. The library also provides helper methods which abstract accessing compressed or raw data. We use rpy2 so that models can be written in either R or Python.

We do not want data scientists to have to go outside of their comfort zone by writing Spark Scala or Map Reduce jobs. We also want to leverage the large ecosystem of statistical libraries written in Python and R. However, many analyses have raw datasets that don’t fit on one machine. So, we’ve implemented an optional compression layer that drastically reduces the size of the data. Depending on the statistic, the compression can be either lossless or tunably lossy. Additionally, we’ve structured the API so that model implementors don’t need to distinguish between compressed and uncompressed data. When contributing a new statistical test, the data scientist only needs to think about one comparison computation at a time. We take the functions that they’ve written and parallelize it for them through multi-processing.

Sometimes statistical models are expensive to run even on compressed data. It can be difficult to efficiently perform linear algebra operations in native Python or R. In those cases, our mathematical engineering team writes custom C++ in order to speed through those bottlenecks. Our scientists can then reference them easily in Python via pybind11 or in R via Rcpp.

As a result, innovative methods like Quantile Bootstrapping and OLS with heterogeneous effects are no longer confined to un-versioned controlled notebooks/scripts. The barrier to entry is very low to develop on the production system and sharing methods across metrics and business areas is effortless.

Rendering Visualizations with Plotly

In the old model, visualizations in the experimentation platform were created by UI engineers in React. The new architecture is still based on React, but we allow data scientists to contribute arbitrary graphs and plots using Plotly. We chose to use Plotly because it has a JSON specification that is implemented in many different frameworks and languages, including R and Python. Scientists can pick and choose from a wide variety of pre-made visualizations or create their own for others to use.

This work kickstarted an initiative called Netflix Vizkit to create a cross-library shared design that lowers the barrier for a unified look and feel in contributions.

Many scientists at Netflix primarily use notebooks for day to day development, so we wanted to make sure they could perform A/B test analysis on them as well. To ensure that the analysis shown in ABlaze can be replicated in a notebook, with e run the exact same code in both environments, even the visualizations!

Now scientists can easily introspect the data and extend it in an ad-hoc analysis. They can develop new metrics, statistical models, and visualizations in their notebooks and contribute it to the platform knowing the results will be identical because their exact code will be running in production. As a result, anyone at Netflix looking at ABlaze can now view these new contributions when looking at test analyses.

XP: Combining contributions into analyses

Next Steps

We aim to accelerate research in causal inference methodology, expedite product innovation, and ultimately delight our members. We’re looking forward to enhancing our frameworks to tackle experimentation automation. This is an ongoing journey. If you are passionate about the field, we have opportunities to join our dream team!

Reimagining Experimentation Analysis at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21?source=rss—-2615bd06b42e—4

Top Resources for API Architects and Developers

Top Resources for API Architects and Developers

We hope you’ve enjoyed reading our series on API architecture and development. We wrote about best practices for REST APIs with Amazon API Gateway  and GraphQL APIs with AWS AppSync. This post will cover the top resources that all API developers should be aware of.

Tech Talks, Webinars, and Twitch Live Stream

The technical staff at AWS have produced a variety of digital media that cover new service launches, best practices, and customer questions. Be sure to review these videos for tips and tricks on building APIs:

  • Happy Little APIs: This is a multi part series produced by our awesome Developer Advocate, Eric Johnson. He leads a series of talks that demonstrate how to build a real world API.
  • API Gateway’s WebSocket webinar: API Gateway now supports real time APIs with Websockets. This webinar covers how to use this feature and why you should let API Gateway manage your realtime APIs.
  • Best practices for building enterprise grade APIs: API Gateway reduces the time it takes to build and deploy REST development but there are strategies that can make development, security, and management easier.
  • An Intro to AWS AppSync and GraphQL: AppSync helps you build sophisticated data applications with realtime and offline capabilities.

Gain Experience With Hands-On Workshops and Examples

One of the easiest ways to get started with Serverless REST API development is to use the Serverless Application Model (SAM). SAM lets you run APIs and Lambda functions locally on your machine for easy development and testing.

For example, you can configure API Gateway as an Event source for Lambda with just a few lines of code:

Type: Api
Path: /photos
Method: post

There are many great examples on our GitHub page to help you get started with Authorization (IAMCognito), Request, Response,  various policies , and CORS configurations for API Gateway.

If you’re working with GraphQL, you should review the Amplify Framework. This is an official AWS project that helps you quickly build Web Applications with built in AuthN and backend APIs using REST or GraphQL. With just a few lines of code, you can have Amplify add all required configurations for your GraphQL API. You have two options to integrate your application with an AppSync API:

  1. Directly using the Amplify GraphQL Client
  2. Using the AWS AppSync SDK

An excellent walk through of the Amplify toolkit is available here, including an example showing how to create a single page web app using ReactJS powered by an AppSync GraphQL API.

Finally, if you are interested in a full hands on experience, take a look at:

  • The Amazon API Gateway WildRydes workshop. This workshop teaches you how to build a functional single page web app with a REST backend, powered by API Gateway.
  • The AWS AppSync GraphQL Photo Workshop. This workshop teaches you how to use Amplify to quickly build a Photo sharing web app, powered by AppSync.

Useful Documentation

The official AWS documentation is the source of truth for architects and developers. Get started with the API Gateway developer guide. API Gateway is currently has two APIs (V1 and V2) for managing the service. Here is where you can view the SDK and CLI reference.

Get started with the AppSync developer guide, and review the AppSync management API.


As an API architect, your job is not only to design and implement the best API for your use case, but your job is also to figure out which type of API is most cost effective for your product. For example, an application with high request volume (“chatty“) may benefit from a GraphQL implementation instead of REST.

API Gateway currently charges $3.50 / million requests and provides a free tier of 1 Million requests per month. There is tiered pricing that will reduce your costs as request volume rises. AppSync currently charges $4.00 / million for Query and Mutation requests.

While AppSync pricing per request is slightly higher, keep in mind that the nature of GraphQL APIs typically result in significantly fewer overall request numbers.

Finally, we encourage you to join us in the coming weeks — we will be starting a series of posts covering messaging best practices.

About the Author

George MaoGeorge Mao is a Specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. George is responsible for helping customers design and operate Serverless applications using services like Lambda, API Gateway, Cognito, and DynamoDB. He is a regular speaker at AWS Summits, re:Invent, and various tech events. George is a software engineer and enjoys contributing to open source projects, delivering technical presentations at technology events, and working with customers to design their applications in the Cloud. George holds a Bachelor of Computer Science and Masters of IT from Virginia Tech.

from AWS Architecture Blog

TechTalks With Tom Smith: VMworld Hybrid Cloud and Multicloud Conversations

TechTalks With Tom Smith: VMworld Hybrid Cloud and Multicloud Conversations

Toys having conversation

Take a seat and see what everyone’s talking about

In addition to meeting with VMware executives and covering the keynotes, I was able to meet with a number of other IT executives and companies during the conference to learn what they and their companies were doing to make developers’ lives simpler and easier. 

You may also enjoy:  TechTalks With Tom Smith: VMware Courts Developers

Here’s what I learned:

Jim Souders, CEO of Adaptiva, announced the VMware edition of its OneSite peer-to-peer content distribution product to work with VMware’s Workspace ONE product to distribute software from the cloud across enterprise endpoints with speed and scale. According to Jim, this will help developers to automate vulnerable processes, eliminating the need for developers to build scripts and tool so they can focus on DevOps rather than security.

Shashi Kiran, Chief Marketing Officer of Aryaka, shared key findings from their State of the WAN report in which they surveyed more than 800 network and IT practitioners worldwide where nearly 50% of enterprises are implementing a multi-cloud strategy, are leveraging 5+ providers and SaaS apps, with 15% having more than 1,000 apps deployed. Implications for developers are to develop with multi-cloud in mind.

Tom Barsi, SVP, Business & Corporate Development, Carbon Black currently monitors 15 million endpoints but when it’s incorporated into VMware’s vSphere and Workspace ONE, the number of endpoints will grow by an order of magnitude for even greater endpoint denial response, freeing developers to focus on building applications without concern for endpoint security.

Don Foster, Senior Director Worldwide Solutions Marketing, Commvault — with cloud migrations, ransomware attacks, privacy regulations and a multi-cloud world, Commvault is helping clients to be “more than ready” for the new era of IT. They are also humanizing the company as they hosted a Data Therapy Dog Park during the conference.

Jonathan Ellis, CTO & Co-founder, and Kathryn Erickson, Director of Strategic Partnerships, DataStax —  spoke on the ease of deploying new applications in a hyper-converged infrastructure (HCI) with the key benefits of management, availability, better security, and consistent operations across on-premises and cloud environments. If DSE/Cassandra is on K8s or VMware, developers can access additional resources without going through DBAs or the infrastructure team.

Scott Johnson, Chief Product Officer, Docker — continuing to give developers a choice of languages and platforms while providing operations a unified pipeline. Compose helps developers assemble multiple container operations without rewriting code. Docker also helps improve the security of K8s deployments by installing with smart defaults along with image scanning and digital signatures.

Mark Jamensky, E.V.P. Products, Emboticscloud management platforms help with DevOps pipelines and accelerate application development by providing automation, governance, and speed under control. It enables developers to go fast with guardrails.

Duan van der Westhuizen, V.P. of Marketing, Faction — shared the key findings of their first VMware Cloud on AWS survey. 29% of respondents plan to start running or increase workloads in the next 12 months. Tech services, financial services, education, and healthcare are the industries showing the most interest. The key drivers are scalability, strategic IT initiative, and cost savings and the top use cases are data center extension, disaster recovery, and cloud migration.

Ambuj Kumar, CEO and Nishank Vaish, Product Management, Fortanix — discussed developers’ dislike of hardware security modules (HSM) and how self-defending key management services (SDKMS) can securely generate, store, and use cryptographic key and certificates, as well as secrets like passwords, API keys, tokens, and blobs of data to achieve a consistent level of high performance. 

Sam Kumasamy, Senior Product Marketing Manager, Gigamon — owns 38% of all network visibility market share giving them the ability to seel all traffic across virtual and physical environments provides visibility and analytics for digital apps and services in any cloud, container, K8s cluster, or Docker presence to enable applications to run fast while remaining secure. 

Stan Zaffos, Senior V.P. Product Marketing and Gregory Touretsky, Technical Product Manager and Solutions Architect, Infinidat are helping clients achieve multi-petabyte scale as more applications on more devices are generating more data. They are building out their developer portal to enable developers to leverage APIs and share code, use cases, and solutions.

Rich Petersen, President/Co-founder, JetStream Software are helping move virtual machines to the cloud for running applications, storage systems, and disaster recovery platforms. The I/O filters provide the flexibility to copy data to a physical device for shipment and sending only newly-written data over the network resulting in near-zero recovery time objective (RTO) and recovery point objective (RPO).

Josh Epstein, CMO, Kaminario — is providing a Storage-as-a-Service (STaaS) platform enabling developers to think about stored service as a box with shared storage arrays with data reduction/compression, deduplication, data mobility, data replication, orchestration, and the ability to spin up storages instances using K8s at a traditional data center or in a public cloud with the public API framework.

Kevin Deierling, Vice President, Marketing, Mellanox Technologies — provides remote direct memory access (RDMA) networking solutions to enable virtualized machine learning (ML) solutions that achieve higher GPU use and efficiency. Hardware compute accelerators boosts app performance in virtualized deployments.

Adam Hicks, Senior Solution Architect, Morpheus — provides a multi-cloud management platform for hybrid IT and DevOps automation for unified multi-cloud container management. The platform reduces the number of tools developers and operations need to automate development and deployment. Compliance is achieved with role-based access, approvals, quotas, and policy enforcement. Agile DevOps with self-service provisioning with APIs. Manage day-2 operations: scaling, logging, monitoring, backup, and migration.

Ingo Fuchs, Chief Technologist, Cloud and DevOps, NetApp —  as new cloud environments promote greater collaboration between developers and IT operations, NetApp is enabling the cloning of standardized developer workspaces so developers can get up and running quickly, new workspace versions can be rolled out simultaneously, standard datasets can be made available for DevTest, and developers are able to go back to a previous state with a single API call if they need to correct a mistake. 

Kamesh Pemmaraju, Head of Product Marketing, Platform9 — provides a SaaS-managed Hybrid Cloud solution that delivers fully automated day-2 operations with a 99.9% SLA for K8s, bare-metal, and VM-based environments. Developers get a public-cloud experience with databases, data services, open-source frameworks, Spark, and more deployed with a single click.

Mike Condy, System Consultant, Quest is focusing on monitoring, operations, and cloud from an IT management perspective. They are optimized, and provide support for K8s and Swarm and enable clients to compare on-premise to cloud to identify the optimal placement of workload from a performance and cost perspective. They enable clients to see how containers are performing, interacting, and scaling up or down with heat maps and optimization recommendations.

Peter FitzGibbon, V.P. Product Alliances, Rackspace — consistent with the move to hybrid-cloud environments are supporting customers, and developers, by providing new offerings around managed VMware Cloud on AWS, K8s and container services, cloud-native support, managed security, and integration and API management assessment. Peter also felt like the Tanzu announcement would help to bring technology and people together.

Roshan Kumar, Senior Product Marketing Manager, Redis Labs — developers tend to be the first adopters of Redis for the cloud or Docker since the database can hold data in specific structures with no objects or relational standing. Data is always in state. The database works with 62 programming languages. It addresses DevOps and Ops concerns for backup and disaster recovery with high availability, reliability, and scalability. While AWS has had more than 2,000 node failures, no data has been lost on Redis due to their primary and backup servers.

Image title

Chris Wahl, Chief Technologist and Rebecca Fitzhugh, Principal Technologist, Rubrik — focused on developers the last three to four years and provide an environment to serve the needs of different application environments. Rubrik Build is an open-source community helping to build the future of cloud data management and supporting a programmatic approach to automation for developers and infrastructure engineers with an API-fist architecture.

Mihir Shah, CEO and Surya Varanasi, CTO, StorCentric for Nexsan — provide purpose-built storage for backup, databases, and secure archive to ensure compliance, protect against ransomware and hackers, and provide litigation support. Maintain data integrity by using a combination of two cryptographic hashes for unique identification. Enables developers to move seamlessly to the cloud and automate data compliance using APIs.

Mario Blandini, CMO & Chief Evangelist, Tintri by DDN — Virtualization is the new normal. More than 75 percent of new workloads are now virtualized, and companies are beginning to make significant investments in virtual desktop infrastructure (VDI). Tintri enables developers to manage their virtual machines through automation, isolate traffic between virtual machines, and use self-service automation in the development phase.

Danny Allen, V.P., Product Strategy, Veeam — moving to a subscription model to provide cloud data availability and backup which is critical for applications and containers. Agility to move workloads from infrastructure to cloud-agnostic portable data storage. Acceleration of business data and backup to containers. Enables DevOps to iterate on existing workloads with the ability to run scripts to mask data.

Steve Athanas, President, VMware User Group (VMUG) — The active community of more than 150,000 members works to connect users, share stories, and solve problems. Steve says, “if you have VMware in your stack, there’s probably a VMUG member in your company.” He would love for more developers to become involved in VMUG to help others understand how dev and ops can work better together, address each others’ pain, and solve business problems. 

Nelson Nahum, Co-founder & CEO and Greg Newman, V.P., Marketing, Zadara — offers NVMe-as-a-Service to make storage as simple as possible. It includes optimized object on-premises storage-as-a-service for big data analytics, AI/ML, and video-on-demand. Developers can try Zadara risk-free

Further Reading

3 Pitfalls Everyone Should Avoid with Hybrid Multicloud (Part 1)

Solving for Endpoint Compliance in a Cloud-First Landscape

from DZone Cloud Zone