Category: DevOps

Major Wholesaler Grows Uptime by Refactoring eComm Apps for AWS DevOps

Major Wholesaler Grows Uptime by Refactoring eComm Apps for AWS DevOps

AWS Case Study Ecommerce Cloud Refactor

A recent IDC survey of the Fortune 1000 found that the average cost of an infrastructure failure is $100,000 per hour and the average total cost of unplanned application downtime per year is between $1.25 billion and $2.5 billion. Our most recent customer relies heavily on its eCommerce site for business and knowing the extreme costs of infrastructure failure to its business, turned to the benefits of cloud-based DevOps. The firm sought to increase uptime, scalability, and security for its eCommerce applications by refactoring them for AWS DevOps.

What is Refactoring?

Refactoring involves an advanced process of re-architecting and often re-coding some portion of an existing application to take advantage of cloud-native frameworks and functionality. While this approach can be time-consuming and resource-intensive, it offers low monthly cloud spend as organizations that refactor are able to modify their applications and infrastructure to take full advantage of cloud-native features and thereby maximize operational cost efficiencies in the cloud.

AWS DevOps Refactoring

Employing the DevOps consulting team at Flux7 to help architect and build a DevOps platform solution, the team’s first goal was to ensure that the applications were architected for high availability at all levels in order to meet the company’s aggressive SLA goals. Here, the first step was to build a common DevOps platform for the company’s eCommerce applications and migrate the underlying technology to a common stack consisting of ECS, CloudFormation, and GoCD, an open source build and release tool from ThoughtWorks. (In the process, the team migrated one of the two applications from Kubernetes and Terraform to the new technology stack.)

As business-critical applications for the future of the retailer, the eCommerce applications needed to provide greater uptime scalability and data security than the legacy, on-premises applications from which they were refactored. As a result, the AWS experts at Flux7 built a CI/CD platform using AWS DevOps best practices, effectively reducing manual tasks and thereby increasing the team’s ability to focus on strategic work.

Further, the Flux7 DevOps team worked alongside the retailer’s team to:

  • Migrate the refactored applications to new AWS Accounts using the new CI/CD platform;
  • Automate remediation, recovering from failures faster;
  • Create AWS Identity and Access Management (IaM) resources as infrastructure as code (IaC);
  • Deliver the new applications in a Docker container-based microservices environment;
  • Deploy CloudWatch and Splunk for security and log management; and
  • Create DR procedures for the new applications to further ensure uptime and availability.

Moving forward, application updates will be rolled out via a blue-green deployment process that Flux7 helped the firm establish in order to achieve its zero downtime goals.

Business Benefits

While the customer team is a very advanced developer team, they were able to further their skills, learning through Flux7 knowledge transfer sessions how to enable DevOps best practices and continue to accelerate the new AWS DevOps platform adoption. At an estimated downtime cost of 6x the industry average, this firm couldn’t withstand the financial or reputational impact of a downtime event. As a result, the team is happy to report that it is meeting its zero downtime SLA objectives, enabling continuous system availability and with it growing customer satisfaction.

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

AWS Case Study: Energy Leader Digitizes Library for Analytics, Compliance

AWS Case Study: Energy Leader Digitizes Library for Analytics, Compliance

AWS Case Study Energy Leader Textract

The oil and gas industry has a rich history and one that is deeply intertwined with regulation — with Federal and State rules that regulate everything from exploration to production and transportation to workplace safety. As a result, our latest customer had amassed millions of paper documents to ensure its ability to prove compliance. It also maintained files with vast amounts of geological data, that served as the backbone of its intellectual property.

With over seven million physical documents saved and filed in deep storage, this oil and gas industry leader called the AWS consulting services team at Flux7 for its help digitizing its vast document library. In the process, it also wanted to make it easy to archive documents moving forward, and ensure that its operators could easily search for and find data.

Read the full AWS Case Study here.

Working with AWS Consulting Partner Flux7, the company created a working plan to digitize and catalog its vast document library. AWS had recently announced at re:Invent a new tool, Amazon Textract, which although still in preview mode, was the ideal tool for the task.

What is Textract?

For those of you unfamiliar with Amazon Textract, it is a new service that uses machine learning to automatically extract text and data from scanned documents. Unlike Optical Character Recognition (OCR) solutions, it also identifies the contents of fields in forms and information stored in tables, which allows users to conduct full data analytics on documents once they are digitized.

The Textract Proof of Concept

The proof of concept included several dozen physical documents that were scanned and uploaded to S3. From here, Lambda functions were triggered which launched Textract. In addition to the data being presented to Kibana, URLs for specific documents are presented to users.

As Amazon Textract automatically detects the key elements in a document or data relationships in forms and tables, it is able to extract data within the context it was originally created. With a core set of key parameters, such as revision date, extracted by Textract, operators will be able to search by key business parameters.

Analytics and Compliance

Interfacing with the data via Kibana, end users can now create smart search indexes which allow them to quickly and easily find key business data. Moreover, operators can build automated approval workflows and better meet document archival rules for regulatory compliance. Moreover, no longer does the company need to send an employee in their car to retrieve files from the warehouse, saving time from a labor-intensive task.

At Flux7, we relish the ability to help organizations apply automation and free their employees from manual tasks, replacing it with time to focus on strategic, business-impacting work. Read more Energy industry AWS case studies for best practices in cloud-based DevOps automation for enterprise agility.

For five tips on how to apply DevOps in your Oil, Gas or Energy enterprise, check out this article our CEO, Dr. Suleman, recently wrote for Oilman magazine. (Note that a free subscription is required.) Or, download the full case study here today.

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

IT Modernization DevOps News 12

The Uptime Institute announced findings of its ninth annual Data Center Survey, unveiling several interesting — and important — data points. Underscoring what many in the industry are feeling about the skill gap, the survey found that 61% of respondents said they had difficulty retaining or recruiting staff — up from 55% a year earlier. And, according to the synopsis, “while the lack of women working in data centers is well-known, the extent of the imbalance is notable” with one-quarter of respondents saying they had no women at all on their design, build or operations teams.

To stay up-to-date on DevOps automation, Cloud and Container Security, and IT Modernization subscribe to our blog:

Subscribe to the Flux7 Blog

When it comes to downtime, outages continue to cause significant problems. Without much improvement over the past year, 34% of respondents said they had an outage or severe IT service degradation in the past year. 10% said their most significant outage cost more than $1 million. When it comes to public cloud, 20% of operators reported that they would be more likely to put workloads in a public cloud if there were more visibility. While 50% of respondents already using public cloud for mission-critical applications said that they do not have adequate visibility.

DevOps News

  • Atlassian has announced Status Embed, a service designed to boost customer experience and communication by displaying the current state of services where customers are most likely to see it, such as your homepage, app or help center.
  • GitHub has brought to market repository templates to make boilerplate code management and distribution a “first-class citizen” on GitHub, according to the company.
  • HashiCorp announced the availability of Hashicorp Nomad 0.9.2, a workload orchestrator for deploying containerized and legacy apps across multiple regions or cloud providers. Nomad 09.9.2 includes preemption capabilities for service and batch jobs.
  • SDXCentral reports that, “VMware is developing a multi-cloud management tool that Joe Kinsella, chief technology officer of CloudHealth at VMware, describes as ‘Google docs for IT management, which is the ability to collaborate and share across an organization.’”

AWS News

  • Amazon announced that AWS Organizations now support tagging and untagging of AWS Accounts, allowing operators to assign custom attributes, or tags, to the AWS accounts they manage with AWS Organizations. According to AWS, the ability to attach tags such as owner name, project, business group, cost center, environment, and other values directly to an AWS account makes it easier for people in the organization to get information on particular AWS accounts without having to refer to a separate spreadsheet or other out-of-band method for tracking your AWS accounts.
  • Also introduced this week is AWS Systems Manager OpsCenter which is designed to help operators view, investigate, and resolve operational issues related to their environment from a central location.
  • Amazon has launched a new service to enhance recovery. Host Recovery for Amazon EC2 will now automatically restart instances on a new host in the event of an unexpected hardware failure on a Dedicated Host. Host Recovery will reduce the need for manual intervention, minimize recovery time and lower the operational burden for instances running on Dedicated Hosts. As a bonus, it has built-in integration with AWS License Manager to automatically track and manage licenses. There are no additional EC2 charges for using Host Recovery.
  • Last, our AWS Consulting team thought this foundational blog on Getting started with serverless was a good read for those of you looking to build serverless applications to take advantage of its agility and reduced TCO.

Flux7 News

  • Join AWS and Flux7 as they present a one day workshop on how Serverless Technology is impacting business now (and what you need to get started). Serverless technology on AWS is enabling companies by building modern applications with increased agility and lower total cost of ownership. Find additional information and register here.
  • Read CEO Dr. Suleman’s InformationWeek article, Five-Step Action Plan for DevOps at Scale in which he discusses how DevOps is achievable at enterprise scale if you start small, create a dedicated team and effectively use technology patterns and platforms.
  • Also published this week is Dr. Suleman’s take on Servant Leadership, as published in Forbes. In Why CIOs Should Have A Servant-Leadership Approach he shares why CIOs shouldn’t be in a position where they end up needing to justify their efforts. Read the article for the reason why. (No, it isn’t the brash conclusion you might think it is.)

Subscribe to the Flux7 Blog

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog

Digital Transformation & The Agile Enterprise in Oil and Gas

Digital Transformation & The Agile Enterprise in Oil and Gas

Digital Transformation Agile Enterprise Oil Gas

According to the World Economic Forum, digital transformation could unlock approximately $1.6 trillion of value for the Oil and Gas industry, its customers and society. This value is derived from greater productivity, better system efficiency, savings from reduced resource usage, and fewer spills and emissions. Yet, the journey to these digital transformation benefits begins with a proverbial first step which can be elusive for large oil and gas enterprises who have vast legacy technologies and complicated organizational structures to navigate.

At Flux7, we are proponents of the Agile Enterprise. While much work has been put into defining what makes an enterprise agile, we are fans of the research by McKinsey, who found a common set of five disciplines that agile enterprises share in common. Defined by their practices more than anything else, these agile organizations deploy an agile culture and agile technology to effectively support their digital transformation initiatives.

Becoming an Agile Enterprise is critically important within the oil and gas industries where unparalleled transformation is happening in rapid fashion. From new extraction methods to IoT and changing customer expectations, the industry is evolving quickly. For long-term, scalable success, digital efforts must be a cornerstone as organizations transition to becoming an Agile Enterprise.

DevOps for Oil and Gas

Equal parts people, process and technology, DevOps is a key component of marrying digital and agile. With a solid cloud-based DevOps platform, automation to streamline processes and ensure they are followed, and a Center of Excellence in place to help train teams, oil and gas enterprises have a roadmap to digital transformation success with DevOps.

For a more detailed road map to DevOps success across the enterprise, please download our white paper:

5 Steps to Enterprise DevOps at Scale

Let’s explore a few examples of organizations in the energy industry that have applied DevOps best practices to facilitate digital transformation and reach greater enterprise agility:

TechnipFMC, a world leader in project management, engineering, and construction for the energy industry, was looking to ensure compliance and security for cloud computing for its global sites and the perimeter networks that support its client-facing applications. To help accomplish this goal, TechnipFMC wanted to create a consistent, self-service solution to enable its global IT employees to easily provision cloud infrastructure and migrate externally facing Microsoft SharePoint sites to the cloud. With templates and automation, TechnipFMC can now enforce security and compliance standards in every deployment, which enhances overall perimeter network security. In addition, TechnipFMC is expecting to reduce operational costs while growing operational effectiveness. Listen as TechnipFMC’s John Hutchinson shares the experience at re:Invent or read the full Technip story.

A renewable energy leader had two parallel goals: It wanted to use an AWS cloud migration strategy as an opportunity to overhaul its business systems and in the process, the company wanted to build standardization. Moreover, it aimed to increase developer agility, grow global access for its workers and decrease capital expenses. Based on its application portfolio TCO analysis, a lift and shift migration approach was pursued. With 80% of its applications now defined by a small number of templates, the company has standardized its software builds, ensuring security best practices are followed by default. The enterprise has increased its time to innovation, speed to market and operational efficiencies. Preview their story here.

Fugro, which collects and provides highly specialized interpretation of oceanic geological data, is able to keep skilled staff onshore using an Internet of Things (IoT) platform model. Called OARS, its cloud-based project provides faster interpretation of data and decisions. With continuous delivery of code, its vessels are sure to always have the newest software features at their fingertips. And, new environments which previously took weeks to build, now launch in a matter of hours, providing better access to information across global regions. Read the full Fugro case study here.

A global oil field services company was looking to embrace digitalization with a SaaS model solution that sought to integrate data and business process management and in the process address operational workflows that would lead to greater scalability and more efficient delivery. The firm implemented a pipeline for delivering AMIs that are provisioned using Ansible and Docker containers, thereby streamlining complex workflows, allowing the firm to reap efficiencies of scale from automation, meet tight deadlines and ensure SOC2 compliance. Now the firm has pipelines for delivering resources and processes to build and deploy current and future solutions — ensuring digital transformation in the short- and long- term.

We are living in an uncertain, complex and constantly changing world. To stay competitive, oil and gas enterprises are expected to react to changes at unprecedented speed, which has ushered in a strong focus on becoming an agile enterprise. Effectively balance stability with ever-evolving customer needs, technologies, and overall market conditions with DevOps best practices as your foundation to scalable digital transformation.

For five tips on how to apply DevOps in your Oil, Gas or Energy enterprise, check out this article our CEO, Dr. Suleman, recently wrote for Oilman magazine. (Note that a free subscription is required.)  Or, you can find additional resources on our Energy resource page.

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

Upskill Your Team to Address the Cloud, Kubernetes Skills Gap

Upskill Your Team to Address the Cloud, Kubernetes Skills Gap

Upskill Your Team Kubernetes Cloud Skills Gap This article originally appeared on Forbes

According to CareerBuilder’s Mid Year Job Forecast, 63% of U.S. employers planned to hire full-time, permanent workers in the second half of 2018. This growing demand coupled with low unemployment is driving a real talent shortage. The technology field, in particular, is experiencing acute pain when it comes to finding skilled talent. Indeed, more than five million IT jobs are expected to be added globally by 2027, reports BusinessInsider.

Of these five million jobs, the two most requested tech skills according to research by DICE are for Kubernetes and Terraform with the company also finding that DevOps Engineer has quickly moved up the ranks of the top paid IT careers. As companies invest in IT modernization with approaches like Agile and DevOps and technologies like cloud computing and containers, skills to support these initiatives are in increasing demand.

The problem is not set to get better in the near or mid-term with many companies reporting that it’s taking longer to find candidates with the right technology and business skills for driving digital innovation. A survey by OpsRamp found that 94% of HR departments take at least 30 days to fill an empty position and 25% report taking 90 days or more. With internal pressures for innovation that won’t wait out a protracted hiring process, I encourage leaders to look internally, using two key levers to help grow innovation.

Upskill Your Team

One way to work around a skills gap within the organization is to upskill the team. Rather than hiring a new headcount that is already difficult to find, a solution is to train your existing team. (Or a few members of the team who can in turn train others.) While there are a variety of training options — from classroom training to virtual classes and more — at Flux7, our experience has shown that hands-on training works best for technical skills like Terraform or Kubernetes. 

Specifically, a successful model consists of the following:

  • Find a coach that can work hand-in-hand with your team
  • Identify a small but impactful project for the coach and team to work on together with the goal of having the coach train the team along the way
  • Start the project with the coach taking the initial lead sharing what they are doing, why and how with your team shadowing
  • Slowly transition over the course of the project to the coach assigning tasks to your team, with your employees ultimately leading tasks and checking in with the coach as needed.

In this way, teams are able to learn in a practical, hands-on manner, taking ownership of the environment as they learn and grow — all while having access to an expert who can guide, correct and reinforce learning.

In addition to gaining much-needed skills in-house, upskilling your existing team has retention benefits. In a survey of tech professionals by DICE, 71% said that training and education are important to them, yet only 40% currently have company-paid training and education. Underscoring the importance of training to technologists, 45% who are satisfied with their job receive training; conversely, only 28% of those who are dissatisfied with their job receive training.

Grow Productivity with Automation

In addition to upskilling your team, automation is important to continue to expand your capacity. Approaches like DevOps embrace the use automation to create continuous integration and delivery, in the process reducing handoffs and speeding time to market. In addition, the use of automation can keep employees from working on tactical, repeatable tasks and instead focused on strategic, business-impacting work.

Let me give you an example. I recently had the opportunity to work with a large semiconductor company who sought to bolster its team’s cloud, container and Kubernetes talents in order to support a new AWS initiative. Working hands-on in the cloud to automate its pipelines and other processes, the company was able to streamline tasks that formerly took days to mere minutes.

In addition to working elbow-to-elbow with a cloud coach on the project, the company also initiated weekly knowledge transfer sessions to the team to ensure everyone had received the same level of training and was ready for the next week’s work. At the end of the project, the team was ready to train others in the organization and felt confident that they were building better products faster as their time was focused less on tactical work and more on making a strategic impact. Another benefit to the team — and company as a whole — is that by taking a cross-functional DevOps approach, employees felt that communication improved making their work more enjoyable.

In a recent poll of over 70,000 developers, HackerRank found that salary wasn’t the lead driver of what they look for in a job. Rather, the most important factors for developers, across all job levels and functions, was the opportunity for professional growth and the opportunity to work on interesting problems. The application of automation not only increases developer productivity and code throughput but provides the space to work on interesting projects that leads to greater job satisfaction and retention.

With competition growing for employees skilled in Kubernetes, Terraform, DevOps and more, growing your own is an increasingly attractive approach. UC Berkeley found that the average cost to hire a new professional employee may be as high as $7,000 (while replacement costs can be as great as 2.5x salary) not to mention lost opportunity costs as organizations place projects on hold as they vie to find skilled talent. Upskilling employees, combined with greater automation, can increase code throughput and get more projects to market faster, maximizing near-term opportunity. Just as importantly, presenting employees with new skills and the opportunity to work on interesting work has proven to increase job satisfaction and retention.

Learn more about addressing the skills gap, building cloud-native infrastructure and more on the Flux7 DevOps blog. Subscribe today:

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

Predictive CPU isolation of containers at Netflix

Predictive CPU isolation of containers at Netflix

By Benoit Rostykus, Gabriel Hartmann

Noisy Neighbors

We’ve all had noisy neighbors at one point in our life. Whether it’s at a cafe or through a wall of an apartment, it is always disruptive. The need for good manners in shared spaces turns out to be important not just for people, but for your Docker containers too.

When you’re running in the cloud your containers are in a shared space; in particular they share the CPU’s memory hierarchy of the host instance.

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. However, the key insight here is that these caches are partially shared among the CPUs, which means that perfect performance isolation of co-hosted containers is not possible. If the container running on the core next to your container suddenly decides to fetch a lot of data from the RAM, it will inevitably result in more cache misses for you (and hence a potential performance degradation).

Linux to the rescue?

Traditionally it has been the responsibility of the operating system’s task scheduler to mitigate this performance isolation problem. In Linux, the current mainstream solution is CFS (Completely Fair Scheduler). Its goal is to assign running processes to time slices of the CPU in a “fair” way.

CFS is widely used and therefore well tested and Linux machines around the world run with reasonable performance. So why mess with it? As it turns out, for the large majority of Netflix use cases, its performance is far from optimal. Titus is Netflix’s container platform. Every month, we run millions of containers on thousands of machines on Titus, serving hundreds of internal applications and customers. These applications range from critical low-latency services powering our customer-facing video streaming service, to batch jobs for encoding or machine learning. Maintaining performance isolation between these different applications is critical to ensuring a good experience for internal and external customers.

We were able to meaningfully improve both the predictability and performance of these containers by taking some of the CPU isolation responsibility away from the operating system and moving towards a data driven solution involving combinatorial optimization and machine learning.

The idea

CFS operates by very frequently (every few microseconds) applying a set of heuristics which encapsulate a general concept of best practices around CPU hardware use.

Instead, what if we reduced the frequency of interventions (to every few seconds) but made better data-driven decisions regarding the allocation of processes to compute resources in order to minimize collocation noise?

One traditional way of mitigating CFS performance issues is for application owners to manually cooperate through the use of core pinning or nice values. However, we can automatically make better global decisions by detecting collocation opportunities based on actual usage information. For example if we predict that container A is going to become very CPU intensive soon, then maybe we should run it on a different NUMA socket than container B which is very latency-sensitive. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Optimizing placements through combinatorial optimization

What the OS task scheduler is doing is essentially solving a resource allocation problem: I have X threads to run but only Y CPUs available, how do I allocate the threads to the CPUs to give the illusion of concurrency?

As an illustrative example, let’s consider a toy instance of 16 hyperthreads. It has 8 physical hyperthreaded cores, split on 2 NUMA sockets. Each hyperthread shares its L1 and L2 caches with its neighbor, and shares its L3 cache with the 7 other hyperthreads on the socket:

If we want to run container A on 4 threads and container B on 2 threads on this instance, we can look at what “bad” and “good” placement decisions look like:

The first placement is intuitively bad because we potentially create collocation noise between A and B on the first 2 cores through their L1/L2 caches, and on the socket through the L3 cache while leaving a whole socket empty. The second placement looks better as each CPU is given its own L1/L2 caches, and we make better use of the two L3 caches available.

Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.

We formulate the problem as a Mixed Integer Program (MIP). Given a set of K containers each requesting a specific number of CPUs on an instance possessing d threads, the goal is to find a binary assignment matrix M of size (d, K) such that each container gets the number of CPUs it requested. The loss function and constraints contain various terms expressing a priori good placement decisions such as:

  • avoid spreading a container across multiple NUMA sockets (to avoid potentially slow cross-sockets memory accesses or page migrations)
  • don’t use hyper-threads unless you need to (to reduce L1/L2 thrashing)
  • try to even out pressure on the L3 caches (based on potential measurements of the container’s hardware usage)
  • don’t shuffle things too much between placement decisions

Given the low-latency and low-compute requirements of the system (we certainly don’t want to spend too many CPU cycles figuring out how containers should use CPU cycles!), can we actually make this work in practice?


We decided to implement the strategy through Linux cgroups since they are fully supported by CFS, by modifying each container’s cpuset cgroup based on the desired mapping of containers to hyper-threads. In this way a user-space process defines a “fence” within which CFS operates for each container. In effect we remove the impact of CFS heuristics on performance isolation while retaining its core scheduling capabilities.

This user-space process is a Titus subsystem called titus-isolate which works as follows. On each instance, we define three events that trigger a placement optimization:

  • add: A new container was allocated by the Titus scheduler to this instance and needs to be run
  • remove: A running container just finished
  • rebalance: CPU usage may have changed in the containers so we should reevaluate our placement decisions

We periodically enqueue rebalance events when no other event has recently triggered a placement decision.

Every time a placement event is triggered, titus-isolate queries a remote optimization service (running as a Titus service, hence also isolating itself… turtles all the way down) which solves the container-to-threads placement problem.

This service then queries a local GBRT model (retrained every couple of hours on weeks of data collected from the whole Titus platform) predicting the P95 CPU usage of each container in the coming 10 minutes (conditional quantile regression). The model contains both contextual features (metadata associated with the container: who launched it, image, memory and network configuration, app name…) as well as time-series features extracted from the last hour of historical CPU usage of the container collected regularly by the host from the kernel CPU accounting controller.

The predictions are then fed into a MIP which is solved on the fly. We’re using cvxpy as a nice generic symbolic front-end to represent the problem which can then be fed into various open-source or proprietary MIP solver backends. Since MIPs are NP-hard, some care needs to be taken. We impose a hard time budget to the solver to drive the branch-and-cut strategy into a low-latency regime, with guardrails around the MIP gap to control overall quality of the solution found.

The service then returns the placement decision to the host, which executes it by modifying the cpusets of the containers.

For example, at any moment in time, an r4.16xlarge with 64 logical CPUs might look like this (the color scale represents CPU usage):


The first version of the system led to surprisingly good results. We reduced overall runtime of batch jobs by multiple percent on average while most importantly reducing job runtime variance (a reasonable proxy for isolation), as illustrated below. Here we see a real-world batch job runtime distribution with and without improved isolation:

Notice how we mostly made the problem of long-running outliers disappear. The right-tail of unlucky noisy-neighbors runs is now gone.

For services, the gains were even more impressive. One specific Titus middleware service serving the Netflix streaming service saw a capacity reduction of 13% (a decrease of more than 1000 containers) needed at peak traffic to serve the same load with the required P99 latency SLA! We also noticed a sharp reduction of the CPU usage on the machines, since far less time was spent by the kernel in cache invalidation logic. Our containers are now more predictable, faster and the machine is less used! It’s not often that you can have your cake and eat it too.

Next Steps

We are excited with the strides made so far in this area. We are working on multiple fronts to extend the solution presented here.

We want to extend the system to support CPU oversubscription. Most of our users have challenges knowing how to properly size the numbers of CPUs their app needs. And in fact, this number varies during the lifetime of their containers. Since we already predict future CPU usage of the containers, we want to automatically detect and reclaim unused resources. For example, one could decide to auto-assign a specific container to a shared cgroup of underutilized CPUs, to better improve overall isolation and machine utilization, if we can detect the sensitivity threshold of our users along the various axes of the following graph.

We also want to leverage kernel PMC events to more directly optimize for minimal cache noise. One possible avenue is to use the Intel based bare metal instances recently introduced by Amazon that allow deep access to performance analysis tools. We could then feed this information directly into the optimization engine to move towards a more supervised learning approach. This would require a proper continuous randomization of the placements to collect unbiased counterfactuals, so we could build some sort of interference model (“what would be the performance of container A in the next minute, if I were to colocate one of its threads on the same core as container B, knowing that there’s also C running on the same socket right now?”).


If any of this piques your interest, reach out to us! We’re looking for ML engineers to help us push the boundary of containers performance and “machine learning for systems” and systems engineers for our core infrastructure and compute platform.

Predictive CPU isolation of containers at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium—-2615bd06b42e—4

Making our Android Studio Apps Reactive with UI Components & Redux

Making our Android Studio Apps Reactive with UI Components & Redux

By Juliano Moraes, David Henry, Corey Grunewald & Jim Isaacs

Recently Netflix has started building mobile apps to bring technology and innovation to our Studio Physical Productions, the portion of the business responsible for producing our TV shows and movies.

Our very first mobile app is called Prodicle and was built for Android & iOS using the same reactive architecture in both platforms, which allowed us to build 2 apps from scratch in 3 months with 4 software engineers.

The app helps production crews organize their shooting days through shooting milestones and keeps everyone in a production informed about what is currently happening.

Here is a shooting day for Glow Season 3.

We’ve been experimenting with an idea to use reactive components on Android for the last two years. While there are some frameworks that implement this, we wanted to stay very close to the Android native framework. It was extremely important to the team that we did not completely change the way our engineers write Android code.

We believe reactive components are the key foundation to achieve composable UIs that are scalable, reusable, unit testable and AB test friendly. Composable UIs contribute to fast engineering velocity and produce less side effect bugs.

Our current player UI in the Netflix Android app is using our first iteration of this componentization architecture. We took the opportunity with building Prodicle to improve upon what we learned with the Player UI, and build the app from scratch using Redux, Components, and 100% Kotlin.

Overall Architecture

Fragments & Activities

— Fragment is not your view.

Having large Fragments or Activities causes all sorts of problems, it makes the code hard to read, maintain, and extend. Keeping them small helps with code encapsulation and better separation of concerns — the presentation logic should be inside a component or a class that represents a view and not in the Fragment.

This is how a clean Fragment looks in our app, there is no business logic. During the onViewCreated we pass pre-inflated view containers and the global redux store’s dispatch function.

UI Components

Components are responsible for owning their own XML layout and inflating themselves into a container. They implement a single render(state: ComponentState) interface and have their state defined by a Kotlin data class.

A component’s render method is a pure function that can easily be tested by creating a permutation of possible states variances.

Dispatch functions are the way components fire actions to change app state, make network requests, communicate with other components, etc.

A component defines its own state as a data class in the top of the file. That’s how its render() function is going to be invoked by the render loop.

It receives a ViewGroup container that will be used to inflate the component’s own layout file, R.layout.list_header in this example.

All the Android views are instantiated using a lazy approach and the render function is the one that will set all the values in the views.


All of these components are independent by design, which means they do not know anything about each other, but somehow we need to layout our components within our screens. The architecture is very flexible and provides different ways of achieving it:

  1. Self Inflation into a Container: A Component receives a ViewGroup as a container in the constructor, it inflates itself using Layout Inflater. Useful when the screen has a skeleton of containers or is a Linear Layout.
  2. Pre inflated views. Component accepts a View in its constructor, no need to inflate it. This is used when the layout is owned by the screen in a single XML.
  3. Self Inflation into a Constraint Layout: Components inflate themselves into a Constraint Layout available in its constructor, it exposes a getMainViewId to be used by the parent to set constraints programmatically.


Redux provides an event driven unidirectional data flow architecture through a global and centralized application state that can only be mutated by Actions followed by Reducers. When the app state changes it cascades down to all the subscribed components.

Having a centralized app state makes disk persistence very simple using serialization. It also provides the ability to rewind actions that have affected the state for free. After persisting the current state to the disk the next app launch will put the user in exactly the same state they were before. This removes the requirement for all the boilerplate associated with Android’s onSaveInstanceState() and onRestoreInstanceState().

The Android FragmentManager has been abstracted away in favor of Redux managed navigation. Actions are fired to Push, Pop, and Set the current route. Another Component, NavigationComponent listens to changes to the backStack and handles the creation of new Screens.

The Render Loop

Render Loop is the mechanism which loops through all the components and invokes component.render() if it is needed.

Components need to subscribe to changes in the App State to have their render() called. For optimization purposes, they can specify a transformation function containing the portion of the App State they care about — using selectWithSkipRepeats prevents unnecessary render calls if a part of the state changes that the component does not care about.

The ComponentManager is responsible for subscribing and unsubscribing Components. It extends Android ViewModel to persist state on configuration change, and has a 1:1 association with Screens (Fragments). It is lifecycle aware and unsubscribes all the components when onDestroy is called.

Below is our fragment with its subscriptions and transformation functions:

ComponentManager code is below:

Recycler Views

Components should be flexible enough to work inside and outside of a list. To work together with Android’s recyclerView implementation we’ve created a UIComponent and UIComponentForList, the only difference is the second extends a ViewHolder and does not subscribe directly to the Redux Store.

Here is how all the pieces fit together.


The Fragment initializes a MilestoneListComponent subscribing it to the Store and implements its transformation function that will define how the global state is translated to the component state.

List Component:

A List Component uses a custom adapter that supports multiple component types, provides async diff in the background thread through adapter.update() interface and invokes item components render() function during onBind() of the list item.

Item List Component:

Item List Components can be used outside of a list, they look like any other component except for the fact that UIComponentForList extends Android’s ViewHolder class. As any other component it implements the render function based on a state data class it defines.

Unit Tests

Unit tests on Android are generally hard to implement and slow to run. Somehow we need to mock all the dependencies — Activities, Context, Lifecycle, etc in order to start to test the code.

Considering our components render methods are pure functions we can easily test it by making up states without any additional dependencies.

In this unit test example we initialize a UI Component inside the before() and for every test we directly invoke the render() function with a state that we define. There is no need for activity initialization or any other dependency.

Conclusion & Next Steps

The first version of our app using this architecture was released a couple months ago and we are very happy with the results we’ve achieved so far. It has proven to be composable, reusable and testable — currently we have 60% unit test coverage.

Using a common architecture approach allows us to move very fast by having one platform implement a feature first and the other one follow. Once the data layer, business logic and component structure is figured out it becomes very easy for the following platform to implement the same feature by translating the code from Kotlin to Swift or vice versa.

To fully embrace this architecture we’ve had to think a bit outside of the platform’s provided paradigms. The goal is not to fight the platform, but instead to smooth out some rough edges.

Making our Android Studio Apps Reactive with UI Components & Redux was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium—-2615bd06b42e—4

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

IT Modernization DevOps News 13Palo Alto Networks made the most of a short week by announcing its plan to acquire container security company Twistlock for $410 million. It also announced plans to acquire serverless security company PureSec and launched Prisma, its new cloud security service. With cloud and container security top of mind for many, the acquisitions will prove to be valuable assets as enterprises seek to build security in.

 To stay up-to-date on DevOps automation, Cloud and Container Security, and IT Modernization subscribe to our blog:

Subscribe to the Flux7 Blog

DevOps News

  • Red Hat Ansible Tower 3.5 is now generally available. The release now includes support for RHEL 8, external credential vaults via credential plugins, and Become plugins. In addition, Red Hat noted that the Ansible Tower 3.5 release saw over 160 issues closed.
  • Red Hat Ansible Engine 2.8 is now available. In addition to several enhancements, the release includes several new features such as Ansible content (Collections), BECOME being the default privilege escalation path, no longer depending on paramiko, and BECOMEplugins, and other notable improvements and changes.
  • TeamCity 2019.1, the first major release of this year, is here. The release features a redesigned UI, native GitLab integration, and support for GitLab and Bitbucket server pull requests as well as token-based authentication, detection and reporting of Go tests, faster build agent upgrades, and AWS Spot Fleet requests.

AWS News

Flux7 News

  • Join AWS and Flux7 as they present a one day workshop on how Serverless Technology is impacting business now (and what you need to get started). Serverless technology on AWS is enabling companies by building modern applications with increased agility and lower total cost of ownership. Find additional information and register here.
  • Flux7 has been ranked by Growjo as one of the fastest growing companies in the Austin area. Read more about Flux7’s customer and business momentum.

Subscribe to the Flux7 Blog

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog

Growjo Ranks Flux7 Among Fastest Growing Austin Companies

Growjo Ranks Flux7 Among Fastest Growing Austin Companies

Growjo Ranks Flux7 Fast Growing in Austin

Growjo is on a mission to identify the top growing companies across regions of the US and we’re excited to announce that Flux7 has been ranked among the fastest growing companies in the Austin area. Flux7’s rank of #88 is based on growth indicators and a predictive analysis algorithm unique to Growjo that not only creates the most complete list of the fastest growing companies, but it is also a great predictor of future growth.

In addition to the Austin ranking, the Flux7 DevOps consulting services firm has been named to Growjo’s Tech Services, State of Texas, and overall 10k list of fastest growing companies. Calculated from high growth indicators that include employee size, brand awareness, funding, acquisitions, hiring plans, new locations and additional trigger events, the Growjo formula predicts that Flux7 is both growing at an increased rate and is poised to grow significantly through 2019 and beyond.

In response to the ranking, Aater Suleman, Flux7 co-founder and CEO, said “Flux7 succeeds when our customers succeed. We seek to make it possible for organizations to experiment more, fail cheap, and measure results accurately through an innovation lab strategy. Today’s ranking illustrates the power of this approach combined with Flux7 values of humbleness, transparency, and innovation to solve business challenges.”

At Flux7, we view customer growth as a significant vote of confidence; this year we are humbled to have so many new and repeat customers loudly affirming their confidence in our employees and approach to solving business challenges. We are truly honored to be an integral part of our customer’s digital transformations as we saw customer contracts grow 247% year-over-year in the first quarter of 2019. 2019 growth closely follows our 2018 year-ending cumulative three-year revenue growth of 547%.

Since its inception, Flux7 has established itself as a thought leader and valuable partner for enterprise and midmarket businesses aiming to modernize their IT practices and retain management of their own systems. Flux7 has been able to establish a unique position in the market by filling a need for enterprises to make rapid modernization progress while learning new technical skills for greater business agility.

With its Enterprise DevOps Framework, Flux7 helps organizations apply DevOps methodologies to reap benefits like greater innovation, enhanced security, increased scalability and more.

According to Growjo, inclusion in the Growjo 10000 is a better indicator of success than any other “fast company list”. Want to grow with us? Check out our Career opportunities here: Interested in having our DevOps consulting team help with your IT modernization project? Reach out to us today.

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

IT Modernization DevOps News 10

At its ChefConf 2019 held last week in Seattle, Chef announced several enhancements to its Chef Enterprise Automation Stack (EAS). New features include comprehensive Application Operations Dashboards which Chef describes as providing end-to-end visibility of the application lifecycle; new Migration Accelerators; and new versions of Chef Infra and Chef InSpec that use Chef Habitat to make it even easier to deploy, update and manage the EAS regardless of environment.

 To stay up-to-date on DevOps automation, CI/CD and IT Modernization, subscribe to our blog here:
Subscribe to the Flux7 Blog

 DevOps News

GitHub announced several noteworthy news items at its Satellite developer conference last week, notably:

  • It has acquired Dependabot which will give GitHub the ability to monitor and automatically open pull requests for dependencies with known security vulnerabilities.
  • GitHub has partnered with WhiteSource, an open-source security company, to help developers more easily detect open-source vulnerabilities in their GitHub repos.
  • GitHub Enterprise has been updated with a slew of new features such as enterprise accounts, a new account type that connects organizations; two new user roles, Triage and Maintain, that teams can use to grow and scale securely, addressing their access control needs; the ability for cloud administrators to now access audit log events using a new GraphQL API; security vulnerability alerts, token scanning, and more.
  • GitLab 11.11 has been released. It features multi-assignment for merge requests, the ability to automatically push an alert to Slack and/or Mattermost when an event occurs, alerting your team when a deployment occurs, support for Windows Container Executor for GitLab Runners, enabling Docker containers on Windows, and more.
  • HashiCorp released Terraform 0.12 which is focused on improvements to the Terraform language. The aim: to make configurations for more complex situations more readable, and improve the usability of re-usable modules.
  • Our team enjoyed this blog, Using Infoblox As A Dynamic Inventory In Red Hat Ansible Tower, in which Victor da Costa shares how dynamic inventory can replace headaches associated with tracking Configuration Items (CIs) — whether you’re using a CMDB or spreadsheet.
  • A fun read our DevOps team also enjoyed was the NY Times self-published case study on how it built a Slack bot to keep track of Reddit conversations around New York Times articles.

AWS News

  • A new feature that makes the use of encrypted Amazon EBS (Elastic Block Store) volumes even easier was rejoiced by our DevOps consulting team who can now specify if they want new EBS volumes to be created in encrypted form and if so, if they want to use their own key or a default AWS key.
  • Our AWS consulting services team was excited to see that starting August 1st AWS Config rules will switch to a pay-per-use pricing model, which means a lower bill for almost all existing AWS Config rules customers. AWS Config helps operators maintain AWS configuration compliance.
  • Amazon announced preview availability of Amazon CloudWatch Container Insights, which allows operators to monitor, isolate, and diagnose their containerized applications and microservices environments.
  • In separate CloudWatch news, Amazon has announced that CloudWatch Logs now support percentiles in metric filters, allowing operators to turn log data into numerical CloudWatch metrics that can be graphed.

Flux7 News

  • Join AWS and Flux7 as they present a one day workshop on how Serverless Technology is impacting business now (and what you need to get started). Serverless technology on AWS is enabling companies by building modern applications with increased agility and lower total cost of ownership. Find additional information and register here.
  • For additional reading on building modern applications with a strong cloud foundation, check out our blog on the benefits of pairing a Landing Zone with CI/CD. Spoiler alert: together they multiply the business’s ability to grow efficiency, productivity, security and time to market.

Subscribe to the Flux7 Blog

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog