Tag: CLI

Reimagining Experimentation Analysis at Netflix

Reimagining Experimentation Analysis at Netflix

Toby Mao, Sri Sri Perangur, Colin McFarland

Another day, another custom script to analyze an A/B test. Maybe you’ve done this before and have an old script lying around. If it’s new, it’s probably going to take some time to set up, right? Not at Netflix.

ABlaze: The standard view of analyses in the XP UI

Suppose you’re running a new video encoding test and theorize that the two new encodes should reduce play delay, a metric describing how long it takes for a video to play after you press the start button. You can look at ABlaze (our centralized A/B testing platform) and take a quick look at how it’s performing.

Simulated dataset that shows what the distribution of play delay may look like. Note that the new encodes perform well in the lower quantiles but worse in the higher ones

You notice that the first new encode (Cell 2 — Encode 1) increased the mean of the play delay but decreased the median!

After recreating the dataset, you can plot the raw numbers and perform custom analyses to understand the distribution of the data across test cells.

With our new platform for experimentation analysis, it’s easy for scientists to perfectly recreate analyses on their laptops in a notebook. They can then choose from a library of statistics and visualizations or contribute their own to get a deeper understanding of the metrics.

Extending the same view of ABlaze with other contributed models and visualizations

Why it Matters

Netflix runs on an A/B testing culture: nearly every decision we make about our product and business is guided by member behavior observed in test. At any point a Netflix user is in many different A/B tests orchestrated through ABlaze. This enables us to optimize their experience at speed. Our A/B tests range across UI, algorithms, messaging, marketing, operations, and infrastructure changes. A user might be in a title artwork test, personalization algorithm test, or a video encoding testing, or all three at the same time.

The analysis reports tell us whether or not a new experience made statistically significant changes to relevant metrics, such as member behavior, or technical metrics that describe streaming video quality. However, the default reports only provide a summary view of the data with some powerful but limited filtering options. Our data scientists often want to apply their knowledge of the business and statistics to fully understand the outcome of an experiment.

Instead of relying on engineers to productionize scientific contributions, we’ve made a strategic bet to build an architecture that enables data scientists to easily contribute.

The two main challenges with this approach are establishing an easy contribution framework and handling Netflix’s scale of data. When dealing with ‘big data’, it’s common to perform computation on frameworks like Apache Spark or Map Reduce. In order to reduce the learning curve of contributing analyses, we’ve decided to take an alternative path by performing all of our analyses on one machine. Due to compression and high performance computing, scientists can analyze billions of rows of raw data on their laptops using languages and statistical libraries they are familiar with like Python and R.

Challenges with Pre-existing Infrastructure

Netflix’s well-known experimentation culture was fueled by our previous infrastructure: an optimized framework that scaled to the wide variety of use cases across Netflix. But as our experimentation culture grew, so too did our product areas, users, and ambitions around more sophisticated methodology on measurement.

Our data scientists faced numerous challenges in our previous infrastructure. Complex business logic was embedded directly into the ETL pipelines by data engineers. In order to replicate results, scientists had to delve deep into the data, code, and documentation. Due to Netflix’s scale of over 150 million subscribers, scientists also frequently encountered issues while fetching data and performing custom statistical models in Python or R.

To offer new methods to the community and overcome any existing engineering barriers, scientists would have to run custom scripts outside of the centralized platform. Heavily used or high value scripts were sometimes converted into Shiny apps, allowing easy access to these novel features. However, because these apps lived separately from the platform, they could be difficult to maintain as the underlying data and platform evolved. Also, since these apps were generally written for specific use cases, they were difficult to generalize and graduate back into the platform.

Our scientists come from many backgrounds, such as neuroscience, biostatistics, economics, and physics; each of these backgrounds has a meaningful contribution to how experiments should be analyzed. Instead of spending their time wrangling data and conducting the same ad-hoc analyses multiple times, we would like our data scientists to focus on contributing new and innovative techniques for analyzing tests, such as Interleaving, Quantile Bootstrapping, Quasi Experiments, Quantile Regression, and Heterogeneous Treatment Effects. Additionally, as these new techniques are contributed, we want them to be effortlessly leveraged across the Netflix experimentation community.

Previous XP architecture: all systems are engineering-owned and not easily introspectable

Reimagining our Infrastructure: Democratization Across 3 Tracks

We are reimagining new infrastructure that makes the scientific development experience better. We’ve chosen to break down the contribution framework into 3 steps.

1. Getting Data with the Metrics Repo
2. Computing Statistics with Causal Models
3. Rendering Visualizations with Plotly

Democratization across 3 tracks: Metrics, Stats, Viz

The new architecture employs a modular design that permits data scientists to contribute using SQL, Python, and R, the tools of their trade. Users can contribute metrics and methods directly, without needing to master data engineering tools. We’ve also made sure that both production and local workflows use the same code base, so reproducibility is a given and promotion to production is just a pull request away.

New XP architecture: Systems highlighted in red are introspectable and contributable by data scientists

Getting data with Metrics Repo

Metrics Repo is an in-house Python framework where users define programmatically generated SQL queries and metric definitions. It centralizes metrics definitions which used to be scattered across many teams. Previously, many teams at Netflix had their own pipelines to calculate success metrics which caused a lot of fragmentation and discrepancies in calculations.

A key design decision of Metrics Repo is that it moves the last mile of metric computation away from engineering owned ETL pipelines into dynamically generated SQL. This allows scientists to add metrics and join arbitrary tables. The new architecture is much more flexible compared to the previous Spark based jobs. Views of reports are only calculated on demand and take a couple minutes to execute, so there are no migrations or backfills when making changes or updates to metrics. Adding a new metric is as easy as adding a new field or joining a different table in SQL. By leveraging PyPika, we represent each table as a Python class that can be customized with filters and additional joins. The code is self documenting and serializes to JSON so it can be easily exposed as an API.

Calculating Statistics with Causal Models

Causal Models is an in-house Python library that allows scientists to contribute generic models for causal inference. Previously, the centralized platform only had T-Test and Mann-Whitney while advanced statistical tests were only available via scripts or Shiny apps. Scientists can now add their statistical models by overriding two functions in a model subclass. Many of the models are simple wrappers over Scipy, but it’s flexible enough to do arbitrarily complex calculations. The library also provides helper methods which abstract accessing compressed or raw data. We use rpy2 so that models can be written in either R or Python.

We do not want data scientists to have to go outside of their comfort zone by writing Spark Scala or Map Reduce jobs. We also want to leverage the large ecosystem of statistical libraries written in Python and R. However, many analyses have raw datasets that don’t fit on one machine. So, we’ve implemented an optional compression layer that drastically reduces the size of the data. Depending on the statistic, the compression can be either lossless or tunably lossy. Additionally, we’ve structured the API so that model implementors don’t need to distinguish between compressed and uncompressed data. When contributing a new statistical test, the data scientist only needs to think about one comparison computation at a time. We take the functions that they’ve written and parallelize it for them through multi-processing.

Sometimes statistical models are expensive to run even on compressed data. It can be difficult to efficiently perform linear algebra operations in native Python or R. In those cases, our mathematical engineering team writes custom C++ in order to speed through those bottlenecks. Our scientists can then reference them easily in Python via pybind11 or in R via Rcpp.

As a result, innovative methods like Quantile Bootstrapping and OLS with heterogeneous effects are no longer confined to un-versioned controlled notebooks/scripts. The barrier to entry is very low to develop on the production system and sharing methods across metrics and business areas is effortless.

Rendering Visualizations with Plotly

In the old model, visualizations in the experimentation platform were created by UI engineers in React. The new architecture is still based on React, but we allow data scientists to contribute arbitrary graphs and plots using Plotly. We chose to use Plotly because it has a JSON specification that is implemented in many different frameworks and languages, including R and Python. Scientists can pick and choose from a wide variety of pre-made visualizations or create their own for others to use.

This work kickstarted an initiative called Netflix Vizkit to create a cross-library shared design that lowers the barrier for a unified look and feel in contributions.

Many scientists at Netflix primarily use notebooks for day to day development, so we wanted to make sure they could perform A/B test analysis on them as well. To ensure that the analysis shown in ABlaze can be replicated in a notebook, with e run the exact same code in both environments, even the visualizations!

Now scientists can easily introspect the data and extend it in an ad-hoc analysis. They can develop new metrics, statistical models, and visualizations in their notebooks and contribute it to the platform knowing the results will be identical because their exact code will be running in production. As a result, anyone at Netflix looking at ABlaze can now view these new contributions when looking at test analyses.

XP: Combining contributions into analyses

Next Steps

We aim to accelerate research in causal inference methodology, expedite product innovation, and ultimately delight our members. We’re looking forward to enhancing our frameworks to tackle experimentation automation. This is an ongoing journey. If you are passionate about the field, we have opportunities to join our dream team!


Reimagining Experimentation Analysis at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21?source=rss—-2615bd06b42e—4

AWS Tech Talk: Infrastructure is Code with the AWS CDK

AWS Tech Talk: Infrastructure is Code with the AWS CDK

If you missed the Infrastructure is Code with the AWS Cloud Development Kit (AWS CDK) Tech Talk last week, you can now watch it online on the AWS Online Tech Talks channel. Elad and I had a ton of fun building this little URL shortener sample app to demo the AWS CDK for Python. If you aren’t a Python developer, don’t worry! The Python code we use is easy to understand and translates directly to other languages. Plus, you can learn a lot about the AWS CDK and get a tour of the AWS Construct Library.

Specifically, in this tech talk, you can see us:

  • Build a URL shortener service using AWS Constructs for AWS Lambda, Amazon API Gateway and Amazon DynamoDB.
  • Demonstrate the concept of shared infrastructure through a base CDK stack class that includes APIs for accessing shared resources such as a domain name and a VPC.
  • Use AWS Fargate to create a custom construct for a traffic generator.
  • Use a 3rd party construct library which automatically defines a monitoring dashboard and alarms for supported resources.

If you’re hungry for more after watching the Tech Talk, check out these resources to learn more about the AWS CDK:

  • https://cdkworkshop.com — This fun, online workshop takes an hour or two to complete and is available in both TypeScript and Python. You learn how to work with the AWS CDK by building an application using AWS Lambda, Amazon DynamoDB, and Amazon API Gateway.
  • https://www.github.com/aws-samples/aws-cdk-examples — After you know the basics of the AWS CDK, you’re ready to jump into the AWS CDK Examples repo! Learn more about the Constructs available in the AWS Construct Library. There are examples available for TypeScript, Python, C#, and Java. You can find the URL shortener sample app from our Tech Talk here, too.

I hope you enjoy learning about the AWS CDK. Let us know what other types of examples, apps, or construct libraries you want to see us build in demos or sample code!

Happy constructing!

Jason and Elad

 

from AWS Developer Blog https://aws.amazon.com/blogs/developer/aws-tech-talk-infrastructure-is-code-with-the-aws-cdk/

Applying Netflix DevOps Patterns to Windows

Applying Netflix DevOps Patterns to Windows

Baking Windows with Packer

By Justin Phelps and Manuel Correa

Customizing Windows images at Netflix was a manual, error-prone, and time consuming process. In this blog post, we describe how we improved the methodology, which technologies we leveraged, and how this has improved service deployment and consistency.

Artisan Crafted Images

In the Netflix full cycle DevOps culture the team responsible for building a service is also responsible for deploying, testing, infrastructure, and operation of that service. A key responsibility of Netflix engineers is identifying gaps and pain points in the development and operation of services. Though the majority of our services run on Linux Amazon Machine Images (AMIs), there are still many services critical to the Netflix Playback Experience running on Windows Elastic Compute Cloud (EC2) instances at scale.

We looked at our process for creating a Windows AMI and discovered it was error-prone and full of toil. First, an engineer would launch an EC2 instance and wait for the instance to come online. Once the instance was available, the engineer would use a remote administration tool like RDP to login to the instance to install software and customize settings. This image was then saved as an AMI and used in an Auto Scale Group to deploy a cluster of instances. Because this process was time consuming and painful, our Windows instances were usually missing the latest security updates from Microsoft.

Last year, we decided to improve the AMI baking process. The challenges with service management included:

  • Stale documentation
  • OS Updates
  • High cognitive overhead
  • A lack of continuous testing

Scaling Image Creation

Our existing AMI baking tool Aminator does not support Windows so we had to leverage other tools. We had several goals in mind when trying to improve the baking methodology:

Configuration as Code

The first part of our new Windows baking solution is Packer. Packer allows you to describe your image customization process as a JSON file. We make use of the amazon-ebs Packer builder to launch an EC2 instance. Once online, Packer uses WinRM to copy files and run PowerShell scripts against the instance. If all of the configuration steps are successful then Packer saves a new AMI. The configuration file, referenced scripts, and artifact dependency definitions all live in an internal git repository. We now have the software and instance configuration as code. This means changes can be tracked and reviewed like any other code change.

Packer requires specific information for your baking environment and extensive AWS IAM permissions. In order to simplify the use of Packer for our software developers, we bundled Netflix-specific AWS environment information and helper scripts. Initially, we did this with a git repository and Packer variable files. There was also a special EC2 instance where Packer was executed as Jenkins jobs. This setup was better than manually baking images but we still had some ergonomic challenges. For example, it became cumbersome to ensure users of Packer received updates.

The last piece of the puzzle was finding a way to package our software for installation on Windows. This would allow for reuse of helper scripts and infrastructure tools without requiring every user to copy that solution into their Packer scripts. Ideally, this would work similar to how applications are packaged in the Animator process. We solved this by leveraging Chocolatey, the package manager for Windows. Chocolatey packages are created and then stored in an internal artifact repository. This repository is added as a source for the choco install command. This means we can create and reuse packages that help integrate Windows into the Netflix ecosystem.

Leverage Spinnaker for Continuous Delivery

Flow chart showing how Docker image inheretance is used in the creation of a Windows AMI.
The Base Dockerfile allows updates of Packer, helper scripts, and environment configuration to propagate through the entire Windows Baking process.

To make the baking process more robust we decided to create a Docker image that contains Packer, our environment configuration, and helper scripts. Downstream users create their own Docker images based on this base image. This means we can update the base image with new environment information and helper scripts, and users get these updates automatically. With their new Docker image, users launch their Packer baking jobs using Titus, our container management system. The Titus job produces a property file as part of a Spinnaker pipeline. The resulting property file contains the AMI ID and is consumed by later pipeline stages for deployment. Running the bake in Titus removed the single EC2 instance limitation, allowing for parallel execution of the jobs.

Now each change in the infrastructure is tested, canaried, and deployed like any other code change. This process is automated via a Spinnaker pipeline:

Screenshot of an example Spinnaker pipeline showing Docker image, Windows AMI, Canary Analysis, and Deployment stages.
Example Spinnaker pipeline showing the bake, canary, and deployment stages.

In the canary stage, Kayenta is used to compare metrics between a baseline (current AMI) and the canary (new AMI). The canary stage will determine a score based on metrics such as CPU, threads, latency, and GC pauses. If this score is within a healthy threshold the AMI is deployed to each environment. Running a canary for each change and testing the AMI in production allows us to capture insights around impact on Windows updates, script changes, tuning web server configuration, among others.

Eliminate Toil

Automating these tedious operational tasks allows teams to move faster. Our engineers no longer have to manually update Windows, Java, Tomcat, IIS, and other services. We can easily test server tuning changes, software upgrades, and other modifications to the runtime environment. Every code and infrastructure change goes through the same testing and deployment pipeline.

Reaping the Benefits

Changes that used to require hours of manual work are now easy to modify, test, and deploy. Other teams can quickly deploy secure and reproducible instances in an automated fashion. Services are more reliable, testable, and documented. Changes to the infrastructure are now reviewed like any other code change. This removes unnecessary cognitive load and documents tribal knowledge. Removing toil has allowed the team to focus on other features and bug fixes. All of these benefits reduce the risk of a customer-affecting outage. Adopting the Immutable Server pattern for Windows using Packer and Chocolatey has paid big dividends.


Applying Netflix DevOps Patterns to Windows was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79?source=rss—-2615bd06b42e—4

Setting up an Android application with AWS SDK for C++

Setting up an Android application with AWS SDK for C++

The AWS SDK for C++ can build and run on many different platforms, including Android. In this post, I walk you through building and running a sample application on an Android device.

Overview

I cover the following topics:

  • Building and installing the SDK for C++ and creating a desktop application that implements the basic functionality.
  • Setting up the environment for Android development.
  • Building and installing the SDK for C++ for Android and transplanting the desktop application to Android platform with the cross-compiled SDK.

Prerequisites

To get started, you need the following resources:

  • An AWS account
  • A GitHub environment to build and install AWS SDK for C++

To set up the application on the desktop

Follow these steps to set up the application on your desktop:

  • Create an Amazon Cognito identity pool
  • Build and test the desktop application

Create an Amazon Cognito identity pool

For this demo, I use unauthenticated identities, which typically belong to guest users. To learn about unauthenticated and authenticated identities and choose the one that fits your business, check out Using Identity Pools.

In the Amazon Cognito console, choose Manage identity pools, Create new identity pool. Enter an identity pool name, like “My Android App with CPP SDK”, and choose Enable access to unauthenticated identities.

Next, choose Create Pool, View Details. Two Role Summary sections should display, one for authenticated and the other for unauthenticated identities.

For the unauthenticated identities, choose View Policy Document, Edit. Under Action, add the following line:

s3: ListAllMyBuckets

After you have completed the preceding steps, the policy should read as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "mobileanalytics:PutEvents",
                "cognito-sync:*"
            ],
            "Resource": "*"
        }
    ]
}

To finish the creation, choose Allow.

On the next page, in the Get AWS Credentials section, copy the identity pool ID and keep it somewhere to use later. Or, you can find it after you choose Edit identity pool in the Amazon Cognito console. The identity pool ID has the following format:

<region>:<uuid>

Build and test the desktop application

Before building an Android application, you can build a regular application with the SDK for C++ in your desktop environment for testing purpose. Then you must modify the source code, making the CMake script switch its target to Android.

Here’s how to build and install the SDK for C++ statically:

cd <workspace>
git clone https://github.com/aws/aws-sdk-cpp.git
mkdir build_sdk_desktop
cd build_sdk_desktop
cmake ../aws-sdk-cpp \
    -DBUILD_ONLY="identity-management;s3" \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX="<workspace>/install_sdk_desktop"
cmake --build .
cmake --build . --target install

If you install the SDK for C++ successfully, you can find libaws-cpp-sdk-*.a (or aws-cpp-sdk-*.lib for Windows) under <workspace>/install_sdk_desktop/lib/ (or <workspace>/install_sdk_desktop/lib64).

Next, build the application and link it to the library that you built. Create a folder <workspace>/app_list_all_buckets and place two files under this directory:

  • main.cpp (source file)
  • CMakeLists.txt (CMake file)
// main.cpp
#include <iostream>
#include <aws/core/Aws.h>
#include <aws/core/utils/Outcome.h>
#include <aws/core/utils/logging/AWSLogging.h>
#include <aws/core/client/ClientConfiguration.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/identity-management/auth/CognitoCachingCredentialsProvider.h>
#include <aws/s3/S3Client.h>

using namespace Aws::Auth;
using namespace Aws::CognitoIdentity;
using namespace Aws::CognitoIdentity::Model;

static const char ALLOCATION_TAG[] = "ListAllBuckets";
static const char ACCOUNT_ID[] = "your-account-id";
static const char IDENTITY_POOL_ID[] = "your-cognito-identity-id";

int main()
{
    Aws::SDKOptions options;
    options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Debug;
    Aws::InitAPI(options);

    Aws::Client::ClientConfiguration config;
    auto cognitoIdentityClient = Aws::MakeShared<CognitoIdentityClient>(ALLOCATION_TAG, Aws::MakeShared<AnonymousAWSCredentialsProvider>(ALLOCATION_TAG), config);
    auto cognitoCredentialsProvider = Aws::MakeShared<CognitoCachingAnonymousCredentialsProvider>(ALLOCATION_TAG, ACCOUNT_ID, IDENTITY_POOL_ID, cognitoIdentityClient);

    Aws::S3::S3Client s3Client(cognitoCredentialsProvider, config);
    auto listBucketsOutcome = s3Client.ListBuckets();
    Aws::StringStream ss;
    if (listBucketsOutcome.IsSuccess())
    {
        ss << "Buckets:" << std::endl;
        for (auto const& bucket : listBucketsOutcome.GetResult().GetBuckets())
        {
            ss << "  " << bucket.GetName() << std::endl;
        }
    }
    else
    {
        ss << "Failed to list buckets." << std::endl;
    }
    std::cout << ss.str() << std::endl;
    Aws::ShutdownAPI(options);
    return 0;
}
# CMakeLists.txt
cmake_minimum_required(VERSION 3.3)
set(CMAKE_CXX_STANDARD 11)

project(list_all_buckets LANGUAGES CXX)
find_package(AWSSDK REQUIRED COMPONENTS s3 identity-management)
add_executable(${PROJECT_NAME} "main.cpp")
target_link_libraries(${PROJECT_NAME} ${AWSSDK_LINK_LIBRARIES})

Build and test this desktop application with the following commands:

cd <workspace>
mkdir build_app_desktop
cd build_app_desktop
cmake ../app_list_all_buckets \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_PREFIX_PATH="<workspace>/install_sdk_desktop"
cmake --build .
./list_all_buckets # or ./Debug/list_all_buckets.exe for Windows

The output should read as follows:

Buckets:
  <bucket_1>
  <bucket_2>
  ...

Now you have a desktop application. You’ve accomplished this without touching anything related to Android. The next section covers Android instructions.

To set up the application on Android with AWS SDK for C++

Follow these steps to set up the application on Android with the SDK for C++:

  • Set up Android Studio
  • Cross-compile the SDK for C++ and the library
  • Build and run the application in Android Studio

Set up Android Studio

First, download and install Android Studio. For more detailed instructions, see the Android Studio Install documentation.

Next, open Android Studio and create a new project. On the Choose your project screen, as shown in the following screenshot, choose Native C++, Next.

Complete all fields. In the following example, you build the SDK for C++ with Android API level 19, so the Minimum API Level is “API 19: Android 4.4 (KitKat)”.

Choose C++ 11 for C++ Standard and choose Finish for the setup phase.

The first time that you open Android Studio, you might see “missing NDK and CMake” errors during automatic installation. Ignore these warnings for the moment. You install Android NDK and CMake manually. Or you can accept the license to install NDK and CMake within Android Studio. Doing so should suppress the warnings.

After you choose Finish, you should get a sample application with Android Studio. This application publishes some messages on the screens of your devices. For more details, see Create a new project with C++.

Starting from this sample, take the following steps to build your application:

First, cross-compile the SDK for C++ for Android.

Modify the source code and CMake script to build list_all_buckets as a shared object library (liblist_all_buckets.so) rather than an executable. This library has a function: listAllBuckets() to output all buckets.

Specify the path to the library in the module’s build.gradle file so that the Android application can find it.

Load this library in MainActivity by System.loadLibrary("list_all_buckets"), so that the Android application can use the listAllBuckets() function.

Call the listAllBuckets() function in OnCreate() function in MainActivity.

More details for each step will be given in the following sections.

Cross-compile the SDK for C++ and the library

Use Android NDK to cross-compile the SDK for C++. In this example, you are using version r19c. To find whether Android Studio has downloaded NDK by default, check the following:

  • Linux: ~/Android/Sdk/ndk-bundle
  • MacOS: ~/Library/Android/sdk/ndk-bundle
  • Windows: C:\Users\<username>\AppData\Local\Android\Sdk\ndk\<version>

Alternately, download the Android NDK directly.

To cross-compile SDK for C++, run the following code:

cd <workspace>
mkdir build_sdk_android
cd build_sdk_anrdoid
cmake ../aws-sdk-cpp -DNDK_DIR="<path-to-android-ndk>" \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_BUILD_TYPE=Release \
    -DCUSTOM_MEMORY_MANAGEMENT=ON \
    -DTARGET_ARCH=ANDROID \
    -DANDROID_NATIVE_API_LEVEL=19 \
    -DBUILD_ONLY="identity-management;s3" \
    -DCMAKE_INSTALL_PREFIX="<workspace>/install_sdk_android"
cmake --build . --target CURL # This step is only required on Windows.
cmake --build .
cmake --build . --target install

On Windows, you might see the error message: “CMAKE_SYSTEM_NAME is ‘Android’ but ‘NVIDIA Nsight Tegra Visual Studio Edition’ is not installed.” In that case, install Ninja and change the generator from Visual Studio to Ninja by passing -GNinja as another parameter to your CMake command.

To build list_all_buckets as an Android-targeted dynamic object library, you must change the source code and CMake script. More specifically, you must alter the source code as follows:

Replace main() function with Java_com_example_mynativecppapplication_MainActivity_listAllBuckets(). In the Android application, the Java code calls this function through JNI (Java Native Interface). You may have a different function name, based on your package name and activity name. For this demo, the package name is com.example.mynativecppapplication, the activity name is MainActivity, and the actual function called by Java code is called listAllBuckets().

Enable LogcatLogSystem, so that you can debug your Android application and see the output in the logcat console.

Your Android devices or emulators may miss CA certificates. So, you should push them to your devices and specify the path in client configuration. In this example, use CA certificates extracted from Mozilla in PEM format.

Download the certificate bundle.

Push this file to your Android devices:

# Change directory to the location of adb
cd <path-to-android-sdk>/platform-tools
# Replace "com.example.mynativecppapplication" with your package name
./adb shell mkdir -p /sdcard/Android/data/com.example.mynativecppapplication/certs
# push the PEM file to your devices
./adb push cacert.pem /sdcard/Android/data/com.example.mynativecppapplication/certs

Specify the path in the client configuration:

config.caFile = "/sdcard/Android/data/com.example.mynativecppapplication/certs/cacert.pem";

The complete source code looks like the following:

// main.cpp
#if __ANDROID__
#include <android/log.h>
#include <jni.h>
#include <aws/core/platform/Android.h>
#include <aws/core/utils/logging/android/LogcatLogSystem.h>
#endif
#include <iostream>
#include <aws/core/Aws.h>
#include <aws/core/utils/Outcome.h>
#include <aws/core/utils/logging/AWSLogging.h>
#include <aws/core/client/ClientConfiguration.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/identity-management/auth/CognitoCachingCredentialsProvider.h>
#include <aws/s3/S3Client.h>

using namespace Aws::Auth;
using namespace Aws::CognitoIdentity;
using namespace Aws::CognitoIdentity::Model;

static const char ALLOCATION_TAG[] = "ListAllBuckets";
static const char ACCOUNT_ID[] = "your-account-id";
static const char IDENTITY_POOL_ID[] = "your-cognito-identity-id";

#ifdef __ANDROID__
extern "C" JNIEXPORT jstring JNICALL
Java_com_example_mynativecppapplication_MainActivity_listAllBuckets(JNIEnv* env, jobject classRef, jobject context)
#else
int main()
#endif
{
    Aws::SDKOptions options;
#ifdef __ANDROID__
    AWS_UNREFERENCED_PARAM(classRef);
    AWS_UNREFERENCED_PARAM(context);
    Aws::Utils::Logging::InitializeAWSLogging(Aws::MakeShared<Aws::Utils::Logging::LogcatLogSystem>(ALLOCATION_TAG, Aws::Utils::Logging::LogLevel::Debug));
#else
    options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Debug;
#endif
    Aws::InitAPI(options);

    Aws::Client::ClientConfiguration config;
#ifdef __ANDROID__
    config.caFile = "/sdcard/Android/data/com.example.mynativecppapplication/certs/cacert.pem";
#endif
    auto cognitoIdentityClient = Aws::MakeShared<CognitoIdentityClient>(ALLOCATION_TAG, Aws::MakeShared<AnonymousAWSCredentialsProvider>(ALLOCATION_TAG), config);
    auto cognitoCredentialsProvider = Aws::MakeShared<CognitoCachingAnonymousCredentialsProvider>(ALLOCATION_TAG, ACCOUNT_ID, IDENTITY_POOL_ID, cognitoIdentityClient);

    Aws::S3::S3Client s3Client(cognitoCredentialsProvider, config);
    auto listBucketsOutcome = s3Client.ListBuckets();
    Aws::StringStream ss;
    if (listBucketsOutcome.IsSuccess())
    {
        ss << "Buckets:" << std::endl;
        for (auto const& bucket : listBucketsOutcome.GetResult().GetBuckets())
        {
            ss << "  " << bucket.GetName() << std::endl;
        }
    }
    else
    {
        ss << "Failed to list buckets." << std::endl;
    }

#if __ANDROID__
    std::string allBuckets(ss.str().c_str());
    Aws::ShutdownAPI(options);
    return env->NewStringUTF(allBuckets.c_str());
#else
    std::cout << ss.str() << std::endl;
    Aws::ShutdownAPI(options);
    return 0;
#endif
}

Next, make the following changes for the CMake script:

  • Set the default values for the parameters used for the Android build, including:
    • The default Android API Level is 19
    • The default Android ABI is armeabi-v7a
    • Use libc++ as the standard library by default
    • Use android.toolchain.cmake supplied by Android NDK by default
  • Build list_all_buckets as a library rather than an executable
  • Link to the external libraries built in the previous step: zlib, ssl, crypto, and curl
# CMakeLists.txt
cmake_minimum_required(VERSION 3.3)
set(CMAKE_CXX_STANDARD 11)

if(TARGET_ARCH STREQUAL "ANDROID") 
    if(NOT NDK_DIR)
        set(NDK_DIR $ENV{ANDROID_NDK})
    endif()
    if(NOT IS_DIRECTORY "${NDK_DIR}")
        message(FATAL_ERROR "Could not find Android NDK (${NDK_DIR}); either set the ANDROID_NDK environment variable or pass the path in via -DNDK_DIR=..." )
    endif()
if(NOT CMAKE_TOOLCHAIN_FILE)
    set(CMAKE_TOOLCHAIN_FILE "${NDK_DIR}/build/cmake/android.toolchain.cmake")
endif()

if(NOT ANDROID_ABI)
    set(ANDROID_ABI "armeabi-v7a")
    message(STATUS "Android ABI: none specified, defaulting to ${ANDROID_ABI}")
else()
    message(STATUS "Android ABI: ${ANDROID_ABI}")
endif()

if(BUILD_SHARED_LIBS)
    set(ANDROID_STL "c++_shared")
else()
    set(ANDROID_STL "c++_static")
endif()

if(NOT ANDROID_NATIVE_API_LEVEL)
    set(ANDROID_NATIVE_API_LEVEL "android-19")
    message(STATUS "Android API Level: none specified, defaulting to ${ANDROID_NATIVE_API_LEVEL}")
else()
    message(STATUS "Android API Level: ${ANDROID_NATIVE_API_LEVEL}")
endif()

    list(APPEND CMAKE_FIND_ROOT_PATH ${CMAKE_PREFIX_PATH})
endif()

project(list_all_buckets LANGUAGES CXX)
find_package(AWSSDK REQUIRED COMPONENTS s3 identity-management)
if(TARGET_ARCH STREQUAL "ANDROID")
    set(SUFFIX so)
    add_library(zlib STATIC IMPORTED)
    set_target_properties(zlib PROPERTIES IMPORTED_LOCATION ${EXTERNAL_DEPS}/zlib/lib/libz.a)
    add_library(ssl STATIC IMPORTED)
    set_target_properties(ssl PROPERTIES IMPORTED_LOCATION ${EXTERNAL_DEPS}/openssl/lib/libssl.a)
    add_library(crypto STATIC IMPORTED)
    set_target_properties(crypto PROPERTIES IMPORTED_LOCATION ${EXTERNAL_DEPS}/openssl/lib/libcrypto.a)
    add_library(curl STATIC IMPORTED)
    set_target_properties(curl PROPERTIES IMPORTED_LOCATION ${EXTERNAL_DEPS}/curl/lib/libcurl.a)
    add_library(${PROJECT_NAME} "main.cpp")
else()
    add_executable(${PROJECT_NAME} "main.cpp")
endif()
target_link_libraries(${PROJECT_NAME} ${AWSSDK_LINK_LIBRARIES})

Finally, build this library with the following command:

cd <workspace>
mkdir build_app_android
cd build_app_android
cmake ../app_list_all_buckets \
    -DNDK_DIR="<path-to-android-ndk>" \
    -DBUILD_SHARED_LIBS=ON \
    -DTARGET_ARCH=ANDROID \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_PREFIX_PATH="<workspace>/install_sdk_android" \
    -DEXTERNAL_DEPS="<workspace>/build_sdk_android/external"
cmake --build .

This build results in the shared library liblist_all_buckets.so under <workspace>/build_app_android/. It’s time to switch to Android Studio.

Build and run the application in Android Studio

First, the application must find the library (liblist_all_buckets.so) that you built and the standard library (libc++_shared.so). The default search path for JNI libraries is app/src/main/jniLibs/<android-abi>. Create a directory called: <your-android-application-root>/app/src/main/jniLibs/armeabi-v7a/ and copy the following files to this directory:

<workspace>/build_app_android/liblist_all_buckets.so

  • For Linux: <android-ndk>/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/arm-linux-androideabi/libc++_shared.so
  • For MacOS: <android-ndk>/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/lib/arm-linux-androideabi/libc++_shared.so
  • For Windows: <android-ndk>/toolchains/llvm/prebuilt/windows-x86_64/sysroot/usr/lib/arm-linux-androideabi/libc++_shared.so

Next, open the build.gradle file for your module and remove the externalNativeBuild{} block, because you are using prebuilt libraries, instead of building the source with the Android application.

Then, edit MainActivity.java, which is under app/src/main/java/<package-name>/. Replace all native-libs with list_all_buckets and replace all stringFromJNI() with listAllBuckets(). The whole Java file looks like the following code example:

// MainActivity.java
package com.example.mynativecppapplication;

import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.widget.TextView;

public class MainActivity extends AppCompatActivity {

    // Used to load the 'list_all_buckets' library on application startup.
    static {
        System.loadLibrary("list_all_buckets");
    }

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        // Example of a call to a native method
        TextView tv = findViewById(R.id.sample_text);
        tv.setText(listAllBuckets());
    }

    /**
    * A native method that is implemented by the 'list_all_buckets' native library,
    * which is packaged with this application.
    */
    public native String listAllBuckets();
}

Finally, don’t forget to grant internet access permission to your application by adding the following lines in the AndroidManifest.xml, located at app/src/main/:

<manifest xmlns:android="http://schemas.android.com/apk/res/android"
package="com.example.mynativecppapplication">
    <uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
    <uses-permission android:name="android.permission.INTERNET" />
    <uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
    ...
</manifest>

To run the application on Android emulator, make sure that the CPU/ABI is armeabi-v7a for the system image. That’s what you specified when you cross-compiled the SDK and list_all_buckets library for the Android platform.

Run this application by choosing the Run icon or choosing Run, Run [app]. You should see that the application lists all buckets, as shown in the following screenshot.

Summary

With Android Studio and its included tools, you can cross-compile the AWS SDK for C++ and build a sample Android application to get temporary Amazon Cognito credentials and list all S3 buckets. Starting from this simple application, AWS hopes to see more exciting integrations with the SDK for C++.

As always, AWS welcomes all feedback or comment. Feel free to open an issue on GitHub if you have questions or submit a pull request to contribute.

from AWS Developer Blog https://aws.amazon.com/blogs/developer/setting-up-an-android-application-with-aws-sdk-for-c/

Configuring boto to validate HTTPS certificates

Configuring boto to validate HTTPS certificates

We strongly recommend upgrading from boto to boto3, the latest major version of the AWS SDK for Python. The previous major version, boto, does not default to validating HTTPS certificates for Amazon S3 when you are:

  1. Using a Python version less than 2.7.9 or
  2. Using Python 2.7.9 or greater and are connecting to S3 through a proxy

If you are unable to upgrade to boto3, you should configure boto to always validate HTTPS certificates. Be sure to test these changes. You can force HTTPS certification validation by either:

  1. Setting https_validate_certificates to True in your boto config file. For more information on how to use the boto config file, please refer to its documentation, or
  2. Setting validate_certs to True when instantiating an S3Connection:
    >>> from boto.s3.connection import S3Connection
    >>> conn = S3Connection(validate_certs=True)

To get the best experience, we always recommend remaining up-to-date with the latest version of the AWS SDKs and runtimes.

from AWS Developer Blog https://aws.amazon.com/blogs/developer/configure-boto-to-validate-https-certificates/

Introducing the ‘aws-rails-provisioner’ gem developer preview

Introducing the ‘aws-rails-provisioner’ gem developer preview

AWS is happy to announce that the aws-rails-provisioner gem for Ruby is now in developer preview and available for you to try!

What is aws-rails-provisioner?

The new aws-rails-provisioner gem is a tool that helps you define and deploy your containerized Ruby on Rails applications on AWS. It currently only supports AWS Fargate.

aws-rails-provisioner is a command line tool using the configuration file aws-rails-provisioner.yml to generate AWS Cloud Development Kit (CDK) stacks on your behalf. It automates provisioning AWS resources to run your containerized Ruby on Rails applications on AWS Fargate with a few commands. It can also generate a CI/CD AWS CodePipeline pipeline for your applications when you enable its CI/CD option.

Why use aws-rails-provisioner?

Moving a local Ruby on Rails application to a well-configured web application running on the cloud is a complicated task. While aws-rails-provisioner doesn’t change this into a “one-click” task, it helps ease the most monotonous and detail-oriented aspects of the job.

The aws-rails-provisioner gem shifts your primary focus to component-oriented definitions inside a concise aws-rails-provisioner.yml file. This file defines the AWS resources your application needs, such as container image environment, a database cluster engine, or Auto Scaling strategies.

The new gem handles default details—like VPC configuration, subnet placement, inbound traffic rules between databases and applications—for you. With CI/CD opt-in, aws-rails-provisioner can also generate and provision a predefined CI/CD pipeline, including a database migration phase.

For containerized Ruby on Rails applications that you already maintain on AWS, aws-rails-provisioner helps to keep the AWS infrastructure for your application as code in a maintainable way. This ease allows you to focus more on application development.

Prerequisites

Before using the preview gem, you must have the following resources:
• A Ruby on Rails application with Dockerfile
• A Docker daemon set up locally
• The AWS CDK installed (requires `Node.js` >= 8.11.x) npm i -g aws-cdk

Using aws-rails-provisioner

Getting started with aws-rails-provisioner is fast and easy.

Step 1: Install aws-rails-provisioner

You can download the aws-rails-provisioner preview gem from RubyGems.

To install the gem, run the following command:


gem install 'aws-rails-provisioner' -v 0.0.0.rc1

Step 2: Define your aws-rails-provisioner.yml file

The aws-rails-provisioner.yml configuration file allows you to define how and what components you want aws-rails-provisioner to provision to run your application image on Fargate.

An aws-rails-provisioner.yml file looks like the following format:

version: '0'
vpc:
	max_azs: 2
services:
	rails_foo:
    source_path: ../sandbox/rails_foo
    fargate:
      desired_count: 3
      public: true
      envs:
        PORT: 80
        RAILS_LOG_TO_STDOUT: true
    db_cluster:
      engine: aurora-postgresql
      db_name: myrails_db
      instance: 2	
	  scaling:
		max_capacity: 2
		on_cpu:
			target_util_percent: 80
			scale_in_cool_down: 300
	rails_bar:
		...

aws-rails-provisioner.yml file overview

The aws-rails-provisioner.yml file contains two parts: vpc and services. VPC defines networking settings, such as Amazon VPC hosting your applications and databases. It can be as simple as:

vpc:
  max_az: 3
  cidr: '10.0.0.0/21'
	enable_dns: true

Left unmodified, aws-rails-provisioner defines a default VPC with three Availability Zones containing public, private, and isolated subnets with a CIDR range of 10.0.0.0/21 and DNS enabled. If these default settings don’t meet your needs, you can configure settings yourself, such as in the following example, which defines the subnets with details:

vpc:
	subnets:
		application: # a subnet name
 			cidr_mark: 24
			type: private
		...

You can review the full range of VPC configuration options to meet your exact needs.

The services portion of aws-rails-provisioner.yml allows you to define your Rails applications, Database cluster, and Auto Scaling policies.
For every application, you can add their entry with identifiers like:

services:
	my_awesome_rails_app:
	  source_path: ../path/to/awesome_app # relative path from `aws-rails-provisioner.yml`
		...
	my_another_awesome_rails_app:
	  source_path: ./path/to/another_awesome_app # relative path from `aws-rails-provisioner.yml`
		...

When you run aws-rails-provisioner commands later, it takes the configuration values of a service—under fargate:, db_cluster:, and scaling: to provision a Fargate service fronted by an Application Load Balancer (DBClusters resource and Auto Scaling policies are optional for a service).

Database cluster

The db_cluster portion of aws-rails-provisioner.yml defines database settings for your Rails application. It currently supports Aurora PostgreSQL, Aurora MySQL, and Aurora. You can specify the engine version by appending engine_version to the command. You can also choose to provide a user name for your databases; if not, aws-rails-provisioner automatically generates username and password and stores it in AWS Secrets Manager.

To enable storage encryption for the Amazon RDS database cluster, provide kms_key_arn with the AWS KMS key ARN you use for storage encryption:

 my_awesome_rails_app:
	  source_path: ../path/to/awesome_app
	  db_cluster:
		engine: aurora-postgresql
      db_name: awesome_db
		username: myadmin

You can review the full list of db_cluster: configuration options to meet your specific needs.

AWS Fargate
The fargate: portion of aws-rails-provisioner.yml defines which Fargate services and Tasks that running your application image, for example:

my_awesome_rails_app:
	source_path: ../path/to/awesome_app
	fargate:
		public: true
		memory: 512
		cpu: 256
		container_port: 80
		envs:
			RAILS_ENV: ...
			RAILSLOGTO_STDOUT: ...
  ...

For HTTPs applications, you can provide certificate with a certificate ARN from AWS Certificate Manager. This automatically associates with the Application Load Balancer and sets container_port to 443. You can also provide a domain name and domain zone for your application under domain_name and domain_zone. If you don’t provide these elements, the system provides a default DNS address from the Application Load Balancer.

When providing environment variables for your application image, you don’t have to define DATABASE_URL by yourself; aws-rails-provisioner computes the value based on your db_cluster configuration. Make sure to update the config/database.yml file for your Rails application to recognize the DATABASE_URL environment variable.

You can review the full list of fargate: configuration options to meet your specific needs.

Scaling
You can also configure the Auto Scaling setting for your service. In this prototype stage, you can configure scaling policies on_cpu, on_metric, on_custom_metric, on_memory, on_request, or on_schedule.

my_awesome_rails_app:
	source_path: ../path/to/awesome_app
	scaling:
	  max_capacity: 10
    on_memory:
      target_util_percent: 80
      scale_out_cool_down: 200
    on_request:
      requests_per_target: 100000
      disable_scale_in: true
    ...

You can review the full list of scaling: configuration options to meet your specific needs.

Step 3: Build and deploy

With aws-rails-provisioner.yml defined, you can run a build command. Doing so bootstraps AWS CDK stacks in code, defining all the necessary AWS resources and connections for you.

Run the following:


aws-rails-provisioner build

This command initializes and builds a CDK project with stacks—installing all required CDK packages—leaving a deploy-ready project. By default, it generates an InitStack that defines the VPC and Amazon ECS cluster, hosting Fargate services. It also generates a FargateStack that defines a database cluster and a load-balanced, scaling Fargate service for each service entry.

When you enable --with-cicd, the aws-rails-provisioner also provides a pipeline stack containing source, build, database migration, and deploy stages for each service defined for you. You can enable CI/CD with the following command:


aws-rails-provisioner build --with-cicd

After the build completes, run the following deploy command to deploy all defined AWS resources:


aws-rails-provisioner deploy

Instead of deploying everything all at the same time, you can deploy stack by stack or application by application:


# only deploys the stack that creates the VPC and ECS cluster
aws-rails-provisioner deploy --init

# deploys the fargate service and database cluster when defined
aws-rails-provisioner deploy --fargate

# deploy the CI/CD stack
aws-rails-provisioner deploy --cicd

# deploy only `my_awesome_rails_app` application
aws-rails-provisioner deploy --fargate --service my_awesome_rails_app

You can check on the status of your stacks by logging in to the AWS console and navigating to AWS CloudFormation. Deployment can take several minutes.

Completing deployment leaves your applications running on AWS Fargate, fronted with the Application Load Balancer.

Step 4: View AWS resources

To see your database cluster, log in to the Amazon RDS console.

To see an ECS cluster created, with Fargate Services and tasks running, you can also check the Amazon ECS console.

To view the application via DNS address or domain address, check the Application Load Balancing dashboard.

Any applications with databases need rails migration to work. The generated CI/CD stack contains a migration phase. The following CI/CD section contains additional details.
To view all the aws-rails-provisioner command line options, run:


aws-rails-provisioner -h

Step 5: Trigger the CI/CD pipeline

To trigger the pipeline that aws-rails-provisioner provisioned for you, you must commit your application source code and Dockerfile with AWS CodeBuild build specs into an AWS CodeCommit repository. The aws-rails-provisioner gem automatically creates this repository for you.

To experiment with application image build and database migration on your own, try these example build spec files from the aws-rails-provisioner GitHub repo.

Conclusion
Although aws-rails-provisioner for RubyGems is currently under developer preview, it provides you with a powerful, time-saving tool. I would love for you to try it out and return with feedback for how AWS can improve this asset before its final launch. As always, you can leave your thoughts and feedback on GitHub.

from AWS Developer Blog https://aws.amazon.com/blogs/developer/introducing-the-aws-rails-provisioner-gem-developer-preview/

Evolution of Netflix Conductor:

Evolution of Netflix Conductor:

v2.0 and beyond

By Anoop Panicker and Kishore Banala

Conductor is a workflow orchestration engine developed and open-sourced by Netflix. If you’re new to Conductor, this earlier blogpost and the documentation should help you get started and acclimatized to Conductor.

Netflix Conductor: A microservices orchestrator

In the last two years since inception, Conductor has seen wide adoption and is instrumental in running numerous core workflows at Netflix. Many of the Netflix Content and Studio Engineering services rely on Conductor for efficient processing of their business flows. The Netflix Media Database (NMDB) is one such example.

In this blog, we would like to present the latest updates to Conductor, address some of the frequently asked questions and thank the community for their contributions.

How we’re using Conductor at Netflix

Deployment

Conductor is one of the most heavily used services within Content Engineering at Netflix. Of the multitude of modules that can be plugged into Conductor as shown in the image below, we use the Jersey server module, Cassandra for persisting execution data, Dynomite for persisting metadata, DynoQueues as the queuing recipe built on top of Dynomite, Elasticsearch as the secondary datastore and indexer, and Netflix Spectator + Atlas for Metrics. Our cluster size ranges from 12–18 instances of AWS EC2 m4.4xlarge instances, typically running at ~30% capacity.

Components of Netflix Conductor
* — Cassandra persistence module is a partial implementation.

We do not maintain an internal fork of Conductor within Netflix. Instead, we use a wrapper that pulls in the latest version of Conductor and adds Netflix infrastructure components and libraries before deployment. This allows us to proactively push changes to the open source version while ensuring that the changes are fully functional and well-tested.

Adoption

As of writing this blog, Conductor orchestrates 600+ workflow definitions owned by 50+ teams across Netflix. While we’re not (yet) actively measuring the nth percentiles, our production workloads speak for Conductor’s performance. Below is a snapshot of our Kibana dashboard which shows the workflow execution metrics over a typical 7-day period.

Dashboard with typical Conductor usage over 7 days
Typical Conductor usage at Netflix over a 7 day period.

Use Cases

Some of the use cases served by Conductor at Netflix can be categorized under:

  • Content Ingest and Delivery
  • Content Quality Control
  • Content Localization
  • Encodes and Deployments
  • IMF Deliveries
  • Marketing Tech
  • Studio Engineering

What’s New

gRPC Framework

One of the key features in v2.0 was the introduction of the gRPC framework as an alternative/auxiliary to REST. This was contributed by our counterparts at GitHub, thereby strengthening the value of community contributions to Conductor.

Cassandra Persistence Layer

To enable horizontal scaling of the datastore for large volume of concurrent workflow executions (millions of workflows/day), Cassandra was chosen to provide elastic scaling and meet throughput demands.

External Payload Storage

External payload storage was implemented to prevent the usage of Conductor as a data persistence system and to reduce the pressure on its backend datastore.

Dynamic Workflow Executions

For use cases where the need arises to execute a large/arbitrary number of varying workflow definitions or to run a one-time ad hoc workflow for testing or analytical purposes, registering definitions first with the metadata store in order to then execute them only once, adds a lot of additional overhead. The ability to dynamically create and execute workflows removes this friction. This was another great addition that stemmed from our collaboration with GitHub.

Workflow Status Listener

Conductor can be configured to publish notifications to external systems or queues upon completion/termination of workflows. The workflow status listener provides hooks to connect to any notification system of your choice. The community has contributed an implementation that publishes a message on a dyno queue based on the status of the workflow. An event handler can be configured on these queues to trigger workflows or tasks to perform specific actions upon the terminal state of the workflow.

Bulk Workflow Management

There has always been a need for bulk operations at the workflow level from an operability standpoint. When running at scale, it becomes essential to perform workflow level operations in bulk due to bad downstream dependencies in the worker processes causing task failures or bad task executions. Bulk APIs enable the operators to have macro-level control on the workflows executing within the system.

Decoupling Elasticsearch from Persistence

This inter-dependency was removed by moving the indexing layer into separate persistence modules, exposing a property (workflow.elasticsearch.instanceType) to choose the type of indexing engine. Further, the indexer and persistence layer have been decoupled by moving this orchestration from within the primary persistence layer to a service layer through the ExecutionDAOFacade.

ES5/6 Support

Support for Elasticsearch versions 5 and 6 have been added as part of the major version upgrade to v2.x. This addition also provides the option to use the Elasticsearch RestClient instead of the Transport Client which was enforced in the previous version. This opens the route to using a managed Elasticsearch cluster (a la AWS) as part of the Conductor deployment.

Task Rate Limiting & Concurrent Execution Limits

Task rate limiting helps achieve bounded scheduling of tasks. The task definition parameter rateLimitFrequencyInSeconds sets the duration window, while rateLimitPerFrequency defines the number of tasks that can be scheduled in a duration window. On the other hand, concurrentExecLimit provides unbounded scheduling limits of tasks. I.e the total of current scheduled tasks at any given time will be under concurrentExecLimit. The above parameters can be used in tandem to achieve desired throttling and rate limiting.

API Validations

Validation was one of the core features missing in Conductor 1.x. To improve usability and operability, we added validations, which in practice has greatly helped find bugs during creation of workflow and task definitions. Validations enforce the user to create and register their task definitions before registering the workflow definitions using these tasks. It also ensures that the workflow definition is well-formed with correct wiring of inputs and outputs in the various tasks within the workflow. Any anomalies found are reported to the user with a detailed error message describing the reason for failure.

Developer Labs, Logging and Metrics

We have been continually improving logging and metrics, and revamped the documentation to reflect the latest state of Conductor. To provide a smooth on boarding experience, we have created developer labs, which guides the user through creating task and workflow definitions, managing a workflow lifecycle, configuring advanced workflows with eventing etc., and a brief introduction to Conductor API, UI and other modules.

New Task Types

System tasks have proven to be very valuable in defining the Workflow structure and control flow. As such, Conductor 2.x has seen several new additions to System tasks, mostly contributed by the community:

Lambda

Lambda Task executes ad-hoc logic at Workflow run-time, using the Nashorn Javascript evaluator engine. Instead of creating workers for simple evaluations, Lambda task enables the user to do this inline using simple Javascript expressions.

Terminate

Terminate task is useful when workflow logic should terminate with a given output. For example, if a decision task evaluates to false, and we do not want to execute remaining tasks in the workflow, instead of having a DECISION task with a list of tasks in one case and an empty list in the other, this can scope the decide and terminate workflow execution.

ExclusiveJoin

Exclusive Join task helps capture task output from a DECISION task’s flow. This is useful to wire task inputs from the outputs of one of the cases within a decision flow. This data will only be available during workflow execution time and the ExclusiveJoin task can be used to collect the output from one of the tasks in any of decision branches.

For in-depth implementation details of the new additions, please refer the documentation.

What’s next

There are a lot of features and enhancements we would like to add to Conductor. The below wish list could be considered as a long-term road map. It is by no means exhaustive, and we are very much welcome to ideas and contributions from the community. Some of these listed in no particular order are:

Advanced Eventing with Event Aggregation and Distribution

At the moment, event generation and processing is a very simple implementation. An event task can create only one message, and a task can wait for only one event.

We envision an Event Aggregation and Distribution mechanism that would open up Conductor to a multitude of use-cases. A coarse idea is to allow a task to wait for multiple events, and to progress several tasks based on one event.

UI Improvements

While the current UI provides a neat way to visualize and track workflow executions, we would like to enhance this with features like:

  • Creating metadata objects from UI
  • Support for starting workflows
  • Visualize execution metrics
  • Admin dashboard to show outliers

New Task types like Goto, Loop etc.

Conductor has been using a Directed Acyclic Graph (DAG) structure to define a workflow. The Goto and Loop on tasks are valid use cases, which would deviate from the DAG structure. We would like to add support for these tasks without violating the existing workflow execution rules. This would help unlock several other use cases like streaming flow of data to tasks and others that require repeated execution of a set of tasks within a workflow.

Support for reusable commonly used tasks like Email, DatabaseQuery etc.

Similarly, we’ve seen the value of shared reusable tasks that does a specific thing. At Netflix internal deployment of Conductor, we’ve added tasks specific to services that users can leverage over recreating the tasks from scratch. For example, we provide a TitusTask which enables our users to launch a new Titus container as part of their workflow execution.

We would like to extend this idea such that Conductor can offer a repository of commonly used tasks.

Push based task scheduling interface

Current Conductor architecture is based on polling from a worker to get tasks that it will execute. We need to enhance the grpc modules to leverage the bidirectional channel to push tasks to workers as and when they are scheduled, thus reducing network traffic, load on the server and redundant client calls.

Validating Task inputKeys and outputKeys

This is to provide type safety for tasks and define a parameterized interface for task definitions such that tasks are completely re-usable within Conductor once registered. This provides a contract allowing the user to browse through available task definitions to use as part of their workflow where the tasks could have been implemented by another team/user. This feature would also involve enhancing the UI to display this contract.

Implementing MetadataDAO in Cassandra

As mentioned here, Cassandra module provides a partial implementation for persisting only the workflow executions. Metadata persistence implementation is not available yet and is something we are looking to add soon.

Pluggable Notifications on Task completion

Similar to the Workflow status listener, we would like to provide extensible interfaces for notifications on task execution.

Python client in Pypi

We have seen wide adoption of Python client within the community. However, there is no official Python client in Pypi, and lacks some of the newer additions to the Java client. We would like to achieve feature parity and publish a client from Conductor Github repository, and automate the client release to Pypi.

Removing Elasticsearch from critical path

While Elasticsearch is greatly useful in Conductor, we would like to make this optional for users who do not have Elasticsearch set-up. This means removing Elasticsearch from the critical execution path of a workflow and using it as an opt-in layer.

Pluggable authentication and authorization

Conductor doesn’t support authentication and authorization for API or UI, and is something that we feel would add great value and is a frequent request in the community.

Validations and Testing

Dry runs, i.e the ability to evaluate workflow definitions without actually running it through worker processes and all relevant set-up would make it much easier to test and debug execution paths.

If you would like to be a part of the Conductor community and contribute to one of the Wishlist items or something that you think would provide a great value add, please read through this guide for instructions or feel free to start a conversation on our Gitter channel, which is Conductor’s user forum.

We also highly encourage to polish, genericize and share any customizations that you may have built on top of Conductor with the community.

We really appreciate and are extremely proud of the community involvement, who have made several important contributions to Conductor. We would like to take this further and make Conductor widely adopted with a strong community backing.

Netflix Conductor is maintained by the Media Workflow Infrastructure team. If you like the challenges of building distributed systems and are interested in building the Netflix Content and Studio ecosystem at scale, connect with Charles Zhao to get the conversation started.

Thanks to Alexandra Pau, Charles Zhao, Falguni Jhaveri, Konstantinos Christidis and Senthil Sayeebaba.


Evolution of Netflix Conductor: was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/evolution-of-netflix-conductor-16600be36bca?source=rss—-2615bd06b42e—4

Real-time streaming transcription with the AWS C++ SDK

Real-time streaming transcription with the AWS C++ SDK

Today, I’d like to walk you through how to use the AWS C++ SDK to leverage Amazon Transcribe streaming transcription. This service allows you to do speech-to-text processing in real time. Streaming transcription uses HTTP/2 technology to communicate efficiently with clients.

In this walkthrough, you build a command line application that captures audio from the computer’s microphone, sends it to Amazon Transcribe streaming, and prints out transcribed text as you speak. You use PortAudio (a third-party library) to capture and sample audio. PortAudio is a free, cross-platform library, so you should be able to build this on Windows, macOS, and Linux.

Note

Amazon Transcribe streaming transcription has a separate API from Amazon Transcribe, which also allows you to do speech-to-text, albeit not in real time.

Prerequisites

You must have the following tools installed to build the application:

  • CMake (preferably a recent version 3.11 or later)
  • A modern C++ compiler that supports C++11, a minimum of GCC 5.0, Clang 4.0, or Visual Studio 2015
  • Git
  • An HTTP/2 client
    • On *nix, you must have libcurl with HTTP/2 support installed on the system. To ensure that the version of libcurl you have supports HTTP/2, run the following command:
      • $ curl --version
      • You should see HTTP2 listed as one of the features.
    • On Windows, you must be running Windows 10.
  • An AWS account configured with the CLI

Walkthrough

The first step is to download and install PortAudio from the source. If you’re using Linux or macOS, you can use the system’s package manager to install the library (for example: apt, yum, or Homebrew).

  1. Browse to http://www.portaudio.com/download.html and download the latest stable release.
  2. Unzip the archive to a PortAudio directory.
  3. If you’re running Windows, run the following commands to build and install the library.
$ cd portaudio
$ mkdir build
$ cd build
$ cmake. 
$ cmake --build . --config Release

Those commands should build both a DLL and a static library. PortAudio does not define an install target when building on Windows. In the Release directory, copy the file named portaudio_static_x64.lib and the file named portaudio.h to another temporary directory. You need both files for the subsequent steps.

4. If you’re running on Linux or macOS, run the following commands instead.

$ cd portaudio
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build .
$ cmake --build . --target install

For this demonstration, you can safely ignore any warnings you see while PortAudio builds.

The next step is to download and install the Amazon Transcribe Streaming C++ SDK. You can use vcpkg or Homebrew to do that step. But I show you how to build the SDK from source.

$ git clone https://github.com/aws/aws-sdk-cpp.git
$ cd aws-sdk-cpp
$ mkdir build
$ cd build
$ cmake .. -DBUILD_ONLY=”transcribestreaming” -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
$ cmake --build . --config Release
$ sudo cmake --build . --config Release --target install

Now, you’re ready to write the command line application.

Before diving into the code, I’d like to explain a few things about the structure of the application.

First, you need to tell PortAudio to capture audio from the microphone and write the sampled bits to a stream. Second, you want to simultaneously consume that stream and send the bits you captured to the service. To do those two operations concurrently, the application uses multiple threads. The SDK and PortAudio create and manage the threads. PortAudio is responsible for the audio capturing thread, and the SDK is responsible for the API communication thread.

If you have used the C++ SDK asynchronous APIs before, then you might notice the new pattern introduced with the Amazon Transcribe Streaming API. Specifically, in addition to the callback parameter invoked on request completion, there’s now another callback parameter. The new callback is invoked when the stream is ready to be written to by the application. More on that later.

The application can be a single C++ source file. However, I split the logic into two source files: one file contains the SDK-related logic, and the other file contains the PortAudio specific logic. That way, you can easily swap out the PortAudio specific code and replace it with any other audio source that fits your use case. So, go ahead and create a new directory for the demo application and save the following files into it.

The first file (main.cpp) contains the logic of using the Amazon Transcribe Streaming SDK:

// main.cpp
#include <aws/core/Aws.h>
#include <aws/core/utils/threading/Semaphore.h>
#include <aws/transcribestreaming/TranscribeStreamingServiceClient.h>
#include <aws/transcribestreaming/model/StartStreamTranscriptionHandler.h>
#include <aws/transcribestreaming/model/StartStreamTranscriptionRequest.h>
#include <cstdio>

using namespace Aws;
using namespace Aws::TranscribeStreamingService;
using namespace Aws::TranscribeStreamingService::Model;

int SampleRate = 16000; // 16 Khz
int CaptureAudio(AudioStream& targetStream);

int main()
{
    Aws::SDKOptions options;
    options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Trace;
    Aws::InitAPI(options);
    {
        Aws::Client::ClientConfiguration config;
#ifdef _WIN32
        config.httpLibOverride = Aws::Http::TransferLibType::WIN_INET_CLIENT;
#endif
        TranscribeStreamingServiceClient client(config);
        StartStreamTranscriptionHandler handler;
        handler.SetTranscriptEventCallback([](const TranscriptEvent& ev) {
            for (auto&& r : ev.GetTranscript().GetResults()) {
                if (r.GetIsPartial()) {
                    printf("[partial] ");
                } else {
                    printf("[Final] ");
                }
                for (auto&& alt : r.GetAlternatives()) {
                    printf("%s\n", alt.GetTranscript().c_str());
                }
            }
        });

        StartStreamTranscriptionRequest request;
        request.SetMediaSampleRateHertz(SampleRate);
        request.SetLanguageCode(LanguageCode::en_US);
        request.SetMediaEncoding(MediaEncoding::pcm);
        request.SetEventStreamHandler(handler);

        auto OnStreamReady = [](AudioStream& stream) {
            CaptureAudio(stream);
            stream.flush();
            stream.Close();
        };

        Aws::Utils::Threading::Semaphore signaling(0 /*initialCount*/, 1 /*maxCount*/);
        auto OnResponseCallback = [&signaling](const TranscribeStreamingServiceClient*,
                  const Model::StartStreamTranscriptionRequest&,
                  const Model::StartStreamTranscriptionOutcome&,
                  const std::shared_ptr<const Aws::Client::AsyncCallerContext>&) { signaling.Release(); };

        client.StartStreamTranscriptionAsync(request, OnStreamReady, OnResponseCallback, nullptr /*context*/);
        signaling.WaitOne(); // prevent the application from exiting until we're done
    }

    Aws::ShutdownAPI(options);

    return 0;
}

The second file (audio-capture.cpp) contains the logic related to capturing audio from the microphone.

// audio-capture.cpp
#include <aws/core/utils/memory/stl/AWSVector.h>
#include <aws/core/utils/threading/Semaphore.h>
#include <aws/transcribestreaming/model/AudioStream.h>
#include <csignal>
#include <cstdio>
#include <portaudio.h>

using SampleType = int16_t;
extern int SampleRate;
int Finished = paContinue;
Aws::Utils::Threading::Semaphore pasignal(0 /*initialCount*/, 1 /*maxCount*/);

static int AudioCaptureCallback(const void* inputBuffer, void* outputBuffer, unsigned long framesPerBuffer,
    const PaStreamCallbackTimeInfo* timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
    auto stream = static_cast<Aws::TranscribeStreamingService::Model::AudioStream>(userData);
    const auto beg = static_cast<const unsigned char*>(inputBuffer);
    const auto end = beg + framesPerBuffer * sizeof(SampleType);

    (void)outputBuffer; // Prevent unused variable warnings
    (void)timeInfo;
    (void)statusFlags;
 
    Aws::Vector<unsigned char> bits { beg, end };
    Aws::TranscribeStreamingService::Model::AudioEvent event(std::move(bits));
    stream->WriteAudioEvent(event);

    if (Finished == paComplete) {
        pasignal.Release(); // signal the main thread to close the stream and exit
    }

    return Finished;
}

void interruptHandler(int)
{
    Finished = paComplete;
}
 
int CaptureAudio(Aws::TranscribeStreamingService::Model::AudioStream& targetStream)
{
 
    signal(SIGINT, interruptHandler); // handle ctrl-c
    PaStreamParameters inputParameters;
    PaStream* stream;
    PaError err = paNoError;
 
    err = Pa_Initialize();
    if (err != paNoError) {
        fprintf(stderr, "Error: Failed to initialize PortAudio.\n");
        return -1;
    }

    inputParameters.device = Pa_GetDefaultInputDevice(); // default input device
    if (inputParameters.device == paNoDevice) {
        fprintf(stderr, "Error: No default input device.\n");
        Pa_Terminate();
        return -1;
    }

    inputParameters.channelCount = 1;
    inputParameters.sampleFormat = paInt16;
    inputParameters.suggestedLatency = Pa_GetDeviceInfo(inputParameters.device)->defaultHighInputLatency;
    inputParameters.hostApiSpecificStreamInfo = nullptr;

    // start the audio capture
    err = Pa_OpenStream(&stream, &inputParameters, nullptr, /* &outputParameters, */
        SampleRate, paFramesPerBufferUnspecified,
        paClipOff, // you don't output out-of-range samples so don't bother clipping them.
        AudioCaptureCallback, &targetStream);

    if (err != paNoError) {
        fprintf(stderr, "Failed to open stream.\n");        
        goto done;
    }

    err = Pa_StartStream(stream);
    if (err != paNoError) {
        fprintf(stderr, "Failed to start stream.\n");
        goto done;
    }
    printf("=== Now recording!! Speak into the microphone. ===\n");
    fflush(stdout);

    if ((err = Pa_IsStreamActive(stream)) == 1) {
        pasignal.WaitOne();
    }
    if (err < 0) {
        goto done;
    }

    Pa_CloseStream(stream);

done:
    Pa_Terminate();
    return 0;
}

There is one line in the audio-capture.cpp file that is related to the Amazon Transcribe Streaming SDK. That is the line in which you wrap the audio bits in an AudioEvent object and write the event to the stream. That is required regardless of the audio source.

And finally, here’s a simple CMake script to build the application:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.11)
set(CMAKE_CXX_STANDARD 11)
project(demo LANGUAGES CXX)

find_package(AWSSDK COMPONENTS transcribestreaming)

add_executable(${PROJECT_NAME} "main.cpp" "audio-capture.cpp")

target_link_libraries(${PROJECT_NAME} PRIVATE ${AWSSDK_LINK_LIBRARIES})

if(MSVC)
    target_include_directories(${PROJECT_NAME} PRIVATE "portaudio")
    target_link_directories(${PROJECT_NAME} PRIVATE "portaudio")
    target_link_libraries(${PROJECT_NAME} PRIVATE “portaudio_static_x64”) # might have _x86 suffix instead
    target_compile_options(${PROJECT_NAME} PRIVATE "/W4" "/WX")
else()
    target_compile_options(${PROJECT_NAME} PRIVATE "-Wall" "-Wextra" "-Werror")
    target_link_libraries(${PROJECT_NAME} PRIVATE portaudio)
endif()

If you are building on Windows, copy the two files (portaudio.h and portaudio_static_x64.lib) from PortAudio’s build output that you copied earlier, into a directory named “portaudio” under the demo directory as such:

├── CMakeLists.txt

├── audio-capture.cpp

├── build

├── main.cpp

└── portaudio

    ├── portaudio.h

    └── portaudio_static_x64.lib

If you’re building on macOS or Linux, skip this step.

Now, “cd” into that directory and run the following commands:

$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build . --config Release

These commands should create the executable ‘demo’ under the build directory. Go ahead and execute it and start speaking into the microphone when prompted. The transcribed words should start appearing printed on the screen.

Amazon Transcribe streaming transcription sends partial results while it is receiving audio input. You might notice that the partial results changing as you speak. If you pause long enough, it recognizes your silence as the end of a statement and sends a final result. As soon as you start speaking again, streaming transcription resumes sending partial results.

Clarifications

If you’ve used the asynchronous APIs in the C++ SDK previously, you might be wondering why the stream is passed through a callback instead of setting it as a property on the request directly.

The reason is that Amazon Transcribe Streaming is one of the first AWS services to use a new binary serialization and deserialization protocol. The protocol uses the notion of an “event” to group application-defined chunks of data. Each event is signed using the signature of the previous event as a seed. The first event is seeded using the request’s signature. Therefore, the stream must be “primed” by seeding it with the initial request’s signature before you can write to it.

 Summary

HTTP/2 opens new opportunities for using the web and AWS in real-time scenarios. With the support of the AWS C++ SDK, you can write applications that translate speech to text on the fly.

Amazon Transcribe Streaming follows the same pricing model as Amazon Transcribe. For more details, see Amazon Transcribe Pricing.

I’m excited to see what you build using this new technology. Send AWS your feedback on GitHub and on Twitter (#awssdk).

from AWS Developer Blog https://aws.amazon.com/blogs/developer/real-time-streaming-transcription-with-the-aws-c-sdk/

Re-Architecting the Video Gatekeeper

Re-Architecting the Video Gatekeeper

By Drew Koszewnik

This is the story about how the Content Setup Engineering team used Hollow, a Netflix OSS technology, to re-architect and simplify an essential component in our content pipeline — delivering a large amount of business value in the process.

The Context

Each movie and show on the Netflix service is carefully curated to ensure an optimal viewing experience. The team responsible for this curation is Title Operations. Title Operations will confirm, among other things:

  • We are in compliance with the contracts — date ranges and places where we can show a video are set up correctly for each title
  • Video with captions, subtitles, and secondary audio “dub” assets are sourced, translated, and made available to the right populations around the world
  • Title name and synopsis are available and translated
  • The appropriate maturity ratings are available for each country

When a title meets all of the minimum above requirements, then it is allowed to go live on the service. Gatekeeper is the system at Netflix responsible for evaluating the “liveness” of videos and assets on the site. A title doesn’t become visible to members until Gatekeeper approves it — and if it can’t validate the setup, then it will assist Title Operations by pointing out what’s missing from the baseline customer experience.

Gatekeeper accomplishes its prescribed task by aggregating data from multiple upstream systems, applying some business logic, then producing an output detailing the status of each video in each country.

The Tech

Hollow, an OSS technology we released a few years ago, has been best described as a total high-density near cache:

  • Total: The entire dataset is cached on each node — there is no eviction policy, and there are no cache misses.
  • High-Density: encoding, bit-packing, and deduplication techniques are employed to optimize the memory footprint of the dataset.
  • Near: the cache exists in RAM on any instance which requires access to the dataset.

One exciting thing about the total nature of this technology — because we don’t have to worry about swapping records in-and-out of memory, we can make assumptions and do some precomputation of the in-memory representation of the dataset which would not otherwise be possible. The net result is, for many datasets, vastly more efficient use of RAM. Whereas with a traditional partial-cache solution you may wonder whether you can get away with caching only 5% of the dataset, or if you need to reserve enough space for 10% in order to get an acceptable hit/miss ratio — with the same amount of memory Hollow may be able to cache 100% of your dataset and achieve a 100% hit rate.

And obviously, if you get a 100% hit rate, you eliminate all I/O required to access your data — and can achieve orders of magnitude more efficient data access, which opens up many possibilities.

The Status-Quo

Until very recently, Gatekeeper was a completely event-driven system. When a change for a video occurred in any one of its upstream systems, that system would send an event to Gatekeeper. Gatekeeper would react to that event by reaching into each of its upstream services, gathering the necessary input data to evaluate the liveness of the video and its associated assets. It would then produce a single-record output detailing the status of that single video.

Old Gatekeeper Architecture

This model had several problems associated with it:

  • This process was completely I/O bound and put a lot of load on upstream systems.
  • Consequently, these events would queue up throughout the day and cause processing delays, which meant that titles may not actually go live on time.
  • Worse, events would occasionally get missed, meaning titles wouldn’t go live at all until someone from Title Operations realized there was a problem.

The mitigation for these issues was to “sweep” the catalog so Videos matching specific criteria (e.g., scheduled to launch next week) would get events automatically injected into the processing queue. Unfortunately, this mitigation added many more events into the queue, which exacerbated the problem.

Clearly, a change in direction was necessary.

The Idea

We decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated.

New Gatekeeper Architecture

With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.

We expected that this continuous processing model was possible because a complete removal of our I/O bottlenecks would mean that we should be able to operate orders of magnitude more efficiently. We also expected that by moving to this model, we would realize many positive effects for the business.

  • A definitive solution for the excess load on upstream systems generated by Gatekeeper
  • A complete elimination of liveness processing delays and missed go-live dates.
  • A reduction in the time the Content Setup Engineering team spends on performance-related issues.
  • Improved debuggability and visibility into liveness processing.

The Problem

Hollow can also be thought of like a time machine. As a dataset changes over time, it communicates those changes to consumers by breaking the timeline down into a series of discrete data states. Each data state represents a snapshot of the entire dataset at a specific moment in time.

Hollow is like a time machine

Usually, consumers of a Hollow dataset are loading the latest data state and keeping their cache updated as new states are produced. However, they may instead point to a prior state — which will revert their view of the entire dataset to a point in the past.

The traditional method of producing data states is to maintain a single producer which runs a repeating cycle. During that cycle, the producer iterates over all records from the source of truth. As it iterates, it adds each record to the Hollow library. Hollow then calculates the differences between the data added during this cycle and the data added during the last cycle, then publishes the state to a location known to consumers.

Traditional Hollow usage

The problem with this total-source-of-truth iteration model is that it can take a long time. In the case of some of our upstream systems, this could take hours. This data-propagation latency was unacceptable — we can’t wait hours for liveness processing if, for example, Title Operations adds a rating to a movie that needs to go live imminently.

The Improvement

What we needed was a faster time machine — one which could produce states with a more frequent cadence, so that changes could be more quickly realized by consumers.

Incremental Hollow is like a faster time machine

To achieve this, we created an incremental Hollow infrastructure for Netflix, leveraging work which had been done in the Hollow library earlier, and pioneered in production usage by the Streaming Platform Team at Target (and is now a public non-beta API).

With this infrastructure, each time a change is detected in a source application, the updated record is encoded and emitted to a Kafka topic. A new component that is not part of the source application, the Hollow Incremental Producer service, performs a repeating cycle at a predefined cadence. During each cycle, it reads all messages which have been added to the topic since the last cycle and mutates the Hollow state engine to reflect the new state of the updated records.

If a message from the Kafka topic contains the exact same data as already reflected in the Hollow dataset, no action is taken.

Hollow Incremental Producer Service

To mitigate issues arising from missed events, we implement a sweep mechanism that periodically iterates over an entire source dataset. As it iterates, it emits the content of each record to the Kafka topic. In this way, any updates which may have been missed will eventually be reflected in the Hollow dataset. Additionally, because this is not the primary mechanism by which updates are propagated to the Hollow dataset, this does not have to be run as quickly or frequently as a cycle must iterate the source in traditional Hollow usage.

The Hollow Incremental Producer is capable of reading a great many messages from the Kafka topic and mutating its Hollow state internally very quickly — so we can configure its cycle times to be very short (we are currently defaulting this to 30 seconds).

This is how we built a faster time machine. Now, if Title Operations adds a maturity rating to a movie, within 30 seconds, that data is available in the corresponding Hollow dataset.

The Tangible Result

With the data propagation latency issue solved, we were able to re-implement the Gatekeeper system to eliminate all I/O boundaries. With the prior implementation of Gatekeeper, re-evaluating all assets for all videos in all countries would have been unthinkable — it would tie up the entire content pipeline for more than a week (and we would then still be behind by a week since nothing else could be processed in the meantime). Now we re-evaluate everything in about 30 seconds — and we do that every minute.

There is no such thing as a missed or delayed liveness evaluation any longer, and the disablement of the prior Gatekeeper system reduced the load on our upstream systems — in some cases by up to 80%.

Load reduction on one upstream system

In addition to these performance benefits, we also get a resiliency benefit. In the prior Gatekeeper system, if one of the upstream services went down, we were unable to evaluate liveness at all because we were unable to retrieve any data from that system. In the new implementation, if one of the upstream systems goes down then it does stop publishing — but we still gate stale data for its corresponding dataset while all others make progress. So for example, if the translated synopsis system goes down, we can still bring a movie on-site in a region if it was held back for, and then receives, the correct subtitles.

The Intangible Result

Perhaps even more beneficial than the performance gains has been the improvement in our development velocity in this system. We can now develop, validate, and release changes in minutes which might have before taken days or weeks — and we can do so with significantly increased release quality.

The time-machine aspect of Hollow means that every deterministic process which uses Hollow exclusively as input data is 100% reproducible. For Gatekeeper, this means that an exact replay of what happened at time X can be accomplished by reverting all of our input states to time X, then re-evaluating everything again.

We use this fact to iterate quickly on changes to the Gatekeeper business logic. We maintain a PREPROD Gatekeeper instance which “follows” our PROD Gatekeeper instance. PREPROD is also continuously evaluating liveness for the entire catalog, but publishing its output to a different Hollow dataset. At the beginning of each cycle, the PREPROD environment will gather the latest produced state from PROD, and set each of its input datasets to the exact same versions which were used to produce the PROD output.

The PREPROD Gatekeeper instance “follows” the PROD instance

When we want to make a change to the Gatekeeper business logic, we do so and then publish it to our PREPROD cluster. The subsequent output state from PREPROD can be diffed with its corresponding output state from PROD to view the precise effect that the logic change will cause. In this way, at a glance, we can validate that our changes have precisely the intended effect, and zero unintended consequences.

A Hollow diff shows exactly what changes

This, coupled with some iteration on the deployment process, has resulted in the ability for our team to code, validate, and deploy impactful changes to Gatekeeper in literally minutes — at least an order of magnitude faster than in the prior system — and we can do so with a higher level of safety than was possible in the previous architecture.

Conclusion

This new implementation of the Gatekeeper system opens up opportunities to capture additional business value, which we plan to pursue over the coming quarters. Additionally, this is a pattern that can be replicated to other systems within the Content Engineering space and elsewhere at Netflix — already a couple of follow-up projects have been launched to formalize and capitalize on the benefits of this n-hollow-input, one-hollow-output architecture.

Content Setup Engineering is an exciting space right now, especially as we scale up our pipeline to produce more content with each passing quarter. We have many opportunities to solve real problems and provide massive value to the business — and to do so with a deep focus on computer science, using and often pioneering leading-edge technologies. If this kind of work sounds appealing to you, reach out to Ivan to get the ball rolling.


Re-Architecting the Video Gatekeeper was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/re-architecting-the-video-gatekeeper-f7b0ac2f6b00?source=rss—-2615bd06b42e—4

Announcing AWS Toolkit for Visual Studio Code

Announcing AWS Toolkit for Visual Studio Code

Visual Studio Code has become an enormously popular tool for serverless developers, partly due to the intuitive user interface. It’s also because of the rich ecosystem of extensions that can customize and automate so much of the development experience. We are excited to announce that the AWS Toolkit for Visual Studio Code extension is now generally available, making it even easier for the development community to build serverless projects using this editor.

The AWS Toolkit for Visual Studio Code entered developer preview in November of 2018 and is open-sourced on GitHub, allowing builders to make their contributions to the code base and feature set. The toolkit enables you to easily develop serverless applications, including creating a new project, local debugging, and deploying your project—all conveniently from within the editor. The toolkit supports Node.js, Python, and .NET.

Using the AWS Toolkit for Visual Studio Code, you can:

  • Test your code locally with step-through debugging in a Lambda-like environment.
  • Deploy your applications to the AWS Region of your choice.
  • Invoke your Lambda functions locally or remotely.
  • Specify function configurations such as an event payload and environment variables.

We’re distributing the AWS Toolkit for Visual Studio Code under the open source Apache License, version 2.0.

Installation

From Visual Studio Code, choose the Extensions icon on the Activity Bar. In the Search Extensions in Marketplace box, enter AWS Toolkit and then choose AWS Toolkit for Visual Studio Code as shown below. This opens a new tab in the editor showing the toolkit’s installation page. Choose the Install button in the header to add the extension.

After you install the AWS Toolkit for Visual Studio Code, you must complete these additional steps to access most of its features:

To use this toolkit to develop serverless applications with AWS, you must also do the following on the local machine where you install the toolkit:

For complete setup instructions, see Setting Up the AWS Toolkit for Visual Studio Code in the AWS Toolkit for Visual Studio Code User Guide.

Building a serverless application with the AWS Toolkit for Visual Studio Code

In this example, you set up a Hello World application using Node.js:

1. Open the Command Palette by choosing Command Palette from the View menu. Type AWS to see all the commands available from the toolkit. Choose AWS: Create new AWS SAM Application.

2. Choose the nodejs10.x runtime, specify a folder, and name the application Hello World. Press Enter to confirm these settings. The toolkit uses AWS SAM to create the application files, which appear in the Explorer panel.

3. To run the application locally, open the app.js file in the editor. CodeLenses appears above the handler function, showing options to Run Locally, Debug Locally, or Configure the function. Choose Run Locally.

This uses Docker to run the function on your local machine. You can see the Hello World output in the console window.

Step-through debugging of the application

One of the most exciting features in the toolkit is also one of the most powerful in helping you find problems in your code: step-through debugging.

The AWS Toolkit for Visual Studio Code brings the power of step-through debugging to serverless development. It lets you set breakpoints in your code and evaluate variable values and object states. You can activate this by choosing Debug Locally in the CodeLens that appears above your functions.

This enables the Debug view, displaying information related to debugging the current application. It also shows a command bar with debugging options and launch configuration settings. As your function executes and stops at a breakpoint, the editor shows current object state when you hover over the code and allows data inspection in the Variables panel.

For developers familiar with the power of the debugging support within Visual Studio Code, the toolkit now lets you use those features with serverless development of your Lambda functions. This makes it easier to diagnose problems quickly and iteratively, without needing to build and deploy functions remotely or rely exclusively on verbose logging to isolate issues.

Deploying the application to AWS

The deployment process requires an Amazon S3 bucket with a globally unique name in the same Region where your application runs

1. To create an S3 bucket from the terminal window, enter:

aws s3 mb s3://your-unique-bucket-name --region your-region

This requires the AWS CLI, which you can install by following the instructions detailed in the documentation. Alternatively, you can create a new S3 bucket in the AWS Management Console.

2. Open the Command Palette by choosing Command Palette from the View menu. Type AWS to see all the commands available from the toolkit. Choose AWS: Deploy a SAM Application.

3. Choose the suggested YAML template, select the correct Region, and enter the name of the bucket that you created in step 1. Provide a name for the stack, then press Enter. The process may take several minutes to complete. After it does, the Output panel shows that the deployment completed.

Invoking the remote function

With your application deployed to the AWS Cloud, you can invoke the remote function directly from the AWS Toolkit for Visual Studio Code:

  1. From the Activity Bar, choose the AWS icon to open AWS Explorer.
  2. Choose Click to add a region to view functions… and choose a Region from the list.
  3. Click to expand CloudFormation and Lambda. Explorer shows the CloudFormation stacks and the Lambda functions in this Region.
  4. Right-click the HelloWorld Lambda function in the Explorer panel and choose Invoke on AWS. This opens a new tab in the editor.
  5. Choose Hello World from the payload template dropdown and choose Invoke. The output panel displays the ‘hello world’ JSON response from the remote function.

Congratulations! You successfully deployed a serverless application to production using the AWS Toolkit for Visual Studio Code. The User Guide includes more information on the toolkit’s other development features.

Conclusion

In this post, I demonstrated how to deploy a simple serverless application using the AWS Toolkit for Visual Studio Code. Using this toolkit, developers can test and debug locally before deployment, and modify AWS resources defined in the AWS SAM template.

For example, by using this toolkit to modify your AWS SAM template, you can add an S3 bucket to store images or documents, add a DynamoDB table to store user data, or change the permissions used by your functions. It’s simple to create or update stacks, allowing you to quickly iterate until you complete your application.

AWS SAM empowers developers to build serverless applications more quickly by simplifying AWS CloudFormation and automating the deployment process. The AWS Toolkit for Visual Studio Code takes the next step, allowing you to manage the entire edit, build, and deploy process from your preferred development environment. We are excited to see what you can build with this tool!

from AWS Developer Blog https://aws.amazon.com/blogs/developer/announcing-aws-toolkit-for-visual-studio-code/