Tag: Netflix TechBlog – Medium

Netflix Studio Hack Day — May 2019

Netflix Studio Hack Day — May 2019

Netflix Studio Hack Day — May 2019

By Tom Richards, Carenina Garcia Motion, and Marlee Tart

Hack Days are a big deal at Netflix. They’re a chance to bring together employees from all our different disciplines to explore new ideas and experiment with emerging technologies.

For the most recent hack day, we channeled our creative energy towards our studio efforts. The goal remained the same: team up with new colleagues and have fun while learning, creating, and experimenting. We know even the silliest idea can spur something more.

The most important value of hack days is that they support a culture of innovation. We believe in this work, even if it never ships, and love to share the creativity and thought put into these ideas.

Below, you can find videos made by the hackers of some of our favorite hacks from this event.

Project Rumble Pack

You’re watching your favorite episode of Voltron when, after a suspenseful pause, there’s a huge explosion — and your phone starts to vibrate in your hands.

The Project Rumble Pak hack day project explores how haptics can enhance the content you’re watching. With every explosion, sword clank, and laser blast, you get force feedback to amp up the excitement.

For this project, we synchronized Netflix content with haptic effects using Immersion Corporation technology.

By Hans van de Bruggen and Ed Barker

The Voice of Netflix

Introducing The Voice of Netflix. We trained a neural net to spot words in Netflix content and reassemble them into new sentences on demand. For our stage demonstration, we hooked this up to a speech recognition engine to respond to our verbal questions in the voice of Netflix’s favorite characters. Try it out yourself at blogofsomeguy.com/v!

By Guy Cirino and Carenina Garcia Motion

TerraVision

TerraVision re-envisions the creative process and revolutionizes the way our filmmakers can search and discover filming locations. Filmmakers can drop a photo of a look they like into an interface and find the closest visual matches from our centralized library of locations photos. We are using a computer vision model trained to recognize places to build reverse image search functionality. The model converts each image into a small dimensional vector, and the matches are obtained by computing the nearest neighbors of the query.

By Noessa Higa, Ben Klein, Jonathan Huang, Tyler Childs, Tie Zhong, and Kenna Hasson

Get Out!

Have you ever found yourself needing to give the Evil Eye™ to colleagues who are hogging your conference room after their meeting has ended?

Our hack is a simple web application that allows employees to select a Netflix meeting room anywhere in the world, and press a button to kick people out of their meeting room if they have overstayed their meeting. First, the app looks up calendar events associated with the room and finds the latest meeting in the room that should have already ended. It then automatically calls in to that meeting and plays walk-off music similar to the Oscar’s to not-so-subtly encourage your colleagues to Get Out! We built this hack using Java (Springboot framework), the Google OAuth and Calendar APIs (for finding rooms) and Twilio API (for calling into the meeting), and deployed it on AWS.

By Abi Seshadri and Rachel Rivera

You can also check out highlights from our past events: November 2018, March 2018, August 2017, January 2017, May 2016, November 2015, March 2015, February 2014 & August 2014.

Thanks to all the teams who put together a great round of hacks in 24 hours.


Netflix Studio Hack Day — May 2019 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/netflix-studio-hack-day-may-2019-b4a0ecc629eb?source=rss—-2615bd06b42e—4

Predictive CPU isolation of containers at Netflix

Predictive CPU isolation of containers at Netflix

By Benoit Rostykus, Gabriel Hartmann

Noisy Neighbors

We’ve all had noisy neighbors at one point in our life. Whether it’s at a cafe or through a wall of an apartment, it is always disruptive. The need for good manners in shared spaces turns out to be important not just for people, but for your Docker containers too.

When you’re running in the cloud your containers are in a shared space; in particular they share the CPU’s memory hierarchy of the host instance.

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. However, the key insight here is that these caches are partially shared among the CPUs, which means that perfect performance isolation of co-hosted containers is not possible. If the container running on the core next to your container suddenly decides to fetch a lot of data from the RAM, it will inevitably result in more cache misses for you (and hence a potential performance degradation).

Linux to the rescue?

Traditionally it has been the responsibility of the operating system’s task scheduler to mitigate this performance isolation problem. In Linux, the current mainstream solution is CFS (Completely Fair Scheduler). Its goal is to assign running processes to time slices of the CPU in a “fair” way.

CFS is widely used and therefore well tested and Linux machines around the world run with reasonable performance. So why mess with it? As it turns out, for the large majority of Netflix use cases, its performance is far from optimal. Titus is Netflix’s container platform. Every month, we run millions of containers on thousands of machines on Titus, serving hundreds of internal applications and customers. These applications range from critical low-latency services powering our customer-facing video streaming service, to batch jobs for encoding or machine learning. Maintaining performance isolation between these different applications is critical to ensuring a good experience for internal and external customers.

We were able to meaningfully improve both the predictability and performance of these containers by taking some of the CPU isolation responsibility away from the operating system and moving towards a data driven solution involving combinatorial optimization and machine learning.

The idea

CFS operates by very frequently (every few microseconds) applying a set of heuristics which encapsulate a general concept of best practices around CPU hardware use.

Instead, what if we reduced the frequency of interventions (to every few seconds) but made better data-driven decisions regarding the allocation of processes to compute resources in order to minimize collocation noise?

One traditional way of mitigating CFS performance issues is for application owners to manually cooperate through the use of core pinning or nice values. However, we can automatically make better global decisions by detecting collocation opportunities based on actual usage information. For example if we predict that container A is going to become very CPU intensive soon, then maybe we should run it on a different NUMA socket than container B which is very latency-sensitive. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Optimizing placements through combinatorial optimization

What the OS task scheduler is doing is essentially solving a resource allocation problem: I have X threads to run but only Y CPUs available, how do I allocate the threads to the CPUs to give the illusion of concurrency?

As an illustrative example, let’s consider a toy instance of 16 hyperthreads. It has 8 physical hyperthreaded cores, split on 2 NUMA sockets. Each hyperthread shares its L1 and L2 caches with its neighbor, and shares its L3 cache with the 7 other hyperthreads on the socket:

If we want to run container A on 4 threads and container B on 2 threads on this instance, we can look at what “bad” and “good” placement decisions look like:

The first placement is intuitively bad because we potentially create collocation noise between A and B on the first 2 cores through their L1/L2 caches, and on the socket through the L3 cache while leaving a whole socket empty. The second placement looks better as each CPU is given its own L1/L2 caches, and we make better use of the two L3 caches available.

Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.

We formulate the problem as a Mixed Integer Program (MIP). Given a set of K containers each requesting a specific number of CPUs on an instance possessing d threads, the goal is to find a binary assignment matrix M of size (d, K) such that each container gets the number of CPUs it requested. The loss function and constraints contain various terms expressing a priori good placement decisions such as:

  • avoid spreading a container across multiple NUMA sockets (to avoid potentially slow cross-sockets memory accesses or page migrations)
  • don’t use hyper-threads unless you need to (to reduce L1/L2 thrashing)
  • try to even out pressure on the L3 caches (based on potential measurements of the container’s hardware usage)
  • don’t shuffle things too much between placement decisions

Given the low-latency and low-compute requirements of the system (we certainly don’t want to spend too many CPU cycles figuring out how containers should use CPU cycles!), can we actually make this work in practice?

Implementation

We decided to implement the strategy through Linux cgroups since they are fully supported by CFS, by modifying each container’s cpuset cgroup based on the desired mapping of containers to hyper-threads. In this way a user-space process defines a “fence” within which CFS operates for each container. In effect we remove the impact of CFS heuristics on performance isolation while retaining its core scheduling capabilities.

This user-space process is a Titus subsystem called titus-isolate which works as follows. On each instance, we define three events that trigger a placement optimization:

  • add: A new container was allocated by the Titus scheduler to this instance and needs to be run
  • remove: A running container just finished
  • rebalance: CPU usage may have changed in the containers so we should reevaluate our placement decisions

We periodically enqueue rebalance events when no other event has recently triggered a placement decision.

Every time a placement event is triggered, titus-isolate queries a remote optimization service (running as a Titus service, hence also isolating itself… turtles all the way down) which solves the container-to-threads placement problem.

This service then queries a local GBRT model (retrained every couple of hours on weeks of data collected from the whole Titus platform) predicting the P95 CPU usage of each container in the coming 10 minutes (conditional quantile regression). The model contains both contextual features (metadata associated with the container: who launched it, image, memory and network configuration, app name…) as well as time-series features extracted from the last hour of historical CPU usage of the container collected regularly by the host from the kernel CPU accounting controller.

The predictions are then fed into a MIP which is solved on the fly. We’re using cvxpy as a nice generic symbolic front-end to represent the problem which can then be fed into various open-source or proprietary MIP solver backends. Since MIPs are NP-hard, some care needs to be taken. We impose a hard time budget to the solver to drive the branch-and-cut strategy into a low-latency regime, with guardrails around the MIP gap to control overall quality of the solution found.

The service then returns the placement decision to the host, which executes it by modifying the cpusets of the containers.

For example, at any moment in time, an r4.16xlarge with 64 logical CPUs might look like this (the color scale represents CPU usage):

Results

The first version of the system led to surprisingly good results. We reduced overall runtime of batch jobs by multiple percent on average while most importantly reducing job runtime variance (a reasonable proxy for isolation), as illustrated below. Here we see a real-world batch job runtime distribution with and without improved isolation:

Notice how we mostly made the problem of long-running outliers disappear. The right-tail of unlucky noisy-neighbors runs is now gone.

For services, the gains were even more impressive. One specific Titus middleware service serving the Netflix streaming service saw a capacity reduction of 13% (a decrease of more than 1000 containers) needed at peak traffic to serve the same load with the required P99 latency SLA! We also noticed a sharp reduction of the CPU usage on the machines, since far less time was spent by the kernel in cache invalidation logic. Our containers are now more predictable, faster and the machine is less used! It’s not often that you can have your cake and eat it too.

Next Steps

We are excited with the strides made so far in this area. We are working on multiple fronts to extend the solution presented here.

We want to extend the system to support CPU oversubscription. Most of our users have challenges knowing how to properly size the numbers of CPUs their app needs. And in fact, this number varies during the lifetime of their containers. Since we already predict future CPU usage of the containers, we want to automatically detect and reclaim unused resources. For example, one could decide to auto-assign a specific container to a shared cgroup of underutilized CPUs, to better improve overall isolation and machine utilization, if we can detect the sensitivity threshold of our users along the various axes of the following graph.

We also want to leverage kernel PMC events to more directly optimize for minimal cache noise. One possible avenue is to use the Intel based bare metal instances recently introduced by Amazon that allow deep access to performance analysis tools. We could then feed this information directly into the optimization engine to move towards a more supervised learning approach. This would require a proper continuous randomization of the placements to collect unbiased counterfactuals, so we could build some sort of interference model (“what would be the performance of container A in the next minute, if I were to colocate one of its threads on the same core as container B, knowing that there’s also C running on the same socket right now?”).

Conclusion

If any of this piques your interest, reach out to us! We’re looking for ML engineers to help us push the boundary of containers performance and “machine learning for systems” and systems engineers for our core infrastructure and compute platform.


Predictive CPU isolation of containers at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/predictive-cpu-isolation-of-containers-at-netflix-91f014d856c7?source=rss—-2615bd06b42e—4

Making our Android Studio Apps Reactive with UI Components & Redux

Making our Android Studio Apps Reactive with UI Components & Redux

By Juliano Moraes, David Henry, Corey Grunewald & Jim Isaacs

Recently Netflix has started building mobile apps to bring technology and innovation to our Studio Physical Productions, the portion of the business responsible for producing our TV shows and movies.

Our very first mobile app is called Prodicle and was built for Android & iOS using the same reactive architecture in both platforms, which allowed us to build 2 apps from scratch in 3 months with 4 software engineers.

The app helps production crews organize their shooting days through shooting milestones and keeps everyone in a production informed about what is currently happening.

Here is a shooting day for Glow Season 3.

We’ve been experimenting with an idea to use reactive components on Android for the last two years. While there are some frameworks that implement this, we wanted to stay very close to the Android native framework. It was extremely important to the team that we did not completely change the way our engineers write Android code.

We believe reactive components are the key foundation to achieve composable UIs that are scalable, reusable, unit testable and AB test friendly. Composable UIs contribute to fast engineering velocity and produce less side effect bugs.

Our current player UI in the Netflix Android app is using our first iteration of this componentization architecture. We took the opportunity with building Prodicle to improve upon what we learned with the Player UI, and build the app from scratch using Redux, Components, and 100% Kotlin.

Overall Architecture

Fragments & Activities

— Fragment is not your view.

Having large Fragments or Activities causes all sorts of problems, it makes the code hard to read, maintain, and extend. Keeping them small helps with code encapsulation and better separation of concerns — the presentation logic should be inside a component or a class that represents a view and not in the Fragment.

This is how a clean Fragment looks in our app, there is no business logic. During the onViewCreated we pass pre-inflated view containers and the global redux store’s dispatch function.

UI Components

Components are responsible for owning their own XML layout and inflating themselves into a container. They implement a single render(state: ComponentState) interface and have their state defined by a Kotlin data class.

A component’s render method is a pure function that can easily be tested by creating a permutation of possible states variances.

Dispatch functions are the way components fire actions to change app state, make network requests, communicate with other components, etc.

A component defines its own state as a data class in the top of the file. That’s how its render() function is going to be invoked by the render loop.

It receives a ViewGroup container that will be used to inflate the component’s own layout file, R.layout.list_header in this example.

All the Android views are instantiated using a lazy approach and the render function is the one that will set all the values in the views.

Layout

All of these components are independent by design, which means they do not know anything about each other, but somehow we need to layout our components within our screens. The architecture is very flexible and provides different ways of achieving it:

  1. Self Inflation into a Container: A Component receives a ViewGroup as a container in the constructor, it inflates itself using Layout Inflater. Useful when the screen has a skeleton of containers or is a Linear Layout.
  2. Pre inflated views. Component accepts a View in its constructor, no need to inflate it. This is used when the layout is owned by the screen in a single XML.
  3. Self Inflation into a Constraint Layout: Components inflate themselves into a Constraint Layout available in its constructor, it exposes a getMainViewId to be used by the parent to set constraints programmatically.

Redux

Redux provides an event driven unidirectional data flow architecture through a global and centralized application state that can only be mutated by Actions followed by Reducers. When the app state changes it cascades down to all the subscribed components.

Having a centralized app state makes disk persistence very simple using serialization. It also provides the ability to rewind actions that have affected the state for free. After persisting the current state to the disk the next app launch will put the user in exactly the same state they were before. This removes the requirement for all the boilerplate associated with Android’s onSaveInstanceState() and onRestoreInstanceState().

The Android FragmentManager has been abstracted away in favor of Redux managed navigation. Actions are fired to Push, Pop, and Set the current route. Another Component, NavigationComponent listens to changes to the backStack and handles the creation of new Screens.

The Render Loop

Render Loop is the mechanism which loops through all the components and invokes component.render() if it is needed.

Components need to subscribe to changes in the App State to have their render() called. For optimization purposes, they can specify a transformation function containing the portion of the App State they care about — using selectWithSkipRepeats prevents unnecessary render calls if a part of the state changes that the component does not care about.

The ComponentManager is responsible for subscribing and unsubscribing Components. It extends Android ViewModel to persist state on configuration change, and has a 1:1 association with Screens (Fragments). It is lifecycle aware and unsubscribes all the components when onDestroy is called.

Below is our fragment with its subscriptions and transformation functions:

ComponentManager code is below:

Recycler Views

Components should be flexible enough to work inside and outside of a list. To work together with Android’s recyclerView implementation we’ve created a UIComponent and UIComponentForList, the only difference is the second extends a ViewHolder and does not subscribe directly to the Redux Store.

Here is how all the pieces fit together.

Fragment:

The Fragment initializes a MilestoneListComponent subscribing it to the Store and implements its transformation function that will define how the global state is translated to the component state.

List Component:

A List Component uses a custom adapter that supports multiple component types, provides async diff in the background thread through adapter.update() interface and invokes item components render() function during onBind() of the list item.

Item List Component:

Item List Components can be used outside of a list, they look like any other component except for the fact that UIComponentForList extends Android’s ViewHolder class. As any other component it implements the render function based on a state data class it defines.

Unit Tests

Unit tests on Android are generally hard to implement and slow to run. Somehow we need to mock all the dependencies — Activities, Context, Lifecycle, etc in order to start to test the code.

Considering our components render methods are pure functions we can easily test it by making up states without any additional dependencies.

In this unit test example we initialize a UI Component inside the before() and for every test we directly invoke the render() function with a state that we define. There is no need for activity initialization or any other dependency.

Conclusion & Next Steps

The first version of our app using this architecture was released a couple months ago and we are very happy with the results we’ve achieved so far. It has proven to be composable, reusable and testable — currently we have 60% unit test coverage.

Using a common architecture approach allows us to move very fast by having one platform implement a feature first and the other one follow. Once the data layer, business logic and component structure is figured out it becomes very easy for the following platform to implement the same feature by translating the code from Kotlin to Swift or vice versa.

To fully embrace this architecture we’ve had to think a bit outside of the platform’s provided paradigms. The goal is not to fight the platform, but instead to smooth out some rough edges.


Making our Android Studio Apps Reactive with UI Components & Redux was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/making-our-android-studio-apps-reactive-with-ui-components-redux-5e37aac3b244?source=rss—-2615bd06b42e—4

Android Rx onError Guidelines

Android Rx onError Guidelines

By Ed Ballot

“Creating a good API is hard.” — anyone who has created an API used by others

As with any API, wrapping your data stream in a Rx observable requires consideration for reasonable error handling and intuitive behavior. The following guidelines are intended to help developers create consistent and intuitive API.

Since we frequently create Rx Observables in our Android app, we needed a common understanding of when to use onNext() and when to use onError() to make the API more consistent for subscribers. The divergent understanding is partially because the name “onError” is a bit misleading. The item emitted by onError() is not a simple error, but a throwable that can cause significant damage if not caught. Our app has a global handler that prevents it from crashing outright, but an uncaught exception can still leave parts of the app in an unpredictable state.

TL;DR — Prefer onNext() and only use onError() for exceptional cases.

Considerations for onNext / onError

The following are points to consider when determining whether to use onNext() versus onError().

The Contract

First here are the definitions of the two from the ReactiveX contract page:

OnNext
conveys an item that is emitted by the Observable to the observer

OnError
indicates that the Observable has terminated with a specified error condition and that it will be emitting no further items

As pointed out in the above definition, a subscription is automatically disposed after onError(), just like after onComplete(). Because of this, onError() should only be used to signal a fatal error and never to signal an intermittent problem where more data is expected to stream through the subscription after the error.

Treat it like an Exception

Limit using onError() for exceptional circumstances when you’d also consider throwing an Error or Exception. The reasoning is that the onError() parameter is a Throwable. An example for differentiating: a database query returning zero results is typically not an exception. The database returning zero results because it was forcibly closed (or otherwise put in a state that cancels the running query) would be an exceptional condition.

Be Consistent

Do not make your observable emit a mix of both deterministic and non-deterministic errors. Something is deterministic if the same input always result in the same output, such as dividing by 0 will fail every time. Something is non-deterministic if the same inputs may result in different outputs, such as a network request which may timeout or may return results before the timeout. Rx has convenience methods built around error handling, such as retry() (and our retryWithBackoff()). The primary use of retry() is to automatically re-subscribe an observable that has non-deterministic errors. When an observable mixes the two types of errors, it makes retrying less obvious since retrying a deterministic failures doesn’t make sense — or is wasteful since the retry is guaranteed to fail. (Two notes: 1. retry can also be used in certain deterministic cases like user login attempts, where the failure is caused by incorrectly entering credentials. 2. For mixed errors, retryWhen() could be used to only retry the non-deterministic errors.) If you find your observable needs to emit both types of errors, consider whether there is an appropriate separation of concerns. It may be that the observable can be split into several observables that each have a more targeted purpose.

Be Consistent with Underlying APIs

When wrapping an asynchronous API in Rx, consider maintaining consistency with the underlying API’s error handling. For example, if you are wrapping a touch event system that treats moving off the device’s touchscreen as an exception and terminates the touch session, then it may make sense to emit that error via onError(). On the other hand, if it treats moving off the touchscreen as a data event and allows the user to drag their finger back onto the screen, it makes sense to emit it via onNext().

Avoid Business Logic

Related to the previous point. Avoid adding business logic that interprets the data and converts it into errors. The code that the observable is wrapping should have the appropriate logic to perform these conversions. In the rare case that it does not, consider adding an abstraction layer that encapsulates this logic (for both normal and error cases) rather than building it into the observable.

Passing Details in onError()

If your code is going to use onError(), remember that the throwable it emits should include appropriate data for the subscriber to understand what went wrong and how to handle it.

For example, our Falcor response handler uses a FalcorError class that includes the Status from the callback. Repositories could also throw an extension of this class, if extra details need to be included.


Android Rx onError Guidelines was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/android-rx-onerror-guidelines-e68e8dc7383f?source=rss—-2615bd06b42e—4

Engineering a Studio Quality Experience With High-Quality Audio at Netflix

Engineering a Studio Quality Experience With High-Quality Audio at Netflix

by Guillaume du Pontavice, Phill Williams and Kylee Peña (on behalf of our Streaming Algorithms, Audio Algorithms, and Creative Technologies teams)

Remember the epic opening sequence of Stranger Things 2? The thrill of that car chase through Pittsburgh not only introduced a whole new set of mysteries, but it returned us to a beloved and dangerous world alongside Dustin, Lucas, Mike, Will and Eleven. Maybe you were one of the millions of people who watched it in HDR, experiencing the brilliant imagery as it was meant to be seen by the creatives who dreamt it up.

Imagine this scene without the sound. Even taking away one part of the soundtrack — the brilliant synth-pop score or the perfectly mixed soundscape of a high speed chase — is the story nearly as thrilling and emotional?

Most conversations about streaming quality focus on video. In fact, Netflix has led the charge for most of the video technology that drives these conversations, from visual quality improvements like 4K and HDR, to behind-the-scenes technologies that make the streaming experience better for everyone, like adaptive streaming, complexity-based encoding, and AV1.

We’re really proud of the improvements we’ve brought to the video experience, but the focus on those makes it easy to overlook the importance of sound, and sound is every bit as important to entertainment as video. Variances in sound can be extremely subtle, but the impact on how the viewer perceives a scene differently is often measurable. For example, have you ever seen a TV show where the video and audio were a little out of sync?

Among those who understand the vital nature of sound are the Duffer brothers. In late 2017, we received some critical feedback from the brothers on the Stranger Things 2 audio mix: in some scenes, there was a reduced sense of where sounds are located in the 5.1-channel stream, as well as audible degradation of high frequencies.

Our engineering team and Creative Technologies sound expert joined forces to quickly solve the issue, but a larger conversation about higher quality audio continued. Series mixes were getting bolder and more cinematic with tight levels between dialog, music and effects elements. Creative choices increasingly tested the limits of our encoding quality. We needed to support these choices better.

At Netflix, we work hard to bring great audio to our members. We began streaming 5.1 surround audio in 2010, and began streaming Dolby Atmos in 2016, but wanted to bring studio quality sound to our members around the world. We want your experience to be brilliant even if you aren’t listening with a state-of-the-art home theater system. Just as we support initiatives like HDR and Netflix Calibrated Mode to maintain creative intent in streaming you picture, we wanted to do the same for the sound. That’s why we developed and launched high-quality audio.

To learn more about the people and inspiration behind this effort, check out this video. In this tech blog, we’ll dive deep into what high-quality audio is, how we deliver it to members worldwide, and why it’s so important to us.

What do we mean by “studio quality” sound?

If you’ve ever been in a professional recording studio, you’ve probably noted the difference in how things sound. One reason for that is the files used in mastering sessions are 24-bit 48 kHz with a bitrate of around 1 Mbps per channel. Studio mixes are uncompressed, which is why we consider them to be the “master” version.

Our high-quality sound feature is not lossless, but it is perceptually transparent. That means that while the audio is compressed, it is indistinguishable from the original source. Based on internal listening tests, listening test results provided by Dolby, and scientific studies, we determined that for Dolby Digital Plus at and above 640 kbps, the audio coding quality is perceptually transparent. Beyond that, we would be sending you files that have a higher bitrate (and take up more bandwidth) without bringing any additional value to the listening experience.

In addition to deciding 640 kbps — a 10:1 compression ratio when compared to a 24-bit 5.1 channel studio master — was the perceptually transparent threshold for audio, we set up a bitrate ladder for 5.1-channel audio ranging from 192 up to 640 kbps. This ranges from “good” audio to “transparent” — there aren’t any bad audio experiences when you stream!

At the same time, we revisited our Dolby Atmos bitrates and increased the highest offering to 768 kbps. We expect these bitrates to evolve over time as we get more efficient with our encoding techniques.

Our high-quality sound is a great experience for our members even if they aren’t audiophiles. Sound helps to tell the story subconsciously, shaping our experience through subtle cues like the sharpness of a phone ring or the way a very dense flock of bird chirps can increase anxiety in a scene. Although variances in sound can be nuanced, the impact on the viewing and listening experience is often measurable.

And perhaps most of all, our “studio quality” sound is faithful to what the mixers are creating on the mix stage. For many years in the film and television industry, creatives would spend days on the stage perfecting the mix only to have it significantly degraded by the time it was broadcast to viewers. Sometimes critical sound cues might even be lost to the detriment of the story. By delivering studio quality sound, we’re preserving the creative intent from the mix stage.

Adaptive Streaming for Audio

Since we began streaming, we’ve used static audio streaming at a constant bitrate. This approach selects the audio bitrate based on network conditions at the start of playback. However, we have spent years optimizing our adaptive streaming engine for video, so we know adaptive streaming has obvious benefits. Until now, we’ve only used adaptive streaming for video.

Adaptive streaming is a technology designed to deliver media to the user in the most optimal way for their network connection. Media is split into many small segments (chunks) and each chunk contains a few seconds of playback data. Media is provided in several qualities.

An adaptive streaming algorithm’s goal is to provide the best overall playback experience — even under a constrained environment. A great playback experience should provide the best overall quality, considering both audio and video, and avoid buffer starvation which leads to a rebuffering event — or playback interruption.

Constrained environments can be due to changing network conditions and device performance limitations. Adaptive streaming has to take all these into account. Delivering a great playback experience is difficult.

Let’s first look at how static audio streaming paired with adaptive video operates in a session with variable network conditions — in this case, a sudden throughput drop during the session.

The top graph shows both the audio and video bitrate, along with the available network throughput. The audio bitrate is fixed and has been selected at playback start whereas video bitrate varies and can adapt periodically.

The bottom graph shows audio and video buffer evolution: if we are able to fill the buffer faster than we play out, our buffer will grow. If not, our buffer will shrink.

In the first session above, the adaptive streaming algorithm for video has reacted to the throughput drop and was able to quickly stabilize both the audio and video buffer level by down-switching the video bitrate.

In the second scenario below, under the same network conditions we used a static high-quality audio bitrate at session start instead.

Our adaptive streaming for video logic is reacting but in this case, the available throughput is becoming less than the sum of audio and video bitrate, and our buffer starts draining. This ultimately leads to a rebuffer.

In this scenario, the video bitrate dropped below the audio bitrate, which might not provide the best playback experience.

This simple example highlights that static audio streaming can lead to suboptimal playback experiences with fluctuating network conditions. This motivated us to use adaptive streaming for audio.

By using adaptive streaming for audio, we allow audio quality to adjust during playback to bandwidth capabilities, just like we do for video.

Let’s consider a playback session with exactly the same network conditions (a sudden throughput drop) to illustrate the benefit of adaptive streaming for audio.

In this case we are able to select a higher audio bitrate when network conditions supported it and we are able to gracefully switch down the audio bitrate and avoid a rebuffer event by maintaining healthy audio and video buffer levels. Moreover, we were able to maintain a higher video bitrate when compared to the previous example.

The benefits are obvious in this simple case, but extending it to our broad streaming ecosystem was another challenge. There were many questions we had to answer in order to move forward with adaptive streaming for audio.

What about device reach? We have hundreds of millions of TV devices in the field, with different CPU, network and memory profiles, and adaptive audio has never been certified. Do these devices even support audio stream switching?

  • We had to assess this by testing adaptive audio switching on all Netflix supported devices.
  • We also added adaptive audio testing in our certification process so that every new certified device can benefit from it.

Once we knew that adaptive streaming for audio was achievable on most of our TV devices, we had to answer the following questions as we designed the algorithm:

  • How could we guarantee that we can improve audio subjective quality without degrading video quality and vice-versa?
  • How could we guarantee that we won’t introduce additional rebuffers or increase the startup delay with high-quality audio?
  • How could we guarantee that this algorithm will gracefully handle devices with different performance characteristics?

We answered these questions via experimentation that led to fine-tuning the adaptive streaming for audio algorithm in order to increase audio quality without degrading the video experience. After a year of work, we were able to answer these questions and implement adaptive audio streaming on a majority of TV devices.

Enjoying a Higher Quality Experience

By using our listening tests and scientific data to choose an optimal “transparent” bitrate, and designing an adaptive audio algorithm that could serve it based on network conditions, we’ve been able to enable this feature on a wide variety of devices with different CPU, network and memory profiles: the vast majority of our members using 5.1 should be able to enjoy new high-quality audio.

And it won’t have any negative impact on the streaming experience. The adaptive bitrate switching happens seamlessly during a streaming experience, with the available bitrates ranging from good to transparent, so you shouldn’t notice a difference other than better sound. If your network conditions are good, you’ll be served up the best possible audio, and it will now likely sound like it did on the mixing stage. If your network has an issue — your sister starts a huge download or your cat unplugs your router — our adaptive streaming will help you out.

After years perfecting our adaptive video switching, we’re thrilled that a similar approach can enable studio quality sound to make it to members’ households, ensuring that every detail of the mix is preserved. Uniquely combining creative technology with engineering teams at Netflix, we’ve been able to not only solve a problem, but use that problem to improve the quality of audio for millions of our members worldwide.

Preserving the original creative intent of the hard-working people who make shows like Stranger Things is a top priority, and we know it enhances your viewing — and listening — experience for many more moments of joy. Whether you’ve fallen into the Upside Down or you’re being chased by the Demogorgon, get ready for a sound experience like never before.


Engineering a Studio Quality Experience With High-Quality Audio at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/engineering-a-studio-quality-experience-with-high-quality-audio-at-netflix-eaa0b6145f32?source=rss—-2615bd06b42e—4

Python at Netflix

Python at Netflix

By Pythonistas at Netflix, coordinated by Amjith Ramanujam and edited by Ellen Livengood

As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there.

Open Connect

Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens afterwards (i.e., video streaming) takes place in the Open Connect network. Content is placed on the network of servers in the Open Connect CDN as close to the end user as possible, improving the streaming experience for our customers and reducing costs for both Netflix and our Internet Service Provider (ISP) partners.

Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. The network devices that underlie a large portion of the CDN are mostly managed by Python applications. Such applications track the inventory of our network gear: what devices, of which models, with which hardware components, located in which sites. The configuration of these devices is controlled by several other systems including source of truth, application of configurations to devices, and back up. Device interaction for the collection of health and other operational data is yet another Python application. Python has long been a popular programming language in the networking space because it’s an intuitive language that allows engineers to quickly solve networking problems. Subsequently, many useful libraries get developed, making the language even more desirable to learn and use.

Demand Engineering

Demand Engineering is responsible for Regional Failovers, Traffic Distribution, Capacity Operations and Fleet Efficiency of the Netflix cloud. We are proud to say that our team’s tools are built primarily in Python. The service that orchestrates failover uses numpy and scipy to perform numerical analysis, boto3 to make changes to our AWS infrastructure, rq to run asynchronous workloads and we wrap it all up in a thin layer of Flask APIs. The ability to drop into a bpython shell and improvise has saved the day more than once.

We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.

CORE

The CORE team uses Python in our alerting and statistical analytical work. We lean on the many of the statistical and mathematical libraries (numpy, scipy, ruptures, pandas) to help automate the analysis of 1000s of related signals when our alerting systems indicate problems. We’ve developed a time series correlation system used both inside and outside the team as well as a distributed worker system to parallelize large amounts of analytical work to deliver results quickly.

Python is also a tool we typically use for automation tasks, data exploration and cleaning, and as a convenient source for visualization work.

Monitoring, alerting and auto-remediation

The Insight Engineering team is responsible for building and operating the tools for operational insight, alerting, diagnostics, and auto-remediation. With the increased popularity of Python, the team now supports Python clients for most of their services. One example is the Spectator Python client library, a library for instrumenting code to record dimensional time series metrics. We build Python libraries to interact with other Netflix platform level services. In addition to libraries, the Winston and Bolt products are also built using Python frameworks (Gunicorn + Flask + Flask-RESTPlus).

Information Security

The information security team uses Python to accomplish a number of high leverage goals for Netflix: security automation, risk classification, auto-remediation, and vulnerability identification to name a few. We’ve had a number of successful Python open sources, including Security Monkey (our team’s most active open source project). We leverage Python to protect our SSH resources using Bless. Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid. We use Python to help generate TLS certificates using Lemur.

Some of our more recent projects include Prism: a batch framework to help security engineers measure paved road adoption, risk factors, and identify vulnerabilities in source code. We currently provide Python and Ruby libraries for Prism. The Diffy forensics triage tool is written entirely in Python. We also use Python to detect sensitive data using Lanius.

Personalization Algorithms

We use Python extensively within our broader Personalization Machine Learning Infrastructure to train some of the Machine Learning models for key aspects of the Netflix experience: from our recommendation algorithms to artwork personalization to marketing algorithms. For example, some algorithms use TensorFlow, Keras, and PyTorch to learn Deep Neural Networks, XGBoost and LightGBM to learn Gradient Boosted Decision Trees or the broader scientific stack in Python (numpy, scipy, sklearn, matplotlib, pandas, cvxpy, …). Since we’re constantly trying out new approaches, we use Jupyter Notebooks to drive many of our experiments. We have also developed a number of higher-level libraries to help integrate these with the rest of our ecosystem (e.g. data access, fact logging and feature extraction, model evaluation and publishing).

Machine Learning Infrastructure

Besides personalization, Netflix applies machine learning to hundreds of use cases across the company. Many of these applications are powered by Metaflow, a Python framework that makes it easy to execute ML projects from the prototype stage to production.

Metaflow pushes the limits of Python: We leverage well parallelized and optimized Python code to fetch data at 10Gbps, handle hundreds of millions of data points in memory, and orchestrate computation over tens of thousands of CPU cores.

Notebooks

We are avid users of Jupyter notebooks at Netflix, we’ve written about the reasons and nature of this investment before.

But Python plays a huge role in how we provide those services. Python is a primary language when we need to develop, debug, explore and prototype different interactions with the Jupyter ecosystem. We use Python to build custom extensions to the Jupyter server that allows us to manage tasks like logging, archiving, publishing and cloning notebooks on behalf of our users.
We provide many flavors of Python to our users via different Jupyter kernels, and manage the deployment of those kernel specifications using Python.

Orchestration

The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines.

Many of the components of the orchestration service are written in Python. Starting with our scheduler, which uses Jupyter Notebooks with papermill to provide templatized job types (Spark, Presto, …). This allows our users to have a standardized and easy way to express work that needs to be executed. You can see some deeper details on the subject here. We have been using notebooks as real runbooks for situations where human intervention is required. i.e: restart everything that has failed in the last hour.

Internally, we also built an event-driven platform that is fully written in Python. We have created streams of events from a number of systems that get unified into a single tool. This allows us to define conditions to filter events, and actions to react or route them. As a result of this, we have been able to decouple microservices and get visibility to everything that happens on the data platform.

Our team also built the pygenie client which interfaces with Genie, a federated job execution service. Internally, we have additional extensions to this library that apply business conventions and integrate with the Netflix platform. These libraries are the primary way users interface programmatically with work in the Big Data platform.

Finally, it’s been our team’s commitment to contribute to papermill and scrapbook open source projects. Our work there has been both for our own and external use cases. These efforts have been gaining a lot of traction in the open source community and we’re glad to be able to contribute to these shared projects.

Experimentation Platform

The scientific computing team for experimentation is creating a platform for scientists and engineers to analyze AB tests and other experiments. Scientists and engineers can contribute new innovations on three fronts, data, statistics, and visualizations.

The Metrics Repo is a Python framework based on PyPika that allows contributors to write reusable parameterized SQL queries. It serves as an entry point into any new analysis.

The Causal Models library is a Python & R framework for scientists to contribute new models for causal inference. It leverages PyArrow and RPy2 so that statistics can be calculated seamlessly in either language.

The Visualizations library is based on Plotly. Since Plotly is a widely adopted visualization spec, there are a variety of tools that allow contributors to produce an output that is consumable by our platforms.

Partner Ecosystem

The Partner Ecosystem group is expanding its use of Python for testing Netflix applications on devices. Python is forming the core of a new CI infrastructure, including controlling our orchestration servers, controlling Spinnaker, test case querying and filtering, and scheduling test runs on devices and containers. Additional post-run analysis is being done in Python using TensorFlow to determine which tests are most likely to show problems on which devices.

Video Encoding and Media Cloud Engineering

Our team takes care of encoding (and re-encoding) the Netflix catalog, as well as leveraging machine learning for insights into that catalog.
We use Python for ~50 projects such as vmaf and mezzfs, we build computer vision solutions using a media map-reduce platform called Archer, and we use Python for many internal projects.
We have also open sourced a few tools to ease development/distribution of Python projects, like setupmeta and pickley.

Netflix Animation and NVFX

Python is the industry standard for all of the major applications we use to create Animated and VFX content, so it goes without saying that we are using it very heavily. All of our integrations with Maya and Nuke are in Python, and the bulk of our Shotgun tools are also in Python. We’re just getting started on getting our tooling in the cloud, and anticipate deploying many of our own custom Python AMIs/containers.

Content Machine Learning, Science & Analytics

The Content Machine Learning team uses Python extensively for the development of machine learning models that are the core of forecasting audience size, viewership, and other demand metrics for all content.


Python at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/python-at-netflix-bba45dae649e?source=rss—-2615bd06b42e—4

Introducing SVT-AV1: a scalable open-source AV1 framework

Introducing SVT-AV1: a scalable open-source AV1 framework

by Andrey Norkin, Joel Sole, Kyle Swanson, Mariana Afonso, Anush Moorthy, Anne Aaron

Netflix Headquarters, Winchester Circle.

Netflix headquarters circa 2014. It’s a nice building with good architecture! This was the primary home of Netflix for a number of years during the company’s growth, but at some point Netflix had outgrown its home and needed more space. One approach to solve this problem would have been to extend the building by attaching new rooms, hallways, and rebuilding the older ones. However, a more scalable approach would be to begin with a new foundation and begin a new building. Below you can see the new Netflix headquarters in Los Gatos, California. The facilities are modern, spacious and scalable. The new campus started with two buildings, connected together, and was further extended with more buildings when more space was needed. What does this example have to do with software development and video encoding? When you are building an encoder, sometimes you need to start with a clean slate too.

New Netflix Buildings in Los Gatos.

What is SVT-AV1?

Intel and Netflix announced their collaboration on a software video encoder implementation called SVT-AV1 on April 8, 2019. Scalable Video Technology (SVT) is Intel’s open source framework that provides high-performance software video encoding libraries for developers of visual cloud technologies. In this tech blog, we describe the relevance of this partnership to the industry and cover some of our own experiences so far. We also describe how you can become a part of this development.

A brief look into the history of video standards

Historically, video compression standards have been developed by two international standardization organizations, ITU-T and MPEG (ISO). The first successful digital video standard was MPEG-2, which truly enabled digital transmission of video. The success was repeated by H.264/AVC, currently, the most ubiquitous video compression standard supported by modern devices, often in hardware. On the other hand, there are examples of video codecs developed by companies, such as Microsoft’s VC-1 and Google’s VPx codecs. The advantage of adopting a video compression standard is interoperability. The standard specification describes in minute detail how a video bitstream should be processed in order to produce displayable video frames. This allows device manufacturers to independently work on their decoder implementations. When content providers encode their video according to the standard, this guarantees that all compliant devices are able to decode and display the video.

Recently, the adoption of the newest video codec standardized by ITU-T and ISO has been slow in light of widespread licensing uncertainty. A group of companies formed the Alliance for Open Media (AOM) with the goal of creating a modern, royalty-free video codec that would be widely adopted and supported by a plethora of devices. The AOM board currently includes Amazon, Apple, ARM, Cisco, Facebook, Google, IBM, Intel, Microsoft, Mozilla, Netflix, Nvidia, and Samsung, and many companies joined as promoter members. In 2018, AOM has published a specification for the AV1 video codec.

Decoder specification is frozen, encoder being improved for years

As mentioned earlier, a standard specifies how the compressed bitstream is to be interpreted to produce displayable video, which means that encoders can vary in their characteristics, such as computational performance and achievable quality for a given bitrate. The encoder can typically be improved years after the standard has been frozen including varying speed and quality trade-offs. An example of such development is the x264 encoder that has been improving years after the H.264 standard was finalized.

To develop a conformant decoder, the standard specification should be sufficient. However, to guide codec implementers, the standardization committee also issues reference software, which includes a compliant decoder and encoder. Reference software serves as the basis for standard development, a framework, in which the performance of video coding tools is evaluated. The reference software typically evolves along with the development of the standard. In addition, when standardization is completed, the reference software can help to kickstart implementations of compliant decoders and encoders.

AOM has produced the reference software for AV1, which is called libaom and is available online. The libaom was built upon the codebase from VP9, VP8, and previous generations of VPx video codecs. During the AV1 development, the software was further developed by the AOM video codec group.

Netflix interest in SVT-AV1

Reference software typically focuses on the best possible compression at the expense of encoding speed. It is well known that encoding time of reference software for modern video codecs can be rather long.

One of Intel’s goals with SVT-AV1 development was to create a production-grade AV1 encoder that offers performance and scalability. SVT-AV1 uses parallelization at several stages of the encoding process, which allows it to adapt to the number of available cores including newest servers with significant core count. This makes it possible for SVT-AV1 to decrease encoding time while still maintaining compression efficiency.

In August 2018, Netflix’s Video Algorithms team and Intel’s Visual Cloud team decided to join forces on SVT-AV1 development. Since that time, Intel’s and Netflix’s teams closely collaborated on SVT-AV1 development, discussing architectural decisions, implementing new tools, and improving the compression efficiency. Netflix’s main interest in SVT-AV1 was somewhat different and complementary to Intel’s intention of building a production-grade highly scalable encoder.

At Netflix, we believe that the AV1 ecosystem would benefit from an alternative clean and efficient open-source encoder implementation. There exists at least one other alternative open-source AV1 encoder, rav1e. However, rav1e is written in Rust programming language, whereas an encoder written in C has a much broader base of potential developers. The open-source encoder should also enable easy experimentation and a platform for testing new coding tools. Consequently, our requirements to the AV1 software are as follows:

  • Easy to understand code with a low entry barrier and a test framework
  • Competitive compression efficiency on par with the reference implementation
  • Complete toolset and a decoder implementation sharing common code with the encoder, which simplifies experiments on new coding tools
  • Decreased encoder runtime that enables quicker turn-around when testing new ideas

We believe that if SVT-AV1 is aligned with these characteristics, it can be used as a platform for future video coding standards development, such as the research and development efforts towards the AV2 video codec, and improved AV1 encoding.

Thus, Netflix and Intel approach SVT-AV1 with complementary goals. The encoder speed helps innovation, as it is faster to run experiments. Cleanliness of the code helps adoption in the open-source community, which is crucial for the success of an open-source project. It can be argued that extensive parallelization may have compression efficiency trade-offs but it also allows testing more encoding options. Moreover, we expect multi-core platforms be prevalently used for video encoding in the future, which makes it important to test new tools in an architecture supporting many threads.

Our progress so far

We have accomplished the following milestones to achieve the goals of making SVT-AV1 an excellent experimentation platform and AV1 reference:

  • Open-sourced SVT-AV1 on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ with a BSD + patent license.
  • Added a continuous integration (CI) framework for Linux, Windows, and MacOs.
  • Added a unit tests framework based on Google Test. An external contractor is adding unit tests to achieve sufficient coverage for the code already developed. Furthermore, unit tests will cover new code.
  • Added other types of testing in the CI framework, such as automatic encoding and Valgrind test.
  • Started a decoder project that shares common parts of AV1 algorithms with the encoder.
  • Introduced style guidelines and formatted the existing code accordingly.

SVT-AV1 is currently work in progress since it is still missing the implementation of some coding tools and therefore has an average gap of about 14% in PSNR BD-rate with the libaom encoder in a 1-pass mode. The following features are planned to be added and will decrease the BD-rate gap:

  • Multi-reference pictures
  • ALTREF pictures
  • Eighth-pel motion compensation (1/8-pel)
  • Global motion compensation
  • OBMC
  • Wedge prediction
  • TMVP
  • Palette prediction
  • Adaptive transform block sizes
  • Trellis Quantized Coefficient Optimization
  • Segmentation
  • 4:2:2 support
  • Rate control (ABR, CBR, CBR)
  • 2-pass encoding mode

There is still much work ahead, and we are committed to making the SVT-AV1 project satisfy the goal of being an excellent experimentation platform, as well as viable for production applications. You can track the SVT-AV1 performance progress on the beta of AWCY (AreWeCompressedYet) website. AWCY was the framework used to evaluate AV1 tools during its development. In the figure below, you can see a comparison of two versions of the SVT-AV1 codec, the blue plot representing SVT-AV1 version from March 15, 2019, and the green one from March 19, 2019.

Screenshot of the AreWeCompressedYet codec comparison page.

SVT-AV1 already stands out in its speed. SVT-AV1 does not reach the compression efficiency of libaom at the slowest speed settings, but it performs encoding significantly faster than the fastest libaom mode. Currently, SVT-AV1 in the slowest mode uses about 13.5% more bits compared to the libaom encoder in a 1-pass mode with cpu_used=1 (the second slowest mode of libaom), while being about 4 times faster*. The BD-rate gap with 2-pass libaom encoding is wider and we are planning to address this by implementing 2-pass encoding in SVT-AV1. One could also note that faster encoding settings of SVT-AV1 decrease the encoding times even more dramatically providing significant encoder speed-up.

Open-source video coding needs you!

If you are interested in helping us to build SVT-AV1, you can contribute on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ with your suggestions, comments and of course your code.

*These results have been obtained for 8-bit encodes on the set of AOM test video sequences, Objective-1-fast.


Introducing SVT-AV1: a scalable open-source AV1 framework was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/introducing-svt-av1-a-scalable-open-source-av1-framework-c726cce3103a?source=rss—-2615bd06b42e—4

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency

By: Di Lin, Girish Lingappa, Jitender Aswani

Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question — “Can I run a check myself to understand what data is behind this metric?”

Now, imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by few critical customer facing services (e.g. billing). You are about to make structural changes to the data and want to know who and what downstream to your service will be impacted.

Finally, imagine yourself in the role of a data platform reliability engineer tasked with providing advanced lead time to data pipeline (ETL) owners by proactively identifying issues upstream to their ETL jobs. You are designing a learning system to forecast Service Level Agreement (SLA) violations and would want to factor in all upstream dependencies and corresponding historical states.

At Netflix, user stories centered on understanding data dependencies shared above and countless more in Detection & Data Cleansing, Retention & Data Efficiency, Data Integrity, Cost Attribution, and Platform Reliability subject areas inspired Data Engineering and Infrastructure (DEI) team to envision a comprehensive data lineage system and embark on a development journey a few years ago. We adopted the following mission statement to guide our investments:

“Provide a complete and accurate data lineage system enabling decision-makers to win moments of truth.”

In the rest of this blog, we will a) touch on the complexity of Netflix cloud landscape, b) discuss lineage design goals, ingestion architecture and the corresponding data model, c) share the challenges we faced and the learnings we picked up along the way, and d) close it out with “what’s next” on this journey.

Netflix Data Landscape

Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-functional collaboration.

Data Landscape

Design Goals

At the project inception stage, we defined a set of design goals to help guide the architecture and development work for data lineage to deliver a complete, accurate, reliable and scalable lineage system mapping Netflix’s diverse data landscape. Let’s review a few of these principles:

  • Ensure data integrity — Accurately capture the relationship in data from disparate data sources to establish trust with users because without absolute trust lineage data may do more harm than good.
  • Enable seamless integration — Design the system to integrate with a growing list of data tools and platforms including the ones that do not have the built-in meta-data instrumentation to derive data lineage from.
  • Design a flexible data model— Represent a wide range of data artifacts and relationships among them using a generic data model to enable a wide variety of business use cases.

Ingestion-at-scale

The data movement at Netflix does not necessarily follow a single paved path since engineers have the freedom to choose (and the responsibility to manage) the best available data tools and platforms to achieve their business goals. As a result, a single consolidated and centralized source of truth does not exist that can be leveraged to derive data lineage truth. Therefore, the ingestion approach for data lineage is designed to work with many disparate data sources.

Our data ingestion approach, in a nutshell, is classified broadly into two buckets — push or pull. Today, we are operating using a pull-heavy model. In this model, we scan system logs and metadata generated by various compute engines to collect corresponding lineage data. For example, we leverage inviso to list pig jobs and then lipstick to fetch tables and columns from these pig scripts. For spark compute engine, we leverage spark plan information and for Snowflake, admin tables capture the same information. In addition, we derive lineage information from scheduled ETL jobs by extracting workflow definitions and runtime metadata using Meson scheduler APIs.

In the push model paradigm, various platform tools such as the data transportation layer, reporting tools, and Presto will publish lineage events to a set of lineage related Kafka topics, therefore, making data ingestion relatively easy to scale improving scalability for the data lineage system.

Data Enrichment

The lineage data, when enriched with entity metadata and associated relationships, become more valuable to deliver on a rich set of business cases. We leverage Metacat data, our internal metadata store and service, to enrich lineage data with additional table metadata. We also leverage metadata from another internal tool, Genie, internal job and resource manager, to add job metadata (such as job owner, cluster, scheduler metadata) on lineage data. The ingestions (ETL) pipelines transform enriched datasets to a common data model (design based on a graph structure stored as vertices and edges) to serve lineage use cases. The lineage data along with the enriched information is accessed through many interfaces using SQL against the warehouse and Gremlin and a REST Lineage Service against a graph database populated from the lineage data discussed earlier in this paragraph.

Challenges

We faced a diverse set of challenges spread across many layers in the system. Netflix’s diverse data landscape made it challenging to capture all the right data and conforming it to a common data model. In addition, the ingestion layer designed to address several ingestions patterns added to operational complexity. Spark is the primary big-data compute engine at Netflix and with pretty much every upgrade in Spark, the spark plan changed as well springing continuous and unexpected surprises for us.

We defined a generic data model to store lineage information and now conforming the entity and associated relationships from various data sources to this data model. We are loading the lineage data to a graph database to enable seamless integration with a REST data lineage service to address business use cases. To improve data accuracy, we decided to leverage AWS S3 access logs to identify entity relationships not been captured by our traditional ingestion process.

We are continuing to address the ingestion challenges by adopting a system level instrumentation approach for spark, other compute engines, and data transport tools. We are designing a CRUD layer and exposing it as REST APIs to make it easier for anyone to publish lineage data to our pipelines.

We are taking a mature and comprehensive data lineage system and now extending its coverage far beyond traditional data warehouse realms with a goal to build universal data lineage to represent all the data artifacts and corresponding relationships. We are tackling a bunch of very interesting known unknowns with exciting initiatives in the field of data catalog and asset inventory. Mapping micro-services interactions, entities from real time infrastructure, and ML infrastructure and other non traditional data stores are few such examples.

Lineage Architecture and Data Model

Data Flow

As illustrated in the diagram above, various systems have their own independent data ingestion process in place leading to many different data models that store entities and relationships data at varying granularities. This data needed to be stitched together to accurately and comprehensively describe the Netflix data landscape and required a set of conformance processes before delivering the data for a wider audience.

During the conformance process, the data collected from different sources is transformed to make sure that all entities in our data flow, such as tables, jobs, reports, etc. are described in a consistent format, and stored in a generic data model for further usage.

Based on a standard data model at the entity level, we built a generic relationship model that describes the dependencies between any pair of entities. Using this approach, we are able to build a unified data model and the repository to deliver the right leverage to enable multiple use cases such as data discovery, SLA service and Data Efficiency.

Current Use Cases

Big Data Portal, a visual interface to data management at Netflix, has been the primary consumer of lineage data thus far. Many features benefit from lineage data including ranking of search results, table column usage for downstream jobs, deriving upstream dependencies in workflows, and building visibility of jobs writing to downstream tables.

Our most recent focus has been on powering (a) a data lineage service (REST based) leveraged by SLA service and (b) the data efficiency (to support data lifecycle management) use cases. SLA service relies on the job dependencies defined in ETL workflows to alert on potential SLA misses. This service also proactively alerts on any potential delays in few critical reports due to any job delays or failures anywhere upstream to it.

The data efficiency use cases leverages visibility on entities and their relationships to drive cost and attribution gains, auto cleansing of unused data in the warehouse .

What’s next?

Our journey on extending the value of lineage data to new frontiers has just begun and we have a long way to go in achieving the overarching goal of providing universal data lineage representing all entities and corresponding relationships for all data at Netflix. In the short to medium term, we are planning to onboard more data platforms and leverage graph database and a lineage REST service and GraphQL interface to enable more business use cases including improving developer productivity. We also plan to increase our investment in data detection initiatives by integrating lineage data and detection tools to efficiently scan our data to further improve data hygiene.

Please share your experience by adding your comments below and stay tuned for more on data lineage at Netflix in the follow up blog posts. .

Credits: We want to extend our sincere thanks to many esteemed colleagues from the data platform (part of Data Engineering and Infrastructure org) team at Netflix who pioneered this topic before us and who continue to extend our thinking on this topic with their valuable insights and are building many useful services on lineage data.

We will be at Strata San Francisco on March 27th in room 2001 delivering a tech session on this topic, please join us and share your experiences.

If you would like to be part of our small, impactful, and collaborative team — come join us.


Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Netflix TechBlog – Medium https://medium.com/netflix-techblog/building-and-scaling-data-lineage-at-netflix-to-improve-data-infrastructure-reliability-and-1a52526a7977?source=rss—-2615bd06b42e—4