How Netflix Built Their AI Infrastructure

This fireside chat between Prasanna Padmanabhan, ML Platform Director at Netflix, and Gideon Mendels, CEO at Comet, delves into the journey of building an MLOps team at Netflix with the help of Comet. In this engaging session, they will give you a glimpse into the intricate architecture of Netflix’s machine learning infrastructure and share valuable insights into the intricacies of leading MLOps teams in today’s ever-evolving tech landscape.

Expect to learn:

Strategies and considerations for building an MLOps team
An inside look at the infrastructure that powers Machine Learning at Netflix
Challenges and opportunities in MLOps team management and infrastructure development

A transcript is available below if you prefer to read through the interview.

Gideon Mendels:

Hi everyone. I’m super excited to invite our speaker, Prasanna Padmanabhan, Netflix machine learning platform director, who’s going to share their story of how Netflix built their machine learning infrastructure.

Prasanna Padmanabhan:

Thanks, Gideon. I’m excited to be here. As Gideon said, my name is Prasanna. I lead the ML platform teams within Netflix. Excited to be here and happy to see everybody here.

Gideon Mendels:

Thank you again. Maybe to start, if you can share a little bit about your journey into engineering in general, machine learning. What brought you to that? What did you do before?

Prasanna Padmanabhan:

All right. A pretty deep question to start with, maybe a soul-searching way of identifying this. Where do I start? Look, as everybody here in this room, as a kid growing up I still vividly remember the first time I had my hands on a computer and I was really fascinated by the power, the magic that computer brings in. But over a period of time I think I was pretty intrigued by data specifically, the patterns that you see in data. The decisions that you can make with analyzing data I think is super powerful.

I did my masters with database specialization. Then when I joined Yahoo, it really started helping understand how do you build large scale distributed systems. With that advent of things like Spark, you really make data engineering super easy for everybody to try and analyze things. If I combine things like distributed systems and data infrastructure and things, that’s where I started to get into ML data infrastructure, and then slowly moved into things like training side of things with advent of PyTorch and Tensorflow, which makes it easier for anyone to try out things on training. Maybe the short answer for that is, I think it’s all about data that got me excited about coming in the field of machine learning.

Gideon Mendels:

That’s amazing. Maybe, I know you’ve been at Netflix for almost 10 years now, right? I know a lot of us, I spoke to people before, a lot of us use and watch Netflix a lot. Can you share a little bit on a high level what are some of the projects that Netflix utilize ML for?

Prasanna Padmanabhan:

Look, ML is practiced in many different places at Netflix. We internally call something called pitch to play. What I mean by that is, from the time when our creatives come to us with a pitch to the time where we green light that pitch; get into identifying the casting; doing things like shooting the actual video, whether it’s TV series, a movie; doing things like pre-production and then finally doing things like post-production and stuff, and then comes to the service. What you all have been well familiar with is personalization, which is at the very end of the product is there for you to watch and things. Then there’s lots of things we do in personalization.

But ML is also used in many different places from what we call from pitch to play. We build models to do things like content demand understanding. What I mean by that again is if we green light a particular show, how much members will watch that particular show? That’s a good signal for our creatives to decide if we should green light a show. Then even during content creation, we started to leverage machine learning for creating content. Again, not at the way of replacing our creatives, but leveraging machine learning for making our creatives be more creative, and so that they can focus more on creative tasks and less on mundane tasks.

Gideon Mendels:

I think obviously there’s a lot of use cases, and assuming there’s more stuff on the business side as well. But I know when we last spoke we talked about these two key focus areas for your team or your internal customers, personalization and then content generation. Which at first I think might be surprising for some people, especially with your philosophy of not trying to replace the creator but really try to assist them. Maybe we can cover both of these areas and then jump into a little bit more media engineering stuff. On the personalization side, curious, maybe if you can describe a little bit what kind of models do you guys use, of course the level you feel comfortable sharing? Then how do you measure these models post-production? How do we make sure that… What is the north star metric that you guys are optimizing for?

Prasanna Padmanabhan:

Yeah. Look, our north star vision for personalization is your folks go to your home, switch on your TVs and your absolute best content for that is already started playing for you. That’s our vision. Are we there? We are nowhere close to that. Show of hands, how many people watch Netflix or try to watch Netflix and then spend like, I don’t know, 10, 15 minutes and you get tired of identifying which one to see? There you go. There’s lots of work that we need to do in personalization. One of the favorite quotes from Reed Hastings, our CEO, is that, “We suck where we are today compared to where we will be in the next several months to days to years.” I think we are doing well in certain areas in personalization, but not that great in many areas of personalization as well. But making sure that we are able to recommend the right title to the right member, at the right device, at the right time is super critical for our business.

How we measure that is, if I look at some business metrics, how many hours did our members stream Netflix? It’s a good indication of are they really enjoying the content. As a subscription company, do members retain is another key business metric that we look at to see if our members are enjoying the shows, enjoying various content that we have with the Netflix.

But in terms of models, look, again, there are a slew of machine learning models that power the Netflix homepage. There are models that figure out what videos to show within a row. There are models that figure out what rows to show to our members in the first place. There are models to figure out which artwork should we show for a given member in a video combination. For example, you and I may like Stranger Things, but the artwork that we show for that could be different based on our taste and things like that.

Search is also a recommendation, but we also think search is a failure in homepage recommendation. Why do you want to go for a search when you don’t see the videos that we are recommending in the homepage? These are all the different types of models that we leverage within personalization. But again, it’s a bread and butter. We say choosing is a very critical component of Netflix business, and personalization models helps our members choose the right content as fast as possible. But as you can see here, I think our work will be done when next time if I come and ask people, they’re like, “No, it’s not that much. We’re able to choose videos much faster.”

Gideon Mendels:

That’s super, super exciting. I know the other area we talked about that I think is very exciting, especially as I think the industry very excited, some would say a little bit hyped around generative models. You talked a little bit content generation. Maybe you can share a little bit, why is that something that’s important for Netflix? What are some of the models that you guys built or projects that you built in this space? Again, I think what’s always interesting is how do you measure the success with these things?

Prasanna Padmanabhan:

Okay. We spend tens of billions of dollars in content creation, like I think $15 billion or so last year. So you can imagine how many TV shows and movies will be released almost on a daily basis with that budget. If you think about content creation, there’s a lot of mundane task that our creatives needs to do, which is so not creative. Take an example of trailer generation. You’ve seen Mission Impossible, for example, and you want to create a trailer for Mission Impossible. It’s a very strenuous task. Our creatives look at each scene, try to figure out which scene was worthy enough to be in a trailer, and then they try to… The whole technical process is called Breakdown Assistant, where they break down the whole video into multiple segments, tag each of those segments, and then be searchable at a later point of time. Then make sure there’s a diversity of shots in that. At the same time, you need to make sure that you don’t reveal any secrets, so to speak, as part of the trailer.

This is just one part of content creation. What we’re trying to now do is to see if we can leverage machine learning models which can understand content better and help our creatives to provide some candidate, say, scenes which we think the model thinks could be worthy enough to be in a trailer. Again, this is not for replacing creatives, this is useful for creatives to be more creative and focus less on the mundane activities. That’s one area of where we have been using machine learning for content creation.

Artwork generation is another classic example, too. Today we spend millions of dollars to identify what’s the right artwork to generate for different content. Again, if there’s so much, hundreds and thousands of videos in our catalog, doing that in a manual fashion is not going to scale as we grow as well. Rather, can we leverage what’s the frames that are there on a video and see which of them could be candidates for our creatives to then add more touch-ups to that and make it more aesthetic so that our members may like or want to play a particular video? Those are some of the models that we are starting to use for ML for content creation. There’s more things on virtual production and things like you know want to remove some unwanted objects in your scenes. But can we elaborate some of those machine learning models that can do these kind of things in an automated fashion rather than… or at least in semi-automated fashion, rather than doing it in a very manual basis?

Gideon Mendels:

That’s exciting. You definitely have a very large catalog of videos and images you can train massive models on, probably one of the biggest ones. So very, very exciting. Thank you for sharing. I think it’s very helpful to understand a little bit some of the business problems. One of the things that at least really impressed me, and this is not the first time we’re chatting of course, but how versed you are with the business and the specifics and the creators’ work and what they’re actually doing. Even though theoretically you’re more on the machine learning platform side.

But a lot of our customers that we’ve seen success with, and generally customers that’ve seen success with ML is that there isn’t this disconnect between what the business needs or the end user, and then the actual people building the models and the infrastructure. That was just great to see. With that in mind, maybe we switch gears a little bit. You guys have been doing machine learning productions for a very, very long time. Could you walk us through a little bit of, start from either side you want, but on a high level end-to-end, what does the machine learning platform looks like at Netflix?

Prasanna Padmanabhan:

Yeah, how much time do we have? But look, let me try and see if I can summarize that. If you take a machine learning pipeline, you typically start with data and features. You need the infrastructure to be able to discover data as simple as possible. But even if I take a step back before I go into the details, what’s the success criteria for a machine learning platform team? You are enabling ML practitioners to try out chest or offline experiments as fast as possible. The time that it takes from an ideation to productization, that’s the key thing that you’re trying to optimize for. If your infrastructure is good, the time is obviously much lower. Again, your machine learning models can only be as good as the data that we provide to it. So it’s super critical for us to make sure that accessing data, discovering data is as simple as possible.

The first piece of infrastructure that we build within Netflix is data infrastructure for machine learning. The running joke within Netflix data engineers is that Netflix is a logging company that also happens to stream movies, so much of data that gets logged into our data platform and things too, so that we can better understand our members’ taste. The first part is around, as I said, data infrastructure, making sure that we can easily discover data, making sure it’s easy to access data. Not just in personalization, even for content creation, accessing media data in a much more simpler fashion in a notebook setting. We leverage notebooks as the core IDE for ML practitioners. Then as you get data, how do you make sure your features are generated in a simple fashion? How do you standardize features? How do you store them in the right formats that’s more efficient for training? How do you discover features and make sharing more easier across ML domains, not just within a particular domain?

Then once that part is done, how do you train things faster? How do you make access to GPUs as simple as possible? I’m sure every company has things to… Making access to GPUs, it’s not easy because it’s more expensive. Once you have GPUs, how do you want to train things in a distributed fashion as simple as possible? Then look at your offline model metrics and be able to see if that experiment is good or not.

Again, essentially how do you close the loop from offline experimentation to productization and learning from the product as fast as possible? Essentially building tools in each of these different layers, right from notebooks for offline experimentation to good workflow schedulers, to schedule workflows on a regular basis, to building tools to figure out the quality of data in different areas as well, from the raw data to your features. Making sure that those features are clean before you train your models, and that the model variant that you just trained today is better than what is already in production so that you can then turn on that model into production. Those are different stages, and essentially building tools and infrastructure, all these different stages.

Gideon Mendels:

If you don’t mind, this is fascinating, maybe we can double click. I know there’s some things that you guys… You solved some really hard problems. I know you talked a little bit about notebooks being the IDE. How do you make these features accessible? Let’s start with offline training, and we’ll figure out how we connect it in production. I’m sure that’s not easy as well. How do you make these features accessible to a data scientist that just want to experiment, try different things? What does that look like?

Prasanna Padmanabhan:

Yeah. All right. If you think about… Okay, I’m going to talk about personalization because I think many people can connect here. What are some of the common signals you think is useful for personalization? Let’s make this more a conversational heavy things. Any guesses? Great. What you watched in the past is a great indication of what you want to watch in the future. We have a little thing called thumbs up and thumbs down. Thumbs up is a good signal as well. Lots of explicit signals like that. Then there are lots of implicit signals too like what videos you have not watched is also a great indication of what you like, what you don’t like. Providing access to that data first in a simple fashion is the most important thing for making feature generation, or feature engineering as simple as possible.

You don’t want your ML practitioners to know where the data is sitting. That could be in your big data platform in some iceberg table or hive tables. You really want a programmatic way to access that data and not worry about where is it stored, in what format is it stored and things. If you fired up a notebook, there should be a simple APIs to say, “Oh, get me the watch history data for members, of a particular cohorts of members.” Then be able to write features or feature definitions on those data, and be able to compute that within the notebook environment. Again, things like Spark makes it super easy for us to do distributed data processing on… Some of these are embarrassingly parallel ETL jobs. Feature engineering or label engineering in a large ways is ETL. So having access to these kind of libraries and having access to a standard way to create features is also critical. Then once you’ve done that, having a repository of features and making that easily discoverable is another thing too.

Within Netflix we have something called a fact store, and I know fact is not something that’s a very commonly used industry, standard term. Think of fact as a raw data using which you can create a machine learning feature. For example, what you call is a watch history is a good fact data. These are immutable data. With our fact store we snapshot data from various online microservices. We have a viewing history service or a watch history service, which captures all the videos that our member is watching, is useful for a product as well. What videos you have thumbed up is something that you can see those videos or thumbed up actions in our app too. These are typically powered by our microservices. Our fact store snapshots data of these different microservices on a daily basis, and be able to then provide ways to do time travel and then be able to compute features for any arbitrary time in the past. Things like that with respect to easier discoverability of data, easier access of data, and be able to do things like time travel makes feature generation a lot more easier at Netflix.

Gideon Mendels:

Interesting. I think one of the things, I mean you talked about fact store, but from what I’m hearing it’s actually quite different from a feature store, right?

Prasanna Padmanabhan:

Yeah.

Gideon Mendels:

But is the platform team providing the computed feature? Is that something the data scientists, they just get raw data and compute it themselves. What does that look like from the user experience?

Prasanna Padmanabhan:

Yeah. Until recently we one only supported a fact store and then let our ML practitioners generate whatever features that they want to. But over a period of time as we expanded machine learning across several domains, like now we are into games. I don’t know how many people play games? I don’t play. There’s not many games in Netflix yet, but this is an ongoing investment that we started to do in the next few years. But again, as I said, as we start to embark machine learning in different domains, we felt it’s important to share features across ML domains, not just features embeddings too. Because it’s inherently a time-intensive operation and a resource-intensive operation. You don’t want to reinvent the wheel every time. So a feature store along with a fact store is really helping our ML practitioners to cold start new models easily, so that you can look at existing features, use from them. At the same time, you can create new features easily with a fact store.

Again, many companies also do things like feature logging, which is very reasonable approach to doing feature generation as well for offline model training. But the problems that we have seen within Netflix for feature logging is that typically you want to train your models for large data sets. If you’re training window size is, say, I don’t know, say a couple of weeks for example, then you really have to create a feature, deploy that in production, wait for the data to be collected before you can train. Unless your scale is way too big and you just want to wait for a few hours, you have enough training samples. I don’t think Netflix is there at this point of time or to that levels. So we feel a combination of a fact store and feature store really makes feature engineering much more simpler to do.

Gideon Mendels:

Fascinating. Well, I’m curious to hear what does it look like when eventually this model makes it to production. But maybe we’ll follow how you covered the… We talked a little bit about the data side, making the data accessible. Let’s talk a little bit about training. I’m curious to hear recommendation engines, and from what I’m hearing, the work you’re doing about generative models sounds like very different needs there. Maybe you can talk a little bit what does training look like for these two different use cases and how do they differ?

Prasanna Padmanabhan:

Interestingly, they’re not that different in some ways. Again, I want to also clarify when we say generative AI, at this point of time we’re not building models to create content, and AI generated content is not what you’ve seen in our app and things. I just want to clarify that. What we are doing is to leverage, as I said, machine learning for making content creation easier. I can’t emphasize more that we are not here to replace our creatives, but more for helping our creatives be more creative. But in the case of training, what we have seen in cases for ML for content creation is that those models are inherently big. I’ve seen some folks who are into computer vision here, these models have much larger parameters, more than several millions of parameters if not billions. How do you train those models from scratch is slightly different from typical foundational personalization models, which are maybe smaller in size with respect to model, but the data size is humongous.

For both these different types of models, you want to have enabling things like distributed training much more faster to make model training faster. You want to make access to data also faster here. What I mean by that is, in the case of media data throughput to your data access is the slowest that we have seen. You may have your fastest GPUs, but if you’re not able to optimize the data loading time, your training time is little slower. This is where things like… Again, we are AWS shop, we have leveraged things like FSs for Lustre from AWS, which has really sped how we can load data into Tensorflow and PyTorch, and leverage GPUs better for some of those use cases.

Data loading is slightly different for different use cases and personalization and ML for configuration, because for media data inherently they’re much more bigger. So how do you optimize your data loading for these training jobs are different. But then end of the day, everything else relies more on distributed training. Whether you use data parallel or model parallelism is different for different things, or whether it’s fully shot at data parallel or is it model parallelism too? That’s kind of the difference that I see between recommendation models and what we call ML for content creation. Most of that is similar in many cases.

Gideon Mendels:

Very similar. Okay. I know we talked a little bit about some… Because of the low-latency requirements and recommendation, it’s more of a JVM stack on some cases where most of the foundational work is like Python, PyTorch, but that’s super helpful. Maybe on the same topic of the training side of things, can you talk a little bit about what does offline training, offline experimentation looks like in Netflix? What tools do you guys use? Is there any difference between the recommendation and the generative content generation side of things?

Prasanna Padmanabhan:

Yeah. Look, as I said, our team starter or success criteria is about how fast can we do experiments. Comet has actually helped in some of those cases as well. We leverage Comet more for experimentation tracking, when we have our ML practitioners leverage Comet to keep track of what data they use, what kind of model architectures did they use, what was the model metrics for that. Comet provides all of that so that we can easily reproduce what our one ML engineer did for the other engineers. I think that’s a pretty good value add for us leveraging Comet for experimentation tracking. But again, generally speaking, we leverage the same infrastructure that we use for productizing our ML workflows, also in experimentation. In fact, it’s important to not have different stacks for offline experimentation and productization so that we don’t run into any new issues around that.

But with offline experimentation, as I said, you leverage notebooks more. You want to keep those feature generation and stuff to be as reactive as possible so that you do in a more iterative fashion. Maybe for offline experimentation you do things like maybe smaller data sets before you train, and put it in a pipeline to see how it works at a large scale. For the most part of offline experimentation, leveraging the same infrastructure that we use for productizing a training pipeline, but leveraging more things like IDEs, like notebooks, and looking more at offline metrics to really see if the experiment is worth doing an AB test. More often than not many of the offline experiments don’t see the light for doing an AB test, which is great. A failure is a great learning more so than an actual success. So you want to see those offline experiments, look at all these offline model metrics, and make that process as iterative as possible.

Gideon Mendels:

Thanks for sharing. Super fascinating. If we walk through, I think we covered data, experimentation training, and then I think we started a conversation where you talked about just in the homepage about a few tens of models per user and I don’t know how many of them are personalized model, how many of them are serving the same. But maybe we talk a little bit about that side like deployment serving. I’m specifically really curious about what does it look like if I’m a data scientist and I built my models using the fact store API and the feature store API, which I know is a newer concept? What does it look like in production? Is it the same code that pulls these features? Is it a separate? What does that process look like?

Prasanna Padmanabhan:

Yeah, great question. Look, it’s super critical, especially for personalization models to not have training-serving skew. The way that we ensure there’s no training-serving skew is that we make sure that the data that we use for inferencing and training is exactly the same, and the code that is used to generate features in online and offline is also exactly the same. When we talk about fact stores, what we essentially do is at the time of inferencing, whatever raw data that you use to create features during inference time is what is logged into the fact store. Then we use that for next day’s model training.

At the same time, whatever code that you use to generate features offline is also what we use in the online side of things. Which is why we leverage things like Scala for offline experimentation and offline feature generation with Spark. For the online side of things, we are still a JVM heavy shop, as you can imagine. Personalization needs to be low-latency access to our members, and Java works much better compared to Python for these kind of low-latency applications. Leveraging this JVM stack for both offline experimentation and for online serving helps. Did I miss any part of the question?

Gideon Mendels:

I think you covered it. Yeah. No, I think you covered it. I mean, I can ask you a million more questions, but just being mindful of time. Obviously this industry’s moving very, very fast, both MLOps and machine learning in general. What are some of the things that you’re mostly excited about and you’re following in the industry? Then specifically, do you think any of those things will impact the media and entertainment industry?

Prasanna Padmanabhan:

What do you think is answer? It’s a small three letter word, GPTs. It’s fascinating how much this industry has evolved in the last year or so. Things like ChatGPT has literally made or democratized machine learning and novel innovative ML techniques to the common people. As a company that does things like content creation and for entertainment business, you want to ensure that you are using the latest technologies to help creators create content. With the advent of GPTs or simple text to images, text to videos is a great advancement in the field. There are various open source models too, like Eclipse model, which does these kind of open source text to video. But imagine a place, like if you are a creator and you are trying to do things like animation movies for example, they do spend a lot of time on what should a background look like for an animation movie. What if you tell your LLMs that, “Hey, show me a background image with rainfalls and fantastically rosy flowers and things like that.” Maybe it helps them to enhance their vision of their creative process.

Generative AI is a super exciting area for Netflix to look at and see how can we make our creatives more creative. Again, I can’t emphasize more, it’s a creative process. We don’t expect AIs to… At least never say never, but we don’t expect it at least in the foreseeable future for us to show AI generated content at Netflix. But again, as a creative process, can we make our creatives more creative? Maybe these kind of things of say a background generation can help our creatives really go into their vision of what do they really want to see in the screen, and how can we make that better, and how can they make that visually more aesthetic? Generative AI is a great place for us to keep an eye on.

Gideon Mendels:

Thank you so much, Prasanna. I’m sure a lot of you have questions, so feel free. You can ask and I’ll repeat the question so we’ll have it on the recording. Go ahead.

Audience Member 1:

I’m leading the machine platform team of Glassdoor and something that we struggle with a lot is the build versus buy versus borrow type of question. Did that ever happen? Did you guys have the conversation, we buy a platform?

Prasanna Padmanabhan:

Absolutely. In fact, it’s critical for any company to look at build versus buy, or borrow as you said, like open source things, on a regular basis. In fact, at Netflix, whenever we are trying to invest on a new area, we ask that question, “Is it more cost-efficient for us to maybe buy something before we build?” We’ve done that in many different things. Like we leverage open source left and right across our infrastructure. We’ve also open sourced things like Metaflow, which is one of our core ML frameworks that is open source for a few years now. We also open source our notebook solution called Polynote, which can work well for both Scala and Python side of things. But yes, it’s a very important question to ask on a daily basis with the advent of things like SageMaker and GCP’s things, almost everything is now a commodity, right? Like feature stores are table stakes. So many feature stores available in the market for both vendor-based supports and for open source as well.

But again, what you want to really do is, are those open source ones and vendors really plug and play for your infrastructure? Sometimes the answer is yes, sometimes the answer is no. If the answer is no, try and figure out why and what are the niche things that it’s better off building. Long story short, I would imagine we do… I mean we do that for any new initiative and I think it’s critical for us to do that. Even if you’re building things, just looking at what are the options that’s there today will help you be better at what you want to build and what you don’t want to build as well. So having that question is super critical.

Audience Member 2:

Yeah. Okay. Question on something that all ML companies need to do is to have easy to define ETLs and data pipelines for ML practitioners. But when we’re talking about very large-scale data sets, it can take… or at least in my experience, I’ve seen it take many hours for that to happen. Sometimes you realize that, “Oh, the query was slightly wrong or that’s not exactly the data I was looking for.” Maybe it’s too generic question, but I’m interested in terms of how have you solved this problem in which ML practitioner can very efficiently or quickly access large data to train their models on? Is it just simple application of Spark, or is it something that Netflix has done differently than other companies?

Prasanna Padmanabhan:

It’s a good question. I’m trying to see what has Netflix done that’s unique in the industry. Look, we leverage Spark, we leverage Presto and other companies. It’s true that you may be executing things for a couple of hours as you access large data sets and you end up finding that something was bad in your query or something that was bad in your Spark SQL query and things. I think one of the things that we have done reasonably well in the past is doing better at unit testing. You want to ensure that you’re able to execute your queries in a local environment in the first place. You want to ensure that you have some represented copy of the data available in your development environments as well, so that you can run your queries locally to then check if you’re able to functionally write the correct ones before you can probably execute that at larger scale.

I think that’s mostly what we have been doing. I don’t think we’ve done anything unique that has fundamentally changed how can you make that process better. Look, if there’s bugs in your query, there’s nothing much what the Spark engines or Presto engines can do. But being more disciplined in software engineering, this is… At the end of the day, it’s a lot of things is general software engineering disciplines and you need to ensure that you have written good unit test, written good integration test before doing things at a large scale.

Audience Member 2:

Thank you.

Audience Member 3:

Hello. I just wondering, so when you manage your ML team’s training jobs, model training job using the GPU resource, and I know you run your AI infrastructure on AWS cloud, cloud is on demand service pay as you go. Does it means any engineer need to run their job and require whatever number of GPU your platform just allow? Or you match a queue, you have a fixed number of GPU doing it?

Prasanna Padmanabhan:

Good question too. Look, as any AWS users here, there are two concepts. There’s reserve capacity and then there is spot capacity. We try and make sure we leverage as much as reserve capacity as possible. At Netflix there’s a separate team which does capacity planning, which is super critical. You don’t want to overinvest and reserve too much capacity from AWS and just waste for those resources. What I’m trying to say is that for things like GPUs, especially when they’re super expensive, these are mostly reserved capacities. We say in advance to AWS if we want to increase our fleet capacity by even if it’s like 10% or so, so that we can better manage our cost that we get billed with AWS.

For the most part with ML practitioners, when they want to access GPUs, whether it’s for training and stuff, those are all going through the reserve capacity, and not much on spot capacity. Yes, we have an ability to do that, but again, this goes to our fundamental culture of what we call freedom and responsibility. Our ML practitioners have the freedom to try that out, but then they also have to be responsible for, “Is it worth the money for doing spot instances as opposed to maybe waiting and scaling it up with reserve capacities?”

Audience Member 3:

Okay.

Prasanna Padmanabhan:

Does that answer your question?

Audience Member 3:

Yes. Yeah. Also, for the training data, you mentioned you need to use the data maybe from production user’s data. We know we have a lot of offline training, collect data, training not in production. But do you need to have a scenario that you need a online retraining the model which need to access the user’s data but the engineer not able to access the production environment? How do you handle the data from production, user’s data, but engineer not able to access that?

Prasanna Padmanabhan:

Right. Look, first of all, when we do things like offline model training, you want to make sure that you’re not hitting the online services. You don’t want to bring down your online services when you’re running high-throughput training jobs. If you remember, that’s where we snapshot the data into our fact store, which is essentially in our big data platforms. The time of training, you want to access this data from data lakes, like even S3 for that matter, and not hit these online systems. What we have done within Netflix, as the data platform, is as user interacts with our app, as I mentioned, there’s a lot of data platform infrastructure that was built to log all of the data into our data warehouse. Offline training always looks at the data from the big data warehouse.

Audience Member 2:

Just a follow-up of my previous question again on data access. Let’s say the fact store, if it’s raw data, I mean if you are looking at past histories for many, many days, which most personalization models need to, it is very expensive to compute over long timelines. Do you use some sort of automatic aggregate creation, so to create aggregate tables and then make them available in the fact store so that the job at the time is not super expensive?

Prasanna Padmanabhan:

Yeah, good question. When I say raw, it’s not super raw. You can think of it as transformed raw to a certain extent. Yes, there are various transformation jobs that happens and we snapshot that transformed data as well so that we don’t have to have each ML practitioners do those kind of transformation. Yes, absolutely, you do want to have your ML practitioners reuse those kind of common data transforms and not reinvent the wheel every time.

Audience Member 4:

I think it’s really interesting to hear that there’s an undercurrent of model success criteria tracking to business value, which ultimately drives value for your subscribers. How do practitioners stay in line with their business counterparts on aligning what those business metrics are when they’re thinking about a modeling task?

Prasanna Padmanabhan:

Good question. Look, again, as I said, if you look at business metrics for personalization, we talked about how much streaming hours this particular model is enabling, what’s the retention that this model is enabling. Again, then we do AB testing, there is a cohort of users who are experiencing this new model, versus what’s happening in production. Then we can be able to figure out, of those cohorts of members, are they streaming more and are they retaining more? That’s the business side of things. But end of the day, for you to stream more, you have to figure out and easily discover the content that you like to watch. That’s where personalization model comes into the picture. From a ML practitioner’s perspective, you’re optimizing the model to find the next right content or the next content that you want to watch very soon.

But at the same time, you don’t want to be too focused on short term. You want to also focus your models on long-term benefits and not just focus on short-term benefits. For example, if you’re watching a particular series and you say you are on season two, episode three, continue watching is the first row that we typically show in the Netflix app because we know that you’re binge-watching, so to speak, and you want to actually show that. But if we optimize just for that, you won’t be able to discover new content that we think you want to watch after. So from an ML practitioner’s perspective, it’s all about, how do we build models that can help our members find the video that they will enjoy as soon as possible? That kind of correlates reasonably well with our business metrics about streaming hours and retention.

Audience Member 4:

Just as a follow-up, is it fair to say that practitioners are actively looking at these success metrics as they’re thinking about initial modeling tasks, retraining tasks, and things of that nature? Or do they find themselves siloed and they’re really just thinking about model creation?

Prasanna Padmanabhan:

No. Look, as I said, when models are graduated to do an AB test, there’s a good dashboard that they look at which looks at those business metrics right in the UIs where you can see how is the model reacting to… or how is the model attributing to your business metric. If they are far away, as what Gideon said, you really can’t move your business in a meaningful way. I think it’s important to make sure that they’re well aligned with what the business is trying to move forward to.

Gideon Mendels:

All right. Prasanna, thank you so much for coming here today. Was that… Do you have another question?

Audience Member 5:

Yeah.

Gideon Mendels:

Yeah, of course.

Audience Member 5:

First of all, thanks for your time. Fascinating to hear the intent pipeline, ML pipeline and Netflix. I’m real curious to know about Netflix ML investments towards gaming, like the new stuff. How are you looking at ML’s investments towards either content generation or increasing customer engagement, since it’s been like a new initiative that’s come up in the last few months or like a year or so?

Prasanna Padmanabhan:

Yeah. Look, we’re still day zero in gaming. Well, if you’re day zero in personalization, imagine where we are in gaming. We have a handful of games launched in our service. At this point of time, what we are really trying to do is to make the discoverability of games more prominent. Like many of our members, many of my own friends or family members don’t even know that there’s games within Netflix, or don’t play. At this point of time, it’s more about personalizing those things. For example, we have a game that are on Stranger Things. If you are a Stranger Things fan, you need to maybe show that game more prominently for members who stream Stranger Things or stream videos that are related to Stranger Things. We’re not at a place where we’re thinking about leveraging machine learning for game creations and things, but it’s more about how do we make personalization of games more better? Especially as we launch more games in the next several years, it then becomes similar to videos, there’ll be a lot more games, and how do you personalize the right games to the right members will be a good challenge.

Gideon Mendels:

All right. Prasanna, thank you again so much for coming here today and sharing your knowledge with us. I know I’ve learned a lot. Really, really fascinating to hear, even though I’ve known you for a few years and every time we chat I learn something new about your approach to the platform team to solving these really, really difficult engineering problems. Thank you again for sharing your knowledge with us.

Prasanna Padmanabhan:

Thank you for having me. Been great conversations as well. Thanks for some intriguing questions as well. I enjoyed my time as well.

How Netflix Built Their AI Infrastructure

Netflix

Industry