Deep Neural Networks for YouTube Recommendations

Paul Covington, Jay Adams, Emre Sargin
Proceedings of the 10th ACM Conference on Recommender Systems - RecSys '16

Add Remove

excerpt YouTube is the world’s largest platform for creating, sharing and discovering video content. YouTube recommendations are responsible for helping more than a billion users discover personalized content from an ever-growing corpus of videos. 1 on 2/29/2020, 11:06:40 PM

excerpt Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. We rarely obtain the ground truth of user satisfaction and instead model noisy implicit feedback signals. Furthermore, metadata associated with content is poorly structured without a well defined ontology. Our algorithms need to be robust to these particular characteristics of our training data. 1 on 2/29/2020, 11:07:06 PM

excerpt The overall structure of our recommendation system is illustrated in Figure 2. The system is comprised of two neural networks: one for candidate generation and one for ranking. 2 on 2/29/2020, 11:07:31 PM

tag-as YouTube on 2/29/2020, 11:07:46 PM

excerpt The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. The similarity between users is expressed in terms of coarse features such as IDs of video watches, search query tokens and demographics. 2 on 2/29/2020, 11:08:09 PM

excerpt During candidate generation, the enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user. 2 on 2/29/2020, 11:08:58 PM

excerpt Although explicit feedback mechanisms exist on YouTube (thumbs up/down, in-product surveys, etc.) we use the implicit feedback [16] of watches to train the model, where a user completing a video is a positive example. This choice is based on the orders of magnitude more implicit user history available, allowing us to produce recommendations deep in the tail where explicit feedback is extremely sparse. 3 on 2/29/2020, 11:09:21 PM

excerpt Machine learning systems often exhibit an implicit bias towards the past because they are trained to predict future behavior from historical examples. 3 on 2/29/2020, 11:10:56 PM

excerpt 50 recent watches and 50 recent searches. 5 on 2/29/2020, 11:11:56 PM

excerpt The categorical features we use vary widely in their cardinality - some are binary (e.g. whether the user is logged-in) while others have millions of possible values (e.g. the user’s last search query). Features are further split according to whether they contribute only a single value (“univalent”) or a set of values (“multivalent”). An example of a univalent categorical feature is the video ID of the impression being scored, while a corresponding multivalent feature might be a bag of the last N video IDs the user has watched. 6 on 2/29/2020, 11:13:24 PM

excerpt We observe that the most important signals are those that describe a user’s previous interaction with the item itself and other similar items, matching others’ experience in ranking ads [7]. As an example, consider the user’s past history with the channel that uploaded the video being scored - how many videos has the user watched from this channel? When was the last time the user watched a video on this topic? These continuous features describing past user actions on related items are particularly powerful because they generalize well across disparate items. 7 on 2/29/2020, 11:14:15 PM

excerpt Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Positive examples are annotated with the amount of time the user spent watching the video. To predict expected watch time we use the technique of weighted logistic regression, which was developed for this purpose. 8 on 2/29/2020, 11:14:59 PM

excerpt Logistic regression was modified by weighting training examples with watch time for positive examples and unity for negative examples, allowing us to learn odds that closely model expected watch time. This approach performed much better on watch-time weighted ranking evaluation metrics compared to predicting click-through rate directly 7 on 2/29/2020, 11:16:43 PM

cites Practical Lessons from Predicting Clicks on Ads at Facebook on 2/29/2020, 11:17:23 PM

cites Beyond clicks on 2/29/2020, 11:22:38 PM