Valentin Bazarevsky and Andrei Tkachenka, Software Engineers, Google Research
Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background, and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process (e.g. an artist rotoscoping every frame) or requires a studio environment with a green screen for real-time background removal (a technique referred to as chroma keying). In order to enable users to create live in the viewfinder we designed a new technique that is suitable for mobile phones.
Today, we are excited to bring precise, real-time, on-device mobile video segmentation to the YouTube app by integrating this technology into stories. Currently in limited beta, stories is YouTube’s new lightweight video format, designed specifically for YouTube creators. Our new segmentation technology allows creators to replace and modify the background, effortlessly increasing videos’ production value without specialized equipment.
|Neural network video segmentation in YouTube stories.|
To achieve this, we leverage machine learning to solve a semantic segmentation task using convolutional neural networks. In particular, we designed a network architecture and training procedure suitable for mobile phones focusing on the following requirements and constraints:
To provide high quality data for our machine learning pipeline, we annotated ten of thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel accurate locations of foreground elements such as hair, glass, neck, skin, lips, etc. and a general background label achieving a cross-validation result of 98% Intersection-Over-Union (IOU) of human annotator quality.
|An example image from our dataset carefully annotated with nine labels – foreground elements are overlaid over the image.|
Our specific segmentation task is to compute a binary mask separating foreground from background for every input frame (three channels, RGB) of the video. Achieving temporal consistency of the computed masks across frames is key. Current methods that utilize LSTMs or GRUs to realize this are too computationally expensive for real-time applications on mobile phones. Instead we first pass the computed mask from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, as shown below:
|The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).|
In video segmentation we need to achieve frame-to-frame temporal continuity, while also accounting for temporal discontinuities such as people suddenly appearing in the field of view of the camera. To train our model to robustly handle those use cases, we transform the annotated ground truth of each photo in several ways and use it as a previous frame mask:
|Our real-time video segmentation in action.|
With that modified input/output, we build on a standard hourglass segmentation network architecture by adding the following improvements:
|Hourglass segmentation network w/ skip connections.|
|ResNet bottleneck with large squeeze factor.|
The end result of these modifications is that our network runs remarkably fast on mobile devices, achieving 100+ FPS on iPhone 7 and 40+ FPS on Pixel 2 with high accuracy (realizing 94.8% IOU on our validation dataset), delivering a variety of smooth running and responsive effects in YouTube stories.
Our immediate goal is to use the limited rollout in YouTube stories to test our technology on this first set of effects. As we improve and expand our segmentation technology to more labels, we plan to integrate it into Google’s broader Augmented Reality services.
A thank you to our team members who worked on the tech and this launch with us: Andrey Vakunov, Yury Karthynnik, Artsiom Ablavatski, Ivan Grishchenko, Matsvei Zhdanovich, Andrei Kulik, Camillol Lugaresi, John Kim, Ryan Bolyard, Wendy Huang, Michael Chang, Aaron La Lau, Willi Geiger, Tomer Margolin, John Nack and Matthias Grundmann.