Generative AI is an artificial intelligence model that, when trained on massive datasets, can generate text, images, audio, and video by predicting the next word or pixel. The simplest input (called a prompt) to generative AI is a text description. Based on that text description, a generative pre-trained transformer (GPT) can write a paragraph, a text-to-image model such as Stable Diffusion can create a picture, MusicLM can create music, and Imagen Video can create a video. This technology will democratize all kinds of content creation. For video creation it could level the playing field more than smartphones and social video platforms have already done. It will also fundamentally change the video content industry.
Consider Netflix, TikTok, and YouTube — the stars in this domain. Although each is unique in terms of content type and business model, all three platforms operate by incentivizing creators to develop engaging content, matching the right content to the right consumer, identifying what content drives engagement. Each of these elements builds on each other to create a flywheel that has helped all three platforms gain viewers at high speed. But that flywheel is beginning to lose momentum. Generative AI will make their problems worse by creating a new video content creation value chain.
Why Netflix, Tiktok, and YouTube are in trouble.
Netflix, TikTok, and YouTube have done well due to their ability to determine content relevance and engagement. They all have enormous amounts of data about who watches what and how. Despite their success, determining the “what” still presents two serious challenges:
Extracting useful, precise features. If a video is commissioned (as happens at Netflix), the categories it falls into are known: genre, cast, duration, etc. But those are broad and sometimes subjective labels, which makes it difficult for an algorithm to learn from them. Of course, many of the video’s features can be specified; the script, shot list, and other production features are known precisely. But attempts to use this data, however, lead to the other extreme: there can be too much information to describe just one video.