Our research

Video Models: The New Frontier

[
]
Pondering Durian
by Pondering Durian
07.07.2025
60 min read
photo by photo by thaxnay kapdee

If the work of the city is the remaking or translating of man into a more suitable form than his nomadic ancestors achieved, then might not our current translation of our entire lives into the spiritual form of information seem to make of the entire globe, and of the human family, a single consciousness?

MARSHALL McLUHAN, Understanding Media 1964

A Living Internet

LLMs have repackaged the collective knowledge of humanity: a mirror which can be prompted and prodded to return PhD level outputs for pennies. This was possible, in large part, due to status seeking behavior of humans online, now rendering 403 terabytes of data daily (~181 zettabytes projected in 2025) on which large language models can hone their predictions of the next token.

Source: International Telecommunication Union (via World Bank) | OurWorldinData.org/internet | CC BY

The same networks which brought our status games online and provided the fuel for LLMs will now be joined by the LLMs themselves, piped into agentic limbs molded by the output of our collective vanity, ultimately surpassing us: pushing agency from 5 billion to >100 billion in the global hunt for knowledge, the battle for attention, the competition for capital, and the war for hearts and minds set to control the institutions that govern how these technologies are deployed.

The same process is underway in video.

Source: Amount of data created daily | Fabio Duarte

Humans are visual creatures. The phonetic alphabet has been a core technology of "civilized man" but is alien to our true nature - our tribal evolution. While the post-cold war liberal world order promised a techno-capitalist monoculture on the back of open networks of trade, information and talent, the AI video revolution may bring its inverse: greater fragmentation - a return to tribes based on networks of shared values as opposed to blood and soil. Tribes now conducting their memetic warfare at the speed of light, forged in superclusters, ferried by fiber optics, curated by algorithms, and injected into billions of eyeballs in minutes...

Paleolithic emotions... medieval institutions... God-like technology, indeed.

In past decades, as the infrastructure matured, video quality first vaulted upwards (big budget Hollywood studios / special effects) followed by an explosion outwards (UGC with algorithmic curation) democratizing production through new tools like the iPhone and enhanced bandwidth.

As generative AI fuses with the internet, we will have both simultaneously. A Hollywood studio linked up to every keyboard.

Today, video is perhaps ~2-4 years behind its text-based counterparts. Models are still glitchy and overly expensive but are clearly approaching an inflection. Increasing hardware specialization, larger clusters, algorithmic breakthroughs, and the brutal efficiency of the market all point to a coming mainstream adoption.

Source: Epoch ai, Artificial Analysis

What will this mean for media and entertainment? Marketing and advertising? Education? These industries - like accounting, software development, writing and more - will soon enter the vertical

Yet, that may only be the beginning.

Multimodal breakthroughs are catalyzing an investment boom in robotics. Real world simulations paired with RL loops may prove crucial in more generalized approaches to embodied AI, helping to unlock markets from humanoids and self-driving to drones and defense. The promise of feedback loops cementing first movers (i.e. the flywheel between scaled hardware deployment, data collection, and model enhancement) only raise the stakes. The biggest bottleneck to date is the quantity and quality of data on which to train.

Yet, unlike text-data, an even greater portion of video data appears gated within the web2 giants early to provide free infrastructure for our mobile era status games: Google, Bytedance, Meta, Tencent, Kuaishou, etc - not to mention integrators like Telsa, BYD, Xiaomi, and more - may mean that open source will have a harder time keeping up in multimodal.

So how big is this market? How wide are its tentacles? Who is best positioned? The web2 giants with rich video assets? The integrated software / hardware plays? The new entrants? Can fast-moving video-model startups breakout like OpenAI in text or will incumbents pull up the ladder behind them? If so, can they stay ahead of open source? If not, where will value accrue?

These questions, and more, are explored below.

We are on the brink of a new renaissance. A renaissance we have collectively summoned, built on the extremes of human ingenuity and vanity. Built on the backbone of an internet we are reforging for an era of accelerated compute. A renaissance which will make every keyboard warrior a Hollywood producer. Which will remake media, education and art. Which may catalyze a robotics revolution and transform labor as we know it. The global consciousness of the internet is shifting from mere information dissemination to synthetic creation. Our institutions and polities and sense-making apparatuses are unprepared.

And yet, perhaps, for those of us willing to master the new tools, that is the beauty…

The 101

Video models essentially extend image generation into the time dimension, learning not only what things look like but also how they move over time - ideally in ways that mimic reality. For example, a video model might use an image generator for still frames and additional modules to animate those frames by learning motion patterns from real video data.

The leading approaches have been evolving rapidly, passing through several different architectures in just ~5 years: from GANs or VAEs to transformer architectures to diffusion models before consolidating around hybrid diffusion-transformer techniques, an approach which entered the mainstream with Sora's impressive initial release.

While a deep dive into the architectural intricacies is beyond the scope of this paper, The Video Model Explosion by Yen-Chen Lin is an excellent read for those keen to dive deeper into the technical evolution and nuance.

Until recently, however, the technology had remained at a proof-of-concept stage, roughly akin to AI image generation in 2022, reserved primarily for early adopters and held back by a few stubborn bottlenecks:

  • Limited Data: High-quality, diverse video data for training is harder to come by than image or text data. Publicly available video datasets (like WebVid) are relatively small and often have sparse or generic captions (though large scale open data sets like OpenVid-1M, VidGen-1M, and MiraData are helping to assuage this somewhat). Historically, this shortage had made it difficult for models to learn the spatial-temporal relations and infer the physics necessary to generate longer form videos.
  • Compute and Cost: Ensuring spatial and temporal consistency is computationally expensive. Even after training, generating a lengthy, high-res video via diffusion involves many inference steps per frame. Existing hardware constraints are part of the reason video models are converging towards transformer architectures.
  • Technical Hurdles:
    • Temporal coherence: keeping objects and characters consistent over many frames is challenging. Models can drift, causing glitches like random color-shifts or shape morphs between frames.
    • Understanding physics and causality: models can struggle with complex physics and logical consistency (objects moving through each other, mismatched shadows, etc.) which again largely ties back to a shortage of high-quality data.
    • Length limitations: most models can only generate seconds of video at a time (often 10–20 seconds at decent resolution currently, though SORA and Veo3 are testing 60s with privileged partners). Extending lengths with a coherent story arc requires continued research in sequence modeling or memory mechanisms (i.e. transformer-diffusion models are basically compressing along the spatial and temporal dimension so there are tradeoffs between length and quality / consistency).
  • Integration and Usability: For commercialization, these models must be packaged into tools that creators or consumers can easily use: building user-friendly interfaces, workflows, reasonable generation times, and often combining the video models with other modalities.

And still, the prize of true multimodality merits the sustained investment in compute scale up, research, and optimized inference to make video AI economically viable at scale.

2025-2027 in video appears poised to be the 2022 - 2024 of text.

The models are about to get… very good.

Size of the Prize

Narrowly defined, today's market for AI-generated video models is quite modest but growing quickly. The serviceable addressable market would include revenues from specialized text-to-video tools, AI video generation platforms, and related services like subscriptions to Runway ML, Synthesia, and others.

Source: Grandview Research

However, as the underlying models and the tooling progress, the addressable market should mushroom quickly, expanding to include the surrounding analytics, orchestration, and applications from end-to-end AI-driven content creation / editing, personalized advertising or surveillance analytics.

Source: mix of grandview Research and o3 estimates

US$42.5b may be under-selling the potential. Today, video accounts for well over >50% of internet traffic (with some estimates going as high as ~65%). Bloomberg Intelligence sizes the Gen AI market at ~US$1.3t by 2032. Over a long-enough horizon, video should carve out significant share as models become truly multimodal.

Disruption across sizable swaths of the economy seems obvious.

  • Content Creation & Media Production: significantly accelerate creative workflows like:
    • Content repurposing – automatically creating different versions of a video for various platforms (e.g. turning a horizontal YouTube video into a vertical TikTok format with relevant cuts and edits)
    • Automated video editing: analyze raw footage, pick out highlights, and assemble a rough cut or trailer without human labor
    • Automated Content Generation: completely novel content generation
  • Marketing and Advertising: Global digital video ad spend is in the hundreds of billions annually, and AI-generated videos can enable mass personalization. For example, generating 1000 variant product promo videos, each tailored to a different demographic or consumer to increase engagement and conversion
  • Enterprise Training, Education, and Communication: Enterprises and educational institutions are increasingly using video for training modules, curricula, and corporate communications at a fraction of the cost. Examples include Synthesia and HeyGen: AI avatar video platforms for generating training videos of a talking presenter without film crews or studios
  • Robotics and Simulation: Outside of media, video models and high-fidelity simulations will play a role in robotics and autonomous systems, providing synthetic training data for the near infinite number of edge cases embodiedAI may encounter in warehouses, cars, drones, or humanoids
  • Surveillance and Security: AI in video surveillance is already a multi-billion dollar market  In many markets in East Asia, there is a camera every 20 feet. For better or worse, real-time surveillance with enhanced analytics will receive a significant boost - at home and abroad. Smart cities is an adjacent trend which will suck in capital.
  • Interactive Entertainment: while early, generative video should prove an essential unlock for the now derided "Metaverse", populating digital worlds with on-the-fly content to match player actions or even immersive, personalized movies.

As my colleague Yan is fond of reminding us, the 21st will be the century of drugs and screens.

Competitive Landscape

The race for multimodal primacy is bifurcated by the Pacific. The American and Chinese juggernauts are leading the pack: competition spanning tech incumbents, leading labs, and upstarts all vying for their slice of this rapidly growing pie. The US still lays claim to the world's leading AI labs and boasts global titans like Google and Facebook with video repositories spanning billions of users. China, on the other hand, has a population 1.4b strong, the world's most dynamic consumer internet ecosystem (that was early to both video and voice) with global champions of its own in ByteDance, Tencent, and more - while emerging labs within DeepSeek, Alibaba, and Baidu have taken leadership over the open source mantle. China has other advantages in data: a ubiquitous domestic surveillance apparatus, more centralized government repositories, and a clear lead in automation and scaled manufacturing which should be beneficiaries of low cost multimodal intelligence. In a world where software is becoming commoditized, the complements matter more.

The race is on.

In a multimodal future, video will be a crucial pillar, and judging by the Artificial Analysis leaderboard, the US has work to do…

Artificial Analysis Video Arena Leaderboard

Source: Artificial Analysis

While Artificial Analysis is extremely thorough, subjective human assessment is also helpful in something as taste-conscious as video-generation, so I will provide examples of outputs from competing vendors. To compare models, I grabbed a random paragraph from a novel I’m working as the input prompt:

“Please depict this as realistically as possible: "[The Filipina teen] walked to the window, casting out bloodshot eyes to meet the afternoon hue. Dragonflies dotted the sky. Hundreds. Maybe thousands. Nine out of ten hovered in place: wings glistening, life bestowed on the group by the ten percent darting to and fro with epileptic spasms.”
U.S.A. (Big Tech & Incumbent AI Labs)

OpenAI (Sora)

OpenAI's release of Sora was not the first latent diffusion-transformer (Google’s VDM/Imagen Video lay claim), but it’s commercial launch kicked off palpable excitement for video models and sparked a host of R&D efforts around similar architectures. However, today rankings point to a dethroning by Bytedance, Google, and Kuaishou. While OpenAI has the talent, and increasingly the distribution, it still lacks the sizable video assets of other platforms like Google, Meta, ByteDance, and xAI which may be why its toying with a social network.

Sora Review on Test Prompt: B+

  • Strengths: Cutting-edge research capabilities, strong brand as an AI leader, access to compute/talent, strong consumer distribution, great fundraising capabilities
  • Weaknesses: Closed stance might slow ecosystem adoption. Lacks the data assets of other large ecosystem players.

Anthropic:

As of now, Anthropic has not publicly released a video generation model. Their efforts have been centered on their Claude language model, coding use cases, and text/multimodal understanding.

  • Strengths: World-class AI research talent, lots of funding (from Amazon, Google, etc.), and a focus on reliable AI which could be an asset if others falter on safety
  • Weaknesses: No established product in the video space yet, currently still on the sidelines. Amazon, their early Big Tech benefactor, has backed Luma, indicating they may branch out for other modalities.

Google DeepMind (Veo2 / Veo3):

The video model race is likely Google's to lose. It has proposed numerous foundational models in video generation: Video Diffusion Models (VDM) in 2022, Imagen Video and Phenaki in 2022–2023 etc, but until 2024, had remained cautious. In late 2024, Google moved aggressively towards productization, releasing market leader Veo 2 which users complemented for its intricate understanding of physics and motion, and even “speaks the language of cinematography” at up to 4k resolution, 60 fps.

Veo2 Review: A-

While Veo2 is clearly the most realistic, particularly in rendering humans, the video receives an A- largely because it wouldn’t generate until I removed references to “bloodshot” and “epileptic” which I thought was super lame and pedantic and representative of the own goals Google continues to have despite its very real technical and distribution advantages… and yet…

Veo3

On May 20th, DeepMind dropped the gauntlet yet again, introducing Veo3 which, until the recent drop of Seedance 1.0, was the undisputed leader in text-to-video-AND-Audio model.

At US$249 per month, I have yet to splurge to become a subscriber, but the examples on the site are… impressive.

  • Strengths: immense compute resources, probably the largest repository of video data in the world (YouTube), and an extremely deep bench of AI research talent. Their offering is uniquely verticalized spanning research (DeepMind/Brain), silicon (TPUs), cloud deployment (GCP / Vertex), and strong distribution across a host of applications (YouTube, Android, Google Maps / satellite, Gmail). Few can match Google's multimodal capabilities if they execute.
  • Weaknesses: Historically, Google appeared slow and reticent to release products, wrestling with a classic innovators dilemma and bloated legal departments stifling innovation. Now under threat, that appears to be changing…

Meta AI:

With >3bn users globally across a host of properties - most notably Facebook, Reels, and WhatsApp - Meta is well positioned in both distribution and UGC data assets. ROI on Meta's ad platform should also benefit from increasing AI video-personalization

User Review: Unrated (Movie Gen poised to launch for mainstream users in H2)

  • Strengths: distribution, data assets, social network, ad network
  • Weaknesses: to date, Meta has remained cautious in implementing video tools into its products likely due to political / regulatory considerations. Products like Make-A-Video are behind the competition. However, Movie Gen’s expected H2 2025 release may change the status quo.

Microsoft:

Microsoft’s approach to generative video has primarily relied on OpenAI to date, but will likely invest in its own capabilities as the honeymoon phase comes to an end. MSFT developed the NUWA family of models in 2022 / 2023, and its recent collaboration on the Step-Video-T2V model indicates the company remains active on the research side:

  • Strengths: Huge enterprise customer base and distribution (which could drive adoption of generative video in business contexts), deep pockets for investment, access to OpenAI’s tech, Xbox assets, and research capabilities
  • Weaknesses: OpenAI dependency. Outside of Xbox on the gaming side, lacks a true consumer video platform

CHINA (Big Tech)

China's mobile revolution, which inflected in the 2010s, brought smart phones to 1.4 billion people. Mandarin characters can be more cumbersome on mobile compared with phonetic letters pushing Chinese to be pioneers of voice / video across not only messaging apps like WeChat but across mobile applications from live-streaming, to shorts, to eCommerce promotion.

These sizable domestic data sets, large tech ecosystems, and latent AI talent spanning giants like Bytedance, Kuaishou, Tencent, Baba, Baidu, BiliBili, and others - not to mention a thriving smart manufacturing ecosystem bubbling out from Shenzhen in FSD, drones, and humanoids - make China a true contender in video and multimodal.

Alibaba (Wan 2.1):

Alibaba has been active in video models since 2022 following ModelScope Text2Video and continues to have chart topping releases like Wan 2.1. Their strategy is twofold: (1) incorporate AI into its ecosystem (e.g., helping Taobao sellers auto-generate product showcase videos, or aiding its media streaming site Youku with AI content), and (2) offer it as a service on Alibaba Cloud.

Wan 2.1 Review: A

Pretty solid all around, though still not as life like as Veo2 imho.

  • Strengths: Strong cash flow generating business with sizable cloud arm, access to large proprietary datasets (large catalogs of product images/videos, user-generated content on e-commerce platform, often tied to payments data) from >1 billion users. Releases continue to impress
  • Weaknesses: Alibaba’s primary business is commerce/cloud/payments – so not purely focused on consumer social media – lacking a platform like TikTok / Douyin, Kuaishou or the social embedding of Tencent.

Tencent (Hunyuan Video):

Tencent provides the defacto operating system (WeChat) for many aspects of life in China and beyond, with offerings that span social, streaming, gaming, payments, cloud and more, not to mention its thriving ecosystem of mini-apps. These assets provide unparalleled access to data and distribution in China which will inevitably integrate multimodal capabilities. Tencent's latest release HunyuanVideo (open source) has cracked the top 15 and signals its support of commoditization of its complements.

User Review: Poor man’s Wan 2.1

  • Strengths: Massive distribution, broad digital ecosystem with data sets across text, voice, and video, significant R&D capacity and cashflows to reinvest, backed by its own cloud computing arm with a strong position in one of the world's largest and most important markets
  • Weaknesses: Outside of its investment portfolio, Tencent lacks a large presence outside of China (excluding the Chinese diaspora) and will face strong regulatory oversight in integrating generative AI into its products (slowing innovation) and appears to have been less aggressive than global peers in own AI buildout

ByteDance:

ByteDance, the company behind both Douyin and TikTok is, with the possible exception of Google, the company best positioned globally in AI video. ByteDance’s entire business revolves around short-form video content with proven AI capabilities infused into its curation algorithm. The massive data sets, distribution advantage, proven talent and positioning, all mean ByteDance has been the dark horse to watch - making a big splash in June with Seedance 1.0.

  • Strengths: Unparalleled access to user-generated video data / trends, a massive global user base, and a proven consumer AI capabilities.
  • Weaknesses: geopolitically-driven fragmentation, sanction-driven GPU constraints

Kuaishou (Kling) / XiaoHongshu. Kuaishou and XiaoHongShu retain many of the advantages of Bytedance but on a smaller scale. Kuaishou, like Douyin, broke out in live-streaming, but generally catered to more rural / tier 3 / 4 cities in China while XiaoHongShu focused on video-recommendations for a more affluent / educated user base. Both have substantial video data and distribution to tap on. Kling, recently released by Kuaishou, is one of the top video-models globally.

Kling 1.6 Rating (sorry I was too cheap to pay for another subscription to Kling 2.1 which is clearly better than the below): B-

Not bad, but I’m not sure what is going on with the dragonflies which look more like weird birds!  Many users rate Kling 2.1 as likely the best video model outside of Veo3 and Seedance 1.0

Baidu:

China’s search giant is also an AI contender investing in AI across text (ERNIE) and image (its ERNIE-ViLG image generator), with diverse video assets from iQiyi to autonomous driving (Apollo). Baidu is kind of like a poor man's Google with an interesting assortment of assets, but without the same advantages in video:

  • Strengths: Deep AI research with strong historic emphasis on computer vision and a diverse ecosystem across search, cloud, and content.
  • Weaknesses: consumer influence has waned relative to Tencent/ByteDance in recent years. Less obvious advantages in video. XiaoHongShu carved out a good chunk of the search market with its video-friendly approach to recommendations.

Specialty Challengers (Leading Startups and Pure-Plays)

Runway ML (USA):

Runway was a first mover in video AI specializing in providing models and AI tools for creators, designers, and filmmakers, positioning themselves as the Adobe of AI. My suspicion is Runway will ultimately not go head-to-head at the model layer, but will focus on user experience and tooling as a differentiator, competing on seamless integrations / tooling.

User Review: Runway has now come out with Gen 4, which is supposed to be quite strong. However, the user experience on the site is pretty poor. I signed up to get credits and then was unable to use them to generate my sample video which was very frustrating…

  • Strengths: carving out a brand among digital artists with first mover advantage and strong, multimodal tool suite
  • Weaknesses: no real data / compute advantages. Risk of commoditization by large player  open source releases as part of a bundled offering, monetized elsewhere in the ecosystem.
  • Notable Metrics: close to US$600m in raised capital, >US$3b post-money valuation, ~US$84m in ARR as of Dec 2024 according to Sacra, “targeting ~US$300m in 2025”

Luma Labs (USA):

Luma initially became known for its AI-powered 3D capture (using NeRF technology) before introducing Dream Machine, a generative video/3D tool, focusing on high-quality, photorealistic outputs and consistent scenes/characters. After the release of Ray 2, Luma AI has integrated their video-model offerings into Amazon Bedrock, a promising signal that Amazon may be leaning towards "anointing" Luma AI as its primary video partner. Luma offers more advanced features like: reference character images (achieving consistency), specifying start and end frames to control a shot, and helping to guide style, targeting professional users like filmmakers, VFX artists, and possibly AR/VR content creators (given 3D roots).

User Review: Sadly free tier does not come with ability to test models

  • Strengths: Strong technical chops (cutting edge of NeRF and generative fusion) and "anointing" from Amazon as both an investor and tech partner are promising signals on competence, distribution, and future access to capital in a compute intensive end market
  • Weaknesses: sub-scale on their own. Targeting high-end production means longer sales cycles and smaller user base compared to a viral consumer app. Big Tech / open source competition will impact margins.
  • Notable Metrics: Raised ~US$90m series B at >US$900m post-money

Stability AI (UK):

Stability AI is an example of both the power and downsides of open source. It became a household name after the release of its Stable Diffusion image model before pushing into video offerings. While the project quickly rose to prominence, the high burn and lack of concrete monetization strategy forced a restructuring and cash injection in 2024, highlighting the difficulties of sustainable open source without a clear monetization path

  • Strengths: A passionate open-source community driving experimentation, rapid feedback, and product iteration. Strong brand given first mover advantage in images and open source DNA
  • Weaknesses: Lacking the compute, data, and funding of large tech companies. Still need to prove the business model can be self-sustaining in what is an extremely cutthroat industry.
  • Notable Metrics: ~US$300m raised, ~US$1b valuation

Synthesia (UK):

Synthesia is a UK-based startup that focuses on generating videos with human avatars from text (commonly used for corporate training, marketing, news summaries), providing an example of a more application specific approach to video model generation, focused on a particular niche. It uses a combination of real footage of actors and AI to synthesize speech and subtle mouth movements, leading to strong initial monetization. They have raised reportedly raised ~US$330m, with the most recent valuation of US$2.1b, and have recently crossed the US$100m ARR threshold

  • Strengths: Focus on a profitable niche with proprietary acting data to make avatars, and a head start in enterprise sales / relationships.
  • Weaknesses: Over time, I would expect the underlying technology to be commoditized by large video models and open source. They will need to compete on BD / integrations more than superior technology

MidJourney (USA)

MidJourney was a primarily bootstrapped project reportedly with ~US$200m in annual revenues in reported 2023, projecting ~US$300m in 2024. The company distinguishes itself by running the world’s largest AI-art community, tapping discord for frictionless distribution with a strong shipping cadence and aggressive personalization.

Pika Labs (USA)

Pika Labs is similar to Runway in its focus on fusing video generation and frame-level editing in a single web-plus-Discord product, aiming for Hollywood-style tricks (style-shift, extend, canvas-expand). It initially gained popularity on Twitter for stylish AI video loops. The company has raised a total of US$135m with the last recorded valuation ~US$470m

MiniMax, Zhipu AI & ShengShu (China)

Minimax and Zhipu are the most prominent of "the four AI tigers" in video who have all received sizable investment from China’s domestic incumbents.

Minimax is the leading contender with an ex-SenseTime led team, backed by Tencent and Alibaba, and has a strong track-record of product releases (e.g. Talkie and Hailuo). The company has raised a monstrous US$850m, most recently valued at US$2.5b in 2024, with estimated revenues of ~US$70m annualized at that time. Of the group, Minimax seems to be the most likely of the Chinese AI Video startups to reach consumer scale, potentially pulling off an OpenAI-style flywheel of usage → feedback → funding → better models → more usage, though clearly early days.

  • Strengths: Talented team, significant funding, and local market understanding in a large market barricaded from external competition with privileged access to leading distribution / cloud partners like Tencent / Baba
  • Weaknesses: Stiff competition from open source / domestic titans who will potentially have conflicts of interest similar to the rift between OpenAI and Microsoft

Rival Zhipu AI as also raised a colossal amount (~US$950m), most recently valuing the company at ~US$3b. The company was founded by renowned Tsinghua faculty and researchers and, despite limited revenue traction, boasts an impressive track-record of open source models from CogVideoX to the IPOC family of models (#2 and #4 on VBench, another leaderboard). Though my China-based giga-brain friends (like @moonshot6666) tell me Zhipu is falling behind.

The other company worth mentioning, based on the latest VBench rankings (before they stopped updating, is ShengShu. While Shengshu is materially smaller that Minimax and Zhipu, having raised only a few tens of millions with limited revenue to date, its latest model Vidu Q1, released in April, managed to top the latest VBench rankings.

User Review: B

Vidu receives a B largely due to its forced animation against specific instructions for realism and my opting for “general mode” (not animation) which… well… didn’t really pan out.

Needless to say, the competitive set is crowded. Yet it seems the deck is stacked in favor of a few, larger names.

PICKING WINNERS

The criteria for winning this race are fairly clear:

  • Data: Access to large scale video data and the infrastructure to process it is critical, and public data sets, especially labeled ones, are more limited than text or images.
  • Compute: Training and serving video models demands enormous computation. The next frontier of video will provide another leg up in demand.
  • Talent: The expertise to build and refine these models is scarce. Though, people tend to follow the data and compute...
  • Distribution: Distribution often outcompetes tech. Getting a product into the hands of millions of users is non-trivial.

In light of these factors, the best positioned players should not be a huge surprise:

  • Google: After a surprisingly slow start, Google is beginning to flex, donning its infinity stones one by one. Veo 3 is very impressive, and yet it still feels like Google has only started to scratch the surface of what should be possible given their data, talent, compute, and distribution advantages.
  • ByteDance: The company has been the most aggressive of the Chinese tech companies in investing in compute and is the one to watch in AI video with a string of 2025 releases which have likely laid waste to a whole generation of startups. The Seedance 1.0 release is just a sign of things to come.
  • OpenAI: OpenAI has now made the leap to "consumer internet giant, with AI lab characteristics", boasting ~800m active users. Bundling in Sora with other products makes it a real contender for multimodal leadership based on brand, distribution, and research talent.
  • Meta: despite a dearth of impressive video model releases, Meta has a track-record as a ruthless fast-follower: from SnapChat Stories, To Instagram Reels, to the Llama series, Zuck will have competitive AI video products piped to >3b users. It is a matter of time. The aggressive recruiting of the SuperIntelligence team is a sign that Zuck means business.
  • Baba/Tencent: to me, China's tech giants are some of the biggest winners of ubiquitous, cheap generative AI plugged into their massive ecosystems across all modalities with strong network effects and tame valuations. Amidst the geopolitical tensions, China's export economy will face headwinds, and they will require greater domestic stimulus to offset the growth hit. The public markets have underpriced Tencent / Baba as key beneficiaries of domestic AI investment and consumption tailwinds.
  • Integrators: In industrial use cases, the feedback loop between active deployments and data collection favors scaled, integrated players like Tesla, BYD, Unitree, and others... but they will need large scale video data and real2sim environments to even kickstart the flywheel

There is little "alpha" in the above predictions given their sizable market caps. However, most of these companies are "hedged" in that they have the ability to monetize either:

  1. Cutting edge model capabilities that outpace open source or
  2. Strong product ecosystems that can benefit from ubiquitous, cheap generative AI.

Like text, the video model layer will be a difficult arena for start-ups to compete: stuck between the closed source giants and the open source swarm. And yet, growth-stage investors have no shortage of ways to play an obvious trend.

Value Accrual: The Pending Bifurcation

OpenAI raising US$40b at US$300b post and Anthropic raising at >US$60b clearly indicate at least some smart investors expect material value extraction from model layer leadership - whether via APIs or vertical integration with the product layer over time. OpenAI's transformation into a consumer tech company is emblematic of how vertical integration might play out.

After DeepSeek's distillation, closed labs will be more savvy in protecting their endpoints, likely keeping SOTA models in house for continuous training. Any acceleration away from open source in performance would disrupt the current "commodization of the model layer" narrative altogether, making model layer performance an extreme differentiator.

However, to date, open source models have remained stubbornly competitive in video, just like other modalities. China, in particular, has been flooding the market with extremely high-quality, cost-competitive open models. With China's large digital ecosystem, tech giants armed with extensive video assets, and a clear path to monetizing open models in their existing product suites, its likely we see model value capture squeezed towards the bookends: to low cost infra providers and demand aggregation.

The opportunity set spans chip / networking supply chains, the hyperscalers who assemble them, the software to orchestrate workloads, all the way to the end products themselves. Importantly, the competitive landscape is multi-layered**,** laced with overlapping partnerships, horizontal expansion, and vertical integration efforts, not to mention a geopolitical environment now bifurcating global value-chains: one for an American-led west and another for a Sino-led global south.

The investment landscape is incredibly dynamic.

Infra: The 20,000 Foot View

Given the significantly increased data load, adding a temporal dimension to already data intensive images, the compute needs for video are enormous. As a rough heuristic, a short video can require ~10 - 50x more flops than an image model and 50-100x more than a text-only prompt. Real-time video analysis for extended periods can quickly spiral higher.

To quote AnyScale CEO Robert Nishihara, "we are moving from a world that is compute intensive to one that is heavily compute and heavily data intensive, and that is going to break a lot of stuff".

Large scale video, audio, and truly multimodal models simultaneously put pressure on FLOPs, memory bandwidth, network fabric and throughput requiring continued investments across the stack. The winners will be firms that can assist in shipping massively parallel compute coupled to high-bandwidth memory, stitched together by low-latency networks in power-dense data centers with retrieval optimized data pipelines. A new backbone for a world where machines can digest massive amounts of multimodal data, assess, and create.

Below are a few pockets levered to the trend worth highlighting.

Chips:

Leading players like Nvidia / AMD will no doubt be at the forefront of new designs catering to video workloads, but they will not be alone. In-house hyperscaler efforts now span GPUs, TPUs, FPGAs, and ASICs - not to mention startup accelerators from Groq to Cerebras to SambaNova to d-Matrix and more. All are gunning for Jensen's crown.

Greater fragmentation feels inevitable as chip designers optimize along different constraints for different workloads. Video will be a big one.

Memory & Interconnect:

High bandwidth memory suppliers like SK Hynix, Samsung, and Micron are also well-positioned, projecting HBM TAM expansion from US$35b to US$100b by 2030.

On the networking side, Nvidia, Broadcom, and Astria.

Deployment:

Semianalysis just dropped the de facto bible on neocloud infra, so I'm not going to rehash that:

Source: Seminanalysis - The GPU Cloud ClusterMAX Rating System

Outside of the hyperscalers, it is surprising numerous startups have carved out positions in an extremely capital intensive industry through focus and speed. While we expect consolidation in the years ahead, niche players who specialize in serving video workloads may carve out meaningful chucks of the market with novel infra setups.

Orchestration:

Orchestration platforms provide a more asset light means of exposure, providing a software layer to help optimize scaling and deployment of AI workloads across often heterogenous hardware to maximize performance and minimize cost and vendor lock-in.

AnyScale, mentioned earlier, is just one example of many. The company has raised >US$250m to commercialize the Ray distributed compute framework for sizable ML workloads across some of the world's largest organizations, increasingly focused on large scale video / multimodal demand.

The category is wide ranging including:

  • Inference Orchestration and MLOps: Deploying a generative model involves many components – containerization, model repositories, GPUs or other accelerators, load balancers, feature preprocessors, and monitoring.
  • Dynamic Resource Scheduling: For example, Run:ai (acquired by Nvidia for US$700m) offers a virtualization layer that lets multiple AI jobs share GPUs or dynamically borrow unused GPU memory/compute from each other
  • Model Optimization Software: platforms, like OctoML, focus on making models run faster on existing hardware.
  • Network and Pipeline Optimization: Beyond compute, platforms need to address the dataflow pipeline (e.g. MosaicML Streaming from Databricks)
  • Integrated APIs: higher-level services like Fal.ai that abstract the various components, specializing in serving inference for media workloads, offering a unified API to run hundreds of different generative models across modalities (image, video, audio). Others like Baseten and **Replicate,** similarly allow developers to deploy models with one line of code or via a web platform.

We are likely heading towards a future where virtually every software product integrates generative media, and many will prefer to call an API rather than maintain their own infrastructure – similar to how Twilio succeeded by offering easy telecom APIs.

While these specialist solutions have clear tailwinds, they will face increasing competition from bundled hyperscaler offerings who squeeze efficiency gains from the entire stack.

The App Layer

Beyond the obvious (but often priced in) structural tailwinds at the infrastructure layer, the application layer is increasingly the battleground of new value capture, where raw AI capability is transformed into usable solutions: intuitive interfaces, editing tools, integrations, and domain-specific features.

Whether vertically integrated approaches - like Google in the extreme (infra + model + apps) or OpenAI (model + product) - outcompete modular approaches (i.e. cloud + model of choice + standalone product) is one of the trillion dollar questions in AI right now.

Investing here is tricky. "Picking up pennies in front of a steam roller" is an ever-present risk. The graveyard of startups created from the ChatGPT -> o3 product march is deep and wide.

Yet, the bull case for the app layer, and the modular stack more broadly, can be made through companies like Cursor that has built a ~US$10b business atop foundation models: in this case, in software development. Cursor helps harness raw LLMs to provide a full developer co-pilot by supplying the right context, enforcing safe edits, automating editor-level workflows, etc - capabilities that the model APIs alone do not (yet) delivered, substantially improving the developer experience.

In their current state, video models still require similar "meta-layers" to help harness the raw capabilities of LLMs into products end users can better utilize. From easy templates to prompt scaffolding to character consistency to proper context to precision editing tools to broader integrations, and more, video LLMs still require significant help in terms of UX, and those workflows will look different depending on the specific use case in question: casual or expert, consumer or enterprise, ads or education. This fragmentation provides cracks between the base model and the end user in which many applications aim to insert themselves, playing a "cursor-esque" role for media generation in different use cases.

While they target different use cases, there are similarities in their approaches:

  • Vertical Specialization: many tailor features to specific industry use cases, creating deep domain expertise (e.g. Synthesia and HeyGen going deep on corporate avatars)
  • Enhanced Feature Set & Workflow Integration: combining video generation with intensive editing and post production tools to offer a fulsome pipeline. (e.g. Runway or Luma gunning for the "Comprehensive AI creative studio" vibe).
  • Community & Network Effects: trying to leverage virality for distribution (e.g. OpenAI with its Ghiblification campaign or MidJourney's fantastic community engagement, sharing prompts and templates to enhance the product and expand the user network)
  • Partnerships: symbiotic relationships to expand data / usage loops (e.g. Synthesia integrating with Learning Management Systems like WorkRamp or Runway partnering with Lionsgate and Getty for exclusive data)

On the consumer side, Runway, Higgsfield, Luma Labs, Skade, Pollo AI are great examples, some vertically integrated, others pure aggregators or workflow / tooling providers atop a host of models. On the enterprise side, Synthesia, HeyGen, Aeon and Pictory are a few worth watching.

The opportunity is large, but precarious. Each will have to move quickly to capture as many users as possible before subsequent model layer releases cannibalize their added functionality. The race is on: can large labs and tech companies vertically integrate up the stack to add these UX enhancements in the largest use cases or can startups land grab a broad enough slice of demand to begin fine-tuning their own models, using proximity to the customer to expand their own defensibility over time?

We will almost certainly have a mix of outcomes, but, in the largest use cases, the risks of model expansion upwards appears greater than the inverse.

Lastly, consumer internet platforms with network effects will be big winners. Meta, Baba, Tencent, and ByteDance all have massive ecosystems which can leverage these tools: optimizing products to enhance time spent, ad targeting, and conversion.

If the network effects are built in, the big will get bigger.

Embodied AI

The last category worth mentioning given the interplay and sheer size of the TAM is embodied AI. Sufficient video data is one crucial bottleneck given the near-infinite edge cases which can emerge in the real world. As Lyn Alden points out, we should be more cautious about the timelines to deploy in the world of atoms vs. the world of bits. Level 5 FSD has been “two years” away for the last fifteen and bottlenecks to scaled humanoid deployment may prove similarly stubborn.

Yet, the rapid advances from scaling up transformers, the plummeting costs, and the promise of RL all point to a world in which video data combined with simulated environments that mimic the real world with close enough fidelity can provide useful training data for the edge cases a humanoid many encounter in unconstrained environments. The questions becomes: can we reach the "good enough" tipping point via video data and simulated environments to kickstart real world deployment and the ensuing data feedback loop which would then take hold?

In many respects, China is well positioned here as the robot supply chain has substantial overlap with the automotive and EV supply chains. Shenzhen’s hardware agglomeration effects will be difficult for other regions to replicate.

The size of the prize is enormous, and there are several pockets to loop for exposure:

  1. Large public company with strong hardware and software integration: like Tesla, Amazon, Xpeng, Xiaomi, BYD, Google etc, but obviously these companies tend to already be well into the >US$100b range.
  2. Large public suppliers like MHCP, IFX, RRX, and ON, trading at cyclical lows from the automotive cycle but with upside exposure to an acceleration in robotics (Citrini Research)
  3. Secondaries into leading private humanoid integrators - like Figure or Unitree.
  4. Hardware-first privates: scan for software leaders with strong video assets and AI chops and scope for partnership candidates on the hardware side. For instance, DeepMind's partnership with Apptronik. Nvidia, Meta, Baba, Tencent, and Baidu etc will all have partners of choice on the hardware side to compete with the US and China's strong stable of integrators.
  5. Software-first privates like Physical Intelligence - which recently raised US$400m from Bezos and OpenAI - is probably the poster child here, aiming to craft software that can work with any robot. Nvidia is another pioneer, releasing Gr00t in March, its open source foundation model for robotics.

The Edge of Tomorrow

Today, video models are having their chatGPT moment. Advancements over the next three years will upend industries from marketing and advertising to media and entertainment, education and more. The costs will crater and the supply will explode.

UGC but on an unparalleled scale. The Tiktokification of Hollywood-caliber entertainment, ads, courses, and more... constrained only by the number of eyeballs on planet earth and the twenty four hours in a day they can consume.

And while that alone is exciting (or anxiety inducing) for entrepreneurs, investors, and apparatchiks, it still feels like a "skeuomorphic" view of how these tools will be used.

Video, along with computer use (report coming soon!), are key unlocks for truly multimodal models that will interact with the world; a tipping point in a shift towards the "era of experience" (link to Paul’s article). RL loops today are limited by their textual confines but may soon find new horizons. My colleague Lexy says it well referencing ex-Machina, "only when (Mary) steps outside and experiences the color red directly does she truly understand it."

Maybe the consumption of novel entertainment will not be Gen AI video's ultimate legacy. Fun and wild and culturally transformational and almost certainly an accelerant of "The Great Weirding", but really just an appetizer. The something that “looks like a toy" that generally proceeds something much more significant.

Video models will start as a plaything. But their very existence portends a future where synthetic intelligence comprehends the laws of physics, and the commands necessary to recreate and manipulate them. A dense cocktail of training data from existing web videos, private data sets, high-fidelity simulated environments, real-world robotics data, and RL experimentation which, when consumed, helps synthetic intelligence to learn how the world outside of text actually works.

Models are going multimodal. They will escape the confines of cyberspace and merge with the world of atoms.

You are here.

Source: Wait, But Why?

***

Many thanks to Alex L. (@moonshot6666) for his insightful feedback on this essay.