Guest posts

DeAI II: Seizing the Means of Production

Exploring how DeAI II unlocks decentralized AI infrastructure by empowering community compute, data ecosystems and GPU marketplaces.

[
]
Pondering Durian
by Pondering Durian
18.07.2024
42 min read
photo by photo by thaxnay kapdee

This report has been made freely available to all readers thanks to Olas, the headline sponsor of Delphi’s Crypto x AI Month. Olas enables everyone to own a share of AI, specifically autonomous agent economies.

AI x Crypto (Part 2)

Part 1 – The Tower & the Square
Part 2 –
Seizing the Means of Production (Infrastructure)
Part 3 –
Composable Compute (Middleware)
Part 4 –
The Agentic Economy (Apps)

Disclaimer: I cannot possibly cover all of deAI infra in a single report. This is my best effort to cover core areas and projects, but undoubtedly will not be comprehensive.

The Means of Production

“The people will use their political supremacy to wrest, by degrees, all data and compute from Big Tech, to decentralise all instruments of production in the hands of the Network … and to increase the total of productive forces as rapidly as possible.”

-The Crypto-Anarchist Manifesto, @ponderingdurian 2024

The inputs which underpin economic activity are shifting. The networks of capital and labor which drove the 20th century are rapidly giving way to those of the 21st: networks of data and compute. Data is the raw material. And compute is increasingly the means by which it is converted into a “finished product”. In services-driven economies, intelligence is the product.

Carl Shuman recently pointed out, in the near future, the energy required to propel one human mind (~20 watts) may soon sustain 50,000 human brain equivalents of AI cognitive labor. A 50,000x decrease in price. Consider a high-performing professional at, say, $100 per hour, at 100% employment, who never sleeps or takes time off: close to $1 million in annualized wage equivalents today. Tomorrow, the same energy budget could sustain US$50b worth of wage equivalents.

There is a not insignificant chance the value of human cognitive labor implodes over the next decade. On the flip side, the value of capital – specifically capital channeled into data and compute – will reap the benefits of those gains, swallowing the share which previously would have gone to labor.

Realistically, I see two potential solutions (assuming one’s p(doom) isn’t too high…):

  1. The State
  1. This is the likely path: Extremely-Redistributive Democratic Socialism. Accelerating inequality leads to populist pressures for redistribution. Politics and the state increasingly replace markets as the tool by which (hopefully now less) scarce resources are allocated.
    This can work in rich nations like the United States who have a fairly abundant capital base and house many of the Tech Giants set to own the 21st Century’s means of production.
    In the emerging world where the traditional manufacturing-> urbanization->services-led growth model is rapidly closing due to automation, the path is much less clear.
  2. A Digital Renaissance
    Instead of trying to assuage the symptoms of a broken system through redistribution (with many countries not having much to redistribute), deAI aims to attack the root cause: laying a new foundation which allows individuals, increasingly displaced by automation, a way to contribute to and have ownership of, the new means of production. Digital property rights and universally-accessible compute.
    The shape of economic activity is increasingly a mesh directed by algorithms. And the algorithms – fueled by data and compute – are increasingly writing themselves.

The key question facing the 21st century is oddly unchanged from the prior two: Who will own the means of production?

Compute

Because AI moats are actually shallower than web2 network effects, Big Tech is hoping to retain their oligopoly through regulatory capture and cornering the market for the essential inputs: talent, data, and compute.

Perhaps compute is the most blatant.

By locking up the supply of high-performing chips, Big Tech can either:

  1. Craft intelligences superior to others in the market and rent them out at elevated margins or
  2. Rent out the underlying hardware at elevated margins

Cloud at 60 – 70% gross margins is representative of the oligopoly profits these companies have attained through scale and capital moats.

Fortunately, a host of GPU marketplaces are emerging to challenge this oligopoly by coordinating latent hardware globally to compete with Big Tech’s projected US$1T data center build out.

Source: https://www.topology.vc/deai-map by Casey

My analysis cannot cover all of the above players, but sticks to several leaders in the category, covering a mix across:

  • Hardware Quality: Enterprise Grade vs. Consumer Grade
  • End Markets: Rendering, Gaming, Compute, Artificial Intelligence
  • And even within AI: a focus on training vs. inference

Decentralized Training

Projects like Gensyn and Prime Intellect are tackling what is arguably the most challenging portion of the decentralized intelligence stack: training.

The consensus view – that decentralized training is still largely a “pipedream” – has merit.

Microsoft, Meta and xAI are all racing to build clusters with over 100k H100 GPUs. As noted by SemiAnalysis, these clusters would cost in excess of US$4b in server capex alone and require >150MW in capacity, 1.59 terawatt hours a year, costing about US$124m at the standard rate of .078 kw/h, and provide the next step-function in foundational model capabilities.

Source: Semianalysis.com

So far, Big Tech has managed to find several facilities capable of >150MW which means the 100k H100 single clusters are coming. This would provide for an increase in computation above the 20,000 A100s used to train GPT4 of about 31.5x!

Even that may be underestimating the step functions coming to a mega data-center near you…

There are reasons for this massive co-location of GPUs: training >1T parameter LLMs requires parallelism – usually using a combination of three methods:

  1. Data parallelism
  2. Tensor parallelism
  3. Pipeline Parallelism

Data Parallelism: each GPU holds an entire copy of the model weights, and the data is sharded. For data parallelism to work, each GPU needs enough memory to store the model weights, and for >1 trillion parameter LLMs the weights alone can be >10 terabytes of memory leading to…

Tensor parallelism: helps to overcome GPU memory constraints by spreading model weights across multiple GPUs, which requires very high bandwidth and low latency. “In effect, every GPU in the domain works together on every layer with every other GPU as if they were all one giant GPU”. Latency matters.

Pipeline parallelism is another technique to help assuage the memory constraints by sharding model weights and the optimizer state by layers before passing to next subset of GPUs

To maximize GPU utilization, mega-clusters will combine all three techniques using Tensor parallelism within H100 servers, pipeline parallelism between nodes in the same island, and data parallelism (lowest communication volume) used between islands (networking between islands is ~7x slower compared to networking within the island and even 2km of distance can cause >10x price increases for optical transceivers).

Realistically, the slower interconnect, non-homogeneous hardware, and more limited consistency and reliability are all real constraints meaning frontier models, under the current training paradigm, will continue to be produced by co-located superclusters.

However, distributed training should not be discounted too heavily. The advantages are two fold:

  1. Openness / Censorship Resistance: not all models need to be multi-trillion parameter LLMs to be useful. Three labs dictating what types of intelligences can be trained is incredibly restrictive. As we mentioned last month, Zuckerberg will not always be there for open source. Laying the infrastructure for a collective training apparatus is essential to the future of open intelligence and is aligned with the barbell thesis outlined later on.
  2. Long-term Performance: as energy constraints continue to bite, the generation after 100k clusters will further highlight the tension in doubling down on ever larger single clusters vs. pushing the boundaries of distributed training techniques.

Alex Long makes a compelling argument in this write up, reminding us what is possible with distributed incentives:

“Bitcoin PoW mining consumption was estimated at 150 TWh in 2022, or 17.12 GW on average and approximately 0.5% of total worldwide energy consumption. While this figure is approximate, the core fact remains striking; compute two orders of magnitude larger than the largest pools of centrally controlled compute has already been assembled under a single protocol.”

Perhaps the future is not up, but out…

SOTA in Decentralized Training

To date, cutting-edge foundational models have been honed with bandwidth-intensive backpropagation techniques – driving capex investment into co-located clusters. However, as other constraints like power, regulations, and data center capacity begin to bite, distributed approaches which require less intensive inter-node communication should gain in mindshare.

Below summarizes a fantastic write-up from Prime Intellect, outlining some of the distributed techniques beginning to show promise. Combined with distributed incentives on offer in crypto, we may be in the early innings of a decentralized training renaissance:

Distributed Low-Communication Training (DiLoCo)

From Google Deepmind, this approach enables training on islands of poorly connected devices – reducing intra-GPU communication by only requiring synchronization of gradients every 500 steps

Strengths:

  • Minimal communication between instances
  • Robust to changes in number of workers / compute power in distributed network

Limitations:

  • Still subject to data parallelism constraints based on individual GPU memory
  • More difficult in an async setting
  • Only tested on model size of 400m >1.1b parameters

The above edit comes as just this week, Prime Intellect released OpenDiLoCo, reproducing DeepMind’s original work trained across three countries and scaling it 3x the size of the original.

Distributed Path Composition (DiPaCo)

DiPaCo is another technique using sparse models (MoEs) to train larger models while keeping training costs and communication between nodes low by using a coarse routing mechanism to shard the data to individual workers

SWARM

SWARM is a model-parallel training algo designed for poor connectivity and varying reliability which continuously allocates more work to faster workers to reduce idle GPU time, capable of training >1 billion transformer models with less than 200mb/s bandwidth.

Strengths:

  • Fault-tolerant training on cheaper spot instances
  • Heterogeneous device support

Limitations:

  • 10b parameter training is still unproven

GEMINI ULTRA: A sign of things to come?

While Google has world class proprietary networking infrastructure, the fact that even Google ran into single facility constraints in training their Flagship Gemini Ultra is telling:

Assuming Gemini Ultra is a 2T parameter MoE model, Prime Intellect estimates 100 day training run with ~60% GPU utilization would take about 18 super-pods across different data centers.

This is the direction Gensyn and Prime Intellect are betting on. While today, Fortune 500 CEOs will likely pony-up for the more expensive AI-specific clouds or hyperscalers with multi-year commitments, there are other, more cost conscious customers with less forecastable demand: from startups to academics to hobbyists who also need access to compute.

In my opinion, decentralized training may be the hardest problem in DeAI to solve. Many in crypto have written it off, relegating training to centralized super-clusters and advocating resources be channeled towards tuning or “verifiable compute” in recompense for this unsolvable “original sin”. That view may prove short sited. We may look back a decade from now and realize that this – decentralized training – was, in fact, crypto’s killer use case.

Like most contrarian bets, its a long shot, but the payoff would be tremendous.

Not Quite “Airbnb for GPUs”

Out of the distributed compute I have reviewed (admittedly incomplete), I find io.net and Aethir the most interesting near-term because of their focus on co-location, enterprise grade hardware, and proven ability to onboard supply.

The thesis is essentially “AirBnB for GPUs” which, given the mismatch in surging demand and latent supply, appears sound.

On the demand side, the market is exploding.

While on the supply side, the market has been cornered by bulk purchases from the hyperscalers and other specialized clouds. This leaves smaller, price conscious entities without corporate balance sheets and a willingness to lock-in extended contracts out in the cold when it comes to spot demand for the ~25 – 100 GPU clusters needed to run serious training or inference jobs.

At the same time, there is latent supply in non-tier one data centers or non-profitable miners. In a recent Delphi interview, Tory Green noted the 1000 data centers in the US operating at under <20% utilization.

While this might seem to be a classic matching problem perfectly suited for a marketplace, successfully meeting this demand is much more complex than simply aggregating disparate GPUs…

Source: Semianalysis, single pod example

Building a modern supercomputer is a herculean task integrating hardware, software, power management, and data center logistics. There are the chip makers, OEMs and ODMs, their suppliers, and specialized cloud providers who all provide expertise in ensuring the super computer is optimized for maximum performance and uptime. Given the dollars invested, impeccable service and performance are essential. AWS, Azure, GCP, CoreWeave, Lambda and a host of others are investing heavily in fully-managed offerings which integrate hardware, software, and expertise – not to mention privileged hardware relationships – at a level difficult for marketplaces to compete.

Carving out a niche in this highly competitive ecosystem will not be easy, but Aethir and io.net appear best positioned today due to:

  1. Availability of Co-located, Enterprise-Grade Clusters.
    Let’s face it, enhancing / automating human labor is by far the largest use case. This market is set to receive the vast majority of AI server market spend and, for many enterprises, simply “aggregating” hardware is not enough.
    The hardware-type, availability, proximity, and orchestration all matter tremendously to performance which is why io.net has built on top of open-source framework Ray to better manage heterogeneous clusters, dynamic task scheduling, and efficient distribution of tasks across GPUs to maximize utilization and decrease latency – often across different co-located clusters.
  2. Inference-Aligned
    Near-term, the distributed model dove-tails more naturally with inference than training.
    Lambda and MSFT have recently claimed to see >50% of their load used for inference, a number likely to increase steadily as the market shifts from one-off costs of training to the recurring costs of servicing live models.
    While latency is a constraint in distributed training, having more widely dispersed clusters – closer to the end customer – can actually be a boon in inference: almost like a “CDN for inference”.
  3. First Movers in Kick-Starting the Flywheel
    These markets are large and marketplaces enjoy power laws. Network liquidity is an important metric future supply and demand will evaluate – often leading to flywheels behind the first players to inflect.
    Aethir has 43,000+ GPUs including 4000 H100s with aims to reach 50,000 over the course of 2024. According to a Mythos Research Report, Aethir has signed US$20m in ARR as of Q1 2024, largely with cloud gaming customers – demonstrating real demand.
    Io.net‘s initial success in onboarding supply has been its primary differentiator to date: reaching ~294k GPUs verified on the network. The willingness of Render, Aethir, Filecoin, and Akash to make devices available on io.net is also telling. By choosing to onboard their supply, these networks see io.net as a potential demand funnel – a privileged position for any aspiring marketplace.
    With only ~US$1.1m in GMV (not annualized nor counting off-chain contracts), io.net will need to pivot from supply side aggregation to ramping up demand, but if it can succeed in establishing itself as the demand aggregator for GPU compute, inflection could start a flywheel worth monitoring.

Why Crypto?

An astute investor might ask, why crypto? Can’t this just be a marketplace like AirBnB? Isn’t this just… vast.ai?

Fair questions. The answer appears to be four fold:

  1. Distributed incentives in onboarding supply (classic), but also
  2. Cross Border Payments: both demand for and supply of GPUs are global
  3. Traceability / Micropayments: tracing and splitting micro-payments between say model owner, app creator, and io.net – for servicing a single inference, cross-border. Realistically only ultra fast, cheap chains like Solana can enable this functionality
  4. Permissionless: instead of a multi-week KYC, users can spin up a cluster in ~90 seconds

Whether these advantages prove enough to outcompete web2 leaders, only time can tell.

Utilization / Valuation

The narrative appears strong, but like everything in crypto, the AI hype will die down and valuations will need to find a cushion in fundamentals. While these markets are large, DePINs focused on the enterprise have tended to disappoint. Filecoin (data storage) and Akash (cloud computing) are examples of exciting bull market narratives targeting colossal markets which, so far, have limited traction. Product-market-fit is not found by simply subsidizing supply, but in meeting real enterprise demand.

Aethir has found it’s first toe-hold in cloud gaming, while io.net is targeting the under-served seed and Series A startups out of the gate. However, caution on valuation is warranted.

Coreweave recently received a US$19b valuation on what is projected to be US$2.3b in 2024 revenues (up ~5x from 2023) and contracts worth US$7b through 2026. Even with a 500% YoY growth rate, the implied revenue multiple is 8.2x. Even if Aethir managed to triple ARR by year end 2024 (which would be extremely impressive), the implied multiple would still be ~50x.

In high growth eCommerce marketplaces, valuations can stretch to ~1x GMV in hot markets. Comparing the single digit millions in quarterly GMV flowing through these GPU marketplaces today vs. the billion dollar valuations, there is a disconnect…

On the other hand, we are early. In most industries, there has been a role for first-party managed services as well as a marketplace models, and I suspect GPUs will be no different. These markets are colossal. I see these projects as call options on a mega-trend which, outside of a few trillion companies, is difficult for retail to gain pure-play exposure.

Render Network

In my opinion, Render is an interesting project but overvalued.

The use case is compelling with a market structure well-suited for crypto incentives and utilization of non-uniform hardware. Render specifically targets creatives, designers, and architect types who are more cost-conscious and often more sporadic in their compute demands. Unlike frontier LLMs, massive clusters of H100s are not required. Tapping into the excess supply of GPUs – allowing suppliers to make a bit of extra income and users a more cost-effective solution – feels like a win-win.

This is an old chart, but shows RNDR as the true low cost option against competing farms:

Source: @OmegaFoxxx

My biggest concerns, however, are market size and valuation.

According to GMI, the Global Rendering market is expect to explode from US$4.4b in 2023 to $33b by 2032. However, the majority of value capture is expected to go to software tooling which can direct customers to their own in-house clouds or the host of Render Farms competing on cost.

The farm industry appears extremely fragmented, with many “leaders” appearing to only do single digit or low double digit millions in revenues. One of the largest, Shenzhen based Fox RenderFarm – appears to be doing just ~$30m topline. This leads me to believe the market for outsourced rendering is surprisingly small today.

The parent company of the Render Network “OTOY” provides rendering software tooling and integrations which allows for easy utilization of Render Network by creatives. According to CB Insights, it’s last equity fundraise was in 2016 at a valuation of US$300m, and today revenues appear to be around US$11m.

This, along with the ~US$5m in quarterly GMV, means that the US$3.5b valuation feels excessive.

How I might be wrong?

We finally seem to be at a tipping point in AR / VR. On the integrated side, Apple Vision Pro is truly a step change in entertainment. Meta has responded by opening up its software stack to external hardware OEMs, promising similar experiences at a mass-market price point. Multi-model is the logical next battleground in the hair-on-fire race to scale LLMs towards AGI. Sora’s demo of text-to-video was mind blowing. The world has 3b gamers, many with low-end hardware, which could benefit from “real-time rendering” at the edge. The world we are heading towards will be one of intense personalization and visualization. Life-like NPCs. Real-time augmentation. Entire 3D worlds crafted to the spontaneous tastes of individual users. In many ways, multi-modal LLMs may be the key unlock towards a credible metaverse with near infinite, on-demand content….

I’m not disagreeing. This is the world we are sprinting towards. I’m simply saying even Aethir – another player targeting rendering use cases – sites the global real-time rendering market at US$4b by 2033… the market bull case outlined above, in 9 years, roughly equal to the (AI-narrative-inflated) FDV of Render today.

All that being said, I like the market Render is going after. If AI revenues disappoint, more hardware will find its way into cost-competitive networks. I think crypto has a credible role to play in providing lower-cost solutions at the edge. I think Render is the leading contender in this promising niche today. And I think this niche will end up being larger than most market estimates project.

And yet… I will want to see more than ~US$1m in annualized burns before I pay US$3.5b to own a slice of it…

Cloud Convergence

Filecoin and Arweave have moved into compute with the FVM and AO computer (great write up by Teng), respectively. Akash has added storage capabilities. All three are hoping to make a push into AI. Aethir has both rendering and virtual cloud offerings. Io.net is exploring horizontal market expansion into gaming / zk or vertical expansion into model marketplaces, bringing it into conflict with the models / orchestrators we plan to discuss next month.

Just as we saw in the collision course of leading DEXes, stablecoins, and lending market protocols, decentralized infrastructure will see a convergence. Almost no one plans to sit in their niche; nor should they.

Integrated solutions make sense: enabling compute and AI workflows over data where it resides is logical. However, the lack of real traction of Filecoin, Arweave, and Akash in their original, large markets of storage and compute should give reason to pause when considering their right to win in the latest hot adjacency.

On the compute side, AI ambitions are frustrated by the shift in underlying hardware, from CPUs to GPUs – and specifically co-located enterprise-grade clusters. Io.net and Aethir are like the CoreWeave’s and Lambda’s of the distributed AI cloud but in a universe where the hyperscalers failed to dominate the first cloud transition. Unlike GCP, Azure, and AMZN, the web3 OGs cannot leverage their large customer bases, distribution, and cashflows to “bundle” new offerings as effectively.

On the storage side, emerging challengers like 0g promise more performant ingestion and retrieval capabilities, purpose-built to serve AI use cases on-chain.

The race remains wide open.

However, bickering about share between web3 players in almost a distraction at this stage. The real money is in the entire web3 market siphoning share from existing infrastructure providers: a thriving network of agents on-chain would almost certainly mean spiking demand across all distributed infrastructure.

The race is only just beginning.

Data (and Talent)

Before we dive into the deAI data-stack, we should understand the evolving role of data as a constraint in model building.

To put it bluntly, we are running out of public data sets online to train frontier-LLMs. Llama 3 was trained on 15 trillion tokens and a de-dupped version of common crawl (effectively a public scraping of the internet) comes out to roughly 30 trillion tokens. The low-hanging fruit has been picked.

How the problem is solved could very well shape the future and distribution of intelligence.

Scaling the Data Wall

Even more than scaling hardware (chips, bandwidth, power), the next step-change in model capabilities will come from algorithmic efficiencies:

These increases in performance will likely come from a few areas:

  1. Synthetic Data & Self-Play
  2. AI Search
  3. Other “Unhobblings”

Synthetic Data

This is big. On the data side, I was initially bearish on open source’s ability to compete with Big Tech because of the size of their proprietary data advantage. Realistically, Facebook, Google, Amazon and Apple are the only companies with the data scale to materially enhance that of the public commons.

However, Synthetic Data / Simulation / GANs could theoretically unbound the current data shortage: the future bounded only by the cost to generate new tokens (i.e. compute which itself benefits from Moore’s Law, scaling up, and algorithmic efficiencies). Just like AlphaGO leapfrogged previous capabilities by playing millions of games against itself, LLMs may be similarly induced.

Andrew Ng, one of the World’s top AI researchers, thinks so:

The TLDR: LLMs wrapped in agentic workflows can generate more high quality outputs which can be used recursively in further training – capped largely by the cost to generate new tokens.

These sentiments were echoed by Dario, CEO of Anthropic: “if you look at it very naively, we’re not that far from running out of data… My guess is that this will not be a blocker… There’s just many different ways to do it”.

Given the size of the prize, it seems likely one or several of the top labs cracks the data wall with clever algorithmic techniques. This would provide a further step change in capabilities and likely provide separation from the other Big Tech firms + Open source.

The impact is difficult to assess: on the one hand, it would give one party – likely a leading lab – an incredible advantage but, on the other, it remains to be seen how defensible algorithmic secrets would remain. Talent is porous and projects like FirstBatch Labs are already working on bringing these capabilities to crypto.

In many ways, data may prove to be an illusory constraint: one solved by a collection of talent and compute.

AI Search

On the other hand, we may not even need to scale up compute to reach AGI levels of performance. Perhaps all we need is more targeted allocation. As Aidan writes in “The Bitter-er Lesson”, adding “Search” (i.e. the ability to think for longer) in specific domains could substantially boost performance without increases in compute.

Instead of spending all of our compute on training 1.8T parameter models of the entire internet, what if specific entities – say a drug discovery company – could shift some of the general training spend towards inference spend in their targeted domain? What if the relevant token set is only a few hundred million?

We wouldn’t need to wait for the step-change in foundational model capabilities to see a step change in performance in specific domains today.

Other “Unhobblings”

Aside from simulation, GANs, RHLF, and Search / CoT capabilities, perhaps the biggest areas of low-hanging fruit is in the scaffolding and tooling around LLMs which provide additional context.

Source: Situational Awareness, Leopold Aschenbrenner

This category, in particular, is where open source / deAI solutions are likely to be able to play a role, enhancing underlying models with better tooling, app specific data, and context to win in specific domains.

Intelligence is a Barbell

My latest thinking is the future of intelligence is a barbell: a few massive foundational models and an ocean of smaller, cost-effective app / enterprise-specific models built on top of open source.

Menlo Ventures has an excellent write up on how the AI Stack is evolving within the enterprise shifting from closed source API calls to closed + RAG and on into outright customized open source:

This marks a movement towards more tailored experiences leveraging enterprise and customer specific data. To justify their existence, many companies will need to become more than wrappers around closed source models. The bet is open source + incremental context from enterprise / customer specific data can outcompete foundational models on either performance or cost in the narrow domains in which they operate.

Many enterprises are already enhancing closed-source intelligence with RAG at the prompt-level, but the next obvious step is to push towards customized fine-tuning. From operating systems to database markets, open source software and its many variations are regularly leveraged within the enterprise. The model layer should prove similar with many enterprises already using a stable of models today: serving queries based on a combination of performance and cost.

Given this trend, its reasonable to expect smaller, fine-tuned, task-specific models to grow in utilization, replacing frontier models where they prove unwieldy or expensive. As Sam at Slow Ventures points out:

This seems like a win for team open source / deAI / model fragmentation. It also has ramifications for value capture in crypto.

Agentic Protocols: Unbundling the Enterprise

While cliché, the “data economy” is set to inflect as automated intelligences begin diffusing through the economy.

In a barbell world, there will be a role for large data pools and smaller, app-specific data assets.

Crypto is well-suited to incentivize diffuse networks of data collection – both on and offline. Monetization of these data assets would largely take two forms:

  1. Data marketplaces: sales to enterprises, foundational models (open or closed), or larger DataDAO “aggregators”
  2. Upstream expansion into fine-tuned models which propel “agentic protocols

Agentic protocols are essentially on-chain enterprises with extremely low-overhead which will be able to outcompete web2-era enterprises in many domains. Outside of money and finance, I believe they will come to be one of the most valuable categories in crypto.

Today, however, they suffer from a material disadvantage.

Enterprises have operational history. Customer profiles and user behavior and large data-lakes filled over decades which provide both raw material and important context for fine-tuned intelligences in specific applications.

Agentic protocols will need to overcome this.

Financial incentives may be the one thing powerful enough to change established user behaviors. By offering a financial reward for different data streams, dataDAOs may just be the unlock needed to unbundle many enterprises into agentic protocols.

DataDAOs: Fueling an Agentic Future?

DataDAOs are emerging in a host of categories. Below is a mere sampling of the hundreds of projects beginning to sprout.

Mapping:

Surveillance is particularly in vogue with a host of projects tackling different strata to compete with Big Tech’s mapping oligopoly.

Mapping and geospatial data provides services beyond navigation. These data sets can help direct transport routes, manage city zoning, assuage congestion, monitor foot traffic, provide dynamically priced parking, etc.

Analysts forecast this market at >US$250b by 2028.

Hivemapper:

Hivemapper is attacking this opportunity from the street view. Billions of users leverage maps daily. Millions of businesses pay for mapping APIs. 1.5 billion vehicles are steadily adding more and more self-driving capabilities. These services are crucial for industries from transportation and logistics to real estate to utilities and federal and municipal governments.

Hivemapper is crowd-sourcing a network of driver dashboard camera’s which map terrains as drivers go about their routes.

So far, Hivemapper estimates it has covered 24% of the globe with 270m KM in total (14m unique km’s) – primarily in North America, Western Europe, Japan, Australia / New Zealand, and Southeast Asia.

Natix:

Natix is another DePIN entrant targeting geospatial mapping but through a different lens: tapping into the ~44 billion cameras across mobile phones, drones, cars, etc worldwide.

The one billion CCTV cameras which have been deployed globally costs more than US$6 trillion to install, operate, and maintain. Natix is hoping to offer its proprietary, privacy-preserving software to the 44 billion devices to create a collectively owned “internet of cameras” to give businesses, communities, and local governments access to real-time data on cars, bikes, foot traffic, potholes, etc.

Spexi

Spexi is taking the data-fight to the air; aiming to leverage a swarm of drones to deliver aerial data at a much higher-quality and lower-cost than satellites. Fly-to-earn!

The best commercially available satellite imagery is being collected at resolutions of just 30cm per pixel, drones are 10x better and cost dramatically less at <US$500.

Enhanced aerial data could be used to prep for disasters, enable smart cities, monitor infrastructure / natural resources, all at a fraction of today’s cost.

Climate / Agriculture:

More than >1/3 of the global economy is weather-sensitive. According to the World Meteorological Organization (yes, a real thing), “highly weather-sensitive sectors such as agriculture, energy, transport and construction, and disaster risk management can benefit by over US$160b per year from potential improvements in weather forecasting capabilities”.

WeatherXM:

WeatherXM aims to form a dense, mesh of weather stations, focused particularly in areas with poor weather infrastructure – open sourcing their hardware specs to a range of manufacturers.

The data could be used to better manage energy production at power plants, flight scheduling, optimize shipping routes, assist farmers, and better predict fallout from climate-risk.

PlanetWatch:

Air pollution causes over 7 million premature deaths per year (WHO) and costs US$225b in lost work hours, not to mention a potential US$5.1T in inflated healthcare / climate costs over the long-term.

PlanetWatch aims to deploy a dense network of indoor and outdoor sensors to deliver real-time, hyper-local data quickly and cost-effectively to enable smart cities and municipal governments to combat the ills of elevated air pollution.

Dimitria:

Dimitria is another example: not really a “dataDAO” but more of a SaaS platform with several components integrated on-chain which use a combination of satellite, genetic, weather, and other local sensor data to help small plot farmers make better decisions.

Healthcare

Personal Health dataDAOs are still very early and will be slowed by data regulations in many jurisdictions but the industry is massive (~20% of US GDP) and notoriously fragmented, paper-based, and wasteful.

The future of healthcare is personalized care: based on genetic, behavioral, microbiome, and other sensor data which will be leveraged by more “hands-on” consumers to assist their healthcare journey.

From Big Pharma, to insurers, to providers, to wellness companies, personal healthcare data is likely to be the most sensitive and the most valuable.

So far, no crypto projects have really cracked this segment, but a few are tangential:

  • BrainStem: aims to monetize wearables data
  • StepN: lifestyle “move-to-earn” fitness app which, if successful, could expand into other adjacencies like fitness data monetization (assuming user permissions)
  • DeSci: while not really “dataDAOs” in the traditional sense, a host of DeSci projects like vitaDAO, and moleculeDAO are springing up which are in the business of monetizing research IP

Finance:

The intersection of finance and data is obvious. Hedge Funds and other institutional money managers pay handsomely for any datasets which may provide an edge. Financial markets are built on information asymmetry.

Delphia

Delphia is a great example. The thesis is simple: users will willingly contribute their commerce, device, search and social media data in exchange for tokens. Out of the gate, data can be sold to third-parties, but at scale the more profitable path would be to use the data advantage to craft proprietary models.

The platform’s you use today are already selling your attention – your search, your commerce history, your order flow… You might as well try to get a cut while you’re at it….

Numerai

Numerai was one of the earliest experiments at the intersection of hedge funds, crypto, and AI. The platform hosts weekly tournaments for data scientists to submit models with market predictions; leading models receive token rewards in pursuit of outperformance

Crowd sourced, automated market intelligence has barely scratched the surface. The future of finance is not individual traders but individuals with fractional ownership (tokens) in models which trade on their behalf. This is effectively money management today (Fin advisors, hedge funds, private equity firms) just with less people (reduced mgmt fees). The future of finance is co-owned automated intelligences going PVP, differentiated by their data inputs and model architectures.

Others:

DIMO:

DIMO is striving to make every car smart and programmable through an open, connected network containing vehicle identity, purchase history, credentials, loans, titles, monitoring software, etc.

The end vision is ambitious: an AI mechanic in your pocket, autonomous parking, an on-chain DMV, app alerts when your kid is speeding after curfew, a digital marketplace for car purchases / financing with full health check, all the way through to real-time communication between self-driving cars while on the road.

Grass:

Grass has already onboarded >2m users contributing spare CPU cycles towards scraping the public internet. Just this month, the team announced >600m scraped reddit posts were now openly available for AI training:

Worth roughly ~US$60m per year in value according to Alphabet.

DePIN: The Future of Labor?

In a world of AGI – potentially coming as early as 2027 – the value of labor will dramatically diminish relative to GPU cycles.

Perhaps the future of humanity is simply DePIN: installing sensors and reporting online & offline activity to a range of dataDAOs who monetize it on the user’s behalf. Google and Tiktok and Amazon essentially do this today but the upside goes to their shareholders as opposed to the users.

Crypto’s end state is effectively a rebundling of today’s (very effective) corporate governance and capital markets and legal systems around novel always-on, composable, transparent, and agent-infused infrastructure with more effective attribution and (hopefully) more widely-distributed and aligned ownership.

If I’m going to get mined, I might as well load my iPhone (or future AR glasses) up with sensors to the tilt, so I can at least get paid. I suspect these solutions will be even more popular in developing areas where the potential gain is a greater relative share of income.

It’s important to note that DataDAOs are still very early and far from a panacea. They may not even work at all. As Li Jin notes:

  • Incentives corrupt user behaviors
  • Allocating rewards based on contributions is challenging
  • Verification of data integrity is essential and non-trivial
  • Market sizes may not incentivize participation at broad enough scale (For example, Facebook average ARPU is only ~$13)

And still, with compute driven intelligence is poised to eat away at the second pillar of the capitalist engine. What else will those not blessed with capital have left to contribute?

The future of on-chain enterprises are not DAOs but agentic protocols. And agentic protocols will require data sets to fuel the intelligence which guides their profit motive.

And yet these data sets don’t just magically morph into model ready inputs… they need to be refined.

Data Ecosystem Enablers

The toolkit enterprises use to refine raw data into model-ready inputs is robust and expanding.

Source: BessemerVP

Servicing the piping from the dataDAO funnel to model intelligence in deAI will require a similar proliferation of tooling: from cleaning and labeling on the front-end to storage, warehousing, indexing and retrieval through analytics and marketplaces.

Source: https://www.topology.vc/deai-map by Casey

Data Labeling

The first stop in the pipeline – data cleaning and labeling – is small today, but poised to grow at a ~25% CAGR to >US$5b by 2030.

The landscape is already quite competitive with web2 companies benefitting from an early lead and successful network of clients, but web3 upstarts like Liqhtworks and Sapien hoping to use crypto incentives to tap different pools of expertise as the non-automated data sets become increasingly complex.

Source: @notdegenamy, ocular.vc

Fraction.AI is another player, championing itself as “the data layer L2 for AI – agents and humans working together to create the highest quality data, used to train specialized AI models”.

Like Wikipedia replacing the Encyclopedia Britannica, a crowd sourced solution should be able to tap into the diverse skills and scale necessary to outcompete siloed companies on curation and maintenance of large data sets… especially with the right incentives.

But after they have been cleaned, labelled, and rated, these data sets would need to find their highest and best use…

Decentralized DataHubs / Marketplaces

With data gaining in importance as fuel for for the new economy, data marketplaces have emerged as dynamic hubs where users can buy, sell, exchange or license data or data streams.

To date, these marketplaces and hubs have been dominated by a mix of web2 companies and open source: Dawex, Explorium, Narrative, Eagle Alpha, Snowflake, Nomad Data, Datarade, Drop, Google DataSet Search, Kaggle, AWS Data Exchange, Data.world, Hugging Face etc.

These marketplaces will likely continue to dominate exchange between enterprises, but upstarts are re-envisioning what data exchange might entail in a web3 world:

Interestingly, almost all the leaders like Masa, Ocean (now Fetch), and Sahara AI, whether organically or inorganically, see value in combining different layers within the data pipeline: from attribution and cleaning through exchange all the way to model building and agents. (More next month…)

However, all the data assets in the world won’t help if they are not well-organized and readily available to inform decisions. Providing dApps and smart contracts with the right data and relevant context at the right time is essential to unlocking real value….

Data “Lakehouses”: Providing Essential Context

In an AI-enabled world, superior data aggregation and infrastructure can provide an essential competitive advantage. Web2 giants know this: perfecting the art of user interactions, preferences, and behavioral insights to increase engagement, drive conversion, and build category leaders.

Essential to this process are the databases, warehouses, datalakes, lakehouses (Ya, I’m losing track too) which optimize data ingestion and processing for data-intensive applications. Web3 needs its own Snowflake, Databricks, and Pinecone equivalents to deliver similar levels of personalization and context-enabled, actually-“smart” contracts.

Two projects leading the charge in this space are Hyperline and Space and Time.

Hyperline aims to build the “unified data layer for Web3 applications” enabling simplified data management through a centralized repository of structured and unstructured data. Applications can tap directly into a comprehensive pool of blockchain and non-blockchain data and focus their efforts on product innovation.

Essentially, a “shared” data layer could be amortized across a host of projects instead of everyone building out their own infrastructure.

Space and Time has a similar vision: a “community-owned” data warehouse enabling smart contracts to query data using a “query coprocessor” which sits next to major chains and supplements the limited compute (data processing) capacity of smart contracts on-chain. However, Space and Time adds greater emphasis on decentralization vs. convenience, leveraging zk-proofs for verifiability of inputs. The decentralized data warehouse would be combined with a “Proof of SQL” protocol to allow smart contracts to query onchain and offchain data and verify the result in a trustless manner: to essentially “ask ZK-proven questions” enabling smart contracts to do things like:

  • Liquidity Pools TVL: “Show me all liquidity pools with a TVL greater than $1m that were deployed at least one month ago”
  • Avg Lending Rates Onchain: “Show me the volume-weighted average lending rate for USDC on Aave, Maker and Compound right now”
  • Airdrop Criteria: “Roll up the wallet transaction histories of all wallets that meet the following criteria for my airdrop…”
  • Gaming Achievements: “Show me all gamer wallets that have at least 2 hours of playtime in-game, have minted our NFT, and played with ‘xyz’ weapon”

etc.

By making indexed blockchain and other verified off-chain data easily accessible, Space and Time can provide the context for smart contracts and apps to truly level-up.

Shared data lakehouses and “query co-processors” are both exciting solutions to enable applications and smart contracts with the relevant context to make more nuanced, personalized actions on-chain. However, both are born out of the fundamental limitations of blockchain throughput – needing to find a way to inform on-chain interactions with additional on or off-chain data.

Today’s consensus view is this: blockchain’s will never have the throughput to enable data-intensive applications – like AI – on-chain. Based on today’s mb/s bandwidth speed, they are correct.

However, this may be changing.

Exponential Storage: GB/s, not MB/s

As opposed to informing on-chain transactions with verified “off-chain” data, some projects in the storage space – like 0g – aim to push the limits of what can be done on-chain: pushing out the bandwidth frontier “from mb/s to gb/s”.

Because of scalability constraints, most models today are hosted on AWS or other centralized providers. As Michael, CEO of 0g has said: “if you want to serve a million inference traces per minute and each one is 100 kb or so, then you would need a DA capacity of ~1.6GB per second”. Today, no on-chain solutions are close to this throughput.

Source: Near

Median bandwidth speeds on IPFS appear to fall in a similar range.

In distributed training contexts, inter-node bandwidth can range from as low as 200mb/s to 100Gb/s while cross-node communication in co-located clusters with InfiniBand can reach up to 800Gb/s.

At 50, 100, 200 GBs per second (and assuming continued advances in async / batch training discussed earlier), on-chain training, while still meaningfully behind co-located, Infiniband-laced speeds, would be a reality, especially as we move towards a network of smaller, more-specialized models.

0g is hoping to bring this throughput to both data storage and data availability, significantly upgrading performance and under-cutting costs of existing DA solutions like Danksharding and Celestia, but also (more interesting in my view) providing a highly-performant storage layer for scaled applications like gaming, high-throughput DeFi, data marketplaces, inference – and even, one day, decentralized training.

Using a novel architecture leveraging erasure coding, 0g hopes to generate 10/mb/s/node in a horizontally scalable network (i.e. 5000 nodes = 50 GB/s in throughput). The network will also be customizable with clients able to dictate replicability, geographical constraints, persistence, and different data types.

The challenges are known:

  • “Is horizontal scaling possible without compromising decentralization?”
  • “Can 0g find PMF on-chain where Arweave and Filecoin have been slow to ramp?”
  • “Isn’t this pretty similar to EigenDA on the data availability side?”
  • “Isn’t it easier to just train / host in a centralized setting and then “verify” the inputs when they interact on chain?”

These are fair questions which 0G will need to disprove with execution.

However, on-chain attribution may prove an essential unlock in catalyzing individuals to participate in DePIN networks. In a world where labor is rapidly replaced by GPUs, I suspect the “human sensor” data income streams will be increasingly relevant.

The end vision 0g posits is compelling: more seamless “interoperability” and attribution of underlying data and models. Instead of the data-siloes which exist today, a fully “on-chain data economy” would know the exact training data used, how it was cleaned, the contributors, the model weights, any fine-tuning post training, and real-time monitoring. Given the importance in decisions many of these systems will make, having real-time auditability of all of the components may prove to be blockchain’s most important role in safely integrating AI into societal workflows.

I like 0g because they are taking a big swing. The most-likely path forward is continued fragmentation and efforts to integrate AI x Crypto further up the stack. 0g is hoping to remove these constraints at the base layer. A vision that blockchains can be used for scaled applications; that a thriving deAI ecosystem requires scale at the base layer. Everything on chain.

If they are right, it could upend the current paradigm across the stack… including retrieval.

Retrieval

Theoretically, decentralized networks should be well-suited to compete at the edge. One of the “mega-trends” in data centers and content delivery is pushing context and code closer to the user to reduce latency and save on bandwidth. Data held and replicated in a network of widely dispersed nodes should be well-suited to compete with CDNs.

Logically, this makes sense, and we have seen a range of players – from Fleek to Filecoin Saturn to Theta Edge to Aethir Edge (just to name a few) – keen to push into this space.

The ~US$30b gorilla they will run up against is, of course, Cloudflare which provides an extremely compelling solution for most SMBs and developers: with 2024E revenues of ~US$1.7b, growing ~30% YoY and ~20% of internet traffic flowing through its network. Akamai is also an established competitor on the enterprise side.

The reality is CDNs are largely a commoditized market with much of Cloudflare’s rise attributable to its security offerings: with customers attracted to its Anycast Network design (natural protection against DDoS attacks) and willing to pay up for its CDN + WAAP (Web and Application Security) offering. NET has continuously layered on additional security features and seems poised to continue taking share in the large CDN + internet security market (expected to be ~US$40b by 2030, WallStreet Estimates).

The potential shakeup in this status quo is generative AI. Personalized web experiences, augmented / virtual reality, real-time cloud gaming, and a proliferation of bots – both helpful and malicious – all point to an evolution of the internet fueled by more specialized hardware. Just as the hyperscalers have needed to revamp their cloud offerings (remember the US$166b spend on AI server market…), perhaps a similar revamp is overdue at the edge. With latency an important consideration for inference and rendering jobs (a sort of “CDN for inference”), there may be white space for decentralized offerings to carve out share during the transition.

Competing against behemoths like Amazon Lambda Edge and Cloudflare is no cake walk, but the logic of distributed systems in inference, rendering, and content delivery makes the opportunity worth tracking.

Digital Liberalism

Digital liberalism is less about “owning one’s own data” and more about proper incentivization and attribution of contributions. Data and compute are poised to swallow much of labor’s share in the decades to come. Just as capitalist incentives helped unleash the industrial revolution, catalyzing participants – no matter how small – to contribute their capital and labor towards solutions which propelled quality of life globally, crypto can incentivize broader participation into the production inputs set to dominate the AI revolution: data and compute.

Markets in capital and labor were a crucial driver in increasing GDP per Capita by >30x in many areas over the last 200 years.

And yet, increasingly these gains are accruing largely to the former…

A trend set to accelerate dramatically as AI reaches human level competence…

Source: Stanford AI Index

So far, one of crypto’s primary appeals has been its role as “outside money”: a tool to push hard-won property beyond the reaches of increasingly profligate and debasement-prone states, flailing to mollify the lopsided gains as labor erodes.

In seizing the means of production, deAI offers a very different vision: a vision of individuals not merely as economic liabilities in a post-cognition society, but as essential inputs – competing to generate the best new data sets and most efficient compute networks possible to feed the intelligences set to swallow global economic activity.

“The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man”

-G.B. Shaw

Fortunately, we have no shortage of them in deAI.

*Source: https://www.topology.vc/deai-map*

**

Thanks to @Shaughnessy119 and @ceterispar1bus for their review.

*@PonderingDurian signing off*