Deep diveInfrastructure

The unspoken bottleneck reshaping artificial intelligence

The narrative around AI has long revolved around compute power. Stories of massive GPU clusters, breakthrough chip designs, and soaring demand for high-bandwidth memory dominated headlines through 2025. As we enter 2026, a quieter shift is underway. The limiting factor is no longer just how quickly we can process data, but how effectively we can store and retrieve it at scale.

By Dylan Bristot22 min read

The thesis

This transition from compute-centric to data-centric infrastructure will define the next phase of AI evolution, influencing everything from model capabilities to investment returns. The companies solving retrieval at scale, not just processing speed, will capture the lasting value.

15GW
Max AI DC capacity DRAM can support
577%
Sandisk returns in 2025
2x
NAND prices since Feb 2025
2027
SK Hynix shortage prediction

The coming data deluge

Data generation has always grown exponentially, but artificial intelligence is accelerating it into uncharted territory. Global estimates place the datasphere at approximately 180 zettabytes in 2025, with projections toward 540 zettabytes by 2029 as generative workloads compound the cycle. Those numbers are abstract until you consider the mechanics driving them.

A single insight from Seagate's 2025 investor conference crystallizes the shift: a typical one-minute video consumes roughly 100 times more storage than a high-definition image. As models move beyond text and static images into native video understanding and generation, this multiplier becomes profound.

Short clips of five to ten seconds already strain systems. Extend that to fifteen minutes, thirty, or a full hour of coherent long-form video, and both compute and storage demands curve sharply upward. Generative models capable of producing hour-long content remain on the horizon, yet their arrival will demand petabytes of accessible training data and rapid retrieval for inference.

The surveillance preview

Surveillance systems offer an early glimpse of this future. Right here in New York City, 20,000 cameras link together into platforms generating approximately four petabytes of storage every single day. That's equivalent to filling up 220 terabyte disk drives daily. The challenge is not about how many cameras are deployed anymore. It's about how you process and store what they see.

And with agentic AI, that ability increases exponentially. More sensors, more streams, more replication across edges for real-time decisions. Richer content, higher resolution, and relentless replication drive the cycle. Data is no longer static; it is transformed, augmented, and redistributed instantaneously.

Let's walk through one historical comparison. In the last 150 years, humanity created 15 billion images. With AI image generation, that same number was created in just the last year and a half. The resolution of those images has climbed dramatically as well. We're not talking about text and images anymore. We're talking about video, and a typical one-minute video consumes 100x more data than one high-definition image.

Regulations around data sovereignty further complicate matters, requiring local storage and reducing global pooling efficiencies. The result is not linear growth but a compounding explosion few infrastructures are prepared to handle.

Agentic systems and multimodal horizons

Agentic artificial intelligence marks the leap from responsive tools to proactive partners. These systems plan, execute, and iterate across multi-step tasks, often coordinating swarms of specialized agents. In 2026, they move from prototypes to production, managing logistics, codebases, and scientific workflows end-to-end.

Multimodal integration fuels this advance. Models now reason natively across text, images, audio, and video, with benchmarks showing frontier performance in video understanding nearing human levels. This convergence demands vast, diverse datasets. Training on video alone multiplies storage needs by orders of magnitude compared to text corpora. Inference adds another layer: long contexts preserve state across extended interactions, ballooning memory requirements.

When your agent maintains context over days or weeks of interactions, when it needs instant access to generated artifacts, retrieved documents, and historical context, the storage requirements compound in ways that traditional infrastructure simply was not designed to handle.

The KV cache challenge in large language models

At the heart of modern transformer-based models lies the key-value cache, a mechanism that stores prior computations during sequence generation to avoid redundant work. This cache scales linearly with context length and model size, becoming the dominant memory consumer in inference.

For flagship models, the cache often exceeds what fits on a single high-bandwidth memory equipped node. The math is straightforward but sobering: memory usage approximates 2 × layers × heads × head_dimension × context_length × 2 bytes in floating-point 16 precision. For frontier models with hundred-thousand or million-token contexts, this dominates the total footprint.

The memory hierarchy problem

Traditional solutions scaled horizontally across servers linked by high-speed networking, prioritizing training bandwidth over inference latency. Yet real-world deployment favors low tail latency: users expect instant responses, not networked delays.

Adding more HBM

Raises costs and reduces yields without proportional bandwidth gains. NVIDIA's H100 has 80GB HBM3, H200 pushed to 141GB, but capacity scaling hits physical limits.

Networking across nodes

Optimized for bandwidth but introduces latency hops fatal to real-time inference tail percentiles. Training works; serving does not.

The counterintuitive winner

Flash storage directly attached to nodes. Access times beat cross-server networking in many regimes, and clever DMA overlapping hides remaining delays.

As contexts lengthen and mixture-of-experts architectures grow, flash becomes essential for cost-effective, low-latency serving. Video models exacerbate this: extended contexts plus rapid access to generated or retrieved clips tilt the balance further toward dense, fast storage.

NVIDIA standardized this approach in early 2026, enabling clusters to spill cold portions of the cache to NVMe while keeping hot prefixes resident. Vast Data and others integrate it with high-speed networking for agentic-era workloads. Flash does not replace high-bandwidth memory for active computation. It extends viable context lengths and model scales cost-effectively, turning a hard wall into a manageable tier.

The supply chain that cannot scale

The most striking analysis comes from Macquarie's recent memory report. Their calculation is sobering: over the next two years, the DRAM industry can only support 15 gigawatts of AI data centers.

⚠️

Macquarie's calculation

  • A 1GW-scale AI data center configured with 400,000 GPUs (each with a TDP of 1,700W) requires more than 18,000 DRAM wafers per month, including both HBM and main memory.
  • New DRAM supply is limited to roughly 250,000 wafers per month globally.
  • Maximum AI data center capacity that can be supported without cannibalizing the existing DRAM market: 15GW.
  • The projected 40% CAGR for AI chips and plans to build 40 to 50GW of AI data centers over the next three years are exposed to significant risk.

Memory constraints are likely to cause delays and timeline adjustments in AI data center projects, which could further exacerbate the supply shortage. The industry announced ambitious buildout plans. The semiconductor supply chain cannot physically deliver the memory to fill them.

OpenAI's Stargate project illustrates the scale of the problem. Initial deals with Samsung and SK Hynix would require up to 900,000 wafers per month by 2029. That's approximately double current global monthly HBM production. As SK Group chairman Chey Tae-won noted at an industry forum: "These days, we're receiving requests for memory supplies from so many companies that we're worried about how we'll be able to handle all of them."

Investment shifts beneath the surface

The market caught this shift abruptly in 2025. Sandisk, spun out as a pure-play flash provider, delivered returns exceeding 577 percent in its first year, briefly leading major indices. The stock continued gaining into 2026.

Contract prices for NAND flash doubled from mid-2025 lows. Suppliers sold out multi-year allocations. For the first time in decades, simultaneous shortages hit DRAM, NAND, and hard drives. This is no transient crunch.

MetricEarly 2025January 2026Change
DRAM supplier inventory (weeks)13-172-4-77%
NAND contract pricesBaseline2x+100%
Enterprise SSD demand vs supplyBalanced5-7% deficitShortage
Sandisk stock performanceIPO+577%Leading index

Hyperscalers cannot compress raw training corpora indefinitely. Quantization and efficiency gains help at the margins, but foundational data volume grows unchecked. Storage pure plays, freed from commoditized legacies, capture pricing power as demand outstrips supply growth by wide margins.

SK Hynix has told analysts that the memory shortfall would last through late 2027. Both SK Hynix and Samsung announced all their chips are sold out for 2026, while new factories for conventional chips won't come online until 2027 or 2028.

The scramble is real

The Reuters investigation paints a vivid picture of what's happening behind the scenes. Google, Amazon, Microsoft, and Meta in October 2025 asked Micron for open-ended orders, telling the company they will take as much as it can deliver, irrespective of price.

China's Alibaba, ByteDance, and Tencent dispatched executives to visit Samsung and SK Hynix in October and November to lobby for allocation. As one source told Reuters: "Everyone is begging for supply."

The ripple effects

The squeeze spans almost every type of memory, from flash chips used in USB drives and smartphones to advanced high-bandwidth memory that feeds AI chips in data centers. The fallout reaches beyond tech.

  • Japanese electronics stores have begun limiting how many hard-disk drives shoppers can buy.
  • Chinese smartphone makers Xiaomi and Realme warn they may have to raise handset prices by 20-30%.
  • Tokyo's Akihabara district sees 32GB DDR5 memory jump from 17,000 yen to 47,000 yen in weeks.
  • Secondhand markets boom as buyers flee new product pricing.
  • Hong Kong intermediaries buy up recycled chips from decommissioned data centers for resale to Chinese clients.

The memory shortage has graduated from a component-level concern to a macroeconomic risk. As Sanchit Vir Gogia, CEO of Greyhound Research, put it: "The AI buildout is colliding with a supply chain that cannot meet its physical requirements."

The HBM supply chain

Understanding where the bottleneck sits requires mapping the supply chain. Three companies dominate HBM production: SK Hynix, Samsung Electronics, and Micron. These suppliers serve a concentrated customer base that splits into two categories.

HBM supplier and customer relationship diagram showing SK Hynix, Samsung, and Micron supplying to GPU makers (NVIDIA, AMD) and custom ASIC designers (Broadcom, Marvell, Alchip, GUC) serving hyperscalers
HBM supplier/customer relationships. Three memory giants supply both GPU makers and custom ASIC designers serving hyperscalers. Source: J.P. Morgan estimates, Company data

GPU makers NVIDIA and AMD source HBM directly for their accelerators. The more interesting development is the rise of custom ASIC designers serving hyperscalers directly. Broadcom designs Google's TPU. Marvell works on Amazon's Trainium and Inferentia chips. Alchip serves Microsoft's MAIA accelerator. GUC (Global Unichip Corporation) handles Meta's MTIA.

Each of these custom silicon programs requires guaranteed HBM allocation. As the hyperscalers diversify away from NVIDIA dependency, they're not reducing demand on the memory supply chain. They're fragmenting it across more design partners while total volume continues climbing.

GPU customers

  • NVIDIA (H100, H200, B100 series)
  • AMD (MI300X, MI400 series)

Direct HBM integration, highest bandwidth requirements

US CSP custom silicon

  • Google TPU (via Broadcom)
  • Amazon Trainium/Inferentia (via Marvell)
  • Microsoft MAIA (via Alchip)
  • Meta MTIA (via GUC)

Growing share of HBM demand, diversified design partnerships

China adds another dimension. Chinese ASIC designers serve domestic cloud providers building AI infrastructure despite export restrictions. The supply chain becomes a geopolitical pressure point alongside a technical bottleneck.

Flash memory market projections

The flash memory market tells a parallel story with one critical insight: inference workloads are driving the growth curve. TechInsights and Samsung Securities project the overall flash market at approximately 970 exabytes in 2025, growing at a 20% compound annual rate to reach 2,044 exabytes by 2029.

Flash Memory Market Growth chart showing projections from 2025 to 2029, with Non-AI, Training, and Inference segments. Total market grows from 970 EB to 2,044 EB at 20% CAGR, with Inference growing at 69% CAGR
Flash memory market growth driven by inference workloads. Inference compounds at 69% CAGR while non-AI storage grows at just 6%. Source: TechInsights, Samsung Securities Estimate, Kioxia

But the composition matters more than the headline number. Non-AI storage grows at roughly 6% annually. Training workloads contribute 26% growth. Inference, the deployment side of AI that touches every user interaction, compounds at 69% annually.

Flash memory market breakdown (EB)

2025E
970
2026E
1,120
2027E
1,350
2028E
1,660
2029E
2,044
Non-AI (+6% CAGR)
Training (+26% CAGR)
Inference (+69% CAGR)

Source: TechInsights, Samsung Securities Estimate, Kioxia

This 69% inference growth rate is the number that should anchor investment theses. Every deployed model, every API call, every agentic workflow generates inference demand. As these systems move from research previews to production deployments serving millions of users, the storage requirements follow exponentially.

Technical deep dive: why flash wins for inference

The case for flash in AI inference rests on a counterintuitive insight: for many access patterns, directly attached NVMe storage beats network-distributed HBM.

Consider the KV cache for a model serving long-context requests. The cache has two distinct access patterns. Hot entries, recently generated or frequently referenced, need nanosecond access. Cold entries, earlier context that might be referenced but usually isn't, can tolerate microsecond latency.

Keeping everything in HBM guarantees speed but wastes expensive, scarce memory on rarely-accessed data. Distributing across networked nodes maintains capacity but introduces hop latency that spikes tail percentiles. Direct NVMe access offers a middle path: microsecond reads for cold data without network overhead.

Access latency comparison

HBM3 (hot path)~100ns
NVMe Gen5 SSD~10-20μs
Cross-node via NVLink~1-5μs + hop variance
Cross-rack via InfiniBand~5-50μs + congestion

For cold KV cache entries accessed infrequently, NVMe's consistent microsecond latency beats networked access patterns with unpredictable tail latencies.

The key technique is DMA (Direct Memory Access) prefetching with computation overlap. While the GPU processes hot cache entries, background DMA transfers speculatively load likely-needed cold entries from flash. When done correctly, the flash latency hides behind active computation.

NVIDIA's 2026 software stack standardizes this pattern. Their inference runtime automatically tiers KV cache between HBM, system DRAM, and NVMe based on access recency. The programmer sees a unified memory space; the runtime manages placement.

Risks and enduring hurdles

Progress carries friction. Energy demands for data centers already strain grids, with replication multiplying footprints. A 1GW data center is not just about the chips; it's about the cooling, the power distribution, the physical space.

Sovereignty laws fragment global pools, raising costs. The European Union, China, India, and others require local data residency for certain applications. Each jurisdiction needs its own storage infrastructure, eliminating the pooling efficiencies that made cloud economics work.

Provenance and verification become critical as agents handle sensitive decisions, yet tracing multimodal data flows remains immature. When an agent makes a financial recommendation based on retrieved documents and generated analysis, auditing that decision chain requires storage systems designed for accountability, not just throughput.

The counterarguments

  • Not every workload needs petabyte contexts or hour-long video
  • Specialized models and aggressive compression mitigate some pressures
  • Flash remains orders of magnitude slower than HBM for hot access patterns
  • Short-context chatbots thrive on existing memory tiers
  • Grouped-query attention, quantization, and paged KV management shrink cache footprints

Why the thesis holds

  • Efficiency gains apply to a growing base, not a shrinking one
  • Agentic workflows are extending context requirements, not reducing them
  • Video native models are shipping, not hypothetical
  • Supply constraints are physical, not software-solvable
  • Hyperscalers are pre-ordering years of capacity regardless of price

Flash adoption is pragmatic, not revolutionary. It extends current paradigms rather than inventing new ones. The directional bet holds: artificial intelligence thrives on data abundance. Systems that store, retrieve, and transform it efficiently will compound advantages over time.

The broader infrastructure play

Beyond memory chips themselves, the storage thesis extends to adjacent infrastructure. Decentralized storage networks gain traction where centralized clouds hit capacity limits. Edge replication systems position data closer to inference endpoints. Specialized retrieval databases optimize for the vector similarity searches that RAG architectures demand.

The opportunity lies not just in the silicon but in the software and systems that orchestrate petabyte-scale retrieval with millisecond latency requirements. Database architectures designed for OLTP workloads in the 2000s don't map cleanly to embedding search at scale. New purpose-built systems capture value by solving this mismatch.

Consider the inference pipeline for a multimodal agent. User query arrives. Embedding generated. Vector similarity search across document corpus. Relevant context retrieved. KV cache populated with historical conversation. Model generates response, potentially triggering tool calls. Each step has storage implications, and optimizing the full pipeline means rethinking assumptions at every layer.

Practical guidance for builders and investors

For teams building AI applications, the storage constraint changes planning assumptions. Token limits matter less than effective retrieval. Caching strategies become first-class architectural decisions. Provider selection should weight storage tiering capabilities alongside model quality.

ConsiderationImplicationAction
Long context costsPer-token pricing understates true cost for extended contextsBenchmark actual workload costs, not theoretical rates
Retrieval architectureRAG quality depends on storage layer performanceInvest in vector database selection and tuning
Provider tieringFlash-tiered inference becomes standardEvaluate providers on memory architecture, not just model access
Multi-region requirementsSovereignty laws fragment storage poolsBuild for data residency from day one, not as an afterthought

For investors, the storage thesis suggests looking beyond the obvious chip plays. Memory suppliers SK Hynix, Samsung, and Micron benefit from the demand surge, but capacity constraints limit their upside to pricing gains. The more interesting opportunities may lie in:

  • Pure-play flash providers like Sandisk that captured 577% returns in 2025
  • Enterprise SSD manufacturers scaling capacity
  • Vector database companies solving retrieval at scale
  • Infrastructure software that orchestrates tiered memory hierarchies
  • Edge storage systems positioning data closer to inference endpoints

The hyperscaler capex budgets are public. The memory supply constraints are physical. The gap between demand curves and supply curves represents either delayed AI adoption or pricing power for those who control scarce capacity.

Looking forward

The future of artificial intelligence belongs less to those who compute fastest and more to those who access richest datasets with least friction. Agentic, multimodal systems will generate and consume data at scales we are only beginning to price in. Storage, long dismissed as commodity plumbing, reclaims architectural centrality.

SK Hynix predicts the shortage persists through late 2027. New memory fab capacity won't deliver meaningful supply until 2028. The companies building AI products during this window face a constraint that software optimization cannot fully solve. Physical limits on wafer production translate directly to limits on model deployment scale.

The bottom line

For builders and investors alike, the opportunity lies in recognizing this pivot early. The next breakthroughs will not come solely from larger clusters but from infrastructures that make exponential data feel effortless. The AI narrative is shifting from "who has the most GPUs" to "who can feed them fast enough."

In that world, the quiet enablers often capture the lasting value. Memory is the new moat.

Data sourced from Macquarie Research, Reuters, TrendForce, Seagate 2025 Investor Conference, TechInsights, Samsung Securities. See our interactive model comparison for the latest on model capabilities and pricing.

Cite this analysis

If you're referencing this analysis in your work:

Bristot, D. (2026, January 8). The unspoken bottleneck reshaping artificial intelligence. What LLM. https://whatllm.org/blog/memory-bottleneck-reshaping-ai

Primary sources: Macquarie Research (Memory Report 2025), Reuters Investigation (December 2025), SK Hynix/Samsung investor communications, Seagate Investor Conference 2025, TechInsights market projections