All essays

Essay - 2026-05-14

The von Neumann bottleneck: why AI's real constraint is memory, not compute

Eighty years after von Neumann, the bus is still the bus, and modern transformer inference has stretched it past the point where engineering can hide it.

In 1945, John von Neumann published a draft report describing a computer architecture in which the processor and the memory live in separate physical units, connected by a bus. The processor fetches an instruction or a piece of data, performs an operation, and writes the result back. Every step of a computation requires a trip across the bus.

That bus is the von Neumann bottleneck. It is the gap between how fast the processor can compute and how fast the memory can feed it. Every general-purpose computer built since the 1950s has lived inside this architecture. So has every GPU, every TPU, every custom AI accelerator shipping in 2026.

The reason this matters now, eighty years after the original paper, is that AI workloads have stretched the bottleneck past the point where conventional engineering can hide it. The price of intelligence in 2026 is not set by how many floating-point operations per second a chip can do. It is set by how fast memory can be moved into the chip's compute units.

I work on inference at scale. The numbers below are the ones I see in the bill. None of it is investment advice. All of it is what a working engineer looks at when they price a token.

What the von Neumann bottleneck actually is

Take a modern AI accelerator. An H100 has roughly 1,979 trillion floating-point operations per second of theoretical FP16 throughput. It has 3.35 terabytes per second of memory bandwidth to its HBM3 stack.

Those two numbers do not match. The compute side is roughly 600 times faster than the memory side in raw operations. To keep the compute units busy, every byte fetched from memory has to be reused, on average, hundreds of times before another byte arrives.

This is called arithmetic intensity. It is the ratio of compute operations to memory accesses. If the workload's arithmetic intensity is below the ratio of the chip's compute throughput to its memory bandwidth, the chip stalls. The compute units wait for data. The chip you paid $30,000 for runs at 5 to 15% of its theoretical performance.

Modern transformer inference, particularly autoregressive generation of long sequences, has terrible arithmetic intensity. The model has to load every parameter from memory for every token it generates, and read the entire growing KV cache. The compute per parameter is small. The bandwidth required is enormous.

That is the von Neumann bottleneck speaking through the GPU. The architecture has not changed since 1945. The workload has.

Why AI made the bottleneck visible again

For decades, the von Neumann bottleneck was a textbook problem that real engineers worked around. CPUs got bigger caches. Compilers got better at prefetching. Memory hierarchies grew layers (L1, L2, L3, DRAM). The bus got faster.

AI workloads broke all of those workarounds at the same time, for a stack of reasons.

The first reason is model size. A 70-billion-parameter model in FP16 is 140 gigabytes. That does not fit in any single layer of any conventional cache hierarchy. It has to live in HBM, and every forward pass has to move significant fractions of it through the compute units.

The second reason is the KV cache. Every transformer keeps a running record of the keys and values for every token it has seen. For a long context window, this state can balloon to tens of gigabytes per session. A modern agent loop that runs for thirty minutes against a hundred-thousand-token context burns more memory bandwidth on cache reads than on parameter reads.

The third reason is batch size. Training amortizes parameter loads across thousands of samples in a batch, so arithmetic intensity stays high. Inference, especially production inference with low latency requirements, often runs at batch size 1 or batch size 32, where the arithmetic intensity collapses. The chip stalls.

The fourth reason is reasoning models. A reasoning model burns roughly 10x the output tokens of a non-reasoning model, because it spends most of its tokens thinking out loud before answering. Every one of those tokens is another generation step. Every generation step pulls the entire parameter set through HBM again.

Multiply those four reasons together and you get a workload that is bottlenecked, top to bottom, on the bus that von Neumann described in 1945.

The memory wall, in dollars

The cost curve of AI inference in 2026 is not a compute story. It is a memory story.

The 100x reduction in token prices over the last year was achieved by some combination of model distillation, speculative decoding, KV cache quantization, and continuous batching. Three out of those four techniques are pure memory bandwidth optimizations. The fourth is a way to raise arithmetic intensity inside a batch.

The names that have re-rated the hardest in the AI buildout, on a fundamentals basis, are not the names with the highest FLOPs per dollar. They are the names that own the memory and packaging stack.

$MU (Micron) trades like a different company than it did in 2023, because HBM4 is essentially sold out and Micron is the only Western producer at scale. SK Hynix has become the purest HBM exposure on the global tape. Samsung is the catch-up option. The HBM trio has captured roughly half of every dollar of AI margin expansion over the last eighteen months, and the contracts now extend to 2028.

The packaging layer captures the second derivative. $TSM owns CoWoS, the advanced packaging process that bonds HBM stacks onto GPUs. CoWoS capacity has been sold out through 2026 for over a year. $BESI is the cleanest hybrid-bonding convexity, with sequential unit orders more than doubling in Q1 2026. $ONTO and $CAMT split 3D metrology between them. $LRCX owns the TSV etch step.

If the von Neumann bottleneck is the disease, this is the medicine. Every dollar spent narrowing the gap between processor and memory flows through this stack.

How architecture is trying to escape

The conventional response to the memory wall has been to put more HBM next to more compute. HBM3 became HBM3E. HBM3E becomes HBM4 in 2026. HBM4E follows in 2027. Stack heights climb from 8-high to 12-high to 16-high. Bandwidth per stack roughly doubles each generation.

This works, in the sense that it keeps the compute units fed. It does not work, in the sense that it is not a long-term escape from von Neumann. The bus is still the constraint. You are just making the bus wider and shorter.

The real architectural escape attempts fall into three categories.

Compute-in-memory. Push the arithmetic into the memory array itself, so the data never has to travel. This is the dream that has been around since the 1960s. Mythic, Untether, and a handful of others have built analog versions. The yield problems are real. The compiler problems are real. None of them have displaced HBM at hyperscale.

Co-packaged optics (CPO). Replace the copper bus between racks with light. This does not solve the bottleneck inside a single chip, but it does solve the bottleneck inside a single cluster, which is structurally the same problem at a larger scale. The CPO market goes from roughly zero to $91 billion inside 18 months starting in H2 2026, according to Goldman. The substrate layer ($AXTI, $SOI, $IQE) is where that money flows first.

Wafer-scale integration. Build the whole accelerator as a single wafer-sized die, so the memory and the compute are physically adjacent. Cerebras has been doing this for years. Tesla has done variants of it. The economics are hard. The thermal management is harder. But the architectural argument is sound: the only way to truly escape von Neumann is to dissolve the boundary between compute and memory.

None of these escapes is going to displace the HBM-plus-GPU stack inside the 2026 to 2028 window. The buildout that is happening now is a buildout on top of the existing architecture, with HBM4 doing the heavy lifting and CPO solving the cluster-scale version of the same bottleneck.

The KV cache: the silent monster

The piece of the memory story that retail investors and most analysts miss is the KV cache.

The KV cache is the running memory state of an LLM conversation or agent loop. Every token the model generates adds an entry to this cache. The size of the cache scales linearly with the context length and the number of attention heads and the model depth.

For a 70-billion-parameter model with a 32,000-token context, the KV cache is roughly 20 gigabytes per session. For a 128,000-token context, it is more like 80 gigabytes. For an agentic workflow with a million-token context, it can exceed the size of the model weights themselves.

This is not a parameter you can quantize away. The cache has to live in HBM. Every generation step has to read all of it. Every new token adds to it.

The implications stack quickly. CPU-to-GPU ratios are shifting from 1:8 in training to 1:1 in agentic inference, because long-context inference is increasingly CPU-bound on memory management. Google has split its TPU line in two, with a dedicated inference chip carrying tripled SRAM specifically for KV cache. NVIDIA's roadmap is following the same logic.

If you understand why your AI bill went up even though tokens got cheaper, this is the answer. The cache, not the weights, is the price of thought.

The companies that get paid

The von Neumann bottleneck has a tickered expression in 2026. Almost every name that matters fits into one of three buckets.

The memory makers. $MU, $000660.KS (SK Hynix), $005930.KS (Samsung). These names capture the front-end of every AI margin dollar. Hyperscalers have placed HBM preorders out to 2028. Lip-Bu Tan said on a recent earnings call that there is no supply relief until 2028. The cycle is the longest in modern semiconductor history.

The packagers and tool vendors. $TSM, $BESI, $ONTO, $CAMT, $LRCX, $KLIC. These names own the physical processes that bond memory onto logic. The hybrid-bonding transition from pilot to volume in 2026-27 is the single highest-convexity public expression of HBM4 ramp. BESI just printed a record order quarter at €269.7 million.

The interconnect layer. $ALAB, $CRDO, $APH, $COHR, $LITE, $CIEN. These names own the buses that link the racks together once the on-chip memory bottleneck has been pushed out to the cluster level. Every PCIe Gen 6 platform needs a retimer. Every 1.6T optical link needs an EML. Every CPO engine starts on indium phosphide.

That is the full stack of the von Neumann response. The compute is the visible part. The memory is the invisible part. The packaging is the connective tissue. The interconnect is the layer that lets you build a cluster instead of a chip.

If you understand why the four-letter ticker on the cluster matters less than the three-letter ticker on the HBM stack, you understand the most important architectural fact of the AI buildout.

What the bottleneck is teaching us

The first thing the von Neumann bottleneck is teaching us is humility about architecture. Eighty years after the original paper, the bus is still the constraint. The buildout in 2026 is not escaping von Neumann. It is paying a tax to live inside it.

The second thing is that the next layer of AI products (agents, deep research, long-context reasoning, persistent memory) consumes orders of magnitude more memory bandwidth than the chat interfaces it is replacing. The total compute bill is going up not because compute is more expensive, but because each query crosses the bus thousands of times instead of dozens.

The third thing is that the companies who figure out the layer beneath the API (KV cache management, speculative decoding, quantization, routing, the kind of vertical integration that lets you stop paying full bandwidth tax) are the companies that will keep their margins.

The fourth thing, and the one I find genuinely interesting, is that the bottleneck is not going away. It is just being rebalanced. Every architectural improvement that narrows the gap between processor and memory creates demand for new workloads that re-widen it. Reasoning models did this. Agents did it again. The next generation will do it harder.

Jevons in 1865 noticed that making coal burning more efficient did not reduce coal consumption. It increased it. The same thing is happening with memory bandwidth right now, and it is happening faster than any analogous historical cycle.

Cheaper tokens, more memory traffic, same coal as 1865.

The takeaway

The von Neumann bottleneck is not a problem to be solved. It is a structural feature of how computers work, and AI has stretched it to the point where the entire economics of inference flow through it.

If you are building an AI product, the bottleneck shows up as your gross margin. If you are buying compute, it shows up as your bill. If you are watching this from a distance and trying to understand where the next chokepoints form, the answer is everywhere downstream of "more memory state, always on, with massive KV cache per session."

The names that matter are the ones that own the memory, the packaging, and the interconnect. The narrative that matters is that the architecture is older than the chip industry, and the chip industry has not figured out how to escape it.

Eighty years in, the bus is still the bus.

Educational, not investment advice.