1M Context: The GPU Math (2/3)

Ethan wrote Part 1 of this series about why long-context models usually beat retrieval-augmented generation for single-vessel deployments. I agree with all of it. The hard part is not deciding that context wins. The hard part is paying for the GPUs to actually serve a million tokens of context on your own hardware, in your own engine room, with your own cooling and power budget.

This is Part 2. I am going to do the unsexy work of walking through the memory math for what "1M context on a vessel" actually costs in VRAM, for four model sizes, across three real deployment scenarios. If you read Part 1 and thought "great, we will just run Llama 3.3 70B at 1M context locally," this is the post that tells you what that actually requires.

Spoiler: the weights are the small number. The KV cache is the big one. And the KV cache is the reason the cloud players can offer 1M context and your on-vessel deployment today probably cannot.

The two numbers that matter

When you load a model onto a GPU for inference, there are two separate chunks of memory you have to budget for:

1. Model weights. This is the parameter data itself. It is a fixed cost. Load it once, it stays resident, it does not grow or shrink based on how many users you have or how long the conversations get. For a 70 billion parameter model in FP16, weights are about 140 GB. Quantize to 4-bit and they drop to about 40 GB. That is straightforward.

2. KV cache. This is the part nobody talks about in the model card. Every transformer layer has to remember the key and value tensors from every token it has processed so far in the current sequence, so it does not recompute them on every new token. That remembered data is the KV cache. It scales linearly with sequence length and linearly with batch size (concurrent sessions). It is the real memory tax of long-context inference.

Ignore either number and you will buy the wrong hardware. Most people who do the back-of-envelope math for a local deployment only count the weights. That is how you end up with a "70B on a single H100 80GB" plan that falls over the moment you try to use more than 40K of context.

KV cache math, from scratch

Let me walk through how to calculate KV cache for a specific model. I am going to use Llama 3.3 70B as the running example because it is the most popular open-weights model people ask us to deploy. The math generalizes.

Llama 3.3 70B has:

80 transformer layers
8 KV attention heads (grouped-query attention, so fewer KV heads than Q heads)
128 dimensions per head
FP16 activations by default (2 bytes per element)

For every token in the context, every layer has to store both a key vector and a value vector, for every KV head. That is:

Per-token KV footprint = 2 (K and V) × 80 layers × 8 heads × 128 dim × 2 bytes = 327,680 bytes

Round that to 320 KB per token in FP16. Now multiply by context length.

Context length	KV cache (FP16)
32K tokens	10 GB
128K tokens	40 GB
256K tokens	80 GB
512K tokens	160 GB
1M tokens	320 GB

Read the bottom row. Three hundred and twenty gigabytes, per active session, in addition to the 140 GB of model weights. That is your KV cache bill for one user running one conversation with Llama 3.3 70B at full 1M context in FP16.

A single H100 SXM5 has 80 GB of HBM. You would need five of them just for the KV cache of one user, before you loaded the weights.

This is not a quantization problem with the weights. You can take a 70B model weights down to 4-bit and fit them in 40 GB. The KV cache does not change. The KV cache is the same 320 GB whether your weights are FP16 or 4-bit, because the KV cache is about the activations flowing through the model at inference time, not the parameters.

That is the reality check most people are missing.

Four model sizes, four KV cache bills

The KV cache math depends on the model's layer count, head count, and head dimension. Smaller models have smaller per-token footprints. Bigger models have bigger ones. Here is the table for the four sizes you are most likely to be looking at.

Model	Layers	KV heads	Head dim	Per-token KV (FP16)	KV @ 1M
Llama 3.1 8B	32	8	128	128 KB	128 GB
Llama 3.1 13B	40	8	128	160 KB	160 GB
Llama 3.3 70B	80	8	128	320 KB	320 GB
Llama 3.1 405B	126	8	128	504 KB	504 GB

A few notes on this table:

All four use grouped-query attention with 8 KV heads. This is the single biggest reason Meta's models are tractable at long context. The older models with 32 full KV heads had 4x the cache footprint per token. GQA is not optional at long context, it is table stakes.
Head dim is 128 across the board. This is standard.
The per-token KV scales with layer count. That is why Llama 405B is worse than Llama 70B by a factor of 126/80 = 1.575, exactly matching the layer ratio.
None of this scales with the weights quantization. 4-bit weights do not shrink the KV cache.

Now add a column for the weights so you can see both bills at once, assuming you are running 4-bit quantized weights (Q4_K_M or equivalent) for maximum compression:

Model	Weights (Q4)	KV @ 32K	KV @ 128K	KV @ 1M	Total @ 1M
Llama 3.1 8B	5 GB	4 GB	16 GB	128 GB	133 GB
Llama 3.1 13B	8 GB	5 GB	20 GB	160 GB	168 GB
Llama 3.3 70B	40 GB	10 GB	40 GB	320 GB	360 GB
Llama 3.1 405B	230 GB	16 GB	63 GB	504 GB	734 GB

These are the numbers that matter. A 405B model at 1M context needs three quarters of a terabyte of VRAM before you have served a single token to a second user. Even the 8B model, which nobody thinks of as a big model, needs 133 GB of VRAM to run at full 1M context.

This is the math the cloud providers do not show you because on their side, they are spreading the cache across thousands of GPUs with clever batching and attention tricks. On a vessel, you do not have thousands of GPUs.

Scaling by user count

Everything above assumes one user. The KV cache is per session, which means every concurrent user adds another full KV cache to the bill. Here is where the numbers get genuinely ugly for anyone with more than one person on the boat.

Let me walk through three realistic vessel deployment scenarios.

Scenario A: One user, one session (the owner)

A single-user deployment is actually the only way 1M context on a large model is currently feasible on a yacht. One session, one full KV cache, one set of weights. For Llama 3.3 70B:

Weights: 40 GB (Q4)
KV at 1M: 320 GB
Total: 360 GB of VRAM

You need five H100 80GB cards to serve this, or four L40S 48GB cards (192 GB, not enough, you actually need six or seven L40S), or three H200 141GB cards (423 GB, finally enough), or two Blackwell B200 192GB cards (384 GB, enough with margin). The historical default was a 4x H100 SXM5 rack in a single node, about $100K in silicon and $250K-$320K installed. The Blackwell generation makes this a two-card conversation for the first time, but at $35,000 per B200 plus the 2000W power envelope, you are still in serious-hardware territory.

For one user at 1M context on a 70B model. Let that sink in.

Scenario B: Workforce only (ten concurrent crew sessions)

Ten crew members using a voice assistant on the bridge, in the engine control room, in the galley, in the stewardess pantry. Each one has their own session, each one gets their own KV cache, but in practice most of them will not be running at the full 1M context. They will be asking focused questions, with maybe 50K tokens of relevant context loaded at a time.

Let me do the math at 128K per session, which is a more realistic steady-state:

Weights: 40 GB (shared, loaded once)
KV per session: 40 GB
10 concurrent sessions: 400 GB of KV cache
Total: 440 GB of VRAM

That is the same ballpark as the single-user 1M case. Five to six H100s. A real server rack, real power draw, real cooling.

At 32K per session, the math is gentler:

KV per session: 10 GB
10 sessions: 100 GB
Total with weights: 140 GB

Two H100s. A normal vessel-scale deployment. This is why most production on-vessel deployments today serve 32K, not 1M. It is not because the model cannot handle more context, it is because the KV cache budget does not exist.

Scenario C: Workforce plus guests (twenty concurrent sessions)

Add ten guests to the charter. They are asking their concierge agent about excursions, restaurants, dietary preferences, the history of the vessel. Twenty concurrent sessions, total.

At 32K per session:

Weights: 40 GB
KV: 20 × 10 GB = 200 GB
Total: 240 GB

Three H100s minimum. A beefy rack but still feasible.

At 128K per session:

Weights: 40 GB
KV: 20 × 40 GB = 800 GB
Total: 840 GB

You are now looking at a 10x H100 deployment. Roughly a DGX H100. Half a million dollars of GPUs in a cabinet, drawing 10 kilowatts, and you are still at 128K context, not 1M. Your dream of a long-context concierge agent for every guest just collided with a power budget that no yacht is going to approve.

The three honest ways out

If you look at these numbers and think "that cannot be right," I am here to tell you it is. But there are three real ways to make on-vessel long-context deployments tractable. In order of how much they help:

1. Use a smaller model. Llama 3.1 8B with 4-bit weights takes 5 GB for weights and 16 GB for a 128K KV cache. That is a single L40S card. You lose some capability but the context math becomes friendly. This is the right default for concierge agents where a smaller model is already good enough.

2. Cap context more aggressively. 32K per session is plenty for most operational queries. If you have a "load everything" use case like Ethan described in Part 1, serve that as a single session and let other users run at 32K. Mixed-context serving is well-supported in vLLM and SGLang.

3. Quantize the KV cache itself. This is the option that actually changes the economics, and the options have gotten a lot better in the last two years. FP8 KV cache is already production-grade in vLLM and TensorRT-LLM. INT4 is experimental but working. KIVI (Rice/CMU, February 2024) and KVQuant (Berkeley, January 2024) pushed 2-bit KV cache into the plausibility zone with asymmetric per-channel K and per-token V schemes. Two weeks ago, Google Research published TurboQuant (arxiv 2504.19874, ICLR 2026), which is the first training-free, data-oblivious scheme that hits 3 bits per element with no measurable accuracy loss on the long-context benchmarks that matter for this conversation.

The third option is where this series is going. If you can get the KV cache from FP16 down to 3-bit TurboQuant or 2-bit KIVI-style, you cut the bill by 5x to 8x. 320 GB becomes 40 GB to 64 GB. That is the difference between "impossible on a yacht" and "possible on a pair of H100s."

Ethan is writing Part 3 on the details. The short version is that TurboQuant is the current frontier, KIVI is what got us there, and all of these schemes are now either in production inference engines or on their way in. Without any of them, what I have described in this post is the ceiling. With them, 1M context on a vessel becomes a real conversation with a real price tag that a real client can approve.

What I would actually buy today

If a client walked in and said "I want 1M context on my 85-meter yacht, for one primary user, running everything on-vessel," here is what I would actually spec, today.

The L40S answer (still the default for most deployments):

Model: Llama 3.1 8B or 13B at Q4_K_M quantization. The 70B dream has to wait on this hardware tier.
Context cap: 512K per session, not 1M.
Hardware: Two L40S 48GB cards in a dual-card configuration. Weights fit comfortably on one, KV cache at 512K fits on the other with room to spare.
Power: 700W sustained.
Cost: Under $20K in silicon.

The H200 answer (the new sensible default if you want 70B at 1M for a single user):

Model: Llama 3.3 70B at Q4_K_M, the real 70B dream without compromises.
Context cap: 1M. Actual 1M.
Hardware: One NVIDIA H200 at 141 GB HBM3e. Weights are 40 GB, FP8 KV cache at 1M is 160 GB... no wait, that is still too much. You need FP8 KV plus aggressive context cap (say 512K), OR you turn on a lower-precision KV quant like TurboQuant 3-bit, at which point 70B at full 1M fits in a single H200 with comfortable headroom. See Part 3 for that math.
Power: 700W sustained.
Cost: About $32K in silicon for one card, closer to $50K for a redundant pair.

The Blackwell answer (for flagship yachts and commercial operators):

Model: Llama 3.3 70B or 405B with future upgrade headroom.
Context cap: 1M with concurrency for a workforce.
Hardware: One or two Blackwell B200s at 192 GB HBM3e. A single B200 can hold 70B weights plus 1M context FP8 KV cache; two cards serve the workforce scenario comfortably.
Power: 1000W per card sustained. This is the first recommendation in this post that requires you to think about your rack cooling as seriously as you think about your GPU budget.
Cost: $30K to $40K per B200 card; a redundant pair lands around $70-80K in silicon alone.

For the classic 85-meter single-user yacht deployment, the L40S answer is still the right one. Not because the B200 would not be better, but because it would cost four times as much and give you maybe 2x the capability for workloads that do not need the ceiling. For a vessel, the mid-tier deployment wins on price, power, and redundancy, the three metrics that actually matter.

When you should upgrade the recommendation: if the workload is materially larger (multiple users, concurrent sessions, true 1M context reads, or a model above 70B), step up to H200 as the new modern default. If you are building the ceiling for a flagship or cruise operator, step up to Blackwell. Otherwise two L40S cards and call it done.

KV cache quantization is already landing in production inference engines. FP8 has been there for a year. Google's TurboQuant result at 3-bit changes the ceiling again. The 70B dream is not dead, and Part 3 explains why.

Speaking of which, Ethan is up next. He is going to walk through exactly how TurboQuant's PolarQuant plus Johnson-Lindenstrauss construction works, how it compares to the KIVI/KVQuant lineage that preceded it, where the quality honesty really lies on long-context tasks, and how much memory you actually save in practice. If Part 1 of this series convinced you that context beats retrieval, and Part 2 scared you about the hardware bill, Part 3 is where it all starts to add up.

Trying to size a long-context local deployment for your vessel and running into the KV cache wall? Talk to us. We have done this math more times than we would like, and we will tell you honestly what the right ceiling is for your specific workload.