KV Cache Quantization at Sea (3/3)

This is Part 3 of a three-part series. In Part 1 I argued that loading everything into context usually beats retrieval for single-vessel deployments, citing Vercel's recent eval where a compressed always-loaded doc file outperformed on-demand skills 100% to 53%. In Part 2 James walked through the VRAM math and landed on the ugly number that stops most people cold: running Llama 3.3 70B at 1M tokens of context costs 320 GB just for the KV cache, per session, before you have loaded the weights.

This post is about how to get that number down to somewhere between 40 and 64 gigabytes without wrecking the model's output quality, and about the specific piece of research from Google, published two weeks ago, that reset the ceiling on what a vessel rack can actually hold.

The research is called TurboQuant (arxiv 2504.19874), and it is going to be presented at ICLR 2026 at the end of this month. It is not the first KV cache quantization paper, it is not even the first one we have deployed in experiments, but it is the first one I have read where the combination of "training-free, data-oblivious, 3-bit, no measurable accuracy loss" is all in the same sentence. That is a significant delta, and the chip stocks felt it the morning it landed. SK Hynix fell 6% in Seoul. Micron and SanDisk dropped 5 to 8%. The memory industry understood what it meant faster than most AI engineers did.

Here is what TurboQuant is, how it is different from what came before, and what it means for an on-vessel deployment you are speccing today.

A quick history, so the delta makes sense

KV cache quantization is not new. It is the opposite of new. Here is the rough timeline, because context matters:

2023: FP8 KV cache starts shipping in production inference stacks. Half-precision was the obvious first move, and NVIDIA Hopper's native FP8 support made it cheap. vLLM, TensorRT-LLM, and SGLang all exposed FP8 KV cache by late 2023 or early 2024. Memory savings: 2x over FP16. Quality cost: negligible.
January 2024: KVQuant from Hooper et al. at UC Berkeley. First to demonstrate 3-bit KV cache with under 0.1 perplexity degradation on Wikitext-2 and C4. Showed LLaMA-7B running at 1M context on a single A100-80GB. Required calibration data and per-channel K plus per-token V grouping.
February 2024: KIVI from Liu et al. at Rice/CMU. Tuning-free 2-bit KV cache with asymmetric grouping. Reported 2.6x less peak memory, up to 4x larger batch size, and 2.35x to 3.47x throughput improvement on Llama-2-7B, Falcon-7B, and Mistral-7B.
2024 throughout: These techniques land in HuggingFace transformers' QuantizedCache class, in vLLM via community PRs, and in research forks of TensorRT-LLM. 4-bit KV cache becomes "the thing you turn on when you want to serve more concurrent users per GPU." 2-bit remains experimental.
March 24, 2026: Google Research publishes TurboQuant. Training-free, data-oblivious, 3-bit with no accuracy loss on LongBench, Needle-In-A-Haystack, RULER, ZeroSCROLLS, and L-Eval. 6x memory reduction over FP16, up to 8x speedup on H100 for the attention logit computation. Gemma and Mistral benchmarks.

That is the arc. KVQuant and KIVI proved you could get to 2 to 3 bits with clever outlier handling. TurboQuant removed the calibration step and the model-specific tuning and made 3-bit essentially a free drop-in on any LLM. That is the delta.

How TurboQuant actually works

I am going to give you the version I would give a senior engineer over coffee. For the full math, read the paper. For the version that will help you understand why it works and what it cannot do, read this.

TurboQuant is a two-stage scheme. Both stages are training-free, which means they run on any model checkpoint without any calibration data or per-model tuning.

Stage 1: PolarQuant. Instead of quantizing each KV vector in Cartesian coordinates (X, Y, Z style), TurboQuant converts each vector into polar coordinates. That splits it into two pieces: a radius (how strong the vector is) and a set of angles (the direction the vector points). The insight is that the radius and angle have very different distributional properties. The radius is a scalar magnitude that benefits from one kind of quantization. The angle lives on a unit sphere and benefits from a different kind. Separating them lets you quantize each optimally instead of applying one coarse scheme to the mashed-together vector.

The practical effect is that PolarQuant maps data onto a fixed, predictable "circular" grid instead of having to compute per-vector scale factors the way traditional quantization schemes do. No per-vector scale factor means no calibration data. That is most of the training-free property right there.

Stage 2: Quantized Johnson-Lindenstrauss (QJL). This is the error correction layer. JL transforms are a classical trick from randomized algorithms: you project a high-dimensional vector down to a lower dimension while preserving the pairwise distances between vectors with high probability. TurboQuant uses a specific JL variant that reduces each projected vector element to a single sign bit (+1 or -1). Because the transform preserves distances, the attention score computation (which is fundamentally a bunch of dot products) still works after the projection, with bounded error.

The clever part is that QJL's memory overhead is essentially zero. You are storing a single bit per element, and the bit acts as error correction for the PolarQuant step. Put the two stages together and you get 3-bit KV cache with no accuracy loss on long-context benchmarks. That is the headline result.

Where it differs from KIVI and KVQuant

KIVI and KVQuant noticed that keys and values have different distributional properties and therefore want different quantization grouping axes. Keys have outlier channels, so per-channel grouping. Values have scattered outliers, so per-token grouping. That insight is still correct, and TurboQuant does not contradict it.

What TurboQuant does differently is reach the same 3-bit compression without the outlier-specific machinery. The Cartesian-to-polar transformation changes the shape of the data distribution so that the outlier problem no longer dominates the error budget. The JL projection then handles whatever residual error is left. Because it works from the geometry of the vectors rather than the statistics of any particular model, the scheme generalizes without calibration.

Put another way: KIVI/KVQuant is "here is how to be smart about the outliers in this specific model." TurboQuant is "here is a geometric construction that makes outlier handling unnecessary." Both get to 2 to 3 bits. TurboQuant is easier to deploy because it does not need a calibration pass. In a production inference engine, that is the difference between "drop it in and turn it on" and "profile the model, compute channel statistics, store them alongside the weights, maintain them when the weights change."

For a vessel deployment, "drop it in and turn it on" is the feature. We cannot run a calibration job at 3am when Starlink is down.

What it does NOT do

The paper is careful about its claims, and I will be too. A few things worth being honest about.

It is K and V together, not K-only or V-only. Some earlier work (and some confusion in public discussion) hints at asymmetric bit widths where you aggressively quantize one side and leave the other in full precision. TurboQuant does not do that. Both K and V are passed through the same PolarQuant + QJL pipeline at the same bit width. The 3-bit result applies to the full KV cache, not half of it.

The benchmarks are long-context reading comprehension, not tool-calling. LongBench, Needle-In-A-Haystack, RULER, ZeroSCROLLS, L-Eval. These are the right benchmarks for testing whether a model can find and reason over information buried deep in a long context, which is the primary use case we care about for vessel deployments (the "load the whole manual" scenario). They are not the right benchmarks for testing whether a model can reliably emit valid JSON or call a tool with the exact right argument schema. The quality claim in the TurboQuant paper does not extend to structured output correctness, code generation strictness, or agent orchestration reliability. For those use cases, you still need to run your own evals.

3-bit is the accuracy sweet spot. Lower bit widths degrade gracefully. TurboQuant reports clean results at 3 bits. At 4 bits it gets a speedup but the memory savings are smaller. Below 3 bits you start seeing quality loss on harder benchmarks. The headline number is the 3-bit case. Do not expect magic at 2 bits or 1 bit without retraining.

The Google blog reports results on Gemma and Mistral. Not Llama, not Qwen, not DeepSeek. I expect TurboQuant to generalize because the scheme is training-free and geometry-based, but "I expect" is not "I have evals for." If you are deploying Llama 3.3 70B on your vessel tomorrow, run the benchmarks on your own checkpoints before you bet the engine room on the published numbers.

Weight quantization times KV quantization: the gotcha nobody warns you about

Here is the part of this post I think is actually the most useful if you are deploying these techniques. In theory, weight quantization and KV cache quantization are orthogonal. Weights are static parameters loaded once. KV cache is activations generated at inference time. You should be able to combine any weight quant scheme with any KV quant scheme freely.

In practice, the two interact in a few places, and you will save yourself a debugging afternoon if you understand where.

1. Error compounding through attention. Your quantized weights W_Q, W_K, W_V produce slightly noisy K and V tensors because the multiplication itself introduces quantization error. When you then quantize those K and V tensors again for the cache, you are stacking two lossy steps. Aggressive combinations like 4-bit weights plus 2-bit KV compound that error faster than either step alone. KV quantization papers (including TurboQuant) typically report quality assuming FP16 or FP8 weights. The quality delta is worse on top of INT4 or lower weights. Budget for it.

2. SmoothQuant and AWQ change the activation distribution. AWQ (Activation-aware Weight Quantization) and its cousins migrate activation outliers into the weight quantization, which means the K and V tensors feeding the cache have a flatter distribution than they would on a vanilla FP16 model. Flatter distributions are friendlier to per-token KV quantization. So AWQ + INT4 KV tends to behave better than GPTQ + INT4 KV at matched bit widths, because GPTQ does not do the activation smoothing and leaves the outlier problem for the KV quantizer to deal with. TurboQuant is less sensitive to this because its geometry-based construction handles the outliers inherently, but the error compounding from weight quant is still real.

3. FP8 weights pair naturally with FP8 KV cache. If you are running an FP8-weighted checkpoint on Hopper or Blackwell, both vLLM and TensorRT-LLM expose FP8 KV cache as the matched default. Same numeric format, same scale bookkeeping, no extra conversion steps at attention time. This is the "safe production pairing" if you are not ready to experiment with lower KV bit widths yet.

4. Llama.cpp's Q4_K_M is the exception that proves the rule. In llama.cpp the recommended KV cache precision for Q4_K_M weights is F16 or Q8_0. Dropping KV to Q4_0 with Q4_K_M weights is a combination that degrades noticeably on long contexts. This is a known issue that the community has documented in discussions for the better part of a year. It is the right combination for short-context chat; it is the wrong combination for 256K+ context. If you are running llama.cpp on a vessel, pair Q4_K_M weights with Q8_0 KV and you will be fine up through most context lengths your crew will actually hit.

5. The safe production default, today. If you are deploying right now and do not want to do your own eval work, the safe pairing is AWQ or GPTQ weights at 4-bit, plus FP8 KV cache. That is a 2x KV savings over FP16, a 4x weight savings, and a combination that has been validated on every popular checkpoint. If you have eval budget, try TurboQuant at 3-bit KV on top of the same weights. If the eval passes on your workloads, you just cut another 50% off the KV bill.

The new numbers, for the scenarios James walked through in Part 2

Let me close with the table that actually matters. These are the totals for Llama 3.3 70B with 4-bit quantized weights, across the same context lengths James used in Part 2, under three KV cache schemes: FP16 baseline, FP8 production default, and 3-bit TurboQuant.

Context	FP16 KV	FP8 KV	TurboQuant 3-bit KV
32K	10 GB	5 GB	1.9 GB
128K	40 GB	20 GB	7.5 GB
256K	80 GB	40 GB	15 GB
512K	160 GB	80 GB	30 GB
1M	320 GB	160 GB	60 GB

Add the 40 GB of Q4 weights and you get the total VRAM bill:

Context	Total @ FP16 KV	Total @ FP8 KV	Total @ 3-bit KV
128K	80 GB	60 GB	47.5 GB
256K	120 GB	80 GB	55 GB
512K	200 GB	120 GB	70 GB
1M	360 GB	200 GB	100 GB

One hundred gigabytes of VRAM for Llama 3.3 70B at 1M context on-vessel. Six months ago that was a four-H100 deployment. Today, it fits inside a single NVIDIA H200 (141 GB HBM3e) with 40 GB of headroom left over for embedding models, Whisper, or a second smaller LLM running alongside. It fits inside a single Blackwell B200 (192 GB HBM3e) with room to serve multiple concurrent sessions. The hardware that used to be a four-card rack is now a one-card decision.

For the workforce-plus-guests scenario James walked through (20 concurrent sessions at 128K each), TurboQuant at 3-bit drops the total from 840 GB down to about 190 GB. On Blackwell that is a single B200 192 GB plus spillover to a second card for margin. On the Hopper generation it is a pair of H200s. In either case it is no longer a DGX-class deployment; it is a single rack with one or two cards in it, drawing 700W to 2000W depending on which generation you picked. That is absolutely something a large yacht can accommodate, power-wise and thermally, though Blackwell's 1000W-per-card envelope requires rack cooling design that some older vessels will not have out of the box. See James's GPU selection guide for which card is actually right for your specific scenario.

This is what "the math got smaller" means in practice. Same model, same context, same concurrency, 1/4 the hardware bill.

Production availability, today

Here is where we are, as of this post:

TurboQuant in vLLM: There is an open feature request for native integration. Independent implementations exist (the tonbistudio/turboquant-pytorch repo has a working PyTorch version; a few forks have Triton kernels). Official Google implementation is expected in Q2 2026.
TurboQuant in TensorRT-LLM: Not there yet. NVIDIA will almost certainly add it, but the timeline is their call.
TurboQuant in llama.cpp: Discussion thread open, community contributors are prototyping. MLX and Apple Silicon implementations exist for experimentation.
FP8 KV cache: Already in vLLM, TensorRT-LLM, SGLang production releases. This is the "safe default" I would turn on today while waiting for TurboQuant to land upstream.
HuggingFace QuantizedCache: Already supports 2-bit and 4-bit KV via quanto and HQQ backends. Good for experimentation on a laptop. Not the path I would ship a production vessel deployment on.

If I were speccing a yacht deployment today that wanted to be ready for TurboQuant the moment it lands upstream, I would buy the hardware now (two H100 80GBs), ship with AWQ 4-bit weights plus FP8 KV cache as the production baseline, and plan to turn on TurboQuant 3-bit KV in an update within the next 90 days. The hardware does not change. The software does.

What this actually means for on-vessel AI

Zoom out from the numbers for a moment. What TurboQuant represents is the pattern that keeps showing up in the on-vessel AI story: frontier capability that initially seems impossible on a constrained platform becomes possible six to twelve months later because a specific piece of research closes the gap. The cloud players got 1M context first because they have the GPUs. On-vessel deployments get 1M context second because the math got smaller.

We have seen this happen before with model quantization (70B weights used to need a DGX, now they fit on an L40S), with inference engines (vLLM and SGLang closed most of the throughput gap to proprietary stacks), with long-context attention mechanisms (FlashAttention, sparse attention, sliding window). KV cache quantization is the next chapter. Two weeks ago the gap between "feasible on a yacht" and "feasible in a hyperscaler" was 8x in memory. Today it is closer to 2x.

That is the pattern. It is going to keep happening. Build for it.

The series playbook

One last summary, because I want to leave you with something concrete from all three parts. Here is the practical playbook:

1. Build context-first, not RAG-first (Part 1). Load the full vessel corpus into the model's context at session start. Use a retriever only as a fallback for fleet-wide or deep-historical data that will not fit. Vercel proved the pattern works, the logic applies even harder at sea.

2. Size your GPUs for the KV cache, not just the weights (Part 2). The KV cache is the memory tax of long context and scales with concurrent users. Count it explicitly in every capacity plan.

3. Turn on quantized KV cache from day one (this post). Start with FP8, which is production-grade everywhere. Move to TurboQuant 3-bit as soon as it lands in your inference engine. Run the combination against your own evals before production.

4. Mind the weight-quant gotchas. AWQ plus FP8 KV or TurboQuant is the safest pairing. GPTQ needs more care. Q4_K_M in llama.cpp wants Q8_0 KV, not Q4_0. FP8 weights want FP8 KV. Match the quant schemes and you avoid the footguns.

5. Keep the retrieval infrastructure around as an emergency fallback. Even with 1M context running locally, you still want the ability to fall back to retrieval over historical data that predates the current session. Small, targeted retriever, not the primary lookup path.

6. Re-evaluate every six months. This field moves fast. TurboQuant itself is two weeks old as of this writing. Six months from now there will be a new paper, a new breakthrough, a new ceiling. Build an architecture that can absorb that.

Closing thought

I started writing this series thinking the story was going to be "1M context on a yacht is technically possible if you squint hard enough." By the time I finished reading the TurboQuant paper, the story changed. It is not "technically possible if you squint." It is "possible with two H100s, an AWQ checkpoint, and an inference engine that is about to ship the upstream integration." That is a completely different sentence. The first one is a research curiosity. The second one is a client conversation I want to have next week.

The cloud players get the headlines. The sovereign deployments get the research that makes the impossible feasible, six months later. That is fine. The link is going to drop anyway, and we are going to be the ones still running.

Interested in deploying TurboQuant or any of the other KV cache quantization schemes on your on-vessel AI stack, and not sure which combination is safe for your workload? Talk to us. We are running the evals on our own workloads right now and we will tell you which pairings we trust for production and which ones are still experiments. That is the conversation that actually matters for a vessel deployment that cannot fail at 3am in heavy weather.