Context vs. RAG on Vessels (1/3)

Claude Opus 4.6 ships with a 1 million token context window. So does GPT-5. So does Gemini 2.5. If you are building anything serious in AI right now, you are already working with a million-token context budget. That is not news anymore. It is table stakes.

The interesting question, the one the industry is actually arguing about, is a harder one: when should you just load everything into the context, and when should you retrieve?

That tension is the whole game for anyone deploying on-vessel AI. It is also the first thing I want to write about in a three-part series on what it actually takes to run long-context models at sea. This is Part 1. Part 2, from James, covers the real GPU and memory math for running 1M context locally. Part 3 covers Google's TurboQuant research, published two weeks ago, which changes the feasibility conversation entirely by compressing the KV cache to 3 bits with no accuracy loss.

Let me start with the surprising eval that kicked this series off.

Vercel ran the experiment nobody else bothered to

On January 27, Jude Gao at Vercel published a benchmark comparing how well AI coding agents handle new Next.js 16 APIs that are not in any model's training data. The APIs in question: connection(), 'use cache', cacheLife(), forbidden(), and the new async cookies() and headers() primitives. These are all post-training-cutoff, which means the model has no prior knowledge of any of them. It either learns from what you put in front of it, or it gets the code wrong.

Vercel tested four configurations:

Baseline: no docs. Model just uses what it remembers.
Skill (default): docs wrapped in an agentskills.io skill, available for the agent to invoke on demand.
Skill with explicit "use this" instructions: same skill, but the system prompt tells the agent to use it.
AGENTS.md: an 8KB compressed markdown index (down from 40KB of full docs) embedded directly in the context, with pointers to retrievable .next-docs/ files.

The results (pass rates on their Build / Lint / Test / Overall suite):

Configuration	Overall Pass
Baseline (no docs)	53%
Skill (default)	53%
Skill + explicit instructions	79%
AGENTS.md (always loaded)	100%

Read the second row again. The skill, when made available but not explicitly invoked, scored exactly the same as no docs at all. In 56% of cases, the agent simply did not reach for the skill it was given. The information was there. The agent did not use it.

The "dumb" approach of just loading the compressed docs straight into context won outright.

Why this is the only eval result that matters for vessel AI

The Vercel finding lands particularly hard for on-vessel deployments because the failure modes of retrieval-on-demand are a lot worse when you are in an engine room at 3am with a Starlink link that is dropping packets. The Vercel agents had perfect infrastructure. They failed anyway, because the agent itself chose not to retrieve.

On a vessel, you get to add:

The retrieval path has its own failure modes. If your RAG pipeline depends on a vector database, that database has to be running, have enough memory, and return results inside your latency budget. One more thing to monitor, one more thing to fail.
Stale embeddings are invisible until they bite you. If you re-indexed last month but added a new service bulletin yesterday, the retriever will confidently serve you the old one and the model will confidently answer with it.
Degraded connectivity does not degrade gracefully. If your agent decides not to retrieve, or retrieves the wrong chunk, the user does not see "retrieval failed." They see a confident wrong answer. That is the worst failure mode there is.

The honest version of the RAG-vs-context debate is that retrieval adds infrastructure, infrastructure adds failure modes, and the agent might not even use it. Loading everything once, at session start, costs more tokens but removes an entire category of "why did it not find that" post-mortems.

When context actually beats RAG

Here is the part that surprised me when I started running the numbers for our deployments. For a single-vessel knowledge base, the entire thing fits in 1M tokens almost every time.

Consider what a real "everything we know about this vessel" corpus looks like:

The OEM operating manual for the vessel: ~400 pages, call it 200,000 tokens
Two years of daily maintenance logs: ~50,000 tokens
The crew handbook and SMS manual: ~80,000 tokens
Every service bulletin, class notice, and safety alert: ~30,000 tokens
Charter guest history for the last three seasons: ~20,000 tokens
Vendor spec sheets for every major piece of equipment: ~100,000 tokens

That is ~480,000 tokens. Under half of a 1M context budget. Everything the model could possibly need to answer a crew question about the vessel, loaded once, referenced forever in the session.

You lose the cleverness of retrieval. You gain the reliability of not needing it.

For fleet-scale knowledge (fifty vessels, millions of pages), you obviously still need retrieval. Nobody is arguing for cramming fleet-wide data into one context. But the single-vessel use case that makes up most of what we actually build for clients? Context wins, and Vercel just gave us the receipts.

The asterisk: what Vercel's result does NOT say

A few things worth being honest about before anyone takes this as "RAG is dead":

It was a coding eval. Vercel was testing whether an agent could write Next.js 16 code using new APIs. That is a narrow, well-defined task with clear success criteria. Your "what should I do about the generator alarm" question at sea is messier, and the answer might benefit from retrieval over specific incident reports that do not fit in context.

The compressed docs were 8KB. That is small. Their 100% result was with a tightly curated index that pointed to larger files loaded on demand. This is not "load all 40KB of Next.js docs into every prompt." It is "load a pointer structure and let the agent pull specifics." That middle path matters.

Agents got worse with too much context in other studies. Research on "lost in the middle" and context rot shows that past a certain point, stuffing the context hurts recall. The Vercel study hit the sweet spot. Yours might not.

What the result really says is: when your agent needs a specific piece of information to not hallucinate, the most reliable way to get it in front of the model is to put it in the context from the start. On-demand retrieval is a second-best pattern unless you genuinely cannot fit the corpus.

What this means for on-vessel AI deployments

I have been rewiring our reference architecture in my head ever since that Vercel post landed. Here is the shift:

Old default: embed the vessel corpus in a vector database, put a retriever in front of the LLM, chunk strategy tuned to top-5 recall. RAG-first.

New default: load the full vessel corpus into the model's context at session start. Keep a lightweight retriever for the edge cases (fleet-wide data, historical incidents the model cannot keep in active context across session restarts). Context-first, retrieval as a fallback.

The reason the new default works is the one thing Vercel's numbers prove: agents do not reliably know when to retrieve. Humans think "oh I need to look that up," agents do not. An agent will happily answer from memory, from training data cutoffs, from the last thing it read, from nothing at all. The only way to guarantee the right facts are in front of the model is to not make retrieval the model's decision.

On a moving vessel, that guarantee is worth the extra tokens.

The catch nobody is going to tell you

Here is where this series stops being fun and starts being hard. Running 1M context on an Anthropic-hosted model is easy. You pay per token, Anthropic runs the GPUs, the latency is fine. Running 1M context locally, on a GPU in your engine room, on a 70B parameter open-weights model that you own end-to-end, is a completely different problem. The KV cache alone for Llama 3.3 70B at 1M tokens in FP16 is 320 gigabytes. Not the model weights. Just the key-value cache for the attention mechanism.

Before anyone jumps in: yes, people have been quantizing the KV cache for a while. This is not new. vLLM has shipped FP8 KV cache in production since 2024. TensorRT-LLM supports INT8 and FP8 KV cache with separate K/V scale factors. Llama.cpp lets you flip KV cache precision independently from weight precision. KIVI from Rice and CMU and KVQuant from Berkeley were the first to push toward 2-bit with clever per-channel / per-token grouping. FP8 KV cuts the 320 GB number in half. INT4 cuts it to 80 GB. KIVI-style 2-bit schemes take it toward 40 GB.

That is the state of play before the thing I actually want to point at. Two weeks ago, on March 24, Google Research published TurboQuant (arxiv 2504.19874, ICLR 2026). It is a training-free, data-oblivious KV cache quantization scheme that hits 3 bits per element with no accuracy loss on Gemma and Mistral across LongBench, Needle-In-A-Haystack, RULER, ZeroSCROLLS, and L-Eval. The reported numbers are 6x less KV cache memory and up to 8x speedup on H100 GPUs for the attention logit computation. That is the kind of delta that actually resets the conversation about what a vessel rack can hold.

That number is why the cloud players can promise 1M context today and most local deployments still top out at 32K or 128K in production. The cloud players are running the math across thousands of GPUs in a data center. You are running the same math on a rack in a lazarette. TurboQuant is the first result I have seen that closes enough of that gap to matter for us.

James is going to walk through the memory math in Part 2. How much VRAM do you actually need for 70B at 1M context, before and after KV quantization? At 405B? What changes when it is one user versus the whole crew versus the whole crew plus every guest? The answer is less depressing than it used to be six months ago.

Part 3 is the TurboQuant deep dive. How the two-stage PolarQuant plus Johnson-Lindenstrauss construction actually works, why it is training-free where everything before it needed calibration data, how it relates to the KIVI/KVQuant line of research that preceded it, and where the quality honesty really lies. The headline is that 1M context on Llama 3.3 70B stops requiring a 4x H100 node and starts being a pair-of-H100s conversation. Not because the cloud got cheaper, but because the math got smaller.

For now, the takeaway from Part 1 is small but real. Context beats retrieval more often than the RAG-first orthodoxy wants to admit. Vercel proved it for coding agents. The same logic applies, with extra force, to anyone deploying AI in an environment where "the agent just did not retrieve" is the difference between a correct answer and a dangerous one.

If you are building on-vessel AI, the right question is not "how do I make RAG more reliable?" The right question is "how much of this actually needs to be retrieved, and what happens if I just put it all in context?"

For a single vessel, the answer is usually: all of it, and it works better than you think.

Deploying AI on a vessel and wrestling with when to retrieve vs. when to load? Talk to us. We have strong opinions on this architecture and we have the receipts. Part 2 of this series covers the real cost of running 1M context locally, Part 3 covers the quantization research that makes it feasible.