Unified Memory for Vessel AI

For the last year I have been writing about on-vessel AI like it is a conversation that starts with an H100 and ends with a Blackwell. That was the right framing when the only real options were data-center GPUs and the prices were $15,000 to $40,000 a card. It is not the right framing anymore, and two pieces of hardware that shipped in the last six months are the reason.

The first is NVIDIA DGX Spark, a desktop-class AI workstation built around the GB10 Grace Blackwell Superchip. It has 128 GB of coherent unified memory between its CPU and GPU. It costs about $3,999. It fits on a shelf.

The second is AMD Strix Halo, marketed as the Ryzen AI Max+ 395, which brings up to 128 GB of unified memory to an APU-class chip with an integrated RDNA 3.5 GPU and an XDNA 2 NPU. Products built on it (Framework Desktop, HP ZBook Ultra G1a, several mini-PC vendors) land between $1,800 and $4,000 depending on memory configuration. It also fits on a shelf.

Both of these are running 70B-class models locally, right now, in production deployments. Neither of them would have been plausible hardware for on-vessel AI twelve months ago. Both of them deserve a serious look for a significant fraction of the vessel deployments we quote today.

This post is about what unified memory actually means, what the real tradeoffs are, and when a $4,000 box on a shelf is a better answer than a $40,000 rack card for your vessel. It is not a replacement for the main GPU selection guide; it is the section of that guide I should have written six months ago and did not because the hardware did not exist yet.

What "unified memory" actually means

The short version: CPU and GPU share the same physical memory pool. No PCIe transfers to move data between them. When the GPU needs a tensor, it just reads it from the address in DRAM where the CPU put it. When the CPU needs a result, it reads it from the same address.

This sounds boring. It is not boring. On a discrete-GPU system, moving a 50 GB model from CPU RAM to GPU VRAM over PCIe takes meaningful time and doubles the memory you need (you have to have both copies while the transfer happens). Unified memory collapses that: your 50 GB model lives in one place and both the CPU and the GPU see it.

For LLM inference specifically, this architecture has two big consequences:

Your memory ceiling is bigger. A consumer GPU tops out at 24 GB VRAM, a prosumer card at 48 GB, an enterprise card at 80 to 192 GB. A unified memory system with 128 GB lets your "VRAM-equivalent" pool be almost as large as the enterprise cards, at a fraction of the price, because the memory is DDR5 or LPDDR5X rather than HBM.
Your memory bandwidth is much smaller. HBM3e on an H200 runs at 4.8 TB/s. HBM3 on an H100 runs at 3.35 TB/s. The unified memory on DGX Spark runs at 273 GB/s. The unified memory on Strix Halo runs at about 256 GB/s in the best configurations. That is 12 to 17 times slower than HBM.

Memory size lets you hold bigger models. Memory bandwidth determines how fast tokens come out. You get the first benefit and pay for it with the second.

NVIDIA DGX Spark in specifics

DGX Spark is NVIDIA's attempt to put a Grace Blackwell system on a developer's desk. It is built around the GB10 Grace Blackwell Superchip, which fuses a 20-core ARM CPU (10 Cortex-X925 performance cores plus 10 Cortex-A725 efficiency cores) with a Blackwell GPU onto a single die. The CPU and GPU are connected by NVLink-C2C running at 900 GB/s internal bandwidth.

The specs that matter for LLM inference:

128 GB unified LPDDR5X memory, shared between CPU and GPU
273 GB/s memory bandwidth (this is the ceiling on token throughput)
1 PFLOP sparse FP4 tensor performance on the Blackwell GPU
Desktop form factor, ships as a Founders Edition at $3,999

The LMSYS team did a real review with actual benchmarks, and Tom's Hardware published a direct comparison against Strix Halo. The headline is that DGX Spark runs quantized 70B models at a few tokens per second, can hold a 200B-parameter model in memory (which is something you cannot do on any discrete GPU at any price without multi-card NVLink), and draws about 240W at full load.

Two hundred and forty watts. For a box that runs a quantized 70B model locally with 1 kW-class data center cards doing roughly the same job at 10 to 15x the power draw.

The GPU compute itself is in the neighborhood of an RTX 5070 to 5070 Ti, which is why throughput is modest. But for workloads that are memory-size-bound rather than compute-bound (the "load the whole vessel corpus and ask questions about it" use case Ethan wrote about in Part 1 of the context series), memory size is the binding constraint and DGX Spark's 128 GB ceiling is exactly what you wanted.

AMD Strix Halo in specifics

AMD's Ryzen AI Max+ 395 (codename Strix Halo) is the other interesting option and it comes at this problem from a completely different direction. AMD built what is essentially a mobile laptop APU and scaled it to a workstation chip: 16 Zen 5 cores, RDNA 3.5 integrated GPU with 40 compute units (about 60 TFLOPS), XDNA 2 NPU for local AI acceleration, and a wide 256-bit LPDDR5X memory bus that tops out at 128 GB of system memory. Of that, up to 96 GB can be carved out as dedicated GPU VRAM via AMD Variable Graphics Memory, and the integrated GPU can actually access up to 115-120 GB via the Graphics Translation Table.

You can buy Strix Halo today in several form factors:

Framework Desktop: $1,999 base, $2,699 with 128 GB unified memory. Modular desktop designed to be hackable and serviced over its lifetime, which matches vessel maintenance realities better than most hardware.
HP ZBook Ultra G1a: $3,000 to $4,500 depending on configuration. Laptop form factor, not ideal for rack deployment but excellent as a portable "bring the AI to the meeting" device.
Several mini-PC vendors (GMKTec, Beelink, Minisforum, and others) shipping small form factor boxes for $1,800 to $3,200.

On LLM performance: community benchmarks show Strix Halo running smaller models (7B-13B at Q4) at 34-38 tokens per second, which is real production throughput. On 70B at Q4, it lands around 5 tokens per second. Not great for interactive chat, but perfectly reasonable for async workloads, summarization, report generation, and the long-running analysis tasks that vessels actually have more of than they have interactive ones.

AMD has posted a guide showing Strix Halo systems clustered to run trillion-parameter models locally. That is the extreme case and not what most vessels need, but it tells you something about the ceiling of where this architecture is going.

The tradeoff table

Here is where I put the numbers side by side, because this is the part that matters when you are actually making a decision. Let me compare a single Llama 3.3 70B Q4_K_M deployment across three configurations: an L40S (the current yacht default), an H200 (the new Hopper refresh), and a DGX Spark-class unified memory system.

	L40S	H200	DGX Spark (GB10)
Memory	48 GB GDDR6	141 GB HBM3e	128 GB unified LPDDR5X
Bandwidth	864 GB/s	4.8 TB/s	273 GB/s
Power	350W	700W	240W
Form factor	Full server rack card	Full server rack card	Desktop box
Cost	$8,500	$32,000	$3,999
70B Q4 weights fit?	Tight (40 GB weights)	Yes with huge margin	Yes with margin
70B Q4 + 128K context?	With FP8 KV cache, yes	Yes, comfortably	Yes
70B Q4 + 1M context?	No	With KV quant, yes	Technically yes (KV quant), practically slow
Throughput on 70B	~15 tok/s per card	~60 tok/s per card	~5 tok/s
Concurrent sessions	2-4	10+	1-2

Read the last three rows carefully. The L40S and H200 are fast but expensive and physically large. The DGX Spark is cheap, small, low-power, can hold the model, but generates tokens at about one-twelfth the speed. For a single user reading long-context replies, 5 tok/s is usable (a fast human reads at maybe 4 tok/s of English). For a crew member asking quick operational questions, 5 tok/s feels laggy. For a guest concierge agent responding to "book me a table," it is borderline unacceptable.

This is the whole tradeoff. Unified memory buys you model size. Discrete GPUs buy you throughput. Pick your workload and the answer falls out.

When to buy a unified memory system for a vessel

Here is my honest recommendation matrix, after speccing these systems into real quotes over the last two months.

Buy a unified memory system when:

Your workload is primarily memory-bound, not throughput-bound. Loading the full vessel corpus and asking occasional questions. Batch document analysis. Overnight incident reports. Report generation. Research-style queries from the owner or captain. These are workloads where the answer can take ten seconds without anyone noticing.
You have tight power or space constraints. A 65-meter yacht with no dedicated rack space cannot install an H200 box. It absolutely can install a DGX Spark on a shelf in a locker that already has some UPS-backed power available. For vessels where rack space and cooling capacity are the binding constraints, unified memory wins by default.
You want N-of-M redundancy at low total cost. Four DGX Sparks at $4,000 each is $16,000 in hardware. That is less than one L40S, and you get four independent compute nodes that can fail independently. For a vessel where "the AI must not go offline entirely" is more important than "any single query must be fast," four small boxes beat one big card decisively.
You are piloting a capability you are not sure will stick. A unified memory box is a real system you can install, run production on, and rip out in six months if the workload does not materialize. A rack of H200s is a $100K commitment. The unified memory systems let you prove the use case before you spend real money.
You want to run experiments without consuming the production rack. Even on vessels that already have a GPU rack, adding a DGX Spark as a "dev and experimentation box" is the cheapest possible way to let your AI engineer actually try things without fighting for production GPU time.

Do not buy a unified memory system when:

You need interactive latency on a large model. A captain talking to a voice assistant about engine room readings cannot tolerate 5 tok/s. They need 20+ tok/s, and that means HBM memory bandwidth, which means H100 or better.
You need concurrent sessions. Unified memory systems degrade badly when you try to run multiple inference streams on the same model, because the memory bandwidth is not high enough to serve multiple attention queries in parallel. An L40S at 864 GB/s can serve 4 to 8 concurrent sessions on a 70B. A DGX Spark at 273 GB/s can reasonably serve 1 to 2. The unified memory architecture is fundamentally a single-heavy-workload architecture.
You are running computer vision pipelines at frame rate. CV is compute-bound, not memory-bound. You need the TFLOPs, not the memory pool. Unified memory systems are bad at this.
You need the inference stack to match your existing cloud infrastructure. vLLM, TensorRT-LLM, and SGLang all run well on discrete NVIDIA GPUs. Strix Halo works but uses ROCm and llama.cpp or MLX-flavored stacks instead, and the ergonomics are different. If your team has existing operational familiarity with the NVIDIA side, there is a learning curve on Strix Halo that matters.

The specific recommendation matrix I am giving clients right now

Putting it all together, here is the flowchart I am actually using with clients this month.

"I want AI on my boat and I do not have strong opinions on what": DGX Spark or Strix Halo mini-PC, $4K in hardware, one week to commission, runs a quantized 70B locally with all the memory you need for long context. This is my new default recommendation for smaller vessels (under 60 meters) and for any deployment where the use case is "the owner and maybe one other person will talk to it occasionally."

"I want AI on my boat for the crew during daily operations": Two L40S cards in a proper server rack, same as the existing recommendation in the main GPU guide. This is still the right call for workloads that need real throughput for multiple concurrent users with reasonable latency.

"I want 1M context on 70B for one primary user": One H200. The new default for this specific scenario. Single card, fits the workload, handles the context length.

"I want the ceiling, I am a commercial operator or cruise line, and I have real rack space": Blackwell B100 or B200. Actually makes sense now.

"I want to run Claude-scale frontier models locally and I do not care about cost": You cannot. That is not yet feasible at any sane price point on a vessel. Use the cloud for this specific workload and fall back to a local 70B when the link drops.

"I want to experiment before committing": DGX Spark, always, until you know what you actually need. $4K to find out if the use case sticks is the best deal in AI infrastructure right now.

The quiet shift this represents

Six months ago, I would have told you that "AI on a yacht" meant a server rack with an enterprise GPU, power conditioning, liquid cooling for the bigger configurations, and an integrator who knew what they were doing. That was the only way to run a useful local model.

Today, "AI on a yacht" can also mean a single desktop-sized box with 128 GB of shared memory, 240W of power draw, and a price tag lower than a single new mainsail. That is a meaningful democratization of what used to be hyperscaler-class hardware. More importantly for our customers, it is a new price point that opens on-vessel AI to vessels that were priced out of the enterprise GPU conversation entirely.

The high end still exists, and Blackwell is still the right call for operators who need the ceiling. But the bottom of the market just got a lot more interesting, and some of the quotes I am writing this month look nothing like the quotes I was writing last year. The hardware categories multiplied, the price floor dropped, and the total addressable market for sovereign AI at sea expanded by a meaningful amount.

Combined with the KV cache quantization research Ethan wrote about in the TurboQuant piece, the feasibility frontier has moved significantly in the last six months. More memory per dollar on the hardware side. Smaller KV cache per token on the software side. Long-context 70B workloads that were impossible on anything short of a DGX rack last year are now feasible on a desktop-sized box this year.

That is the pattern. That is why the small form factor matters. And that is why, for the first time, I would seriously recommend a $4,000 box over a $40,000 rack for a specific class of vessel deployment.

What I would actually ship today

If you are reading this because you are actively speccing a vessel deployment and you want my honest short-list of what I would put on a boat in April 2026:

Under 60 meters, single primary user, knowledge-work use case: Framework Desktop or DGX Spark, 128 GB unified memory, Llama 3.3 70B Q4_K_M, FP8 KV cache, 512K context cap, llama.cpp or vLLM serving stack. Total installed cost under $8,000 including a proper UPS and enclosure.

60 to 90 meters, small crew plus occasional guest use: Two L40S in a server rack. Llama 3.3 70B Q4, 256K context, vLLM serving multiple concurrent sessions. Total installed cost around $45,000.

Over 90 meters, flagship use case, long-context 70B for a single primary user: One H200 in a server rack, Llama 3.3 70B at full 1M context with TurboQuant KV cache. Total installed cost around $60,000.

Commercial cruise operator or fleet owner: Blackwell B200 pair. Llama 3.3 70B or 405B, high concurrency, FP4 precision. Total installed cost $150,000+ per vessel, scales with fleet size.

Each of these is a real deployment profile I have quoted in the last six weeks. None of them require compromising the core narrative we have been writing about for a year: the AI should keep working when the link drops, the data should stay on the vessel, and the architecture should degrade gracefully rather than fail catastrophically. Unified memory systems do not change any of that. They just give us another way to hit those requirements, at a price point that opens the market to vessels that were not in the conversation before.

Considering a DGX Spark, Strix Halo system, or full enterprise GPU rack for your vessel and not sure which one fits your workload? Talk to us. We have been deploying all three categories in the last two months and we will tell you honestly which one is the right spend for your specific use case. For most vessels under 60 meters, the answer is probably not the $40,000 card. For most flagships, it probably is. The only way to know is to talk through the actual workload.