Can you run a 70B LLM on a single GPU?

Yes, with 4-bit quantization. A quantized Llama 3.1 70B fits in about 35-42 GB of VRAM, which works on a single H100 80 GB or A100 80 GB. On a 48 GB L40S it fits but with limited headroom for KV cache and context.

What is the best serving framework for offline LLM deployment?

vLLM is the standard for production GPU serving. It uses PagedAttention for efficient memory management, supports continuous batching, and handles tensor parallelism across multiple GPUs. For CPU-only or minimal-resource deployments, llama.cpp is the alternative.

How much does it cost to deploy a 70B model on a yacht?

Hardware costs range from roughly 15,000 to 19,000 USD for a dual L40S setup to 25,000-35,000 USD for a single H100. Total installed cost including rack, cooling, power conditioning, and integration typically adds 40-60 percent on top of silicon cost.

Deploying a 70B LLM Offline at Sea

Last month a charter operator asked me to scope an AI system for a 70-meter motor yacht. The brief was simple: the owner wanted a private AI assistant that could answer anything, handle guest requests in six languages, and never touch the internet. Not for privacy theater. Because he had watched the concierge chatbot on a previous charter go dark for eleven hours during a satellite handover gap south of Crete, and he never wanted to see that again.

The model class that makes that brief possible is 70 billion parameters. Not 7B (too limited for multi-turn reasoning across domains), not 405B (too large for a sane vessel power budget). 70B is where you get genuine depth of knowledge, strong multilingual capability, and reliable instruction-following, all in a package that fits on hardware you can actually install on a yacht.

Here is what the complete deployment looks like. Hardware, quantization, serving stack, benchmarks, and the parts that only matter when your data center is floating.

Why 70B is the vessel sweet spot

The 70B parameter class (Llama 3.1 70B, Llama 3.3 70B Instruct, Qwen 2.5 72B, Mistral Large) hits a specific set of capabilities that smaller models do not. For on-vessel use cases, three of those capabilities are non-negotiable.

Multilingual fluency. A 7B or 8B model can handle basic prompts in major languages. A 70B model handles nuanced multi-turn conversation in French, German, Italian, Spanish, Mandarin, and Arabic without quality degradation. On a superyacht where guests and crew collectively speak a dozen languages, this is the difference between a useful tool and a toy.

Complex reasoning under long context. Guest concierge queries look simple on the surface ("book dinner at 8pm"), but in practice they chain: check the calendar, check dietary restrictions from the guest profile, check which restaurant has availability, confirm in the guest's language, handle a follow-up question. A 70B model with 128K tokens of context handles that chain reliably. Smaller models start dropping context or hallucinating steps.

Knowledge depth. The owner who wants to ask about marine diesel maintenance procedures, Mediterranean port regulations, or the history of the island they are anchoring off does not want a shallow answer. 70B models trained on 15+ trillion tokens carry enough parametric knowledge to give substantive responses across domains without retrieval, which matters when you are operating without a cloud fallback.

That last point is the whole thesis of sovereign AI on a vessel. All of humanity's knowledge, distilled into weights that live on hardware you control, running when the satellite link is down, when you are in a coverage gap, when you simply do not want your data leaving the hull. The 70B class is where the model is finally good enough to make that promise real.

The VRAM math: what actually fits where

A 70B parameter model in FP16 (full precision, 2 bytes per parameter) weighs about 140 GB. No single GPU on the market holds 140 GB of VRAM. You either use multiple GPUs with tensor parallelism, or (more practically for a vessel) you quantize.

Quantization compresses the model weights from 16-bit floating point down to 4-bit or 8-bit integers. The weights get smaller, the model gets faster, and the quality loss is surprisingly small. Here is the real size table for Llama 3.1 70B at common quantization levels:

Quantization	Format	Weight size	Quality vs FP16
FP16 / BF16	Native	~140 GB	Baseline
INT8 (Q8_0)	GGUF	~74 GB	Virtually lossless
Q6_K	GGUF	~58 GB	Near-FP16
Q5_K_M	GGUF	~50 GB	Minimal degradation
Q4_K_M	GGUF	~42.5 GB	Less than 2% perplexity increase
AWQ INT4	AWQ	~35 GB	~95% quality retention
GPTQ INT4	GPTQ	~35 GB	~90% quality retention

The numbers that matter for hardware selection: at 4-bit quantization, the model weights fit in 35–42 GB depending on the format. That is what you design around.

But weights are only half the memory story. The other half is the KV cache, the memory the model uses to remember previous tokens in the conversation. James covered this in detail in the GPU math post, but the short version is: KV cache scales linearly with context length and concurrency. At 32K context with a single user on a 70B model, the KV cache in FP16 is about 10 GB. At 128K context, it is about 40 GB. With KV cache quantization (FP8 or the newer 3-bit TurboQuant from Google), those numbers drop by 2x to 5x.

So the total VRAM budget for a quantized 70B deployment looks like this:

Model weights (AWQ INT4): ~35 GB
KV cache (32K context, FP8, single user): ~5 GB
Framework overhead: ~3–5 GB
Total: ~43–45 GB

That is the number you compare against your GPU's VRAM.

GPU hardware: what goes in the rack

James wrote the full GPU selection guide and I am not going to repeat it here. What I will do is show you exactly which configurations work for a 70B deployment and which do not.

Single H100 SXM5 (80 GB HBM3, 3.35 TB/s bandwidth). A quantized 70B at AWQ INT4 fits with about 35–45 GB of headroom for KV cache, framework overhead, and a secondary model (embedding or speech-to-text). This is the comfortable single-card option. Single-user decode speed on H100 with vLLM is in the 35–40 tokens per second range, which is fast enough that the response feels instant.

Single L40S (48 GB GDDR6, 864 GB/s bandwidth). AWQ INT4 weights fit (~35 GB), but you only have about 13 GB left for everything else. That is enough for short contexts and a single concurrent user, but it gets tight fast. At 32K context in FP16 KV cache, you are already over budget. With FP8 KV cache quantization, it works, but there is no room for a second model. This config is viable for a dedicated LLM-only card in a multi-card setup, but marginal as a standalone.

Dual L40S with tensor parallelism (96 GB total). This is the configuration we deploy most often for 70B on yachts. You get 96 GB of combined VRAM, the quantized model splits across both cards via vLLM's tensor parallelism (--tensor-parallel-size 2), and you have ample room for KV cache, an embedding model, and Whisper for speech-to-text. Total silicon cost is $15,000–$19,000 versus $25,000–$35,000 for a single H100. Two cards also give you a redundancy story: if one card fails, you can reconfigure to run a smaller model (8B or 30B) on the surviving card while you wait for a replacement. On a vessel, that matters.

Single A100 80 GB. The previous generation, but still capable. AWQ INT4 fits with similar headroom to the H100. The bandwidth is lower (2.0 TB/s HBM2e vs. 3.35 TB/s HBM3) so decode speed is noticeably slower. If you already have A100 hardware on board, it works. If you are buying new, the L40S pair is a better purchase at a similar price point.

Quantization: choosing the right method

Not all 4-bit quantizations are the same. The two methods that matter for production vessel deployment are AWQ and GGUF (Q4_K_M).

AWQ (Activation-Aware Weight Quantization) is the standard for GPU-based serving with vLLM. It runs natively on CUDA, integrates with vLLM's quantization pipeline, and the pre-quantized models are available on HuggingFace (search for Meta-Llama-3.1-70B-Instruct-AWQ-INT4). In benchmarks, AWQ at 4-bit retains about 95% of FP16 quality. Perplexity increase is small (on the order of 6.84 vs. 6.74 for GGUF Q4_K_M in comparative tests).

GGUF Q4_K_M is the standard for llama.cpp and CPU-offloading scenarios. It uses mixed precision (not a flat 4-bit across all layers) and generally produces slightly better quality than AWQ at the same bit budget. The tradeoff: GGUF inference through llama.cpp is significantly slower than AWQ through vLLM on GPU hardware. In benchmarks on dual A100s with 32 concurrent users, vLLM delivered 649 tokens per second versus 196 for llama.cpp. That is a 3.3x throughput difference.

The practical recommendation: if you have dedicated NVIDIA GPUs (which you should for a vessel deployment), use AWQ with vLLM. If you are running on a DGX Spark, a Strix Halo box, or any unified-memory system where llama.cpp's CPU+GPU hybrid mode is the only option, use GGUF Q4_K_M. James covered the unified memory hardware options in a separate post.

The serving stack: vLLM in production

vLLM is the inference engine we deploy on every GPU-based vessel installation. It was originally built at UC Berkeley's Sky Computing Lab, it is now maintained by 2,000+ contributors, and it has become the default for production LLM serving across the industry. Three features make it the right choice for vessel deployments specifically.

PagedAttention. vLLM manages the KV cache the way an operating system manages virtual memory: in non-contiguous pages, allocated on demand, freed when sessions end. This eliminates the memory fragmentation that kills concurrent inference on simpler serving stacks. On a vessel where you might have a guest concierge, a crew operations assistant, and a maintenance knowledge base all hitting the same GPU, PagedAttention is the difference between "it works" and "it crashes at the third concurrent session."

Continuous batching. Traditional batch inference waits for a full batch before processing. vLLM replaces completed sequences with new ones at every iteration, so the GPU stays busy even when request arrival is uneven (which on a vessel, it always is).

Tensor parallelism. Split the model across 2 or 4 GPUs with a single flag. No custom code, no manual sharding.

Here is what the launch command looks like for a dual-L40S deployment:

vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

That gives you an OpenAI-compatible API endpoint on port 8000. Your application layer (concierge app, crew assistant, voice pipeline) talks to it over HTTP on the ship's internal network. Every request stays on the vessel. Nothing leaves the hull.

For deployments where multi-turn conversations share long system prompts (which is every vessel deployment, because you are loading vessel-specific context into every session), look at SGLang as an alternative. Its RadixAttention feature automatically reuses KV cache across requests that share the same prefix, which translates to about 29% higher throughput on shared-context workloads. We are evaluating it for crew operations use cases where every request starts with the same 8,000-token vessel operations manual.

What "offline" actually means in practice

Running a 70B model on-vessel is not just a hardware exercise. There are operational details that only surface when the deployment is genuinely disconnected.

Model updates happen in port. You are not pulling a 35 GB model file over a 50 Mbps Starlink link. Model weights, quantized checkpoints, and embedding indexes get synced to the vessel's local NVMe storage when you are dockside on high-bandwidth shore power internet. We script this as a systemd timer that runs a differential sync (rsync over SSH) whenever the vessel's network gateway detects a wired ethernet connection. In practice, this means your models update every port call, which for most yacht itineraries is weekly or biweekly.

Inference telemetry stays local. In a cloud deployment you would send metrics to Datadog or Prometheus Cloud. On a vessel, you run a local Prometheus instance and Grafana dashboard accessible from the crew network. Token throughput, latency percentiles, GPU utilization, VRAM pressure, error rates. All on-vessel. When the ship reaches port, the telemetry syncs to shore-side storage for trend analysis. We wrote more about this operational pattern in the context of the knowledge-ark model: the vessel is self-contained, self-monitoring, and reports home when it can, not when it must.

Failover is local. If the primary GPU fails, the serving stack restarts on the secondary card with a smaller model (Llama 3.1 8B Instruct, which fits on any card in the rack with room to spare). Response quality drops, but the system stays up. The crew sees a notification. The guest sees nothing. That is the standard we design to.

RAG indexes are pre-built. For vessel-specific knowledge (deck plans, menus, port guides, operations manuals), we build the vector index shore-side and ship it as a file alongside the model weights. The embedding model (typically a small 400M-parameter model like BGE or GTE) runs on the same GPU. No external API calls. The whole retrieval-augmented generation pipeline is local.

A complete vessel deployment: bill of materials

For the 70-meter yacht from the opening paragraph, here is what we actually quoted:

2x NVIDIA L40S (48 GB each, 350W each): primary inference and redundancy
1x AMD EPYC 9124 server (16 cores, 128 GB RAM): host system, runs vLLM, Prometheus, the application layer
2x 2TB NVMe (mirrored): model weights, RAG indexes, telemetry, logs
1x rack-mount UPS (3 kVA): bridge between generator faults and genset restart
Vibration-isolated rack enclosure with positive-pressure filtered cooling
Software: vLLM (AWQ INT4), Whisper large-v3 for voice input, BGE-base for embeddings, custom concierge application

Total power draw under full inference load: about 900W (both GPUs at peak plus the host system). That is less than a residential hair dryer.

The system runs Llama 3.1 70B Instruct at AWQ INT4, serves an OpenAI-compatible API to the guest app and crew tablet app, handles voice input in six languages via Whisper, and answers questions using a combination of parametric knowledge and local RAG over vessel-specific documents. Response latency is under 800ms for a typical concierge interaction. The whole stack fits in a 12U rack enclosure.

This is what it looks like when you build AI that runs where the guests actually are, not in a data center 3,000 miles away on the other side of a satellite link that might not be there when you need it.

Planning a 70B deployment for your vessel? Talk to us. We will walk through your specific requirements, model selection, and hardware sizing. No generic demo, just the deployment that makes sense for your vessel and your guests.