Luxury yacht moored in a calm marina with clear Mediterranean water

Twelve nationalities on a crew of nineteen. The captain is South African, the chief stew is Croatian, the engineer is Filipino, and the two deckhands rotating onto the bow for a stern-to docking speak French and Indonesian. The captain calls a command over the radio. Two people understand it immediately. The rest work from context, body language, and repetition.

This is a normal Tuesday on a superyacht.

Now put an AI voice assistant on that same vessel. A guest asks for the spa schedule in Russian. A deckhand requests the weather forecast in Tagalog. The chef confirms a provisioning order in Italian. None of those requests leave the hull. None require the satellite link that dropped twenty minutes ago.

The question every operator asks us: do you build the voice stack from open-source components, buy a commercial platform, or self-host a vendor's models on your own hardware? The answer depends on your language count, your latency budget, and whether you are willing to lock into a vendor's licensing for a system that has to work when the vendor is unreachable.

The Language Map: What Actually Gets Spoken

Before you architect a voice pipeline, you need to know what languages it has to handle. Not "99 languages" (the number Whisper technically supports). The ones that actually show up on your vessel.

Crew languages vary by department. Deck and engineering skew toward Filipino, Croatian, Indonesian, and increasingly Romanian and Brazilian Portuguese. Interior and service lean British, Australian, and South African English with French and Italian mixed in. On a 90-meter-plus vessel with 36 crew, you might have 15 nationalities aboard. That is not unusual. One captain we work with described it as "running a small nation where nobody shares a first language."

Guest languages are a shorter list but higher stakes. Mediterranean charters run heavily on English, French, Italian, and Russian. Middle Eastern clients bring Arabic. Asian itineraries add Mandarin. Caribbean and US-based vessels are almost exclusively English.

For a production voice assistant, the practical set is roughly ten languages:

  • Crew-facing: English, Filipino/Tagalog, French, Croatian, Indonesian, Romanian
  • Guest-facing: English, French, Italian, Russian, Spanish, Arabic, Mandarin

If your STT and TTS models cannot handle at least those, you are building a monolingual system and calling it multilingual.

Speech-to-Text: Whisper Is the Only Serious Option

For on-vessel STT, OpenAI's Whisper is the answer. Not because it is perfect, but because nothing else combines 99-language support, open weights, and offline operation in one package.

Here is the model lineup that matters:

ModelParametersVRAMRelative Speed
large-v31.55B~10 GB1x (baseline)
large-v3-turbo809M~6 GB~8x
medium769M~5 GB~2x

The large-v3 was trained on over 5 million hours of audio and supports 99 languages with 128 Mel frequency bins (up from 80 in v2). Word error rate on clean English is around 2.8% on the LibriSpeech benchmark. For the languages that matter on superyachts (French, Italian, Russian, Tagalog), accuracy is lower but still serviceable, typically in the 8–15% WER range depending on accent and background noise.

Do not run vanilla Whisper in production. Use faster-whisper, the CTranslate2 implementation by SYSTRAN. It is 4x faster than the original for the same accuracy and uses less VRAM. On an NVIDIA RTX 3070 Ti with CUDA 12.4, faster-whisper processes a 13-minute audio file in about 63 seconds versus 143 seconds for vanilla Whisper. With INT8 quantization, VRAM drops from 4,525 MB to 2,926 MB.

For real-time voice assistant interactions, the large-v3-turbo variant is the practical choice. At 809M parameters and 6 GB of VRAM, it runs at roughly 8x the speed of the full large-v3 while maintaining strong multilingual accuracy. That is fast enough for conversational latency on a single GPU.

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8")
segments, info = model.transcribe("crew_request.wav", language=None)
# language=None enables auto-detection
for segment in segments:
    print(f"[{info.language}] {segment.text}")

Setting language=None triggers auto-detection. In our testing, Whisper correctly identifies the spoken language about 93% of the time for the top-ten superyacht languages. The main failure case is short utterances (under two seconds), where it sometimes confuses Romanian for Italian or Filipino for Indonesian.

Text-to-Speech: Three Contenders, Three Tradeoffs

STT is a solved problem for on-vessel use. TTS is where the decisions get interesting.

Piper is the workhorse. It is VITS-based, runs as ONNX inference, supports 35+ languages with pre-trained voices, and the medium-quality models are 60–150 MB each. Piper runs on a Raspberry Pi. On a vessel GPU, latency is sub-100 ms. The voices are not going to fool anyone into thinking they are human, but they are clear, consistent, and available in the languages you need (including Filipino, Croatian, and Indonesian). For crew-facing announcements and confirmations, Piper is the right choice. The project transitioned to a community-maintained fork under GPL-3.0 in 2025, so ongoing development continues under a production-friendly license.

Orpheus TTS is the new entrant worth watching. Built on a Llama-3B backbone and trained on over 100,000 hours of English speech data, it delivers genuinely human-sounding speech with emotion control and zero-shot voice cloning. Streaming latency is about 200 ms (reducible to 100 ms with input streaming). A multilingual research preview shipped in April 2025 covering French, German, Spanish, Italian, Mandarin, Korean, and Hindi (24 voices total). The license is Apache 2.0. The catch: English is the only language at production maturity. If your primary use case is an English guest concierge, Orpheus is compelling. For a ten-language crew assistant, it is not ready yet.

XTTS-v2 (from the now-shuttered Coqui AI) supports 17 languages and offers voice cloning from a 6-second audio sample. Streaming latency is under 150 ms on a consumer GPU. The quality is strong. The problem is licensing: the Coqui Public Model License restricts commercial use. For a production deployment on a paying client's vessel, that is a non-starter unless you negotiate a separate agreement. Coqui AI shut down in 2024, so there is nobody to negotiate with. Community forks exist, but the licensing ambiguity is real risk for a deployed maritime system.

Our recommendation for today: Piper for multilingual crew interactions, Orpheus for English guest-facing experiences. Revisit in six months when Orpheus multilingual models mature.

Build, Buy, or Self-Host

Three paths to a voice stack on a vessel. Each one is viable. None of them is universally correct.

Build from open source. You assemble Whisper (STT) + Piper/Orpheus (TTS) + a local LLM for intent processing on your own hardware. Total software cost: zero. Total integration cost: significant. You own every component, you can run it completely air-gapped, and you update on your schedule. This is what we deploy at ShipboardAI for most single-vessel installations, and it is the sovereign approach: no license server to phone home to, no vendor whose business decisions can disable your voice assistant mid-charter.

Buy a commercial platform. Deepgram is the most viable option for self-hosted commercial STT. Their on-premise offering runs in Docker or Kubernetes, requires dedicated NVIDIA GPUs (no fractional/MIG support), and a single T4 or A10 handles 50–100 concurrent streams with sub-300 ms latency. Deepgram's Nova-3 achieves roughly 5.3% WER on batch and 6.8% on streaming across production audio domains. The catch: enterprise pricing is not public, the license is per-GPU, and you depend on Deepgram for model updates. If the company changes direction, your vessel's voice stack is affected. Deepgram also supports 36 languages versus Whisper's 99, which matters if your crew speaks Croatian or Tagalog.

Self-host a vendor's models. The middle path: you run vendor-provided models on your hardware under a commercial license. NVIDIA's NIM platform supports Whisper and several TTS models in this mode. The tradeoff is lock-in to the NVIDIA container stack and NIM licensing terms.

FactorOpen Source (Build)Deepgram (Buy)NIM (Self-Host)
Upfront SW cost$0Enterprise contractNIM license
Languages (STT)99 (Whisper)36 (Nova-3)99 (Whisper)
Languages (TTS)35+ (Piper)LimitedVaries
Offline capableFullyYes, self-hostedYes, self-hosted
Vendor dependencyNoneHighMedium
Integration effortHighMediumMedium

For a vessel where the entire point is sovereign operation, where your knowledge ark does not call home, open source is the defensible choice. You control the weights, the runtime, and the update cycle. Vendor risk is not theoretical. Coqui AI shut down. Others will follow.

The Full Pipeline on Vessel Hardware

Here is what the deployed architecture looks like:

[Mic Array] → [VAD: Silero] → [STT: faster-whisper turbo]
    → [LLM: vLLM / Llama 3.3 70B] → [TTS: Piper or Orpheus]
    → [Speaker Output]

The voice activity detection stage matters more than people think. Silero VAD is a 1.5 MB model that decides "is someone talking or is that the generator?" Without good VAD, your STT model wastes GPU cycles transcribing ambient noise into hallucinated text. On a vessel with diesel generators, HVAC systems, and waves hitting the hull, VAD is not optional.

Latency budget for a conversational interaction:

StageTargetActual (dual-L40S)
VAD + audio capture< 50 ms~30 ms
STT (turbo, 5s utterance)< 300 ms~180 ms
LLM intent + response< 500 ms~350 ms
TTS (Piper, single sentence)< 100 ms~60 ms
Total end-to-end< 950 ms~620 ms

On a dual-L40S setup (our standard recommendation for single-vessel GPU deployments), one GPU handles the LLM inference while the second handles STT, TTS, the embedding model, and auxiliary workloads. Whisper turbo at INT8 uses about 3 GB of VRAM. Piper uses negligible GPU resources (it runs fine on CPU). That leaves roughly 42 GB on the second L40S for your embedding model, reranker, and headroom.

If you have already deployed a 70B model for your concierge, the voice pipeline adds about 4 GB of VRAM on the secondary GPU. No additional hardware required.

Failure Modes That Only Show Up at Sea

We have shipped voice assistants on three vessels. Here is what broke that we did not anticipate in the lab.

Engine room noise. Whisper's noise robustness is good but not infinite. Below an engine room hatch with generators running, word error rate on English climbed from 4% to over 25%. The fix: a two-microphone beamforming array per station and a noise gate in the audio preprocessing pipeline. Without hardware-level noise rejection, crew stopped trusting the system within a week.

Short utterances in non-English languages. "Oui" and "Da" and "Oo" (Filipino for yes) are all under one second of audio. Whisper's language auto-detection fails on sub-two-second clips at a high rate. The fix: once a user authenticates (badge tap or voice enrollment), pin their preferred language and skip auto-detection. Fall back to auto-detect only for unregistered guests.

Accent stacking. A Croatian deckhand speaking English with a heavy accent while the French bosun responds in accented English. Whisper handles this better than expected, but you need to increase beam search parameters for reliable results:

segments, info = model.transcribe(
    "crew_radio.wav",
    beam_size=5,
    best_of=5,
    language="en",
    condition_on_previous_text=False
)

The default beam_size=1 is too aggressive for accented speech. Setting condition_on_previous_text=False prevents hallucination loops where the model reinforces its own errors.

TTS pronunciation of proper nouns. Piper mispronounces port names, yacht names, and guest names consistently. The fix is a pronunciation lexicon that maps known proper nouns to phonetic representations. Tedious to build, essential for guest-facing interactions.

What to Ship First

If you are evaluating voice AI for your vessel, here is the sequence that works:

  1. Start with crew-facing STT only. Deploy faster-whisper with the turbo model for hands-free logging and radio transcription. This has immediate operational value (searchable radio logs, incident documentation) and zero guest-facing risk.

  2. Add TTS for announcements. Wire Piper into your PMS for automated multilingual announcements: weather updates, meal times, safety briefings. Low risk, high visibility.

  3. Build the full concierge. Once you have validated STT accuracy and TTS quality on your specific vessel (every hull is acoustically different), layer in the agentic concierge workflow with LLM-powered intent routing. This is where the system goes from useful tool to guest experience differentiator.

Each phase can be deployed independently. Each one works offline. And each one reinforces the core value of your knowledge ark: when the satellite link drops, your crew and guests do not notice. The voice assistant keeps working because every component, VAD, STT, LLM, TTS, lives on the vessel. Nothing phones home. Nothing degrades.

That is the system we are building at ShipboardAI. If you want to talk through the voice stack for your vessel, reach out.