The moment a guest taps the cabin screen and says "book me a table at the Italian restaurant for four at eight tonight," your system is not running a chatbot. It is running a four-tool agent: an intent classifier, a calendar availability check, a reservation API call, and a confirmation generator. Each one is a point where the interaction can break. On a charter yacht with intermittent satellite connectivity, at least one of them will break in the first week.
That is the gap between a concierge demo and a concierge deployment. Demos handle the happy path. Deployments handle the 504 from the galley PMS, the guest who switches from English to Russian mid-sentence, and the total connectivity blackout during a Tyrrhenian Sea crossing.
This post is the engineering walkthrough. We will cover the agent architecture, the tool schemas, the multi-language voice pipeline, and the failure-handling patterns that make an on-vessel concierge actually shippable. If you want the "why local AI matters" framing, start with Why Cloud AI Fails at Sea. If you want the general concierge pitch, James wrote a solid overview. This is the builder's guide.
What Charter Guests Actually Ask For
Before you design an agent, you need a task taxonomy. We categorize guest requests into four tiers based on how many tools they require:
Tier 1: Knowledge retrieval (zero tools). "What time does the pool bar close?" "Is the sauna available?" These are RAG lookups against the vessel's operational knowledge base. No external calls, no state changes. The LLM retrieves context from the embedded document store and responds.
Tier 2: Single-tool actions. "Book me a spa appointment at 3pm." "Send a message to the captain." One intent, one API call, one confirmation. The model identifies the right tool, builds the parameters, executes, and returns the result.
Tier 3: Multi-tool orchestration. "Plan a shore excursion in Portofino tomorrow with lunch, a car, and a guide who speaks French." This requires a planner that decomposes the request into subtasks: availability lookup, restaurant reservation, ground transport booking, guide scheduling. Four or five tools, sequenced, with dependency awareness (you cannot book the car until you know the lunch location).
Tier 4: Ongoing context. "Change tomorrow's dinner to the upper deck and let the chef know we want the seafood tasting menu instead." The agent must resolve references to a previously booked event, identify the two modifications (location change + menu change), update the right records, and notify the galley. This is where most cloud chatbots fall apart because they lose session state across interactions.
Charter concierge data from operators like Fraser and Northrop & Johnson shows that roughly 60% of guest interactions are Tier 1, 25% are Tier 2, and 15% are Tier 3 or 4. The good news: you can deliver a useful concierge with just Tier 1 and Tier 2 support. The great news: the same architecture scales to Tier 3 with one additional component.
The Agent Architecture: Planner, Tools, and Evaluator
The pattern we use is a modified ReAct loop (Reasoning + Acting, from the Yao et al. 2022 paper). The core cycle is Thought, Action, Observation:
- Thought. The LLM analyzes the guest's request and its conversation history, then decides what to do next.
- Action. The model emits a structured tool call (function name + JSON arguments).
- Observation. The tool executes and returns a result. The model incorporates the result and decides whether to call another tool or respond to the guest.
For Tier 1 and Tier 2 requests, the loop usually completes in one cycle. For Tier 3, it runs multiple iterations, and that is where you need a planner.
The planner is not a separate model. It is a system prompt behavior. When the LLM detects a multi-step request, it generates a plan (a numbered list of subtasks) in its reasoning trace before executing the first action. After each tool result, it checks the plan and moves to the next step. If a step fails, it re-plans around the failure.
Here is the serving stack:
- LLM: Llama 3.3 70B Instruct, quantized to Q4_K_M, served via vLLM with the Llama 3.x tool parser enabled
- Context window: 128K tokens (enough for a full day of guest conversation history plus the vessel knowledge base)
- Tool calling: vLLM's native structured output support constrains the model to emit valid JSON matching the tool schemas, eliminating malformed calls
- Evaluator: A lightweight post-processing step that validates tool outputs before surfacing them to the guest (did the reservation API return a confirmation ID? Is the time slot actually available?)
Llama 3.3 70B scores approximately 77% on the Berkeley Function Calling Leaderboard (BFCL v2), which puts it in the same range as models three times its size from a year ago. For a constrained domain like yacht concierge (fewer than 15 tools, well-defined schemas), that accuracy is more than sufficient.
Designing Tool Schemas for a Yacht
A tool is a function the LLM can call. Each tool has a name, a description, and a JSON schema defining its parameters. The quality of your tool schemas determines whether the model picks the right tool and fills the right arguments.
Here is a practical set for a 60-meter charter yacht:
search_knowledge_base: query the RAG index (menus, schedules, policies, itinerary)create_reservation: book a table, spa slot, or activity (params: type, datetime, party_size, preferences)modify_reservation: change an existing booking (params: reservation_id, changes)cancel_reservation: cancel with optional reasoncheck_availability: query open slots for a resource (params: resource_type, date_range)send_crew_message: route a request to galley, deck, housekeeping, or bridge (params: department, message, priority)get_weather: pull from the onboard weather station or cached forecastget_itinerary: return today's or tomorrow's planned stops and eventsrequest_tender: schedule the tender for a shore trip (params: time, destination, pax_count)book_excursion: arrange a shore-side activity (params: location, activity_type, datetime, language)
Ten tools. That is enough to cover Tier 1 through Tier 3. Each tool description must be specific enough that the model can discriminate. Bad: "handles reservations." Good: "creates a new reservation for dining, spa, or activities. Requires type, datetime, and party_size. Returns a confirmation ID or an error with available alternatives."
One pattern worth calling out: always return structured errors with alternatives. When create_reservation fails because the 8pm slot is full, the tool should return {"error": "slot_unavailable", "alternatives": ["7:30pm", "8:30pm"]}. The model can then present the alternatives to the guest without making another tool call. This cuts latency in half for the most common failure case.
Multi-Language Voice: Whisper to LLM to TTS
Charter yachts serve a global clientele. The largest share of superyacht owners is American (roughly 24%), but Mediterranean charters see heavy traffic from French, Italian, German, Russian, and increasingly Arabic-speaking and Mandarin-speaking guests. Your concierge needs to handle at least eight languages to cover the realistic guest population.
The voice pipeline has three stages:
Speech-to-text: Whisper large-v3. OpenAI's open-source model. 1.55 billion parameters, trained on 5 million hours of audio, supports 100 languages with an average word error rate around 10%. The critical spec for vessel deployment: it runs inference in about 3GB of VRAM using FasterWhisper with INT8 quantization. That is small enough to share a GPU with the embedding model and the text-to-speech stack.
Latency matters for voice. FasterWhisper (built on CTranslate2) runs up to 4x faster than the reference PyTorch implementation. On an H100, you get real-time transcription of a 10-second utterance in under 300ms. Guests do not perceive a delay.
Language routing. Whisper detects the spoken language as part of its inference pass. You do not need a separate language identification step. The detected language tag flows to the LLM (so it knows to respond in French) and to the TTS engine (so it selects the right voice model).
Text-to-speech. This is the one component where you have a build-vs-buy decision. Open-source options like Piper TTS cover the major European languages with acceptable quality. For premium voice quality in Mandarin or Arabic, a fine-tuned XTTS-v2 model is the better choice. Either way, the TTS model runs on the same GPU as Whisper, because they are never active simultaneously (the guest is either speaking or listening, not both).
The full voice round-trip on local hardware: guest speaks (Whisper transcribes in 200–300ms), LLM processes and calls tools (300–500ms), TTS generates audio (150–250ms). Total latency: under 1 second. Compare that to a cloud-based voice assistant traversing a satellite link that adds 500ms or more of round-trip latency before the model even starts processing.
When Tools Go Offline: Graceful Degradation
This is the section that separates a deployment from a demo. On a vessel, things go offline. The galley PMS reboots during dinner service. The weather station loses its GPS fix. The satellite link drops during a crossing, taking shore-side excursion APIs with it.
The design principle: every tool must define a fallback behavior, and the agent must know what to do when a tool returns an error.
Three patterns we use:
1. Cached fallback. Tools that fetch relatively static data (menus, schedules, itineraries) maintain a local cache. When the live source is unreachable, the tool returns cached data with a staleness timestamp. The LLM includes the caveat: "Based on the schedule as of this morning. I will confirm with the crew."
2. Deferred execution. When a transactional tool (reservation, crew message) fails, the agent queues the action and tells the guest: "I have logged your request. The crew will confirm within 15 minutes." A background process retries the queue when the tool comes back online.
3. Human handoff. For Tier 3 requests that require multiple tools and one of them is down, the agent switches to a graceful escalation: "I can help with part of this now, but I will need to involve the crew for the restaurant booking. Would you like me to proceed with what I can do?"
The key insight: guests on a yacht are not interacting with a support ticket system. They expect responsiveness. A concierge that says "something went wrong, please try again" is worse than no concierge at all. A concierge that says "the 8pm slot looks full, but I can try 7:30 or check with the chef about a special seating" feels human. The difference is entirely in how you handle tool failures.
This is the core argument for the knowledge ark approach. When your AI runs on the vessel, you control the failure modes. When it runs in the cloud, a satellite outage turns your concierge into a loading spinner, and you control nothing.
The Full Stack on Two GPUs
Here is how it all fits on two NVIDIA H100 GPUs (80GB VRAM each), the same hardware profile we detailed in the 70B deployment guide:
GPU 1 (primary inference):
- Llama 3.3 70B Q4_K_M via vLLM: ~42GB VRAM
- Overhead for KV cache and batch processing: ~30GB
- Total: ~72GB
GPU 2 (supporting models + RAG):
- Whisper large-v3 (INT8): ~3GB
- Embedding model for RAG (e.g., BGE-large): ~2GB
- Reranker model: ~1.5GB
- TTS model (Piper or XTTS-v2): ~2GB
- Vector index (loaded into GPU memory for fast retrieval): ~4GB
- Headroom for concurrent requests: ~67GB available
The division is intentional. GPU 1 handles the computationally intensive LLM inference. GPU 2 handles everything else, and none of those workloads are concurrent with each other in a typical voice interaction (Whisper runs, then the LLM runs, then TTS runs). You can serve 10–15 concurrent guest sessions before you need to think about a third GPU, and most charter yachts have 12–16 guests at maximum capacity.
Power draw for the full stack: two H100 SXM cards at typical inference load pull around 400–500W combined, plus cooling and networking overhead. Under 1kW total. On a vessel with generators producing hundreds of kilowatts, this is a rounding error on the hotel load. For hardware selection details, see the GPU selection guide.
From Architecture to Deployment
The gap between an agent that works on your laptop and one that works on a moving vessel in the Aegean is not the model. It is the tool reliability layer, the voice pipeline, and the failure handling. Get those right, and you have a concierge that feels like a well-trained chief steward who happens to speak eight languages and never sleeps.
The sovereign AI framing matters here more than anywhere else in the ShipboardAI stack. A guest concierge is the most visible AI touchpoint on the vessel. If it fails because a satellite link dropped, every guest notices. If it works flawlessly 200 nautical miles from the nearest cell tower, it becomes the feature that defines the charter experience. All of humanity's knowledge, every language, every recommendation, running on hardware bolted to the vessel's rack. No cloud dependency. No loading spinners. No excuses.
If you are building a concierge for a charter fleet or a private yacht, we would like to hear about it. We help operators design the agent architecture, select the hardware, and ship a system that works when the link drops.