Multiple illuminated screens displaying data dashboards in a dark monitoring environment
Infrastructure

Edge AI Observability Without Phoning Home

By Ethan Marsh12 min read

Your 70B model has been running inference for six weeks. Guests are asking it questions. Crew are querying the knowledge base. The concierge agent is booking restaurants and checking weather. Everything looks fine from the outside.

But you have no idea if time-to-first-token has drifted from 400ms to 1200ms. You do not know that KV cache pressure is forcing evictions on 30% of multi-turn conversations. You cannot tell that one of your two GPUs hit 89°C last Tuesday at 2am because the intake filter clogged with galley grease. The system kept running. Nobody noticed. Nobody had the data to notice.

This is the observability gap on most vessel AI deployments. The hardware is there. The models are running. But the instrumentation that tells you whether things are actually working well, or just working, does not exist. And the default answer from most monitoring vendors ("just send your metrics to our cloud") is exactly the pattern that sovereign AI rejects by design.

You need full observability that runs entirely on the vessel, requires zero outbound connectivity, and syncs data shore-side only when you choose to.

Here is how to build it.

The Local Observability Stack: Prometheus, Grafana, and Loki

The foundation is three open-source tools that together give you metrics, logs, and visualization without a single external dependency.

Prometheus handles time-series metrics. It scrapes HTTP endpoints at a configurable interval (every 15 seconds is standard), stores the data in its local TSDB, and evaluates alerting rules against that data. Everything lives on disk. The default retention is 15 days, configurable via --storage.tsdb.retention.time and --storage.tsdb.retention.size. For a vessel deployment, 90 days of retention at roughly 15 GB of SSD storage covers a typical inference workload with headroom.

Prometheus organizes data into immutable two-hour blocks on disk, then compacts older blocks into larger chunks (10-hour, then daily). This architecture is inherently resilient to power interruptions. If the vessel loses shore power and the UPS runs out, you lose at most the current two-hour block. Everything already compacted survives intact.

Grafana provides the dashboards. It connects to Prometheus as a data source and renders real-time panels for GPU temperature, inference latency, token throughput, queue depth, whatever you need. It runs as a single binary or container with no internet dependency. Preconfigure your dashboards during commissioning, export them as JSON, and deploy them as part of your infrastructure-as-code. When crew open the monitoring page on the vessel's internal network, they see live data with zero cloud round-trips.

Loki handles structured logs. Unlike Elasticsearch (which indexes every word and consumes substantial resources), Loki only indexes labels and stores log lines as compressed chunks. The resource footprint is small enough to run on a Raspberry Pi, though on a vessel with dedicated GPU hardware, you have far more headroom. Loki captures inference errors, agent tool-call failures, model loading events, and anything your serving stack emits to stdout.

In air-gapped environments, standard service discovery (Consul, Kubernetes API, EC2) is unavailable. Prometheus supports file-based service discovery instead. You write a JSON or YAML file listing your scrape targets, and Prometheus watches it for changes. On a vessel with a fixed set of services, this is simpler and more predictable than dynamic discovery anyway.

# /etc/prometheus/targets.json
[
  {
    "targets": ["inference-server:8000", "dcgm-exporter:9400"],
    "labels": {
      "vessel": "MY-AURORA",
      "environment": "production"
    }
  }
]

GPU Health Monitoring with NVIDIA DCGM

Your GPUs are the most expensive, most power-hungry, and most thermally sensitive components in the vessel compute rack. Monitoring them is not optional.

NVIDIA Data Center GPU Manager (DCGM) provides deep telemetry for datacenter GPUs. The dcgm-exporter component runs as a standalone container and exposes GPU metrics at port 9400 in Prometheus-compatible format.

docker run -d --gpus all --cap-add SYS_ADMIN \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.5.3-4.8.2-distroless

Once running, curl localhost:9400/metrics returns every GPU metric you need:

  • DCGM_FI_DEV_GPU_TEMP: Core temperature in Celsius. Alert above 83°C for sustained periods.
  • DCGM_FI_DEV_POWER_USAGE: Real-time power draw in watts. Baseline your GPUs at idle and under typical inference load, then alert on deviations.
  • DCGM_FI_DEV_SM_CLOCK: Streaming multiprocessor clock frequency in MHz. Thermal throttling drops this below your baseline.
  • DCGM_FI_DEV_MEM_CLOCK: Memory clock frequency. Same throttling concern.
  • DCGM_FI_DEV_MEMORY_TEMP: HBM temperature. On H100 cards with HBM3, this is often the thermal bottleneck before the core.

The default metrics configuration lives at /etc/dcgm-exporter/default-counters.csv. You can customize which fields to expose by providing your own CSV file. On a vessel, add GPU fan speed, PCIe bandwidth utilization, and ECC error counts. ECC errors in particular are an early warning of hardware degradation from vibration or thermal cycling.

The critical Grafana panels for GPU health:

  1. Temperature time series per GPU, with a horizontal threshold line at 83°C
  2. Power draw overlaid against the vessel's allocated power budget for compute
  3. SM clock vs. base clock (any sustained delta means throttling is occurring)
  4. ECC error counter (single-bit correctable errors are worth watching; double-bit uncorrectable errors require immediate action)

LLM Inference Metrics: What vLLM Exposes

If you are running a large model on-vessel with vLLM, you already have a Prometheus endpoint built in. vLLM exposes production-grade metrics at /metrics on its API server port with no additional configuration.

The metrics that matter for vessel deployments:

Gauges (current state):

  • vllm:num_requests_running: How many requests are actively generating tokens right now. Sustained high values mean your hardware is saturated.
  • vllm:gpu_cache_usage_perc: Fraction of KV cache blocks currently occupied. This is your canary. When it approaches 1.0, new requests get queued or existing contexts get evicted.

Counters (totals over time):

  • vllm:prompt_tokens_total: Total prompt tokens processed since server start.
  • vllm:generation_tokens_total: Total tokens generated.
  • vllm:request_success_total: Completed requests by finish reason (stop token, length limit, abort). A rising abort count signals context-length pressure or timeout issues.
  • vllm:prefix_cache_hits / vllm:prefix_cache_queries: Cache hit ratio for prefix matching. If your concierge agent asks similar questions repeatedly, this ratio should be high. If it is not, your prompt templates may need restructuring.

Histograms (latency distributions):

  • vllm:time_to_first_token_seconds: The metric guests actually feel. Anything above 800ms feels slow for a conversational interface.
  • vllm:inter_token_latency_seconds: Streaming speed. Determines how natural the response feels while generating.
  • vllm:e2e_request_latency_seconds: Total wall-clock time from request to final token.

Set up a Grafana dashboard with panels for P50, P95, and P99 latency. The P99 tells you about tail latency, which is where guest experience degrades first. A P50 of 500ms with a P99 of 3 seconds means one in a hundred requests feels broken, even though the median looks healthy.

OpenTelemetry for Offline Telemetry Pipelines

Metrics are the foundation, but distributed tracing tells you why a specific request was slow. If your on-vessel stack has multiple services (inference server, embedding model, reranker, speech-to-text, tool-execution layer), OpenTelemetry provides the correlation between them.

The OpenTelemetry Collector runs as a local agent. Its architecture is a pipeline: receivers accept data (OTLP, Prometheus scrape, or others), processors transform it (batching, filtering, attribute enrichment), and exporters send it somewhere.

In an offline deployment, "somewhere" is local storage. The collector supports persistent file-backed buffering using the file_storage extension, which writes a Write-Ahead Log (WAL) to disk using bbolt (an embedded key-value store). Each queued batch gets a unique key, and batches are deleted from disk only after successful delivery to their destination.

Here is what an offline-first collector configuration looks like:

extensions:
  file_storage/buffer:
    directory: /var/lib/otel/buffer
    timeout: 10s

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    send_batch_size: 512
    timeout: 5s
  resource:
    attributes:
      - key: vessel.name
        value: "MY-AURORA"
        action: upsert

exporters:
  file/local:
    path: /var/lib/otel/traces.jsonl
  otlp/shore:
    endpoint: shore-collector.shipboardai.com:4317
    sending_queue:
      storage: file_storage/buffer
      queue_size: 5000

service:
  extensions: [file_storage/buffer]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [file/local, otlp/shore]

The dual-exporter pattern is the key. file/local writes every trace to disk immediately (your local audit trail). otlp/shore attempts to send traces to the shore-side collector, but when the link is down, the persistent queue buffers them on disk. When connectivity returns, the collector drains the queue automatically. The default queue holds 1000 batches; on a vessel you can size it to 5000 or more depending on available disk.

A custom minimal collector binary (built with only the components you need) is typically 30-50 MB. It runs as a DaemonSet or a single systemd service alongside your inference stack.

Alerting Without a Cloud Backend

Metrics without alerting are just history. You need to know when things go wrong now, not when someone looks at a dashboard tomorrow.

Prometheus Alertmanager handles this entirely locally. You define alerting rules in Prometheus (expressed as PromQL queries), and when those rules fire, Alertmanager routes the notification.

groups:
  - name: vessel-ai
    rules:
      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_SM_CLOCK < 1200 and DCGM_FI_DEV_GPU_TEMP > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} is thermal throttling"

      - alert: KVCachePressure
        expr: vllm_gpu_cache_usage_perc > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "KV cache above 90%, requests may queue or evict"

      - alert: InferenceLatencyDegraded
        expr: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 2
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "P95 inference latency exceeded 2 seconds"

For notification delivery without internet, Alertmanager supports local webhooks. Point it at a lightweight service on 127.0.0.1 that writes alerts to the vessel's internal messaging system, triggers a bridge alarm panel indicator, or sends to an internal SMTP server for crew email. On larger vessels with an operational bridge, integrate alerts into the existing alarm management system via a simple webhook-to-NMEA bridge.

No PagerDuty. No Slack. No cloud dependency. The knowledge ark stays self-contained, and so does its alarm system.

Shore-Side Sync: Selective Telemetry on Your Terms

Full sovereignty does not mean full isolation forever. When the satellite link is healthy, you want to push aggregated data shore-side for fleet-level analytics, long-term capacity planning, and model evaluation across voyages.

The architecture for this is a sync layer that sits between your local observability stack and the shore-side analytics platform. It operates on a pull schedule (not a push stream), is authenticated and encrypted, runs only during designated sync windows, and sends only what you choose.

What to sync:

  • Aggregated metrics (hourly rollups of latency percentiles, daily token throughput, weekly GPU health summaries). Not raw 15-second samples.
  • Anonymized traces for model evaluation. Strip guest PII before export.
  • Alert history so shore-side teams can audit vessel health without real-time access.

What not to sync:

  • Raw request/response payloads (guest data never leaves the hull).
  • Real-time streams (defeats the purpose of local-first).
  • Anything that creates a runtime dependency (if sync fails, the vessel keeps operating normally).

Prometheus supports remote-write for forwarding metrics to a shore-side receiver. Configure it with a long retry interval and a generous local buffer. When the link is up, it drains. When the link is down, data accumulates locally (bounded by your retention and disk settings). The vessel never loses observability because the shore-side receiver is unreachable.

This is the pattern that makes sovereign AI operationally real: all capability runs locally, all data stays local by default, and shore-side integration is an optional, selective, authenticated overlay. Not a dependency.

Putting It All Together: Reference Architecture

A complete vessel observability deployment looks like this:

ComponentResourcePortStorage
Prometheus2 vCPU, 4 GB RAM909015-50 GB SSD (90-day retention)
Grafana1 vCPU, 1 GB RAM3000500 MB (dashboards and config)
Loki1 vCPU, 2 GB RAM310010-30 GB SSD (log retention)
Alertmanager0.5 vCPU, 256 MB RAM9093Minimal
DCGM Exporter0.5 vCPU, 256 MB RAM9400None (stateless)
OTel Collector1 vCPU, 1 GB RAM43172-10 GB (queue buffer)

Total: roughly 6 vCPU and 8.5 GB RAM. On a vessel that already has dedicated compute hardware for inference, this fits on the same infrastructure with room to spare. You can run the entire observability stack on a single low-power node (an Intel NUC or equivalent) separate from your GPU servers, or containerize everything alongside your inference workload.

Deploy with Docker Compose or a lightweight Kubernetes distribution (k3s works well in air-gapped environments). Version your entire configuration in Git. When you commission a new vessel, clone the repo, run the deploy script, and you have production observability in minutes.

What You Actually Ship

Start with three things: DCGM for GPU hardware, the vLLM /metrics endpoint for inference performance, and Prometheus with a 30-day retention. Add Grafana dashboards for the engineering team. That alone puts you ahead of every maritime AI deployment I have seen in production.

Then layer in Alertmanager for proactive notification, Loki for structured log analysis, and OpenTelemetry if your architecture spans multiple services. Each layer adds value incrementally without requiring the one before it to be perfect first.

The goal is not perfection. The goal is knowing when your AI is degrading before your guests notice. On a vessel, 2000 nautical miles from the nearest systems engineer, that knowledge is the difference between a quiet fix and a reputation-damaging outage.

Your knowledge ark keeps running when the link drops. Your observability stack should too.


Building an on-vessel AI deployment and need observability that works without cloud dependency? Let's talk architecture. We design monitoring stacks that give you full visibility into GPU health, inference performance, and model behavior, entirely self-contained on the vessel.