Kimi K2.6: 1T Parameters, 8 H200s Required

Moonshot AI shipped Kimi K2.6 today. One trillion total parameters, 32 billion active per token, 262K context window, open weights under a modified MIT license. The detail that got my attention: the weights ship natively in INT4, not as a community quant applied after the fact. Here is what that actually means, how it stacks up, and what it takes to put this model on a vessel.

What Moonshot actually shipped

Kimi K2.6 is a Mixture-of-Experts model built for agentic work. 384 experts, 8 selected plus one shared per token, 61 layers, MLA attention. A separate 400M MoonViT vision encoder handles image input.

Think of it as a direct successor to Kimi K2 Thinking (November 2025) and K2.5 (January 2026). K2.5 is where the native-INT4 story actually started. K2.6 inherits that approach and keeps tuning.

The headline capability is not raw intelligence. It is the agentic stack. Moonshot has pushed swarm orchestration up to around 300 agents interleaving thinking and tool calls. That is exactly the shape of workload you want for a vessel concierge that has to compose answers across reservations, crew scheduling, weather, and engineering telemetry without losing state.

Native INT4, not community INT4

Most open-weights models ship in BF16 or FP8 and rely on the community to produce quantized builds via llama.cpp, AWQ, GPTQ, or ExLlama. Those quants work, but they are post-training quantizations. You take a full-precision model and squeeze it afterward, which almost always costs some quality even with careful calibration.

K2.6 does it differently. Moonshot applied Quantization-Aware Training to the MoE weights after base training. The model is trained to live in INT4. Activations stay at higher precision, but the bulk of the weights, which is where the memory cost lives, are native INT4 in compressed-tensors format.

The practical result is a model that weighs about 594 GB on disk at INT4, compared to roughly 2 TB in BF16. Moonshot claims roughly 2x generation speedup with near-FP16 quality. That is a meaningful shift. Previous rounds of "can this open model replace the frontier" hit a wall at inference cost. Native INT4 compresses the biggest line item in the budget by 4x without the usual quality tax.

How it stacks up

Moonshot's own numbers put K2.6 in the same conversation as Claude Opus 4.7 and GPT-5.4. A few that matter:

HLE-Full with tools: 54.0 (Opus 4.7: 53.0, GPT-5.4: 52.1)
SWE-Bench Pro: 58.6 (GPT-5.4: 57.7, Opus 4.7: 53.4)
SWE-Bench Verified: 80.2 (Opus 4.7: 80.8)
Terminal-Bench 2.0: 66.7
LiveCodeBench v6: 89.6
BrowseComp: 83.2, rising to 86.3 in swarm mode
AIME 2026: 96.4 (GPT-5.4 leads at 99.2)

The pattern is clear. Agentic work, long-context, tool use, real-world software engineering: K2.6 is at the frontier. Closed-form math is a relative weakness. For an on-vessel assistant that has to run a multi-tool workflow without dropping context, this mix is closer to ideal than any open-weights model I have tested.

What it actually takes to run

This is where the story gets expensive.

Native INT4 compresses memory by 4x, which is huge. But the model itself is still 1T parameters. At INT4 you need about 594 GB just for the weights, plus KV cache. A 262K context at INT4 KV adds roughly another 100 GB depending on attention pattern.

Moonshot's deployment guide pins the minimum viable unit at 8 GPUs tensor-parallel, targeting H200 (141 GB), B200, or B300 class hardware. A single H100 80 GB, an L40S 48 GB, or even a dual H100 rig will not hold this model in memory. That is not a quantization problem you can solve. It is a floor.

The unified-memory shortcuts do not save you either. A Mac Studio M3 Ultra with 512 GB unified memory fits Q3 or Q4 community GGUFs with essentially no room for context. DGX Spark or GB10 configurations need five to eight nodes to match what a single 8xH200 server does natively. The Strix Halo boxes at 128 GB are not in the conversation.

For context, our reference 70B offline deployment calls for a dual H100 rig. K2.6 is a different compute class entirely. If you want this model on a vessel, you are sizing a full H200 cluster, not a single 4U appliance.

Serving stacks support is thin but real. vLLM, SGLang, and KTransformers all have working paths on nightly builds with the kimi_k2 parsers wired up. Community GGUFs are landing on Unsloth for llama.cpp and Ollama. TensorRT-LLM and MLX are not officially supported yet; there are workarounds, but I would not build a production deployment around them this week.

Does it belong on a yacht

My honest read, one day in: on most vessels, no. Not yet.

The power, cooling, and acquisition cost of an 8xH200 node is not a luxury yacht hotel-load budget. It is a small datacenter, and you only sign that check if K2.6 is doing something no smaller model can do for you.

Where it does belong: the handful of flagship builds with the power, space, and thermal envelope for real compute, and the operational profile to match. A 100m-plus with full-time IT staff, running a guest concierge, an engineering co-pilot for the ETOs, and a set of background agents that have to compose reasoning across tools. That workload is where K2.6's agentic strength earns its hardware bill.

For everyone else, the sovereign AI story still runs on 30B to 70B open models with KV cache quantization and dual H100s. Those fit on a vessel today, and they do not fall over when the satellite link drops.

Kimi K2.6 is the model that proves open weights can match the frontier. It is not yet the model that fits most vessels. Both of those things can be true at once, and the gap is closing faster than the hardware is.

Weighing whether a flagship build should carry a model this size? Let's talk. We spec vessel AI deployments from sovereignty-first principles, and we will tell you honestly when the right move is a smaller model.