crabNemoClaw with Local Inference

Hardware: NVIDIA RTX 6000 Ada (48 GB VRAM)

Model: Qwen3.5-35B-A3B (MoE, Q6_K quantization, ~138 tok/s)

Stack: llama.cpp → OpenShell inference routing → OpenClaw agent

Tested: March23 2026, NemoClaw 2026.3.11, OpenShell 0.0.13


How It Works

Before diving in, it helps to understand the data flow. The OpenClaw sandbox is an isolated Kubernetes pod — it cannot reach the host directly. All outbound traffic is routed through an OpenShell proxy at 10.200.0.1:3128, which intercepts https://inference.local and forwards it to whatever backend you have configured.

OpenClaw agent
      ↓
https://inference.local
      ↓  (intercepted by OpenShell proxy at 10.200.0.1:3128)
OpenShell gateway  (privacy router — injects credentials, rewrites model)
      ↓
llama-server on host  (your GPU, your model)
      ↓
Qwen3.5-35B-A3B  (MoE, 3B active params per token)

Two things matter here that are easy to get wrong:

  1. Do not bypass the proxy. inference.local only resolves through the OpenShell proxy. Using --noproxy or pointing OpenClaw directly at a host IP will time out.

  2. Use https://, not http://. The proxy only intercepts HTTPS traffic on inference.local.


Prerequisites

REQUIREMENT

NOTES

OS

Ubuntu 22.04 or 24.04

GPU

NVIDIA RTX 6000 Ada (48 GB VRAM)

CUDA

12.0+

Docker

Engine 24+ with NVIDIA Container Toolkit

Node.js

v18+

Disk

~50 GB free

NVIDIA account


Part 1 — Install NemoClaw

Run the onboarding wizard. It will ask for your NVIDIA API key and a cloud model — pick anything, you will replace it with the local model in Part 5.

If you hit a port conflict on 18789:

At step 4, just choose a random model here as we'll use a local model by configuring manually.

Accept the suggested presets (pypi, npm) when prompted. When onboarding finishes:


Part 2 — Build llama.cpp

RTX 6000 Ada is Ada Lovelace — compile with sm_89. The GGML_CUDA_FA_ALL_QUANTS flag is required for Flash Attention to work with quantized KV caches.

Compilation takes 5–10 minutes. Done when you see [100%] Built target llama-server.

Architecture reference: Blackwell (RTX 5090) → 120, Ampere (A100/A40/RTX 3090) → 80, Turing (RTX 2080) → 75


Part 3 — Download the Model

With 48 GB of VRAM, Q6_K gives you near-FP16 accuracy at ~27 GB. Q4_K_XL is faster if you want to trade a bit of quality for throughput.

If you use a RTX 4090, use the Q4_K_XL to avoid OOM problems.


Part 4 — Start llama-server

Run this in a tmux session so it survives terminal disconnects.

Inside tmux, start the server. The --jinja flag is critical — it enables the model's built-in Jinja chat template, which outputs OpenAI-compatible JSON tool calls. Without it, any agent request that triggers a tool call will fail with a parse error.

For Q4_K_XL, use this command:

circle-info

--jinja is an important variable!

Detach from tmux with Ctrl+B D. The server keeps running in the background.

Wait for the model to load (you will see server is listening on http://0.0.0.0:9090), then verify:

You should get back JSON listing the model. If it times out, give it another 30 seconds — first load takes a moment.


Part 5 — Register the Local Inference Provider

This is where OpenShell learns about your llama-server. Use the host's public IP — 127.0.0.1 does not work because requests originate from inside the gateway container.

Set the inference route. Use --no-verify because host.openshell.internal resolves differently inside the gateway than from your host shell.

Confirm it saved:


Part 6 — Configure OpenClaw

OpenClaw reads its config from /sandbox/.openclaw/openclaw.json inside the pod. The file is owned by root and read-only from inside the sandbox, so write it from the host via kubectl.

First, grab your gateway auth token:

Then write the updated config — replace YOUR_GATEWAY_TOKEN with the value above:

Replace Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf with Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf if you use a RTX 4090.

Verify the write:


Part 7 — Test

Connect to the sandbox:

Test the inference endpoint directly.

circle-info

Do not use --noproxy — the proxy needs to intercept this request.

You should get a JSON response from Qwen3.5 within a few seconds. Then test the OpenClaw agent:

Expected output:

If this returns a proper response without a Failed to parse input error, the full stack is working.


Performance

Measured on RTX 6000 Ada (48 GB, Ada Lovelace sm_89):

QUANTIZATION

MODEL SIZE

SPEED

QUALITY

Q4_K_XL

~19 GB

~120–140 tok/s

Good

Q6_K

~27 GB

~120–140 tok/s

Very good (recommended)

Q8_0

~36 GB

~50–70 tok/s

Excellent

Q6_K and Q4_K_XL are close in speed because Qwen3.5-35B-A3B is a MoE model — only 3B parameters activate per token regardless of the 35B total. The bottleneck is memory bandwidth, not parameter count.


Known Issues and Workarounds

Tool call parse error: Failed to parse input at pos N: <tool_call>

Qwen3.5 defaults to its own XML-style tool call format. OpenClaw expects OpenAI-compatible JSON. Fix: add --jinja to the llama-server startup command. This enables the model's built-in Jinja template which outputs proper JSON tool calls.

inference.local returns DNS resolution error

You used --noproxy with curl, or accessed inference.local over http:// instead of https://. The OpenShell proxy only intercepts HTTPS. Drop --noproxy and use https://.

openshell inference set times out during verification

The gateway resolves host.openshell.internal differently than your host shell does. Use --no-verify to skip the probe and save the config anyway.

openshell policy set fails with "filesystem policy cannot be removed"

Known Alpha bug in OpenShell 0.0.13. Dynamic policy updates always fail with this error even when the YAML only contains network_policies. Workaround: edit the static policy file at ~/.nemoclaw/source/nemoclaw-blueprint/policies/openclaw-sandbox.yaml and re-run nemoclaw onboard.

openclaw config set inference.* fails with "Unrecognized key"

OpenClaw 2026.3.11 does not recognize the inference config key. Write openclaw.json directly using the kubectl method in Part 6.

Provider base URL with 127.0.0.1 or localhost does not work

The provider request originates from inside the gateway container. Use the host's actual public IP.


Configuration Reference

llama-server flags

FLAG

VALUE

WHY

--host

0.0.0.0

Must bind all interfaces, not just loopback

--port

9090

Any free port works

-ngl

99

Offload all layers to GPU

--ctx-size

32768

32K context; increase with Q4 KV cache for longer contexts

--cache-type-k

q8_0

High-quality KV cache; use q4_0 for very long contexts

--cache-type-v

q8_0

Same

--flash-attn on

Required for quantized KV cache

--jinja

Required — enables OpenAI-compatible JSON tool call format

openclaw.json provider block

cmake flags by GPU architecture


Quick Reference

Last updated

Was this helpful?