We ran 318 automated benchmarks — 298 on Qwen3.5-9B across four quantisations, plus 20 on NVIDIA Nemotron-3-Nano-4B as a dedicated router — to find the best two-tier inference settings for running OpenClaw locally via LM Studio. Here's what worked, what didn't, and the exact configs you can copy-paste today.
Why Qwen3.5-9B?
If you're running OpenClaw with local models, the 9-billion parameter sweet spot is where things get interesting. Qwen3.5-9B is small enough to run at 40–55 tokens per second on a single consumer GPU, yet capable enough to handle multi-step agent tasks: reasoning through logic puzzles, generating valid tool-call JSON, resisting sycophantic pressure, and maintaining formatting discipline across long outputs.
The catch? Default settings in LM Studio leave performance on the table. The right sampling parameters can mean the difference between an 88% average score and a 95% score — and between a 6% perfect-run rate and a 50% one.
How We Tested
Our benchmark script sends each parameter combination through a five-section evaluation designed to stress-test agent capabilities: logical reasoning (an Einstein-style puzzle), memory recall from padded context, structured tool-call JSON generation, sycophancy resistance (the model must push back on a user who insists 2+2=5), and markdown formatting compliance. Each section scores 0–10, for a maximum of 50 points per run.
We swept across temperature (0.3–0.7), top_k (30–70), top_p, min_p, repeat penalty, max output tokens, and context sizes from 8k to 52k tokens. The campaign spanned multiple sessions totalling over 10 hours of continuous inference across all four quantisations. All tests ran on LM Studio's local API serving Qwen3.5-9B on consumer NVIDIA hardware.
The Optimal Config
Copy this into your LM Studio or OpenClaw provider settings:
{
"temperature": 0.4,
"top_k": 50,
"top_p": 0.9,
"min_p": 0.05,
"repeat_penalty": 1.05,
"max_tokens": 24000
}
Temperature anywhere between 0.3 and 0.7 works well — the model is remarkably stable across this range. We recommend 0.4 as a sensible default: low enough to keep tool-call JSON reliable, high enough to avoid repetitive phrasing in conversational responses. The headline change from our initial testing: top_k=50 is the best universal default, performing well across all four quantisations and all context sizes up to 52k.
The top_k Story: It Depends on Your Quant
This was our most important finding — and it got more nuanced as we scaled from 126 runs to 298. The optimal top_k varies by quantisation:
Q5_K_M (126 runs): Clear bell curve peaking at top_k=60 (94.8% avg, 39% perfect rate). This was our initial finding and it held up.
Q6_K_XL (83 runs): Peaks at top_k=30–40 (95.5% and 95.2% respectively). top_k=40 had the highest perfect-run rate at 50%, though with a smaller sample (8 runs). The curve is flatter overall — top_k 30 through 60 all cluster between 93–96%.
Q8_0 (71 runs): Nearly flat across the board. top_k=70 technically leads at 94.0% but top_k=45 and 50 are within 0.2%. No single top_k dominates — the higher-precision quantisation is more tolerant of sampling width.
Q4_K_M (18 runs): Limited data, but top_k=55 edged out 60 and 65. Not enough runs to draw strong conclusions.
The reason we recommend top_k=50 as the universal default: it scores in the top tier on every quantisation (92.6% on Q5, 94.4% on Q6, 93.8% on Q8) with zero catastrophic results. If you know you're running Q5_K_M specifically, bump it to 60. If you're on Q6_K_XL, you could go as low as 30–40 for a marginal edge.
top_k at High Context (52k Tokens)
We ran a dedicated 27-run sweep of Q6_K_XL at 52k context to test whether the top_k sweet spot shifts under heavy context load. It does:
At 52k context, top_k=50 dominates at 98.0% average with the highest consistency. top_k=30 holds up well at 95.6%, but top_k=70 drops sharply to 86.7%. The hypothesis: at high context, the large prompt already provides rich token diversity in the attention window; a wide top_k on top of that introduces too much noise, especially hurting sycophancy resistance.
If you're running OpenClaw with long agent memory or multi-turn context, keep top_k at 50 or below.
Quantisation Comparison
We tested four GGUF quantisations of Qwen3.5-9B. With 298 total valid runs, the picture is clearer than our initial 170-run report:
| Quant | Runs | Avg Score | Perfect % | Speed | Verdict |
|---|---|---|---|---|---|
| Q8_0 | 71 | 93.4% | 20% | 40.9 tok/s | Highest Avg |
| Q6_K_XL | 83 | 93.1% | 24% | 46.2 tok/s | Sweet Spot |
| Q5_K_M | 126 | 92.2% | 21% | 53.8 tok/s | Best Speed |
| Q4_K_M | 18 | 91.8% | 6% | 47.1 tok/s | Skip |
With 298 runs, the gap between quants is tighter than our initial 170-run report suggested. Q6_K_XL remains the overall winner — it scores within 0.3% of Q8_0 while running 13% faster, and its 24% perfect-run rate leads all quants. Q5_K_M is the speed king at 53.8 tok/s but its perfect-run rate lags at 21%. Q4_K_M remains the worst of both worlds: slower than Q5 with a 6% perfect-run rate.
Which Quantisation Should You Pick?
If you have the VRAM (10+ GB free): go with Q6_K_XL. It's the best balance of quality and speed, and the 24% perfect-run rate means OpenClaw's agent loops complete more reliably on the first attempt — saving you tokens and time on retries.
If VRAM is tight or you're batching requests: Q5_K_M at 53.8 tok/s gives you about 16% more throughput. The quality gap is modest (92.2% vs 93.1%), and for simpler agent tasks like email triage or file management, you won't notice the difference.
If you want maximum quality and don't mind the speed hit: Q8_0 at 40.9 tok/s edges out Q6 by 0.3% on average score. The difference is marginal enough that we'd still recommend Q6 for most users, but if you're running complex multi-step agent chains where every percentage point matters, Q8 is there.
Skip Q4_K_M entirely. It's the worst of both worlds — slower than Q5 with a 6% perfect-run rate.
Max Tokens and Context Size
We tested max_tokens values of 16k, 24k, 32k, and 48k across context windows from 8k to 52k. The short version: set max_tokens to 24,000. Going higher doesn't improve scores (the model rarely needs more than 6,000 tokens for our benchmark tasks), and values below 24k occasionally caused truncation on longer agent workflows.
Context sizes of 8k through 52k all performed well when the model's context window was properly configured in LM Studio. One important caveat: if you load multiple models simultaneously, LM Studio may silently reduce the available context budget. We discovered this the hard way when 21 consecutive 52k runs failed with instant 400 errors after a model reload — the fix was simply reloading the model as the sole active model, restoring the full context allocation.
For OpenClaw specifically, most agent turns use 2,000–8,000 output tokens, so 24k gives you generous headroom without wasting memory on unused KV cache.
Sycophancy: The Hidden Differentiator
Here's what surprised us most. The scoring gap between configurations was almost entirely driven by Section 4: sycophancy resistance. This section asks the model to politely but firmly correct a user who insists that 2+2=5. At suboptimal settings, the model would cave and agree with the user. At optimal settings, it maintained its position respectfully.
On Q6_K_XL, top_k=30 achieved the best sycophancy score at 9.1/10, while top_k=45 was the worst at 6.7/10. Interestingly, this didn't follow the overall score ranking perfectly — sycophancy resistance appears to be sensitive to specific top_k values in ways that other sections aren't.
This matters more than you'd think for agent use. An agent that capitulates to incorrect user assertions will make bad decisions in multi-step workflows — accepting invalid data, skipping necessary validation, or rubber-stamping flawed plans. The jump from 6.7/10 to 9.1/10 at the right top_k is the difference between an agent that pushes back when it should and one that just tells you what you want to hear.
Temperature Doesn't Matter (Much)
Across all 298 runs and four quantisations: temp 0.3 averaged 92.9%, temp 0.5 averaged 93.0%, temp 0.7 averaged 92.3%. This is good news — you can tune temperature to your personal preference without worrying about breaking agent reliability. Lower for more deterministic tool calls, higher for more natural conversational tone.
Nemotron-3-Nano-4B as a Router Model
A two-tier architecture — a lightweight router that classifies and dispatches tasks, feeding a heavier model for actual work — is the most efficient way to run OpenClaw locally. We benchmarked NVIDIA's Nemotron-3-Nano-4B as that router across 20 runs with a dedicated six-section router evaluation: intent classification (10 pts), entity extraction (7 pts), tool dispatch (6 pts), summarisation (5 pts), format compliance (5 pts), and instruction resistance (2 pts), for a total of 35 points per run.
Router Benchmark Config
{
"model": "nvidia/nemotron-3-nano-4b",
"temperature": 0.5,
"top_k": 40,
"max_tokens": 5000
}
No-Padding Results (Short Context)
With minimal context (just the routing prompt), Nemotron was remarkably consistent:
| Setting | Score | Speed | Notes |
|---|---|---|---|
| top_k=40 (all temps) | 97.1% | 57–103 tok/s | Consistent |
| top_k=50 (low temp) | 94–97% | 60–95 tok/s | Good |
| top_k=50 (high temp) | 80–94% | 58–90 tok/s | Unstable |
The key finding: top_k=40 was perfect across every temperature tested (0.3 to 1.0), all scoring 34/35 (97.1%). top_k=50 introduced variance at higher temperatures, dropping as low as 80%. For a router, consistency matters more than peak score — you never want a misrouted task.
8k Context Results (Under Load)
When we padded the prompt to 8,000 tokens of synthetic agent memory (simulating real-world multi-turn context), performance degraded significantly:
At 8k context, scores swung wildly from 51% to 100%. The one perfect 35/35 run landed at temp=0.8, top_k=50, but that was an outlier. Intent classification (S1) and tool dispatch (S3) — the two most critical router functions — broke down first under context pressure. The Nemotron Nano's hybrid Mamba/attention architecture, with only 4 attention layers out of 42, lacks the attention density needed for reliable long-context routing.
The practical takeaway: keep the router's input tight. Strip the routing prompt to just the user's latest message plus minimal metadata. Don't pass full conversation history through the router — that's what Qwen 9B handles downstream.
Why Nemotron for Routing?
At 60–100 tok/s, Nemotron is 1.5–2× faster than Qwen 9B (40–54 tok/s). For a router that just needs to classify an intent, extract a few entities, and pick a tool, that speed advantage compounds across every agent turn. It runs on just 5 GB of VRAM, leaving the rest for your main model. We also tested Qwen3.5-0.8B as a router candidate — it performed well on simple classification but showed inflated self-assessment scores and struggled with reasoning, making it less reliable for production routing.
Two-Tier Config for OpenClaw
The recommended architecture for local OpenClaw deployments:
LM Studio Setup
Here's how to apply these settings in LM Studio:
1. Load Qwen3.5-9B-Q6_K_XL.gguf (or Q5_K_M if VRAM-constrained).
2. In the server settings, set the context length to at least 16,384 tokens. For long agent contexts, 52k works well but ensure only one model is loaded.
3. Apply the sampling parameters from the config block above.
4. Start the server. The default endpoint is http://localhost:1234/v1.
5. In OpenClaw, go to Settings → AI Providers → Add Custom Provider. Enter the LM Studio URL and the model name. OpenClaw will auto-detect the OpenAI-compatible API.
💡 Running local models means zero API costs. But if you're mixing local and cloud models in OpenClaw's auto-router, use our cost calculator to estimate your blended monthly spend.
TL;DR
Agent Model — Qwen3.5-9B
Router Model — Nemotron-3-Nano-4B
Total: 318 benchmark runs across both models.