Hmm… Based on known community findings, there do seem to be several settings worth improving:
Your setup is not obviously wrong — Ubuntu + llama.cpp/SYCL + 2× Arc Pro B70 + Qwen3.6-27B is a reasonable direction for local coding agents. However, Intel Arc/Battlemage + llama.cpp/SYCL + Qwen3.6 has several known performance traps, so I would first build a clean baseline and then change one variable at a time.
The most important distinction is:
- prefill / prompt processing: shown as
prompt eval time; very important for coding agents because they repeatedly send repo context, diffs, tool results, logs, etc.
- decode / generation: shown as
eval time; important for normal token-by-token response speed
- stability / memory behavior: especially important with multi-GPU SYCL, MTP, large context, and long-running agents
Relevant upstream docs/issues:
TL;DR: highest-value things I would test first
| Area |
Current-looking setting |
What I would test |
Why |
| Measurement |
no explicit perf logging |
add --perf |
Without prompt eval vs eval, tuning is guesswork |
| GPU split |
--split-mode layer --tensor-split 1,1 |
try --split-mode none --main-gpu 0 |
27B Q5 may fit on one B70; dual-GPU split may add latency |
| Threads |
--threads 24 |
try --threads 8 --threads-batch 16 |
GPU-offloaded inference often does not benefit from many CPU threads |
| NUMA |
--numa distribute |
remove it first |
likely not useful on a normal single-socket workstation |
| KV cache |
q8_0/q8_0 |
compare f16, q8_0, q4_0 |
Arc/B70 quant paths can behave very differently |
| Quant |
Q5_K_M only |
compare Q4_K_M vs Q5_K_M |
Q4 may be much better latency on B70 |
| Flash Attention |
unspecified/auto |
test -fa on vs auto |
often relevant for long-context workloads |
| Build |
moving target? |
pin/compare builds |
known server-intel-b9159 regression exists |
| Continue |
one big model for everything? |
split autocomplete to smaller model |
autocomplete should be latency-optimized |
| MTP |
tempting |
leave it for later |
Qwen3.6 MTP + SYCL still has sharp edges |
1. Add --perf before changing anything
First, keep your current command but add:
--perf
Then look for:
prompt eval time
eval time
tokens per second
Interpretation:
- slow
prompt eval = context/prefill problem
- slow
eval = generation/quant/backend/split problem
- slow in Continue but not in
llama-bench = likely agentic-context or client-side request pattern problem
For coding agents, prompt eval is often the hidden bottleneck. A model can look fine on short prompts or tg128, but feel bad in Continue because every agent step re-sends large context.
2. Test single GPU before dual-GPU layer split
Your current-style setup appears to use:
--split-mode layer \
--tensor-split 1,1
I would absolutely compare that with single-GPU mode:
--split-mode none \
--main-gpu 0
Optionally also pin the SYCL device:
export ONEAPI_DEVICE_SELECTOR=level_zero:0
Why this may help:
- a 27B Q5_K_M model may fit on a single 32 GB B70
- layer split helps capacity, but does not guarantee better single-user latency
- decode often does not scale well across multiple GPUs
- multi-GPU SYCL may increase host-memory pressure
- Intel multi-GPU SYCL has a known host-side GTT mirror behavior: SYCL multi-GPU GTT mirror issue
If single GPU is faster or similarly fast, I would use the second B70 for another service instead:
GPU 0: Qwen3.6-27B for chat/edit/agent
GPU 1: Qwen2.5-Coder 1.5B/7B for autocomplete, or embeddings/reranking
For a single-user coding workstation, two independent services can feel better than one model split over two GPUs.
3. Reduce CPU threads and set batch threads separately
If you currently use:
--threads 24
I would compare:
--threads 8 \
--threads-batch 16
and:
--threads 4 \
--threads-batch 16
and:
--threads 8 \
--threads-batch 8
--threads and --threads-batch are separate llama.cpp server knobs. --threads is more relevant to generation-side CPU work, while --threads-batch matters for batch/prompt processing. See the official server option docs.
With most layers on GPU, more CPU threads are not always better. Too many threads can add scheduling overhead or just not help. For coding agents, --threads-batch can matter more because large prompt ingestion is common.
4. Remove --numa distribute unless this is really a NUMA machine
If this is a normal single-socket desktop/workstation system, I would remove:
--numa distribute
Baseline should probably be no NUMA setting. Only test NUMA modes later if you know the machine is actually NUMA-relevant.
5. Do not assume q8_0 KV cache is fastest
Your command uses:
--cache-type-k q8_0 \
--cache-type-v q8_0
That may be good for VRAM, but it should be measured. Compare:
--cache-type-k f16 \
--cache-type-v f16
--cache-type-k q8_0 \
--cache-type-v q8_0
--cache-type-k q4_0 \
--cache-type-v q4_0
The B70-specific reason to test this is that quantized paths on Battlemage can behave surprisingly. The clearest known example is Q8_0 being ~4× slower than Q4_K_M on Arc Pro B70. That issue is about model weights, not KV cache, so it does not prove q8_0 KV is bad. But it does prove that on B70, “higher bit = safer/faster” is not a reliable assumption.
The same issue also notes that -DGGML_SYCL_F16=ON improved prompt processing by about 2.4Ă— in one Q4_K_M case, while not improving token generation. That is another clue that prefill and decode must be tuned separately.
6. Test Q4_K_M against Q5_K_M
Q5_K_M is reasonable, but for local coding latency I would compare:
Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M
Qwen3.6-27B-Q6_K
Suggested order:
- Q4_K_M baseline
- Q5_K_M quality comparison
- Q6_K only if you still have enough speed/VRAM
On B70, Q4_K_M may be a better practical latency/quality point than Q5_K_M. The B70 Q8_0 issue is the strongest warning that quant performance on this architecture is not always intuitive: B70 Q8_0 kernel efficiency issue.
7. Explicitly test Flash Attention
Try:
-fa on
and compare with:
-fa auto
and maybe:
-fa off
For long-context coding workloads, Flash Attention can matter, but it should still be measured. The option is documented in the llama.cpp server README.
8. Prefer --n-gpu-layers all over 999
If the intention is “offload everything possible,” use:
--n-gpu-layers all
instead of:
--n-gpu-layers 999
This is mostly clarity, not a guaranteed performance change. The server docs support auto, all, and numeric values.
9. Pin builds; avoid moving latest
There is a very relevant Intel build regression report:
That issue is Qwen3.6-35B-A3B-MTP on Arc Pro B50, not exactly your 27B dense setup, so it is not proof that your setup is affected. But it is close enough to justify build pinning and A/B testing.
Record:
./build/bin/llama-server --version
sycl-ls
uname -a
If using Docker, compare pinned images rather than latest:
ghcr.io/ggml-org/llama.cpp:server-intel-b9144
ghcr.io/ggml-org/llama.cpp:server-intel-b9159
With Intel Arc + SYCL, performance can depend on:
- llama.cpp commit
- Intel compute-runtime
- oneAPI version
- Linux kernel / driver stack
- Docker image contents
- whether
i915 or xe is used
- ReBAR / Above 4G / PCIe platform behavior
10. Check platform basics: ReBAR, PCIe, driver stack
If performance is much lower than other B70 reports, I would verify platform-level things too.
Useful checks:
lspci -vv | grep -i -E "Resizable BAR|Region|prefetchable" -A3
lspci -nnk | grep -i -E "VGA|Display|3D" -A4
sycl-ls
uname -a
Things to confirm:
Above 4G Decoding: enabled
Resizable BAR: enabled
PCIe link width/speed: expected width/speed
driver stack: i915 vs xe
Intel compute-runtime version
oneAPI version
kernel version
There is also a relevant report of very poor SYCL performance on older DDR4 / PCIe 3.0 platform with Battlemage: brutally bad SYCL performance on Battlemage. Your system sounds much newer, but it is still worth verifying ReBAR/PCIe/driver basics.
11. Treat MTP as a later experiment, not the first fix
Qwen3.6 MTP is interesting, but I would not add it until the non-MTP baseline is clean.
Relevant issues:
If you test MTP later, start conservatively:
--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2
Avoid assuming MTP is faster just because draft acceptance is high. On SYCL/Intel Arc, one issue specifically reports correct output but no speed gain, with per-kernel dispatch overhead identified as the remaining bottleneck.
Also watch for:
draft acceptance rate
VRAM before/after requests
forcing full prompt re-processing
create context checkpoint
OOM after multiple requests
12. Watch for full prompt re-processing
For Qwen3.6 and agentic use, look for:
forcing full prompt re-processing
Relevant issue:
If this appears often, the problem may not be raw GPU throughput. It may be prompt cache invalidation, slot reuse behavior, hybrid attention/recurrent-memory behavior, client request shape, or MTP interaction.
For a single local coding-agent user, I would initially force:
--parallel 1
and only increase parallelism after the baseline is stable.
13. Continue.dev: separate autocomplete from chat/agent
Continue has different model roles: Chat, Edit, Apply, Autocomplete, Embedding, Reranker, etc. See Continue model roles.
For autocomplete specifically, Continue recommends smaller/faster models such as Qwen Coder 2.5 1.5B or 7B: Continue autocomplete docs. The docs also note that thinking-type models are generally not recommended for autocomplete because they generate more slowly.
A practical split:
localhost:8080 -> Qwen3.6-27B-Q4_K_M or Q5_K_M for chat/edit/agent
localhost:8081 -> Qwen2.5-Coder-1.5B or 7B for autocomplete
This can improve perceived responsiveness a lot. Autocomplete should not be queued behind a large 27B agent request.
14. Suggested clean baseline
I would start with something like this:
#!/bin/bash
source /opt/intel/oneapi/setvars.sh --force
export ZES_ENABLE_SYSMAN=1
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
cd ~/llama.cpp
./build/bin/llama-server \
-m ~/models/Qwen3.6-27B-Q4_K_M.gguf \
-a Roboto \
-c 32768 \
-fa on \
--cache-type-k f16 \
--cache-type-v f16 \
--n-gpu-layers all \
-b 2048 \
-ub 512 \
--threads 8 \
--threads-batch 16 \
--host 0.0.0.0 \
--port 8080 \
--split-mode none \
--main-gpu 0 \
--parallel 1 \
--perf
This is not guaranteed to be best. It is just a cleaner baseline:
- one GPU
- no NUMA complication
- explicit Flash Attention
- explicit KV type
- explicit
threads vs threads-batch
- explicit
parallel 1
- perf logging
- Q4_K_M as a latency-first starting point
Then change only one thing at a time.
15. Suggested A/B order
Step 0: current setup + perf
Add only:
--perf
Save the logs.
Step 1: single GPU
Change:
--split-mode layer \
--tensor-split 1,1
to:
--split-mode none \
--main-gpu 0
If this is faster, dual-GPU split is probably not helping latency.
Step 2: threads
Try:
--threads 8 \
--threads-batch 16
instead of:
--threads 24
Step 3: remove NUMA
Remove:
--numa distribute
Step 4: KV cache
Compare:
--cache-type-k f16 --cache-type-v f16
--cache-type-k q8_0 --cache-type-v q8_0
--cache-type-k q4_0 --cache-type-v q4_0
Step 5: quant
Compare:
Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M
Step 6: Flash Attention
Compare:
-fa on
-fa auto
-fa off
Step 7: context size
Compare:
-c 16384
-c 24576
-c 32768
Bigger context is not free. For coding agents, having 32K available is useful, but repeatedly filling it can make the system feel slow.
Step 8: pinned builds
Compare known builds / commits, especially if using Docker or recent server-intel images.
Step 9: only then try MTP
Only after non-MTP is stable, try:
--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2
and compare it against non-MTP.
16. What I would put in the log comparison table
Something like this:
Config name:
Model:
Quant:
Backend:
llama.cpp version:
Intel compute-runtime:
oneAPI:
Kernel:
Driver stack:
GPU split:
KV type:
Context size:
Batch / ubatch:
Threads / threads-batch:
Flash Attention:
Parallel slots:
prompt eval t/s:
eval t/s:
VRAM used:
System RAM:
GTT mirror, if checked:
Notes:
For multi-GPU SYCL memory behavior, this may help:
PID=$(pgrep llama-server)
for fd in /proc/$PID/fdinfo/*; do
grep -H "drm-total-gtt\|drm-total-vram" "$fd" 2>/dev/null
done
17. My best guess
My guess is that the biggest practical wins will come from:
- single-GPU baseline instead of dual-GPU layer split
- lower CPU thread count with explicit
--threads-batch
- removing
--numa distribute
- testing Q4_K_M vs Q5_K_M
- testing KV
f16 vs q8_0
- pinning known-good llama.cpp / server-intel builds
- separating Continue autocomplete onto a smaller model
I would not start by chasing MTP, huge context sizes, or experimental split modes. First make the normal Qwen3.6-27B path fast and reproducible.