What should i change to optimize local hosted AI

I have a server with the following hardware:

Itel Ultra 7 270K Plus

64gb RAM

2x Intel ARC B70 32gb VRAM

Im running Ubuntu server with llama.cpp.

Im using it to do local Agentic coding with continue.dev plugin for VScode.

My startllm.sh file looks like this:

#!/bin/bash
source /opt/intel/oneapi/setvars.sh --force
export ZES_ENABLE_SYSMAN=1

cd ~/llama.cpp

ZES_ENABLE_SYSMAN=1 ./build/bin/llama-server \
    -m ~/models/Qwen3.6-27B-Q5_K_M.gguf \
    -a Roboto \
    -c 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --n-gpu-layers 999 \
    -b 2048 \
    -ub 512 \
    --threads 24 \
    --host 0.0.0.0 \
    --port 8080 \
    --split-mode layer \
    --tensor-split 1,1 \
    --numa distribute

I still feel like its responding slow, which parameters should I change?

Hmm… Based on known community findings, there do seem to be several settings worth improving:


Your setup is not obviously wrong — Ubuntu + llama.cpp/SYCL + 2× Arc Pro B70 + Qwen3.6-27B is a reasonable direction for local coding agents. However, Intel Arc/Battlemage + llama.cpp/SYCL + Qwen3.6 has several known performance traps, so I would first build a clean baseline and then change one variable at a time.

The most important distinction is:

  • prefill / prompt processing: shown as prompt eval time; very important for coding agents because they repeatedly send repo context, diffs, tool results, logs, etc.
  • decode / generation: shown as eval time; important for normal token-by-token response speed
  • stability / memory behavior: especially important with multi-GPU SYCL, MTP, large context, and long-running agents

Relevant upstream docs/issues:

TL;DR: highest-value things I would test first

Area Current-looking setting What I would test Why
Measurement no explicit perf logging add --perf Without prompt eval vs eval, tuning is guesswork
GPU split --split-mode layer --tensor-split 1,1 try --split-mode none --main-gpu 0 27B Q5 may fit on one B70; dual-GPU split may add latency
Threads --threads 24 try --threads 8 --threads-batch 16 GPU-offloaded inference often does not benefit from many CPU threads
NUMA --numa distribute remove it first likely not useful on a normal single-socket workstation
KV cache q8_0/q8_0 compare f16, q8_0, q4_0 Arc/B70 quant paths can behave very differently
Quant Q5_K_M only compare Q4_K_M vs Q5_K_M Q4 may be much better latency on B70
Flash Attention unspecified/auto test -fa on vs auto often relevant for long-context workloads
Build moving target? pin/compare builds known server-intel-b9159 regression exists
Continue one big model for everything? split autocomplete to smaller model autocomplete should be latency-optimized
MTP tempting leave it for later Qwen3.6 MTP + SYCL still has sharp edges

1. Add --perf before changing anything

First, keep your current command but add:

--perf

Then look for:

prompt eval time
eval time
tokens per second

Interpretation:

  • slow prompt eval = context/prefill problem
  • slow eval = generation/quant/backend/split problem
  • slow in Continue but not in llama-bench = likely agentic-context or client-side request pattern problem

For coding agents, prompt eval is often the hidden bottleneck. A model can look fine on short prompts or tg128, but feel bad in Continue because every agent step re-sends large context.

2. Test single GPU before dual-GPU layer split

Your current-style setup appears to use:

--split-mode layer \
--tensor-split 1,1

I would absolutely compare that with single-GPU mode:

--split-mode none \
--main-gpu 0

Optionally also pin the SYCL device:

export ONEAPI_DEVICE_SELECTOR=level_zero:0

Why this may help:

  • a 27B Q5_K_M model may fit on a single 32 GB B70
  • layer split helps capacity, but does not guarantee better single-user latency
  • decode often does not scale well across multiple GPUs
  • multi-GPU SYCL may increase host-memory pressure
  • Intel multi-GPU SYCL has a known host-side GTT mirror behavior: SYCL multi-GPU GTT mirror issue

If single GPU is faster or similarly fast, I would use the second B70 for another service instead:

GPU 0: Qwen3.6-27B for chat/edit/agent
GPU 1: Qwen2.5-Coder 1.5B/7B for autocomplete, or embeddings/reranking

For a single-user coding workstation, two independent services can feel better than one model split over two GPUs.

3. Reduce CPU threads and set batch threads separately

If you currently use:

--threads 24

I would compare:

--threads 8 \
--threads-batch 16

and:

--threads 4 \
--threads-batch 16

and:

--threads 8 \
--threads-batch 8

--threads and --threads-batch are separate llama.cpp server knobs. --threads is more relevant to generation-side CPU work, while --threads-batch matters for batch/prompt processing. See the official server option docs.

With most layers on GPU, more CPU threads are not always better. Too many threads can add scheduling overhead or just not help. For coding agents, --threads-batch can matter more because large prompt ingestion is common.

4. Remove --numa distribute unless this is really a NUMA machine

If this is a normal single-socket desktop/workstation system, I would remove:

--numa distribute

Baseline should probably be no NUMA setting. Only test NUMA modes later if you know the machine is actually NUMA-relevant.

5. Do not assume q8_0 KV cache is fastest

Your command uses:

--cache-type-k q8_0 \
--cache-type-v q8_0

That may be good for VRAM, but it should be measured. Compare:

--cache-type-k f16 \
--cache-type-v f16
--cache-type-k q8_0 \
--cache-type-v q8_0
--cache-type-k q4_0 \
--cache-type-v q4_0

The B70-specific reason to test this is that quantized paths on Battlemage can behave surprisingly. The clearest known example is Q8_0 being ~4× slower than Q4_K_M on Arc Pro B70. That issue is about model weights, not KV cache, so it does not prove q8_0 KV is bad. But it does prove that on B70, “higher bit = safer/faster” is not a reliable assumption.

The same issue also notes that -DGGML_SYCL_F16=ON improved prompt processing by about 2.4Ă— in one Q4_K_M case, while not improving token generation. That is another clue that prefill and decode must be tuned separately.

6. Test Q4_K_M against Q5_K_M

Q5_K_M is reasonable, but for local coding latency I would compare:

Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M
Qwen3.6-27B-Q6_K

Suggested order:

  1. Q4_K_M baseline
  2. Q5_K_M quality comparison
  3. Q6_K only if you still have enough speed/VRAM

On B70, Q4_K_M may be a better practical latency/quality point than Q5_K_M. The B70 Q8_0 issue is the strongest warning that quant performance on this architecture is not always intuitive: B70 Q8_0 kernel efficiency issue.

7. Explicitly test Flash Attention

Try:

-fa on

and compare with:

-fa auto

and maybe:

-fa off

For long-context coding workloads, Flash Attention can matter, but it should still be measured. The option is documented in the llama.cpp server README.

8. Prefer --n-gpu-layers all over 999

If the intention is “offload everything possible,” use:

--n-gpu-layers all

instead of:

--n-gpu-layers 999

This is mostly clarity, not a guaranteed performance change. The server docs support auto, all, and numeric values.

9. Pin builds; avoid moving latest

There is a very relevant Intel build regression report:

That issue is Qwen3.6-35B-A3B-MTP on Arc Pro B50, not exactly your 27B dense setup, so it is not proof that your setup is affected. But it is close enough to justify build pinning and A/B testing.

Record:

./build/bin/llama-server --version
sycl-ls
uname -a

If using Docker, compare pinned images rather than latest:

ghcr.io/ggml-org/llama.cpp:server-intel-b9144
ghcr.io/ggml-org/llama.cpp:server-intel-b9159

With Intel Arc + SYCL, performance can depend on:

  • llama.cpp commit
  • Intel compute-runtime
  • oneAPI version
  • Linux kernel / driver stack
  • Docker image contents
  • whether i915 or xe is used
  • ReBAR / Above 4G / PCIe platform behavior

10. Check platform basics: ReBAR, PCIe, driver stack

If performance is much lower than other B70 reports, I would verify platform-level things too.

Useful checks:

lspci -vv | grep -i -E "Resizable BAR|Region|prefetchable" -A3
lspci -nnk | grep -i -E "VGA|Display|3D" -A4
sycl-ls
uname -a

Things to confirm:

Above 4G Decoding: enabled
Resizable BAR: enabled
PCIe link width/speed: expected width/speed
driver stack: i915 vs xe
Intel compute-runtime version
oneAPI version
kernel version

There is also a relevant report of very poor SYCL performance on older DDR4 / PCIe 3.0 platform with Battlemage: brutally bad SYCL performance on Battlemage. Your system sounds much newer, but it is still worth verifying ReBAR/PCIe/driver basics.

11. Treat MTP as a later experiment, not the first fix

Qwen3.6 MTP is interesting, but I would not add it until the non-MTP baseline is clean.

Relevant issues:

If you test MTP later, start conservatively:

--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2

Avoid assuming MTP is faster just because draft acceptance is high. On SYCL/Intel Arc, one issue specifically reports correct output but no speed gain, with per-kernel dispatch overhead identified as the remaining bottleneck.

Also watch for:

draft acceptance rate
VRAM before/after requests
forcing full prompt re-processing
create context checkpoint
OOM after multiple requests

12. Watch for full prompt re-processing

For Qwen3.6 and agentic use, look for:

forcing full prompt re-processing

Relevant issue:

If this appears often, the problem may not be raw GPU throughput. It may be prompt cache invalidation, slot reuse behavior, hybrid attention/recurrent-memory behavior, client request shape, or MTP interaction.

For a single local coding-agent user, I would initially force:

--parallel 1

and only increase parallelism after the baseline is stable.

13. Continue.dev: separate autocomplete from chat/agent

Continue has different model roles: Chat, Edit, Apply, Autocomplete, Embedding, Reranker, etc. See Continue model roles.

For autocomplete specifically, Continue recommends smaller/faster models such as Qwen Coder 2.5 1.5B or 7B: Continue autocomplete docs. The docs also note that thinking-type models are generally not recommended for autocomplete because they generate more slowly.

A practical split:

localhost:8080 -> Qwen3.6-27B-Q4_K_M or Q5_K_M for chat/edit/agent
localhost:8081 -> Qwen2.5-Coder-1.5B or 7B for autocomplete

This can improve perceived responsiveness a lot. Autocomplete should not be queued behind a large 27B agent request.

14. Suggested clean baseline

I would start with something like this:

#!/bin/bash
source /opt/intel/oneapi/setvars.sh --force

export ZES_ENABLE_SYSMAN=1
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1

cd ~/llama.cpp

./build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q4_K_M.gguf \
  -a Roboto \
  -c 32768 \
  -fa on \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --n-gpu-layers all \
  -b 2048 \
  -ub 512 \
  --threads 8 \
  --threads-batch 16 \
  --host 0.0.0.0 \
  --port 8080 \
  --split-mode none \
  --main-gpu 0 \
  --parallel 1 \
  --perf

This is not guaranteed to be best. It is just a cleaner baseline:

  • one GPU
  • no NUMA complication
  • explicit Flash Attention
  • explicit KV type
  • explicit threads vs threads-batch
  • explicit parallel 1
  • perf logging
  • Q4_K_M as a latency-first starting point

Then change only one thing at a time.

15. Suggested A/B order

Step 0: current setup + perf

Add only:

--perf

Save the logs.

Step 1: single GPU

Change:

--split-mode layer \
--tensor-split 1,1

to:

--split-mode none \
--main-gpu 0

If this is faster, dual-GPU split is probably not helping latency.

Step 2: threads

Try:

--threads 8 \
--threads-batch 16

instead of:

--threads 24

Step 3: remove NUMA

Remove:

--numa distribute

Step 4: KV cache

Compare:

--cache-type-k f16 --cache-type-v f16
--cache-type-k q8_0 --cache-type-v q8_0
--cache-type-k q4_0 --cache-type-v q4_0

Step 5: quant

Compare:

Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M

Step 6: Flash Attention

Compare:

-fa on
-fa auto
-fa off

Step 7: context size

Compare:

-c 16384
-c 24576
-c 32768

Bigger context is not free. For coding agents, having 32K available is useful, but repeatedly filling it can make the system feel slow.

Step 8: pinned builds

Compare known builds / commits, especially if using Docker or recent server-intel images.

Step 9: only then try MTP

Only after non-MTP is stable, try:

--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2

and compare it against non-MTP.

16. What I would put in the log comparison table

Something like this:

Config name:
Model:
Quant:
Backend:
llama.cpp version:
Intel compute-runtime:
oneAPI:
Kernel:
Driver stack:
GPU split:
KV type:
Context size:
Batch / ubatch:
Threads / threads-batch:
Flash Attention:
Parallel slots:

prompt eval t/s:
eval t/s:
VRAM used:
System RAM:
GTT mirror, if checked:
Notes:

For multi-GPU SYCL memory behavior, this may help:

PID=$(pgrep llama-server)

for fd in /proc/$PID/fdinfo/*; do
  grep -H "drm-total-gtt\|drm-total-vram" "$fd" 2>/dev/null
done

17. My best guess

My guess is that the biggest practical wins will come from:

  1. single-GPU baseline instead of dual-GPU layer split
  2. lower CPU thread count with explicit --threads-batch
  3. removing --numa distribute
  4. testing Q4_K_M vs Q5_K_M
  5. testing KV f16 vs q8_0
  6. pinning known-good llama.cpp / server-intel builds
  7. separating Continue autocomplete onto a smaller model

I would not start by chasing MTP, huge context sizes, or experimental split modes. First make the normal Qwen3.6-27B path fast and reproducible.