What should i change to optimize local hosted AI

UpDownUpDownLeftRigh · May 29, 2026, 9:25am

I have a server with the following hardware:

Itel Ultra 7 270K Plus

64gb RAM

2x Intel ARC B70 32gb VRAM

Im running Ubuntu server with llama.cpp.

Im using it to do local Agentic coding with continue.dev plugin for VScode.

My startllm.sh file looks like this:

#!/bin/bash
source /opt/intel/oneapi/setvars.sh --force
export ZES_ENABLE_SYSMAN=1

cd ~/llama.cpp

ZES_ENABLE_SYSMAN=1 ./build/bin/llama-server \
    -m ~/models/Qwen3.6-27B-Q5_K_M.gguf \
    -a Roboto \
    -c 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --n-gpu-layers 999 \
    -b 2048 \
    -ub 512 \
    --threads 24 \
    --host 0.0.0.0 \
    --port 8080 \
    --split-mode layer \
    --tensor-split 1,1 \
    --numa distribute

I still feel like its responding slow, which parameters should I change?

John6666 · May 30, 2026, 6:48am

Hmm… Based on known community findings, there do seem to be several settings worth improving:

Your setup is not obviously wrong — Ubuntu + llama.cpp/SYCL + 2× Arc Pro B70 + Qwen3.6-27B is a reasonable direction for local coding agents. However, Intel Arc/Battlemage + llama.cpp/SYCL + Qwen3.6 has several known performance traps, so I would first build a clean baseline and then change one variable at a time.

The most important distinction is:

prefill / prompt processing: shown as prompt eval time; very important for coding agents because they repeatedly send repo context, diffs, tool results, logs, etc.
decode / generation: shown as eval time; important for normal token-by-token response speed
stability / memory behavior: especially important with multi-GPU SYCL, MTP, large context, and long-running agents

Relevant upstream docs/issues:

TL;DR: highest-value things I would test first

Area	Current-looking setting	What I would test	Why
Measurement	no explicit perf logging	add `--perf`	Without `prompt eval` vs `eval`, tuning is guesswork
GPU split	`--split-mode layer --tensor-split 1,1`	try `--split-mode none --main-gpu 0`	27B Q5 may fit on one B70; dual-GPU split may add latency
Threads	`--threads 24`	try `--threads 8 --threads-batch 16`	GPU-offloaded inference often does not benefit from many CPU threads
NUMA	`--numa distribute`	remove it first	likely not useful on a normal single-socket workstation
KV cache	`q8_0/q8_0`	compare `f16`, `q8_0`, `q4_0`	Arc/B70 quant paths can behave very differently
Quant	Q5_K_M only	compare Q4_K_M vs Q5_K_M	Q4 may be much better latency on B70
Flash Attention	unspecified/auto	test `-fa on` vs `auto`	often relevant for long-context workloads
Build	moving target?	pin/compare builds	known `server-intel-b9159` regression exists
Continue	one big model for everything?	split autocomplete to smaller model	autocomplete should be latency-optimized
MTP	tempting	leave it for later	Qwen3.6 MTP + SYCL still has sharp edges

1. Add `--perf` before changing anything

First, keep your current command but add:

--perf

Then look for:

prompt eval time
eval time
tokens per second

Interpretation:

slow prompt eval = context/prefill problem
slow eval = generation/quant/backend/split problem
slow in Continue but not in llama-bench = likely agentic-context or client-side request pattern problem

For coding agents, prompt eval is often the hidden bottleneck. A model can look fine on short prompts or tg128, but feel bad in Continue because every agent step re-sends large context.

2. Test single GPU before dual-GPU layer split

Your current-style setup appears to use:

--split-mode layer \
--tensor-split 1,1

I would absolutely compare that with single-GPU mode:

--split-mode none \
--main-gpu 0

Optionally also pin the SYCL device:

export ONEAPI_DEVICE_SELECTOR=level_zero:0

Why this may help:

a 27B Q5_K_M model may fit on a single 32 GB B70
layer split helps capacity, but does not guarantee better single-user latency
decode often does not scale well across multiple GPUs
multi-GPU SYCL may increase host-memory pressure
Intel multi-GPU SYCL has a known host-side GTT mirror behavior: SYCL multi-GPU GTT mirror issue

If single GPU is faster or similarly fast, I would use the second B70 for another service instead:

GPU 0: Qwen3.6-27B for chat/edit/agent
GPU 1: Qwen2.5-Coder 1.5B/7B for autocomplete, or embeddings/reranking

For a single-user coding workstation, two independent services can feel better than one model split over two GPUs.

3. Reduce CPU threads and set batch threads separately

If you currently use:

--threads 24

I would compare:

--threads 8 \
--threads-batch 16

and:

--threads 4 \
--threads-batch 16

and:

--threads 8 \
--threads-batch 8

--threads and --threads-batch are separate llama.cpp server knobs. --threads is more relevant to generation-side CPU work, while --threads-batch matters for batch/prompt processing. See the official server option docs.

With most layers on GPU, more CPU threads are not always better. Too many threads can add scheduling overhead or just not help. For coding agents, --threads-batch can matter more because large prompt ingestion is common.

4. Remove `--numa distribute` unless this is really a NUMA machine

If this is a normal single-socket desktop/workstation system, I would remove:

--numa distribute

Baseline should probably be no NUMA setting. Only test NUMA modes later if you know the machine is actually NUMA-relevant.

5. Do not assume `q8_0` KV cache is fastest

Your command uses:

--cache-type-k q8_0 \
--cache-type-v q8_0

That may be good for VRAM, but it should be measured. Compare:

--cache-type-k f16 \
--cache-type-v f16

--cache-type-k q8_0 \
--cache-type-v q8_0

--cache-type-k q4_0 \
--cache-type-v q4_0

The B70-specific reason to test this is that quantized paths on Battlemage can behave surprisingly. The clearest known example is Q8_0 being ~4× slower than Q4_K_M on Arc Pro B70. That issue is about model weights, not KV cache, so it does not prove q8_0 KV is bad. But it does prove that on B70, “higher bit = safer/faster” is not a reliable assumption.

The same issue also notes that -DGGML_SYCL_F16=ON improved prompt processing by about 2.4× in one Q4_K_M case, while not improving token generation. That is another clue that prefill and decode must be tuned separately.

6. Test Q4_K_M against Q5_K_M

Q5_K_M is reasonable, but for local coding latency I would compare:

Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M
Qwen3.6-27B-Q6_K

Suggested order:

Q4_K_M baseline
Q5_K_M quality comparison
Q6_K only if you still have enough speed/VRAM

On B70, Q4_K_M may be a better practical latency/quality point than Q5_K_M. The B70 Q8_0 issue is the strongest warning that quant performance on this architecture is not always intuitive: B70 Q8_0 kernel efficiency issue.

7. Explicitly test Flash Attention

Try:

-fa on

and compare with:

-fa auto

and maybe:

-fa off

For long-context coding workloads, Flash Attention can matter, but it should still be measured. The option is documented in the llama.cpp server README.

8. Prefer `--n-gpu-layers all` over `999`

If the intention is “offload everything possible,” use:

--n-gpu-layers all

instead of:

--n-gpu-layers 999

This is mostly clarity, not a guaranteed performance change. The server docs support auto, all, and numeric values.

9. Pin builds; avoid moving `latest`

There is a very relevant Intel build regression report:

Qwen3.6 on server-intel: b9144 32.8 t/s, b9159 25.8 t/s

That issue is Qwen3.6-35B-A3B-MTP on Arc Pro B50, not exactly your 27B dense setup, so it is not proof that your setup is affected. But it is close enough to justify build pinning and A/B testing.

Record:

./build/bin/llama-server --version
sycl-ls
uname -a

If using Docker, compare pinned images rather than latest:

ghcr.io/ggml-org/llama.cpp:server-intel-b9144
ghcr.io/ggml-org/llama.cpp:server-intel-b9159

With Intel Arc + SYCL, performance can depend on:

llama.cpp commit
Intel compute-runtime
oneAPI version
Linux kernel / driver stack
Docker image contents
whether i915 or xe is used
ReBAR / Above 4G / PCIe platform behavior

10. Check platform basics: ReBAR, PCIe, driver stack

If performance is much lower than other B70 reports, I would verify platform-level things too.

Useful checks:

lspci -vv | grep -i -E "Resizable BAR|Region|prefetchable" -A3
lspci -nnk | grep -i -E "VGA|Display|3D" -A4
sycl-ls
uname -a

Things to confirm:

Above 4G Decoding: enabled
Resizable BAR: enabled
PCIe link width/speed: expected width/speed
driver stack: i915 vs xe
Intel compute-runtime version
oneAPI version
kernel version

There is also a relevant report of very poor SYCL performance on older DDR4 / PCIe 3.0 platform with Battlemage: brutally bad SYCL performance on Battlemage. Your system sounds much newer, but it is still worth verifying ReBAR/PCIe/driver basics.

11. Treat MTP as a later experiment, not the first fix

Qwen3.6 MTP is interesting, but I would not add it until the non-MTP baseline is clean.

Relevant issues:

If you test MTP later, start conservatively:

--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2

Avoid assuming MTP is faster just because draft acceptance is high. On SYCL/Intel Arc, one issue specifically reports correct output but no speed gain, with per-kernel dispatch overhead identified as the remaining bottleneck.

Also watch for:

draft acceptance rate
VRAM before/after requests
forcing full prompt re-processing
create context checkpoint
OOM after multiple requests

12. Watch for full prompt re-processing

For Qwen3.6 and agentic use, look for:

forcing full prompt re-processing

Relevant issue:

Qwen3.6 27B full prompt re-processing / cache behavior
Related generic server prompt-cache issue: server forces full prompt re-processing on subsequent prompts

If this appears often, the problem may not be raw GPU throughput. It may be prompt cache invalidation, slot reuse behavior, hybrid attention/recurrent-memory behavior, client request shape, or MTP interaction.

For a single local coding-agent user, I would initially force:

--parallel 1

and only increase parallelism after the baseline is stable.

13. Continue.dev: separate autocomplete from chat/agent

Continue has different model roles: Chat, Edit, Apply, Autocomplete, Embedding, Reranker, etc. See Continue model roles.

For autocomplete specifically, Continue recommends smaller/faster models such as Qwen Coder 2.5 1.5B or 7B: Continue autocomplete docs. The docs also note that thinking-type models are generally not recommended for autocomplete because they generate more slowly.

A practical split:

localhost:8080 -> Qwen3.6-27B-Q4_K_M or Q5_K_M for chat/edit/agent
localhost:8081 -> Qwen2.5-Coder-1.5B or 7B for autocomplete

This can improve perceived responsiveness a lot. Autocomplete should not be queued behind a large 27B agent request.

14. Suggested clean baseline

I would start with something like this:

#!/bin/bash
source /opt/intel/oneapi/setvars.sh --force

export ZES_ENABLE_SYSMAN=1
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1

cd ~/llama.cpp

./build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q4_K_M.gguf \
  -a Roboto \
  -c 32768 \
  -fa on \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --n-gpu-layers all \
  -b 2048 \
  -ub 512 \
  --threads 8 \
  --threads-batch 16 \
  --host 0.0.0.0 \
  --port 8080 \
  --split-mode none \
  --main-gpu 0 \
  --parallel 1 \
  --perf

This is not guaranteed to be best. It is just a cleaner baseline:

one GPU
no NUMA complication
explicit Flash Attention
explicit KV type
explicit threads vs threads-batch
explicit parallel 1
perf logging
Q4_K_M as a latency-first starting point

Then change only one thing at a time.

15. Suggested A/B order

Step 0: current setup + perf

Add only:

--perf

Save the logs.

Step 1: single GPU

Change:

--split-mode layer \
--tensor-split 1,1

to:

--split-mode none \
--main-gpu 0

If this is faster, dual-GPU split is probably not helping latency.

Step 2: threads

Try:

--threads 8 \
--threads-batch 16

instead of:

--threads 24

Step 3: remove NUMA

Remove:

--numa distribute

Step 4: KV cache

Compare:

--cache-type-k f16 --cache-type-v f16
--cache-type-k q8_0 --cache-type-v q8_0
--cache-type-k q4_0 --cache-type-v q4_0

Step 5: quant

Compare:

Qwen3.6-27B-Q4_K_M
Qwen3.6-27B-Q5_K_M

Step 6: Flash Attention

Compare:

-fa on
-fa auto
-fa off

Step 7: context size

Compare:

-c 16384
-c 24576
-c 32768

Bigger context is not free. For coding agents, having 32K available is useful, but repeatedly filling it can make the system feel slow.

Step 8: pinned builds

Compare known builds / commits, especially if using Docker or recent server-intel images.

Step 9: only then try MTP

Only after non-MTP is stable, try:

--parallel 1 \
--spec-type draft-mtp \
--spec-draft-n-max 2

and compare it against non-MTP.

16. What I would put in the log comparison table

Something like this:

Config name:
Model:
Quant:
Backend:
llama.cpp version:
Intel compute-runtime:
oneAPI:
Kernel:
Driver stack:
GPU split:
KV type:
Context size:
Batch / ubatch:
Threads / threads-batch:
Flash Attention:
Parallel slots:

prompt eval t/s:
eval t/s:
VRAM used:
System RAM:
GTT mirror, if checked:
Notes:

For multi-GPU SYCL memory behavior, this may help:

PID=$(pgrep llama-server)

for fd in /proc/$PID/fdinfo/*; do
  grep -H "drm-total-gtt\|drm-total-vram" "$fd" 2>/dev/null
done

17. My best guess

My guess is that the biggest practical wins will come from:

single-GPU baseline instead of dual-GPU layer split
lower CPU thread count with explicit --threads-batch
removing --numa distribute
testing Q4_K_M vs Q5_K_M
testing KV f16 vs q8_0
pinning known-good llama.cpp / server-intel builds
separating Continue autocomplete onto a smaller model

I would not start by chasing MTP, huge context sizes, or experimental split modes. First make the normal Qwen3.6-27B path fast and reproducible.

Topic		Replies	Views
Local LLM and ML platform with RTX 5090 GPU Show and Tell	5	3380	September 19, 2025
TOP local AI models (gguf) for complete web app development (no coding) for 2026? Models	2	1557	March 17, 2026
Want to host a production level server for runnin llm for code generation Intermediate	0	145	January 7, 2025
Anubis OSS — native macOS app for benchmarking local LLMs with real-time hardware telemetry (free, open source) Intermediate	1	139	February 11, 2026
Practical match for 128Gb Strix Halo with 2x3090s? (inference for coding) Beginners	4	124	May 21, 2026

What should i change to optimize local hosted AI

TL;DR: highest-value things I would test first

1. Add --perf before changing anything

2. Test single GPU before dual-GPU layer split

3. Reduce CPU threads and set batch threads separately

4. Remove --numa distribute unless this is really a NUMA machine

5. Do not assume q8_0 KV cache is fastest

6. Test Q4_K_M against Q5_K_M

7. Explicitly test Flash Attention

8. Prefer --n-gpu-layers all over 999

9. Pin builds; avoid moving latest

10. Check platform basics: ReBAR, PCIe, driver stack

11. Treat MTP as a later experiment, not the first fix

12. Watch for full prompt re-processing

13. Continue.dev: separate autocomplete from chat/agent

14. Suggested clean baseline

15. Suggested A/B order

Step 0: current setup + perf

Step 1: single GPU

Step 2: threads

Step 3: remove NUMA

Step 4: KV cache

Step 5: quant

Step 6: Flash Attention

Step 7: context size

Step 8: pinned builds

Step 9: only then try MTP

16. What I would put in the log comparison table

17. My best guess

Related topics

1. Add `--perf` before changing anything

4. Remove `--numa distribute` unless this is really a NUMA machine

5. Do not assume `q8_0` KV cache is fastest

8. Prefer `--n-gpu-layers all` over `999`

9. Pin builds; avoid moving `latest`