User Tools

Site Tools


ai:insttructlab

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ai:insttructlab [2025/06/27 15:59] – created philai:insttructlab [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1
Line 1: Line 1:
-====== Running InstructLab on a Lenovo Thinkpad X1 Carbon Gen 12 ====== 
- 
-This notebook's hardware is advertised as [[https://www.lenovo.com/us/en/p/laptops/thinkpad/thinkpadx1/thinkpad-x1-carbon-gen-12-14-inch-intel/len101t0083|Powered by Intel® Core™ Ultra processors, with integrated AI]] 
-and indeed there seem to be dedicated devices present for that purpose: 
- 
-  % lspci 
-  [...] 
-  00:08.0 System peripheral: Intel Corporation Meteor Lake-P Gaussian & Neural-Network Accelerator (rev 20) 
-  [...] 
-  00:0b.0 Processing accelerators: Intel Corporation Meteor Lake NPU (rev 04) 
-  [...] 
- 
-For the second one above there is a kernel driver enabled by 
-//CONFIG_DRM_ACCEL_IVPU// symbol. In the config menu, it sits in: 
-  > Device Drivers > Compute Acceleration Framework 
-When building as a module, it is called //intel_vpu.ko//. 
- 
-Once loaded, there appears a ///dev/accel/accel0// node in //devtmpfs// and 
-//sysfs// gains a new //class/accel/accel0// symlink pointing at the PCI 
-device. Interesting attributes in ///sys/class/accel/accel0/device//: 
-| npu_busy_time_us | The time this NPU spent executing jobs (in us) | 
-| npu_memory_utilization | Memory currently used (in bytes) | 
-| npu_current_frequency_mhz | Current clock frequency (in MHz) | 
-| npu_max_frequency_mhz | Maximum clock frequency (in MHz) | 
-(The latter three are available since linux-6.15.) 
- 
-====== A first look at InstructLab ====== 
- 
-The [[https://github.com/instructlab/instructlab|Github page]] has installation 
-instructions, but they offer only four choices: 
- 
-  * Install with Apple Metal (accelerators in recent Macbooks) 
-  * Install with AMD ROCm (to utilize AMD GPUs) 
-  * Install with Intel CUDA (utilizing NVIDIA GPUs) 
-  * Install without acceleration (utilizing the CPU only) 
- 
-After choosing the latter variant and following the basic setup guide serving a 
-model and chatting with it is basically possible: 
-  >>> How are you today?                                                    [S][default] 
-  ╭──────────────────────────── granite-7b-lab-Q4_K_M.gguf ────────────────────────────╮ 
-  │ Thank you for asking! I'm doing well today. I'm an AI language model, so I don'  │ 
-  │ have feelings or emotions, but I'm here and ready to help you with any questions   │ 
-  │ or tasks you might have. How can I assist you today?                               │ 
-  ╰──────────────────────────────────────────────────────────── elapsed 7.078 seconds ─╯ 
-Attempting to train the model shows weird behaviour, though: The busy ''ilab 
-data generate'' command seems to read filesystem contents outside of the 
-(modified) taxonomy repository, and moreover it seems to follow symlinks, with 
-inadvertent results: 
-  % strace -fxp <ilab PID> 
-  [...] 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-amd-mp2-plat.c", {st_mode=S_IFREG|0644, st_size=9621, ...}) = 0 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-at91.h", {st_mode=S_IFREG|0644, st_size=6823, ...}) = 0 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-parport.c", {st_mode=S_IFREG|0644, st_size=10747, ...}) = 0 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-cadence.c", {st_mode=S_IFREG|0644, st_size=46715, ...}) = 0 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-npcm7xx.c", {st_mode=S_IFREG|0644, st_size=71028, ...}) = 0 
-  [pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-mv64xxx.c", {st_mode=S_IFREG|0644, st_size=31017, ...}) = 0 
-Apparently it has found the //build// symlink typically found in kernel module 
-install directories. In this case, that symlink sits in a subdirectory of the 
-one it points at and the crawler is obviously ignorant of that. While it's busy 
-following symlinks, the command does not react to CTRL-c key combination, 
-yet it behaves when sent SIGTERM via ''kill'', at least. 
- 
-====== Backends of Backends ====== 
- 
-Leaving model training aside for now, a closer look at ''ilab model serve 
---help'' output reveals there are two possible backends to use: 
-[[https://github.com/vllm-project/vllm|vLLM]] and 
-[[https://github.com/ggml-org/llama.cpp|llama.cpp]]. 
- 
-===== vLLM ===== 
- 
-The former claims to support Intel GPUs, its 
-[[https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html|install 
-page]] has a tab named "Intel XPU". One has to build the package from source, 
-but apart from vague requirements to install Intel GPU drivers and OneAPI the 
-instructions are pretty straightforward. As it turns out, installing 
-//intel-compute-runtime// package via the distribution's package manger seems 
-to suffice. 
- 
-Interestingly, the repository's //requirements/xpu.txt// file which the 
-instructions point at references XPU-enabled builds of ''pytorch''. There is a 
-quick way of checking whether it is happy with the system so far: 
-  % . /tmp/my_venv/bin/activate 
-  (my_venv) % python 
-  >>> import torch 
-  >>> torch.xpu.is_available() 
-  True 
-In Fedora42 for instance, the module would complain and return False: 
-  >>> torch.xpu.is_available() 
-  /home/me/ilab_venv/lib64/python3.12/site-packages/torch/xpu/__init__.py:60: UserWarning: XPU device count is zero! (Triggered internally at /pytorch/c10/xpu/XPUFunctions.cpp:115.) 
-    return torch._C._xpu_getDeviceCount() 
-  False 
- 
-Another simple health check is via ''clinfo'' tool. If //intel-comput-runtime// 
-package is correctly installed, it should find the local GPU: 
-  % clinfo -l 
-  Platform #0: Intel(R) OpenCL Graphics 
-   `-- Device #0: Intel(R) Graphics 
-This is the case on Fedora42, so obviously not sufficient to check accelerator 
-availability. 
- 
-If things look fine, one may try to serve the model using vLLM backend to see 
-what happens. The output is pretty excessive, so the following listing omits 
-large parts: 
-  (my_venv) % ilab model serve --backend vllm 
-  WARNING 2025-06-27 00:22:00,347 instructlab.model.backends.backends:96: The serving backend 'vllm' was configured explicitly, but the provided model is not compatible with it. The model was detected as 'llama-cpp, reason: model is a GGUF file.'. 
-  The backend startup sequence will continue with the configured backend but might fail. 
-  [...] 
-  DEBUG 06-27 00:22:07 [__init__.py:138] Checking if XPU platform is available. 
-  [W627 00:22:08.949943771 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden. 
-    Overriding a previously registered kernel for the same operator and the same dispatch key 
-    operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) 
-      registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 
-    dispatch key: XPU 
-    previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37 
-         Gew kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:186 (function operator()) 
-  DEBUG 06-27 00:22:09 [__init__.py:146] Confirmed XPU platform is available. 
-  [...] 
-  WARNING 06-27 00:22:23 [_logger.py:68] device type=xpu is not supported by the V1 Engine. Falling back to V0. 
-  WARNING 06-27 00:22:23 [_logger.py:68] Unknown device name intel(r) graphics, always use float16 
-  WARNING 06-27 00:22:23 [_logger.py:68] bfloat16 is only supported on Intel Data Center GPU, Intel Arc GPU is not supported yet. Your device is Intel(R) Graphics, which is not supported. will fallback to float16 
-  WARNING 06-27 00:22:23 [_logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode. 
-  ERROR 06-27 00:22:23 [xpu.py:108] Both start methods (spawn and fork) have issue on XPU if you use mp backend, setting it to ray instead. 
-  [...] 
-  WARNING 06-27 00:23:20 [_logger.py:68] No existing RAY instance detected. A new instance will be launched with current node resources. 
-  [...] 
-  ERROR 06-27 00:23:42 [worker_base.py:622] NotImplementedError: The operator 'vllm::_apply_gguf_embedding' is not currently implemented for the XPU device. Please open a feature on https://github.com/intel/torch-xpu-ops/issues. You can set the environment variable `PYTORCH_ENABLE_XPU_FALLBACK=1` to use the CPU implementation as a fallback for XPU unimplemented operators. WARNING: this will bring unexpected performance compared with running natively on XPU. 
-  [...] 
-  RuntimeError: Engine process failed to start. See stack trace for the root cause. 
-A few things to notice from that: 
-  * Maybe a different model is required for vLLM 
-  * From vLLM's point of view, XPU devices seem to be pretty restricted (or maybe just the consumer one in this notebook?) 
-  * There is a CPU fallback for unsupported things. In this case it won't help though, the call fails with: ''NotImplementedError: Could not run 'vllm::_apply_gguf_embedding' with arguments from the 'CPU' backend.'' 
- 
-Next try with a model in Safetensors format: 
-  (my_venv) % ilab model serve --backend vllm --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab 
-  [...] 
-  (raylet) [2025-06-27 01:05:09,708 E 20006 20006] (raylet) node_manager.cc:3193: 14 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: e8d0da19e18cda1181e90e67d93f6cb3cc3a6ebbbad9c52ea82cfea1, IP: 192.168.0.11) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.11` 
-The OOM condition seems like a dead end. 
- 
-===== llama.cpp ===== 
- 
-The Github page lists a number of 
-[[https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#supported-backends|supported backends]], 
-the interesting one is 
-[[https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md|SYCL]] 
-as it is described as "primarily designed for Intel GPUs". 
- 
-To build with SYCL support, Intel's proprietary //icx// and //icpx// compilers 
-need to be present. These come in a 
-[[https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html|self-extracting archive with a binary installer]], 
-so basically a worst-case scenario for anyone interested in system security. 
- 
-A convenient way to recompile the library is via reinstalling 
-//llama-cpp-python// wheel using pip: 
-  (my_venv) % pip cache remove llama_cpp_python 
-  (my_venv) % . /opt/intel/oneapi/setvars.sh 
-  (my_venv) % CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install --verbose --force-reinstall 'llama-cpp-python[server]' 
- 
-With the freshly built library in place, GPU offloading may be verified by inspecting debug output printed by ilab with //--verbose// option: 
-  (my_venv) % ilab --verbose model serve 
-  [...] 
-  load_tensors: loading model tensors, this can take a while... (mmap = true) 
-  load_tensors: layer   0 assigned to device SYCL0, is_swa = 0 
-  load_tensors: layer   1 assigned to device SYCL0, is_swa = 0 
-  [...] 
-  load_tensors: layer  31 assigned to device SYCL0, is_swa = 0 
-  load_tensors: layer  32 assigned to device SYCL0, is_swa = 0 
-  load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead 
-  load_tensors: offloading 32 repeating layers to GPU 
-  load_tensors: offloading output layer to GPU 
-  load_tensors: offloaded 33/33 layers to GPU 
-Response time when chatting with the model decreased, though: 
-  >>> How are you today?                                                    [S][default] 
-  ╭──────────────────────────── granite-7b-lab-Q4_K_M.gguf ────────────────────────────╮ 
-  │ Thank you for asking! I'm doing well today. I'm an AI language model, so I don'  │ 
-  │ have feelings or emotions, but I'm fully operational and ready to assist you with  │ 
-  │ any questions or tasks you might have. How can I help you today?                   │ 
-  ╰─────────────────────────────────────────────────────────── elapsed 19.407 seconds ─╯ 
-This does not seem right. Also contents of the various 
-///sys/class/accel/accel0/device/npu_*// files remain unchanged. So either the 
-offloading is not functional as intended or it is simply not used for this 
-specific use-case. If so, there should not be a difference in performance, though. 
- 
-===== Summary ====== 
- 
-While all involved software components allegedly support offloading to the 
-notebook's Intel GPU, doing so leads to a (slightly) worse user experience in 
-best case and breaks functionality in worst case. 
- 
-Many questions remain though, more investigation is needed for a better 
-picture. The best direction in which to push this forward seems to be using 
-//llama.cpp// backend and identifying either why NPU performance counters don't 
-increase, the NPU is not used when it should or which use-case will actually 
-leverage it.