| |
ai:insttructlab [2025/06/27 15:59] – created phil | ai:insttructlab [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 |
---|
====== Running InstructLab on a Lenovo Thinkpad X1 Carbon Gen 12 ====== | |
| |
This notebook's hardware is advertised as [[https://www.lenovo.com/us/en/p/laptops/thinkpad/thinkpadx1/thinkpad-x1-carbon-gen-12-14-inch-intel/len101t0083|Powered by Intel® Core™ Ultra processors, with integrated AI]] | |
and indeed there seem to be dedicated devices present for that purpose: | |
| |
% lspci | |
[...] | |
00:08.0 System peripheral: Intel Corporation Meteor Lake-P Gaussian & Neural-Network Accelerator (rev 20) | |
[...] | |
00:0b.0 Processing accelerators: Intel Corporation Meteor Lake NPU (rev 04) | |
[...] | |
| |
For the second one above there is a kernel driver enabled by | |
//CONFIG_DRM_ACCEL_IVPU// symbol. In the config menu, it sits in: | |
> Device Drivers > Compute Acceleration Framework | |
When building as a module, it is called //intel_vpu.ko//. | |
| |
Once loaded, there appears a ///dev/accel/accel0// node in //devtmpfs// and | |
//sysfs// gains a new //class/accel/accel0// symlink pointing at the PCI | |
device. Interesting attributes in ///sys/class/accel/accel0/device//: | |
| npu_busy_time_us | The time this NPU spent executing jobs (in us) | | |
| npu_memory_utilization | Memory currently used (in bytes) | | |
| npu_current_frequency_mhz | Current clock frequency (in MHz) | | |
| npu_max_frequency_mhz | Maximum clock frequency (in MHz) | | |
(The latter three are available since linux-6.15.) | |
| |
====== A first look at InstructLab ====== | |
| |
The [[https://github.com/instructlab/instructlab|Github page]] has installation | |
instructions, but they offer only four choices: | |
| |
* Install with Apple Metal (accelerators in recent Macbooks) | |
* Install with AMD ROCm (to utilize AMD GPUs) | |
* Install with Intel CUDA (utilizing NVIDIA GPUs) | |
* Install without acceleration (utilizing the CPU only) | |
| |
After choosing the latter variant and following the basic setup guide serving a | |
model and chatting with it is basically possible: | |
>>> How are you today? [S][default] | |
╭──────────────────────────── granite-7b-lab-Q4_K_M.gguf ────────────────────────────╮ | |
│ Thank you for asking! I'm doing well today. I'm an AI language model, so I don't │ | |
│ have feelings or emotions, but I'm here and ready to help you with any questions │ | |
│ or tasks you might have. How can I assist you today? │ | |
╰──────────────────────────────────────────────────────────── elapsed 7.078 seconds ─╯ | |
Attempting to train the model shows weird behaviour, though: The busy ''ilab | |
data generate'' command seems to read filesystem contents outside of the | |
(modified) taxonomy repository, and moreover it seems to follow symlinks, with | |
inadvertent results: | |
% strace -fxp <ilab PID> | |
[...] | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-amd-mp2-plat.c", {st_mode=S_IFREG|0644, st_size=9621, ...}) = 0 | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-at91.h", {st_mode=S_IFREG|0644, st_size=6823, ...}) = 0 | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-parport.c", {st_mode=S_IFREG|0644, st_size=10747, ...}) = 0 | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-cadence.c", {st_mode=S_IFREG|0644, st_size=46715, ...}) = 0 | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-npcm7xx.c", {st_mode=S_IFREG|0644, st_size=71028, ...}) = 0 | |
[pid 21254] stat("./git/linux-minime/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/modules/lib/modules/6.16.0-rc1-00201-g746c9b4f6a27/build/drivers/i2c/busses/i2c-mv64xxx.c", {st_mode=S_IFREG|0644, st_size=31017, ...}) = 0 | |
Apparently it has found the //build// symlink typically found in kernel module | |
install directories. In this case, that symlink sits in a subdirectory of the | |
one it points at and the crawler is obviously ignorant of that. While it's busy | |
following symlinks, the command does not react to CTRL-c key combination, | |
yet it behaves when sent SIGTERM via ''kill'', at least. | |
| |
====== Backends of Backends ====== | |
| |
Leaving model training aside for now, a closer look at ''ilab model serve | |
--help'' output reveals there are two possible backends to use: | |
[[https://github.com/vllm-project/vllm|vLLM]] and | |
[[https://github.com/ggml-org/llama.cpp|llama.cpp]]. | |
| |
===== vLLM ===== | |
| |
The former claims to support Intel GPUs, its | |
[[https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html|install | |
page]] has a tab named "Intel XPU". One has to build the package from source, | |
but apart from vague requirements to install Intel GPU drivers and OneAPI the | |
instructions are pretty straightforward. As it turns out, installing | |
//intel-compute-runtime// package via the distribution's package manger seems | |
to suffice. | |
| |
Interestingly, the repository's //requirements/xpu.txt// file which the | |
instructions point at references XPU-enabled builds of ''pytorch''. There is a | |
quick way of checking whether it is happy with the system so far: | |
% . /tmp/my_venv/bin/activate | |
(my_venv) % python | |
>>> import torch | |
>>> torch.xpu.is_available() | |
True | |
In Fedora42 for instance, the module would complain and return False: | |
>>> torch.xpu.is_available() | |
/home/me/ilab_venv/lib64/python3.12/site-packages/torch/xpu/__init__.py:60: UserWarning: XPU device count is zero! (Triggered internally at /pytorch/c10/xpu/XPUFunctions.cpp:115.) | |
return torch._C._xpu_getDeviceCount() | |
False | |
| |
Another simple health check is via ''clinfo'' tool. If //intel-comput-runtime// | |
package is correctly installed, it should find the local GPU: | |
% clinfo -l | |
Platform #0: Intel(R) OpenCL Graphics | |
`-- Device #0: Intel(R) Graphics | |
This is the case on Fedora42, so obviously not sufficient to check accelerator | |
availability. | |
| |
If things look fine, one may try to serve the model using vLLM backend to see | |
what happens. The output is pretty excessive, so the following listing omits | |
large parts: | |
(my_venv) % ilab model serve --backend vllm | |
WARNING 2025-06-27 00:22:00,347 instructlab.model.backends.backends:96: The serving backend 'vllm' was configured explicitly, but the provided model is not compatible with it. The model was detected as 'llama-cpp, reason: model is a GGUF file.'. | |
The backend startup sequence will continue with the configured backend but might fail. | |
[...] | |
DEBUG 06-27 00:22:07 [__init__.py:138] Checking if XPU platform is available. | |
[W627 00:22:08.949943771 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden. | |
Overriding a previously registered kernel for the same operator and the same dispatch key | |
operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) | |
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 | |
dispatch key: XPU | |
previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37 | |
Gew kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:186 (function operator()) | |
DEBUG 06-27 00:22:09 [__init__.py:146] Confirmed XPU platform is available. | |
[...] | |
WARNING 06-27 00:22:23 [_logger.py:68] device type=xpu is not supported by the V1 Engine. Falling back to V0. | |
WARNING 06-27 00:22:23 [_logger.py:68] Unknown device name intel(r) graphics, always use float16 | |
WARNING 06-27 00:22:23 [_logger.py:68] bfloat16 is only supported on Intel Data Center GPU, Intel Arc GPU is not supported yet. Your device is Intel(R) Graphics, which is not supported. will fallback to float16 | |
WARNING 06-27 00:22:23 [_logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode. | |
ERROR 06-27 00:22:23 [xpu.py:108] Both start methods (spawn and fork) have issue on XPU if you use mp backend, setting it to ray instead. | |
[...] | |
WARNING 06-27 00:23:20 [_logger.py:68] No existing RAY instance detected. A new instance will be launched with current node resources. | |
[...] | |
ERROR 06-27 00:23:42 [worker_base.py:622] NotImplementedError: The operator 'vllm::_apply_gguf_embedding' is not currently implemented for the XPU device. Please open a feature on https://github.com/intel/torch-xpu-ops/issues. You can set the environment variable `PYTORCH_ENABLE_XPU_FALLBACK=1` to use the CPU implementation as a fallback for XPU unimplemented operators. WARNING: this will bring unexpected performance compared with running natively on XPU. | |
[...] | |
RuntimeError: Engine process failed to start. See stack trace for the root cause. | |
A few things to notice from that: | |
* Maybe a different model is required for vLLM | |
* From vLLM's point of view, XPU devices seem to be pretty restricted (or maybe just the consumer one in this notebook?) | |
* There is a CPU fallback for unsupported things. In this case it won't help though, the call fails with: ''NotImplementedError: Could not run 'vllm::_apply_gguf_embedding' with arguments from the 'CPU' backend.'' | |
| |
Next try with a model in Safetensors format: | |
(my_venv) % ilab model serve --backend vllm --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab | |
[...] | |
(raylet) [2025-06-27 01:05:09,708 E 20006 20006] (raylet) node_manager.cc:3193: 14 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: e8d0da19e18cda1181e90e67d93f6cb3cc3a6ebbbad9c52ea82cfea1, IP: 192.168.0.11) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.11` | |
The OOM condition seems like a dead end. | |
| |
===== llama.cpp ===== | |
| |
The Github page lists a number of | |
[[https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#supported-backends|supported backends]], | |
the interesting one is | |
[[https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md|SYCL]] | |
as it is described as "primarily designed for Intel GPUs". | |
| |
To build with SYCL support, Intel's proprietary //icx// and //icpx// compilers | |
need to be present. These come in a | |
[[https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html|self-extracting archive with a binary installer]], | |
so basically a worst-case scenario for anyone interested in system security. | |
| |
A convenient way to recompile the library is via reinstalling | |
//llama-cpp-python// wheel using pip: | |
(my_venv) % pip cache remove llama_cpp_python | |
(my_venv) % . /opt/intel/oneapi/setvars.sh | |
(my_venv) % CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install --verbose --force-reinstall 'llama-cpp-python[server]' | |
| |
With the freshly built library in place, GPU offloading may be verified by inspecting debug output printed by ilab with //--verbose// option: | |
(my_venv) % ilab --verbose model serve | |
[...] | |
load_tensors: loading model tensors, this can take a while... (mmap = true) | |
load_tensors: layer 0 assigned to device SYCL0, is_swa = 0 | |
load_tensors: layer 1 assigned to device SYCL0, is_swa = 0 | |
[...] | |
load_tensors: layer 31 assigned to device SYCL0, is_swa = 0 | |
load_tensors: layer 32 assigned to device SYCL0, is_swa = 0 | |
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead | |
load_tensors: offloading 32 repeating layers to GPU | |
load_tensors: offloading output layer to GPU | |
load_tensors: offloaded 33/33 layers to GPU | |
Response time when chatting with the model decreased, though: | |
>>> How are you today? [S][default] | |
╭──────────────────────────── granite-7b-lab-Q4_K_M.gguf ────────────────────────────╮ | |
│ Thank you for asking! I'm doing well today. I'm an AI language model, so I don't │ | |
│ have feelings or emotions, but I'm fully operational and ready to assist you with │ | |
│ any questions or tasks you might have. How can I help you today? │ | |
╰─────────────────────────────────────────────────────────── elapsed 19.407 seconds ─╯ | |
This does not seem right. Also contents of the various | |
///sys/class/accel/accel0/device/npu_*// files remain unchanged. So either the | |
offloading is not functional as intended or it is simply not used for this | |
specific use-case. If so, there should not be a difference in performance, though. | |
| |
===== Summary ====== | |
| |
While all involved software components allegedly support offloading to the | |
notebook's Intel GPU, doing so leads to a (slightly) worse user experience in | |
best case and breaks functionality in worst case. | |
| |
Many questions remain though, more investigation is needed for a better | |
picture. The best direction in which to push this forward seems to be using | |
//llama.cpp// backend and identifying either why NPU performance counters don't | |
increase, the NPU is not used when it should or which use-case will actually | |
leverage it. | |
| |