Closed
Description
New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.
It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage.
Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)You can downgrade to Linux 6.8.4 for Arch Linux with these packages:
linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst
linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst
This happens to me as well on both llama.cpp and clpeak.
clpeak
output:
$ NEOReadDebugKeys=1 PrintDebugMessages=1 LogWaitingForCompletion=1 EventsDebugEnable=1 PrintKmdTimes=1 LogZEInfo=1 clpeak
WARNING: Failed to request OCL Turbo Boost
computeUnitsUsedForScratch: 4096
hwInfo: {512, 4096}: (16, 1, 32)
Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Arc(TM) A770 Graphics
Driver version : 24.09.28717.17 (Linux x64)
Compute units : 512
Clock frequency : 2400 MHz
DeviceBinaryFormat::zebin : Unhandled SHT_NOTE section : .note.intelgt.metrics currently supports only : .note.intelgt.compat.
DeviceBinaryFormat::zebin::.ze_info : Minor version : 40 is newer than available in decoder : 39 - some features may be skipped
Global memory bandwidth (GBPS)
Waiting for task count 0 at location 0x7fb1fbd65000 with timeout 0. Current value: 0
Waiting completed. Current value: 0
Waiting for task count 1 at location 0x7fb1fbd5f000 with timeout 0. Current value: 0
Waiting completed. Current value: 1
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
float : DIM:1 GWS:(33550336, 1, 1) ELWS:(256, 1, 1) Offset:(0, 0, 0) AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1) NWGS:(131056, 1, 1) SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
DIM:1 GWS:(33550336, 1, 1) ELWS:(256, 1, 1) Offset:(0, 0, 0) AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1) NWGS:(131056, 1, 1) SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 96. Current value: 0
Waiting completed. Current value: 0
Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 0. Current value: 0
Then it stuck here and clpeak
process consumes one cpu core (100% usage).
perf record -a
when it stuck reports:
Samples: 537K of event 'cycles:P', Event count (approx.): 456413846417
Overhead Samples Command Shared Object Symbol
9.08% 36420 clpeak [kernel.kallsyms] [k] clear_bhb_loop ◆
5.43% 21754 clpeak [kernel.kallsyms] [k] __schedule ▒
4.73% 18984 clpeak libc.so.6 [.] __sched_yield ▒
4.65% 18636 clpeak [kernel.kallsyms] [k] _raw_spin_lock ▒
4.64% 18584 clpeak [vdso] [.] __vdso_clock_gettime ▒
4.61% 18482 clpeak [kernel.kallsyms] [k] native_sched_clock ▒
4.48% 17981 clpeak [kernel.kallsyms] [k] psi_account_irqtime ▒
4.01% 16069 clpeak [kernel.kallsyms] [k] update_curr ▒
3.71% 14894 clpeak [kernel.kallsyms] [k] syscall_exit_to_user_mode ▒
3.69% 14808 clpeak [kernel.kallsyms] [k] __calc_delta.constprop.0 ▒
3.66% 14687 clpeak [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack ▒
3.50% 14043 clpeak [kernel.kallsyms] [k] pick_next_task_fair ▒
3.23% 12965 clpeak libigdrcl.so [.] 0x000000000005f724 ▒
2.76% 11064 clpeak [kernel.kallsyms] [k] pick_eevdf ▒
2.61% 10453 clpeak [kernel.kallsyms] [k] do_syscall_64 ▒
2.40% 9623 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64 ▒
2.25% 9021 clpeak [kernel.kallsyms] [k] update_min_vruntime ▒
1.83% 7352 clpeak [kernel.kallsyms] [k] update_curr_se ▒
1.78% 7126 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
1.56% 6269 clpeak [kernel.kallsyms] [k] __cgroup_account_cputime ▒
1.52% 6079 clpeak [kernel.kallsyms] [k] record_times ▒
1.51% 6064 clpeak [kernel.kallsyms] [k] update_rq_clock ▒
1.49% 5981 clpeak [kernel.kallsyms] [k] do_sched_yield ▒
1.11% 4453 clpeak [kernel.kallsyms] [k] rcu_note_context_switch ▒
1.01% 4037 clpeak libigdrcl.so [.] 0x000000000005f726 ▒
0.91% 3651 clpeak [kernel.kallsyms] [k] yield_task_fair ▒
0.83% 3346 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64_safe_stack ▒
0.82% 3285 clpeak [kernel.kallsyms] [k] raw_spin_rq_lock_nested ▒
0.78% 3113 clpeak [kernel.kallsyms] [k] syscall_return_via_sysret ▒
0.75% 3008 clpeak [kernel.kallsyms] [k] schedule ▒
0.72% 64724 swapper [kernel.kallsyms] [k] intel_idle ▒
0.70% 2792 clpeak libc.so.6 [.] clock_gettime@@GLIBC_2.17 ▒
0.58% 2338 clpeak [kernel.kallsyms] [k] cpuacct_charge ▒
0.53% 2107 clpeak [kernel.kallsyms] [k] sched_clock ▒
0.51% 2060 clpeak [kernel.kallsyms] [k] sched_clock_cpu ▒
0.45% 1805 clpeak libstdc++.so.6.0.33 [.] std::chrono::_V2::system_clock::now() ▒
0.42% 1690 clpeak [kernel.kallsyms] [k] _raw_spin_unlock ▒
0.39% 1567 clpeak libigdrcl.so [.] 0x0000000000534467 ▒
0.39% 1563 clpeak [kernel.kallsyms] [k] x64_sys_call ▒
0.37% 1469 clpeak [kernel.kallsyms] [k] __list_add_valid_or_report ▒
0.35% 1418 clpeak [kernel.kallsyms] [k] check_cfs_rq_runtime ▒
0.35% 1402 clpeak [kernel.kallsyms] [k] pick_next_entity ▒
0.34% 1368 clpeak [kernel.kallsyms] [k] cgroup_rstat_updated ▒
0.34% 1364 clpeak [kernel.kallsyms] [k] __list_del_entry_valid_or_report ▒
0.27% 1098 clpeak [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare ▒
0.23% 1856 pgrep libc.so.6 [.] __strncpy_evex ▒
0.22% 882 clpeak libigdrcl.so [.] 0x000000000056e33e ▒
0.19% 775 clpeak [kernel.kallsyms] [k] __x64_sys_sched_yield
System information:
kernel 6.8.5-301.fc40.x86_64
Fedora Kinoite 40.20240419.n.0
intel-compute-runtime 24.09.28717.17-1.fc40.x86_64 (this also happens on latest 24.13.29138.7, installed in a ubuntu 22.04 container)
CPU: Intel Core i9-10940X
GPU: Intel Arc A770 16GB
Metadata
Metadata
Assignees
Labels
No labels