Closed
Description
Describe the bug
While testing our project (https://github.com/acts-project/vecmem) with the HIP backend of the compiler, I found a failure in the unit tests with atomic operations.
[bash][pcadp04]:build > ./bin/vecmem_test_sycl
Running main() from /afs/cern.ch/user/k/krasznaa/work/projects/vecmem/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 19 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from sycl_containers_test
[ RUN ] sycl_containers_test.shared_memory
[ OK ] sycl_containers_test.shared_memory (8 ms)
[ RUN ] sycl_containers_test.device_memory
[ OK ] sycl_containers_test.device_memory (19 ms)
[ RUN ] sycl_containers_test.atomic_memory
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
value
Which is: 800
ITERATIONS * 4
Which is: 400
...
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
value
Which is: 800
ITERATIONS * 4
Which is: 400
[ FAILED ] sycl_containers_test.atomic_memory (7 ms)
[ RUN ] sycl_containers_test.extendable_memory
[ OK ] sycl_containers_test.extendable_memory (13 ms)
[ RUN ] sycl_containers_test.array_memory
[ OK ] sycl_containers_test.array_memory (4 ms)
[----------] 5 tests from sycl_containers_test (54 ms total)
[----------] 5 tests from sycl_jagged_containers_test
[ RUN ] sycl_jagged_containers_test.mutate_in_kernel
[ OK ] sycl_jagged_containers_test.mutate_in_kernel (9 ms)
[ RUN ] sycl_jagged_containers_test.set_in_kernel
[ OK ] sycl_jagged_containers_test.set_in_kernel (14 ms)
[ RUN ] sycl_jagged_containers_test.set_in_contiguous_kernel
[ OK ] sycl_jagged_containers_test.set_in_contiguous_kernel (13 ms)
[ RUN ] sycl_jagged_containers_test.filter
[ OK ] sycl_jagged_containers_test.filter (15 ms)
[ RUN ] sycl_jagged_containers_test.zero_capacity
[ OK ] sycl_jagged_containers_test.zero_capacity (20 ms)
[----------] 5 tests from sycl_jagged_containers_test (73 ms total)
[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic
[ RUN ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource
[ OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource (0 ms)
[ RUN ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource
[ OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource (5 ms)
[ RUN ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource
[ OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource (5 ms)
[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic (10 ms total)
[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource (0 ms)
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource (0 ms)
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource (0 ms)
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource (0 ms)
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource (0 ms)
[ RUN ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource
[ OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource (0 ms)
[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible (0 ms total)
[----------] Global test environment tear-down
[==========] 19 tests from 4 test suites ran. (139 ms total)
[ PASSED ] 18 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] sycl_containers_test.atomic_memory
1 FAILED TEST
The reason I wanted to show these results is that it seems that this is the only failure with this backend by now. 😄
To Reproduce
I made the following example to reproduce the issue in a standalone way:
// SYCL include(s).
#include <CL/sycl.hpp>
// System include(s).
#include <iostream>
#include <memory>
int main() {
// Create a queue.
sycl::queue queue;
std::cout << "Using device: "
<< queue.get_device().get_info<sycl::info::device::name>()
<< std::endl;
// Allocate a small shared memory buffer.
static constexpr std::size_t BUFFER_SIZE=10;
int* buffer = static_cast<int*>(sycl::malloc_shared(BUFFER_SIZE*sizeof(int), queue));
// Set all elements of it to some well defined value.
for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
buffer[i] = 100;
}
// The type of atomic reference to use.
using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
sycl::memory_scope::device,
sycl::access::address_space::global_space>;
// Run a kernel that would decrement each value atomically.
queue.submit([buffer](sycl::handler& h) {
h.parallel_for<class atomic_test>(BUFFER_SIZE,
[buffer](sycl::item<1> id) {
atomic_ref a(buffer[id.get_linear_id()]);
a.fetch_sub(5);
});
}).wait_and_throw();
// Check whether the decrement succeeded.
for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
if (buffer[i] != 95) {
std::cerr << "buffer[" << i << "] = " << buffer[i] << std::endl;
}
}
// Free the buffer.
sycl::free(buffer, queue);
// Return gracefully.
return 0;
}
With this source, I see:
[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx1031 sycl-atomics.cpp
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'sycl-atomics.cpp' is 'amdgcn-amd-amdhsa'
[-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out
Using device: AMD Radeon RX 6700 XT
buffer[0] = 100
buffer[1] = 100
buffer[2] = 100
buffer[3] = 100
buffer[4] = 100
buffer[5] = 100
buffer[6] = 100
buffer[7] = 100
buffer[8] = 100
buffer[9] = 100
[bash][pcadp04]:sycl-atomics >
While other backends produce the correct results. Like:
[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sycl-atomics.cpp
clang-16: warning: CUDA version is newer than the latest supported version 11.5 [-Wunknown-cuda-version]
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc' is 'nvptx64-unknown-nvidiacl' whereas 'sycl-atomics.cpp' is 'nvptx64-nvidia-cuda'
[-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out
Using device: NVIDIA RTX A5000
[bash][pcadp04]:sycl-atomics >
A couple of things to note:
- When using
sycl::buffer
for memory management, the issue did not show up. It seems to be limited to using USM.- I only tried it with "shared" USM, did not try what happens with "device" USM.
- I didn't test every atomic operation possible, but for instance
fetch_add(...)
does work, even with USM. It'sfetch_sub(...)
that notably doesn't work.
Environment (please complete the following information):
- OS: CentOS Stream 9, although I've seen the same on Ubuntu 20.04 before as well. (I don't think the exact Linux version matters.)
- Target device and vendor: AMD GPU
- DPC++ version:
2022-09
- Dependencies version: In this latest test I used ROCm-5.3.0, but previously I've seen the same behaviour with ROCm-4.2.0 as well.
Pinging @ivorobts.