[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM

# Describe the bug

While testing our project (https://github.com/acts-project/vecmem) with the HIP backend of the compiler, I found a failure in the unit tests with atomic operations.

```
[bash][pcadp04]:build > ./bin/vecmem_test_sycl
Running main() from /afs/cern.ch/user/k/krasznaa/work/projects/vecmem/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 19 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from sycl_containers_test
[ RUN      ] sycl_containers_test.shared_memory
[       OK ] sycl_containers_test.shared_memory (8 ms)
[ RUN      ] sycl_containers_test.device_memory
[       OK ] sycl_containers_test.device_memory (19 ms)
[ RUN      ] sycl_containers_test.atomic_memory
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
  value
    Which is: 800
  ITERATIONS * 4
    Which is: 400
...
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
  value
    Which is: 800
  ITERATIONS * 4
    Which is: 400
[  FAILED  ] sycl_containers_test.atomic_memory (7 ms)
[ RUN      ] sycl_containers_test.extendable_memory
[       OK ] sycl_containers_test.extendable_memory (13 ms)
[ RUN      ] sycl_containers_test.array_memory
[       OK ] sycl_containers_test.array_memory (4 ms)
[----------] 5 tests from sycl_containers_test (54 ms total)

[----------] 5 tests from sycl_jagged_containers_test
[ RUN      ] sycl_jagged_containers_test.mutate_in_kernel
[       OK ] sycl_jagged_containers_test.mutate_in_kernel (9 ms)
[ RUN      ] sycl_jagged_containers_test.set_in_kernel
[       OK ] sycl_jagged_containers_test.set_in_kernel (14 ms)
[ RUN      ] sycl_jagged_containers_test.set_in_contiguous_kernel
[       OK ] sycl_jagged_containers_test.set_in_contiguous_kernel (13 ms)
[ RUN      ] sycl_jagged_containers_test.filter
[       OK ] sycl_jagged_containers_test.filter (15 ms)
[ RUN      ] sycl_jagged_containers_test.zero_capacity
[       OK ] sycl_jagged_containers_test.zero_capacity (20 ms)
[----------] 5 tests from sycl_jagged_containers_test (73 ms total)

[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource (0 ms)
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource (5 ms)
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource (5 ms)
[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic (10 ms total)

[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource (0 ms)
[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible (0 ms total)

[----------] Global test environment tear-down
[==========] 19 tests from 4 test suites ran. (139 ms total)
[  PASSED  ] 18 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] sycl_containers_test.atomic_memory

 1 FAILED TEST
```

The reason I wanted to show these results is that it seems that this is the only failure with this backend by now. :smile:

# To Reproduce

I made the following example to reproduce the issue in a standalone way:

```c++
// SYCL include(s).
#include <CL/sycl.hpp>

// System include(s).
#include <iostream>
#include <memory>

int main() {

   // Create a queue.
   sycl::queue queue;
   std::cout << "Using device: "
             << queue.get_device().get_info<sycl::info::device::name>()
             << std::endl;

   // Allocate a small shared memory buffer.
   static constexpr std::size_t BUFFER_SIZE=10;
   int* buffer = static_cast<int*>(sycl::malloc_shared(BUFFER_SIZE*sizeof(int), queue));

   // Set all elements of it to some well defined value.
   for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
      buffer[i] = 100;
   }

   // The type of atomic reference to use.
   using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                       sycl::memory_scope::device,
                                       sycl::access::address_space::global_space>;

   // Run a kernel that would decrement each value atomically.
   queue.submit([buffer](sycl::handler& h) {
                   h.parallel_for<class atomic_test>(BUFFER_SIZE,
                                                     [buffer](sycl::item<1> id) {
                                                        atomic_ref a(buffer[id.get_linear_id()]);
                                                        a.fetch_sub(5);
                                                     });
                }).wait_and_throw();

   // Check whether the decrement succeeded.
   for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
      if (buffer[i] != 95) {
         std::cerr << "buffer[" << i << "] = " << buffer[i] << std::endl;
      }
   }

   // Free the buffer.
   sycl::free(buffer, queue);

   // Return gracefully.
   return 0;
}
```

With this source, I see:

```
[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx1031 sycl-atomics.cpp
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'sycl-atomics.cpp' is 'amdgcn-amd-amdhsa'
 [-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out 
Using device: AMD Radeon RX 6700 XT
buffer[0] = 100
buffer[1] = 100
buffer[2] = 100
buffer[3] = 100
buffer[4] = 100
buffer[5] = 100
buffer[6] = 100
buffer[7] = 100
buffer[8] = 100
buffer[9] = 100
[bash][pcadp04]:sycl-atomics >
```

While other backends produce the correct results. Like:

```
[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sycl-atomics.cpp
clang-16: warning: CUDA version is newer than the latest supported version 11.5 [-Wunknown-cuda-version]
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc' is 'nvptx64-unknown-nvidiacl' whereas 'sycl-atomics.cpp' is 'nvptx64-nvidia-cuda'
 [-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out 
Using device: NVIDIA RTX A5000
[bash][pcadp04]:sycl-atomics >
```

A couple of things to note:
  - When using `sycl::buffer` for memory management, the issue did not show up. It seems to be limited to using USM.
    * I only tried it with "shared" USM, did not try what happens with "device" USM.
  - I didn't test every atomic operation possible, but for instance `fetch_add(...)` does work, even with USM. It's `fetch_sub(...)` that notably doesn't work.

# Environment (please complete the following information):

  - OS: CentOS Stream 9, although I've seen the same on Ubuntu 20.04 before as well. (I don't think the exact Linux version matters.)
  - Target device and vendor: AMD GPU
  - DPC++ version: `2022-09`
  - Dependencies version: In this latest test I used ROCm-5.3.0, but previously I've seen the same behaviour with ROCm-4.2.0 as well.

Pinging @ivorobts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM #7252

Describe the bug

To Reproduce

Environment (please complete the following information):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM #7252

Description

Describe the bug

To Reproduce

Environment (please complete the following information):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions