Skip to content

Conversation

@StrongSpoon
Copy link
Contributor

An error arises during the training of the DeepSeek model when using the triton_native_sparse_attention. The details are illustrated below. I suspect that this issue is caused by illegal memory access within the backward_store_dk_dv function.

  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/native_sparse_attention_pytorch/triton_native_sparse_attention.py", line 1905, in backward
    native_sparse_attn_backward(c
  File "/usr/local/lib/python3.12/dist-packages/native_sparse_attention_pytorch/triton_native_sparse_attention.py", line 1747, in native_sparse_attn_backwardC
    backward_kernel[grid](
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 347, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 395, in run
    return self.fn.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 591, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
  File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 529, in __call__
    self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

To rectify this, I have included a mask_n in this pull request. This ensures that all memory accesses are secure, even when offs_n is less than zero. With this modification, model training on my device is functioning correctly.

@lucidrains
Copy link
Owner

@StrongSpoon thank you! very cute cat 🐱

@lucidrains lucidrains merged commit 0c98cbf into lucidrains:main Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants