Skip to content

Conversation

@gatbontonpc
Copy link

@gatbontonpc gatbontonpc commented Dec 23, 2025

Add metal count equal op

This PR extends the CPU implementations of count_equal to Metal.

The current implementation uses a single thread group, but supports multiple if anything changes. This currently matches the CPU / Cuda implementation in which only takes int32 for src0 and src1. This kernel uses the atomic_fetch_add_explicit, which only supports returning an int32 adds similar to Cuda. This limits the size of the buffers we can take in to 2^31 - 1.

The docs have been updated.

codex generated summary:

Summary

This PR introduces a Metal implementation for COUNT_EQUAL on int32 tensors that uses SIMD-group reduction to efficiently compute per-threadgroup partial counts and accumulate the result into the destination buffer using atomic operations.

The change improves parallel efficiency over a naïve per-element atomic approach by:

  • Performing the equality comparison per thread
  • Reducing results within a SIMD group via simd_sum
  • Emitting a single atomic update per SIMD group

Key Changes

  • Added a templated Metal kernel kernel_count_equal<int32_t>
  • Uses shared memory (shmem_i32) and SIMD intrinsics (simd_sum) to aggregate counts
  • Emits a single atomic_fetch_add_explicit per SIMD group
  • Registers kernel under the exported symbol:
    kernel_count_equal_i32

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 23, 2025
Comment on lines 4137 to 4140
const size_t smem = pipeline.smem;
int64_t z = 0;
ggml_backend_tensor_set(op, &z, 0, sizeof(z));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work, you need to call a separate kernel that fills the buffer with zeros

Copy link
Author

@gatbontonpc gatbontonpc Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new kernel to memset a buffer to a value. Similar to fill but simpler pipeline and only takes the buffer and value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants