Skip to content

Buggy optimization of vfmaddcsh intrinsics #98306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sayantn opened this issue Jul 10, 2024 · 2 comments · Fixed by #118071
Closed

Buggy optimization of vfmaddcsh intrinsics #98306

sayantn opened this issue Jul 10, 2024 · 2 comments · Fixed by #118071

Comments

@sayantn
Copy link

sayantn commented Jul 10, 2024

The llvm.x86.avx512fp16.maskz.vfmadd.csh intrinsic (and due to that, _mm_maskz_fmadd_sch) is being incorrectly optimized. This code snippet

#include<immintrin.h>
#include<stdio.h>

int main() {
    __m128h a, b, c, r;
    _Float16 array[8];

    a = _mm_setr_ph(0.0, 1.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0);
    b = _mm_setr_ph(0.0, 2.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0);
    c = _mm_setr_ph(0.0, 3.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0);

    r = _mm_maskz_fmadd_sch(0, a, b, c);
    _mm_storeu_ph(array, r);

    for (int i = 0; i < 8; i++){
        printf("%f\n", (float) array[i]);
    }

    return 0;
}

In clang, the unoptimized and optimized output is different. The unoptimized output is the correct one according to Intel. gcc gives the correct output in both.

image

System specification:

  • mingw-w64-x86_64-gcc 14.2.0
  • mingw-w64-x86_64-clang 18.1.8
  • Intel Software Development Emulator v9.44.0
@llvmbot
Copy link
Member

llvmbot commented Jul 10, 2024

@llvm/issue-subscribers-backend-x86

Author: Sayantan Chakraborty (sayantn)

The `llvm.x86.avx512fp16.maskz.vfmadd.csh` intrinsic (and due to that, `_mm_maskz_fmadd_sch`) is being incorrectly optimized. This code snippet
#include&lt;immintrin.h&gt;
#include&lt;stdio.h&gt;

int main() {
    __m128h a, b, c, r;
    _Float16 array[8];

    a = _mm_setr_ph(0.0, 1.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0);
    b = _mm_setr_ph(0.0, 2.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0);
    c = _mm_setr_ph(0.0, 3.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0);

    r = _mm_maskz_fmadd_sch(0, a, b, c);
    _mm_storeu_ph(array, r);

    for (int i = 0; i &lt; 8; i++){
        printf("%f\n", (float) array[i]);
    }

    return 0;
}

In clang, the unoptimized and optimized output is different. The unoptimized output is the correct one according to Intel. gcc gives the correct output in both.

image

System specification:

  • mingw-w64-x86_64-gcc 14.1.0-3
  • mingw-w64-x86_64-clang 18.1.8-1
  • Intel Software Development Emulator v9.33.0

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 11, 2024

CC @phoebewang @KanRobert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants