Open
Description
Bugzilla Link | 38280 |
Version | 6.0 |
OS | Windows NT |
Depends On | #40224 |
Reporter | LLVM Bugzilla Contributor |
CC | @davidbolvansky,@DMG862,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@rotateright |
Extended Description
Example C++ code for x86, simplified from a more complex use case:
// ---- begin
#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>
// neg_offs <= -8 required
void apply_delta(uint8_t *dst, const uint8_t *src, ptrdiff_t neg_offs, size_t count)
{
// Just provided for context
while (count >= 8)
{
__m128i src_bytes = _mm_loadl_epi64((const __m128i *) src);
__m128i pred_bytes = _mm_loadl_epi64((const __m128i *) (dst + neg_offs));
__m128i sum = _mm_add_epi8(src_bytes, pred_bytes);
_mm_storel_epi64((__m128i *) dst, sum);
dst += 8;
src += 8;
count -= 8;
}
// This is the loop in question
while (count--)
{
*dst = *src + dst[neg_offs];
dst++;
src++;
}
}
// ---- end
The bottom (tail) loop gets expanded into a giant monstrosity that attempts to process 64 bytes at once, with various special-case paths for tail processing, to handle cases where neg_offs > -64 (which means the obvious 64-elements-at-a-time loop would not work), etc.
The full code can be viewed at https://godbolt.org/g/yRThcs, I won't post it here. :)
All of which is completely pointless because the tail loop will (as is easy to see) only ever see count <= 7.
This is an extreme example, but I'm seeing this general pattern (a scalar tail loop for a manually vectorized loop getting pointlessly auto-vectorized) a lot.