Skip to content

Pointless loop unroll / vectorization #37628

Open
@llvmbot

Description

@llvmbot
Bugzilla Link 38280
Version 6.0
OS Windows NT
Depends On #40224
Reporter LLVM Bugzilla Contributor
CC @davidbolvansky,@DMG862,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@rotateright

Extended Description

Example C++ code for x86, simplified from a more complex use case:

// ---- begin

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

// neg_offs <= -8 required
void apply_delta(uint8_t *dst, const uint8_t *src, ptrdiff_t neg_offs, size_t count)
{
    // Just provided for context
    while (count >= 8)
    {
        __m128i src_bytes = _mm_loadl_epi64((const __m128i *) src);
        __m128i pred_bytes = _mm_loadl_epi64((const __m128i *) (dst + neg_offs));
        __m128i sum = _mm_add_epi8(src_bytes, pred_bytes);
        _mm_storel_epi64((__m128i *) dst, sum);

        dst += 8;
        src += 8;
        count -= 8;
    }

    // This is the loop in question
    while (count--)
    {
        *dst = *src + dst[neg_offs];
        dst++;
        src++;
    }
}

// ---- end

The bottom (tail) loop gets expanded into a giant monstrosity that attempts to process 64 bytes at once, with various special-case paths for tail processing, to handle cases where neg_offs > -64 (which means the obvious 64-elements-at-a-time loop would not work), etc.

The full code can be viewed at https://godbolt.org/g/yRThcs, I won't post it here. :)

All of which is completely pointless because the tail loop will (as is easy to see) only ever see count <= 7.

This is an extreme example, but I'm seeing this general pattern (a scalar tail loop for a manually vectorized loop getting pointlessly auto-vectorized) a lot.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions