Add test & fix for unusual pitch in `surface.premul_alpha()` #2882

MyreMylar · 2024-05-28T18:24:17Z

From what I understand, these unusual, non-pixel aligned, surface pitches won't practically come up on any modern desktop systems (possibly no modern systems) so I prioritised fixing it for the least performance sensitive versions of this function - sse2 & the non-SIMD fallback. These are also the versions I originally wrote so I had a better idea of how they were supposed to work.

The basic fix is to add in the standard 'skip' value that is used in all the blitters to handle pitch issues between two different surfaces - usually these are pixel aligned. In the SSE2 case, to deal with the .5 of a pixel overlap in the pitch case we have to cast the skip value down to Uint8 to get to 'channel', or single byte, level of pointer incrementing as we only want to skip 2 channels worth of a pixel (2 bytes) rather than a whole pixel (4 bytes).

I think that makes sense anyway.

This probably needs feedback from @itzpr3d4t0r and @Starbuck5 to see if they think what I've changed makes sense and if we need to do anything else here.

fixes #2750

ankith26

IMHO, we don't need to make the sse2 version handle the unusual pitches, and probably should just fallback to the non-simd version.

As long as the avx2 and sse2 code can handle the pitch-is-multiple-of-bpp case, we should be good to go (and cover all real usecases?)

ankith26 · 2024-06-28T11:53:32Z

IG the same approach should probably be taken on all simd surface manipulations not just premul

itzpr3d4t0r

LGTM!

src_c/alphablit.c

coderabbitai · 2025-08-26T15:42:19Z

📝 Walkthrough

Walkthrough

Adds pitch-aware row advancement to premultiply-alpha routines and tight-packing guards for AVX2. Updates SSE2 and non-SIMD paths to compute per-row skips from pitch and advance pointers accordingly. Adds a test verifying premultiplication correctness on surfaces with non-standard pitch.

Changes

Cohort / File(s)	Summary
Core premul alpha path `src_c/alphablit.c`	AVX2 fast path now requires tight packing (pitch == width × bpp). Non-SIMD path computes src/dst row skips from pitch and advances pointers after each row to respect padding. No API changes.
SIMD SSE2 premul `src_c/simd_blitters_sse2.c`	Introduces srcskip/dstskip based on pitch; advances srcp/dstp per row using byte-wise arithmetic. Per-pixel logic unchanged.
Tests for odd pitches `test/surface_test.py`	Adds test creating surfaces from buffers with custom pitch (e.g., 10 bytes for 2px width) and asserts correct premul_alpha results across all pixels.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as Python caller
  participant Premul as premul_alpha()
  participant AVX2 as AVX2 path
  participant SSE2 as SSE2 path
  participant Scalar as Non-SIMD path

  Py->>Premul: request premultiply (src, dst)
  Premul->>Premul: Check format, dims, pitch
  alt Tight-packed and AVX2 available
    Premul->>AVX2: process with AVX2
    note right of AVX2: Guard: pitch == width*bpp
    AVX2-->>Premul: done
  else AVX2 not used
    alt SSE2 available
      Premul->>SSE2: process row by row
      note right of SSE2: Compute srcskip/dstskip = pitch - width*bpp<br/>Advance pointers by skips each row
      SSE2-->>Premul: done
    else
      Premul->>Scalar: process row by row
      note right of Scalar: Compute srcskip/dstskip and advance each row
      Scalar-->>Premul: done
    end
  end
  Premul-->>Py: result surface

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Respect non-standard surface pitches in premul_alpha, including unaligned pitch (e.g., 10 bytes for 2px width) [#2750]	✅

Assessment against linked issues: Out-of-scope changes

None found.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-premul-alpha-bytewidth

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

# Conflicts: # src_c/alphablit.c # src_c/simd_blitters_sse2.c

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src_c/simd_blitters_sse2.c (1)

809-850: Unaligned 32-bit scalar loads/stores become UB on odd-pitch rows; use byte pointers + memcpy for the scalar path.

After moving to byte-granular row skips, row starts can be misaligned (e.g., pitch=10 → row 1 starts at +2). The scalar path dereferences Uint32* (e.g., “*srcp” / “*dstp”), which is undefined behavior on misaligned addresses and can fault on some architectures. The vector path already uses _mm_loadu_si128, which is safe.

Refactor the scalar portion to operate on Uint8* and use memcpy for 32-bit loads/stores; keep the vector portion unchanged but seed its pointers from the byte pointers. This avoids UB with minimal perf impact (only affects ≤3 pixels per row).

Here’s a focused diff for premul_surf_color_by_alpha_sse2 showing the minimal change set:

@@
-    Uint32 *srcp = (Uint32 *)src->pixels;
+    Uint8 *src8 = (Uint8 *)src->pixels;
@@
-    Uint32 *dstp = (Uint32 *)dst->pixels;
+    Uint8 *dst8 = (Uint8 *)dst->pixels;
@@
-        LOOP_UNROLLED4(
-            {
-                Uint32 alpha = *srcp & amask;
+        LOOP_UNROLLED4(
+            {
+                Uint32 src32, dst32, alpha;
+                memcpy(&src32, src8, sizeof(src32));
+                alpha = src32 & amask;
                 if (alpha == 0) {
                     /* do nothing */
                 }
                 else if (alpha == amask) {
-                    *dstp = *srcp;
+                    memcpy(&dst32, src8, sizeof(dst32));
+                    memcpy(dst8, &dst32, sizeof(dst32));
                 }
                 else {
-                    src1 = _mm_cvtsi32_si128(
-                        *srcp); /* src(ARGB) -> src1 (000000000000ARGB) */
+                    src1 = _mm_cvtsi32_si128(
+                        (int)src32); /* src(ARGB) -> src1 (000000000000ARGB) */
                     src1 = _mm_unpacklo_epi8(
                         src1, mm_zero); /* 000000000A0R0G0B -> src1 */
 
-                    dst1 = _mm_cvtsi32_si128(
-                        *dstp); /* dst(ARGB) -> dst1 (000000000000ARGB) */
+                    memcpy(&dst32, dst8, sizeof(dst32));
+                    dst1 = _mm_cvtsi32_si128(
+                        (int)dst32); /* dst(ARGB) -> dst1 (000000000000ARGB) */
                     dst1 = _mm_unpacklo_epi8(
                         dst1, mm_zero); /* 000000000A0R0G0B -> dst1 */
@@
-                    *dstp = _mm_cvtsi128_si32(dst1);
+                    dst32 = (Uint32)_mm_cvtsi128_si32(dst1);
+                    memcpy(dst8, &dst32, sizeof(dst32));
                 }
-                ++srcp;
-                ++dstp;
+                src8 += 4;
+                dst8 += 4;
             },
             n, width);
         /* *INDENT-ON* */
-        srcp += srcskip;
-        dstp += dstskip;
+        src8 += srcskip;
+        dst8 += dstskip;
     }

And just before the 128-bit loop in the same function, seed the vector pointers from the byte pointers:

-        srcp128 = (__m128i *)srcp;
-        dstp128 = (__m128i *)dstp;
+        srcp128 = (__m128i *)src8;
+        dstp128 = (__m128i *)dst8;

This makes all scalar accesses safe across odd/aligned/unaligned row starts, while keeping the SIMD fast path intact via _mm_loadu_si128/_mm_storeu_si128.

Also applies to: 850-852

♻️ Duplicate comments (1)

src_c/alphablit.c (1)

1717-1719: AVX2 fast-path now gated on tight packing — good catch.

Requiring src->pitch == src->w * bpp before taking the AVX2 path is the right safety constraint and addresses the “multiple-of-bpp but not equal” pitfall raised earlier.

🧹 Nitpick comments (4)

src_c/simd_blitters_sse2.c (2)
797-800: Good: pitch-aware skips added (but name clarity would help).

Switching to byte-granular row skip for src/dst is the right fix for non-pixel-aligned pitches. Consider renaming to srcskip_bytes/dstskip_bytes (or add a short comment) to avoid confusion with the many existing “skip in pixels” variables elsewhere in this file/macros.

Apply this small doc tweak:
-    int srcskip = src->pitch - width * PG_SURF_BytesPerPixel(src);
+    // bytes to advance at end-of-row (pitch - width*bpp)
+    int srcskip = src->pitch - width * PG_SURF_BytesPerPixel(src);
...
-    int dstskip = dst->pitch - width * PG_SURF_BytesPerPixel(dst);
+    // bytes to advance at end-of-row (pitch - width*bpp)
+    int dstskip = dst->pitch - width * PG_SURF_BytesPerPixel(dst);
70-141: Heads-up: SETUP_SSE2_BLITTER still assumes 32-bit-aligned s/d_skip (divide by 4).

Unrelated to this PR’s target but worth tracking: the shared SSE2 blitter macro derives srcskip/dstskip as “info->s_skip / 4” and advances Uint32* pointers. That will be incorrect when pitch − widthbpp is not divisible by 4 (the exact scenario fixed in premul_). Consider a follow-up PR to:

Track skips in bytes (srcskip_bytes/dstskip_bytes),

Advance row pointers via Uint8*,

Keep 128-bit loads/stores with _mm_loadu_si128.

This aligns the whole SIMD suite with the fix you applied here and with ankith26’s comment about applying the approach broadly.

I can draft a safe, byte-granular version of SETUP_SSE2_BLITTER and RUN_SSE2_BLITTER that preserves current performance for aligned cases while fixing odd-pitch rows. Want me to put that together?
test/surface_test.py (2)
4003-4013: Prefer a writable buffer for frombuffer to avoid accidental copying.

Using bytes creates an immutable buffer; some backends may fall back to copying rather than sharing or could reject writes. Using bytearray makes the intent explicit and ensures the pitch math is exercised on shared memory.

Apply this small change:
-            byte_data = bytes(byte_width * surf_height)  # 50 bytes
+            byte_data = bytearray(byte_width * surf_height)  # 50 bytes
4014-4022: Great regression test — consider adding an odd stride and premul_alpha_ip coverage.

10-byte pitch hits the “half-pixel (2-byte) pad” case. To also cover the harder misalignment cases and the in-place variant:

Add a 9-byte pitch (1-byte pad) case,

Duplicate the assertions for premul_alpha_ip().

Here’s a concise extension sketch:
@@
-        test_surf = create_surface_from_byte_width(10)
-        test_surf = test_surf.premul_alpha()
+        # 2-byte pad per row
+        test_surf = create_surface_from_byte_width(10)
+        test_surf = test_surf.premul_alpha()
         ...
+        # 1-byte pad per row
+        test_surf = create_surface_from_byte_width(9)
+        test_surf = test_surf.premul_alpha()
+        for y in range(test_surf.get_height()):
+            for x in range(test_surf.get_width()):
+                self.assertEqual(test_surf.get_at((x, y)),
+                                 pygame.Color(94, 39, 55, 200))
+
+        # In-place variant
+        test_surf_ip = create_surface_from_byte_width(10)
+        test_surf_ip.premul_alpha_ip()
+        for y in range(test_surf_ip.get_height()):
+            for x in range(test_surf_ip.get_width()):
+                self.assertEqual(test_surf_ip.get_at((x, y)),
+                                 pygame.Color(94, 39, 55, 200))

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f46a4b6 and d8b2954.

📒 Files selected for processing (3)

src_c/alphablit.c (3 hunks)
src_c/simd_blitters_sse2.c (2 hunks)
test/surface_test.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

test/surface_test.py (2)

buildconfig/stubs/pygame/image.pyi (1)

frombuffer (324-365)

buildconfig/stubs/pygame/surface.pyi (5)

fill (348-373)

premul_alpha (970-1009)

get_height (745-749)

get_width (739-743)

get_at (526-548)

src_c/alphablit.c (1)

src_c/simd_blitters_avx2.c (1)

pg_has_avx2 (17-27)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: arm64
GitHub Check: debug_coverage (ubuntu-24.04, 3.13.5)
GitHub Check: debug_coverage (ubuntu-24.04, 3.9.23)
GitHub Check: debug_coverage (ubuntu-24.04, 3.14.0rc1)
GitHub Check: dev-check
GitHub Check: x86_64
GitHub Check: AMD64
GitHub Check: aarch64
GitHub Check: i686
GitHub Check: x86
GitHub Check: Debian (Bookworm - 12) [ppc64le]

src_c/alphablit.c

ankith26

I somehow missed that my review was requested on this PR, and just now noticed it.

LGTM, thanks for the PR 🎉

Add test, fix for unusual pitch in premul_alpha()

b742938

fixes #2750

MyreMylar requested a review from a team as a code owner May 28, 2024 18:24

MyreMylar added 2 commits May 29, 2024 15:57

Try to please gcc

8b28d80

Merge branch 'main' into fix-premul-alpha-bytewidth

34f230b

Starbuck5 added Surface pygame.Surface bugfix PR that fixes bug labels Jun 16, 2024

ankith26 reviewed Jun 28, 2024

View reviewed changes

MyreMylar added 2 commits July 25, 2024 14:23

Merge branch 'main' into fix-premul-alpha-bytewidth

ad833b8

formatting

c739cf2

MyreMylar requested review from Starbuck5 and itzpr3d4t0r September 29, 2024 12:05

itzpr3d4t0r approved these changes Oct 2, 2024

View reviewed changes

ankith26 reviewed May 18, 2025

View reviewed changes

src_c/alphablit.c Outdated Show resolved Hide resolved

ankith26 marked this pull request as draft June 7, 2025 18:44

MyreMylar commented Aug 26, 2025

View reviewed changes

src_c/alphablit.c Outdated Show resolved Hide resolved

added @ankith suggestion

4145fc3

MyreMylar added 2 commits August 26, 2025 16:45

Merge branch 'main' into fix-premul-alpha-bytewidth

b29bdf7

# Conflicts: # src_c/alphablit.c # src_c/simd_blitters_sse2.c

Formatting.

d8b2954

MyreMylar requested a review from ankith26 August 26, 2025 15:48

MyreMylar marked this pull request as ready for review August 26, 2025 15:48

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

src_c/alphablit.c Show resolved Hide resolved

ankith26 approved these changes Oct 9, 2025

View reviewed changes

ankith26 merged commit abe99c2 into main Oct 9, 2025
28 checks passed

ankith26 deleted the fix-premul-alpha-bytewidth branch October 9, 2025 04:35

ankith26 added this to the 2.5.6 milestone Oct 9, 2025

Uh oh!

Add test & fix for unusual pitch in surface.premul_alpha() #2882

Add test & fix for unusual pitch in surface.premul_alpha() #2882

Uh oh!

Conversation

MyreMylar commented May 28, 2024

Uh oh!

ankith26 left a comment

Choose a reason for hiding this comment

Uh oh!

ankith26 commented Jun 28, 2024

Uh oh!

itzpr3d4t0r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ankith26 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add test & fix for unusual pitch in `surface.premul_alpha()` #2882

Add test & fix for unusual pitch in `surface.premul_alpha()` #2882

coderabbitai bot commented Aug 26, 2025 •

edited

Loading