Skip to content

Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) #96848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

vfdev-5
Copy link
Contributor

@vfdev-5 vfdev-5 commented Mar 15, 2023

Stack from ghstack (oldest at bottom):

Description

Results

  • Pillow (9.0.0.post1) == Pillow-SIMD
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

Source

Context

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @datumbox @pmeier

- Based on #96651
- Fixed mem pointer alignment

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96848

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit caaf0a5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitcc42a3f) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          38.8          |                56.0             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                37.5             |                 112.8                |            3.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.7          |               157.0             |                 305.4                |            1.9
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               146.4             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.4          |               215.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               212.5             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               127.9             |                 464.8                |            3.6
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                56.8             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               325.2             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               239.1             |                 593.5                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.2          |               200.7             |                 833.8                |            4.2
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.2             |                 651.4                |            8.7
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.0          |               444.5             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               309.3             |                 917.6                |            3.0
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 15, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: d961810
Pull Request resolved: #96848
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 4ee5e45
Pull Request resolved: #96848
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this @vfdev-5 ! I mostly just have questions below, for my own understanding.

For future reference, it might be worth clarifying in the PR description that these improvements concern only:

  • the bilinear mode
  • channels_last RGB CPU tensors

Regarding the benchmarks, could you please clarify that we're comparing against pillow SIMD - the current table shows Pillow (9.0.0.post1). Also, it'd be interesting to look at more upscaling results; right now it seems that mostly downscaling situations are reported.

Finally, what is the plan w.r.t. testing the correctness of this new implementation?

@vfdev-5 vfdev-5 changed the title Improved perfs for vectorized interpolate cpu uint8 RGB-case Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) Mar 20, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: #96848
vfdev-5 added a commit to vfdev-5/pytorch that referenced this pull request Mar 21, 2023
- Based on pytorch#96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: pytorch#96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: f807362
Pull Request resolved: #96848
@vfdev-5 vfdev-5 requested a review from peterbell10 March 21, 2023 14:04
Copy link
Collaborator

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks reasonable but I'll wait for a respose to NicolasHug's question on testing and more benchmarks for upsampling. Also some benchmarks showing 4 channels haven't regressed would be nice.

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 21, 2023

@peterbell10 I already added more upsampling benchmarks, see the description after the line "# More test-cases from #90771". @NicolasHug can you confirm that those benchmark are sufficient ? Thanks

…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6132906
Pull Request resolved: #96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 93bd276
Pull Request resolved: #96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 8b90000
Pull Request resolved: #96848
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 29, 2023

From your benchmarks: ....
Are these swapped? 2.1x speedup for 4 channels, 1.1x speedup for 3 channels.

@peterbell10 @NicolasHug it turned out that this is due noisy measurements on my machine for "channels first (1024, 1024) -> (256, 256)" cases. I measured just these cases on nightly and on this PR and could not get any reliable results. Overall numbers look similar, so I do not expect any improvement for these cases...

@vfdev-5 vfdev-5 closed this Mar 29, 2023
@vfdev-5 vfdev-5 reopened this Mar 29, 2023
@vfdev-5 vfdev-5 requested a review from peterbell10 March 29, 2023 21:14
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 30, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@vfdev-5 vfdev-5 added the release notes: nn release notes category label Mar 30, 2023
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/vfdev-5/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96848)

pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6c30da9
Pull Request resolved: #96848
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: interpolation module: vision open source release notes: nn release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants