Skip to content

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Nov 4, 2025

Summary:
Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:

Before:

bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
   8,   100,     16,          378.68,       1.51
   8,   100,     64,          286.91,       1.15
   8,   100,    128,          262.06,       1.05
   8,   100,    256,          251.34,       1.01
   8,   100,    512,          244.92,       0.98
   8,   100,   1024,          237.35,       0.95
   8,   100,   2048,          230.83,       0.92
   8,   120,     16,          378.70,       1.51
   8,   120,     64,          286.72,       1.15
   8,   120,    128,          263.40,       1.05
   8,   120,    256,          251.58,       1.01
   8,   120,    512,          245.30,       0.98
   8,   120,   1024,          238.17,       0.95
   8,   120,   2048,          230.69,       0.92
   8,  1000,     16,          392.85,       1.57
   8,  1000,     64,          294.35,       1.18
   8,  1000,    128,          264.35,       1.06
   8,  1000,    256,          252.13,       1.01
   8,  1000,    512,          245.50,       0.98
   8,  1000,   1024,          241.61,       0.97
   8,  1000,   2048,          231.39,       0.93

After:

bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
   8,   100,     16,         1855.59,       7.42
   8,   100,     64,         2615.43,      10.46
   8,   100,    128,         3134.34,      12.54
   8,   100,    256,         2610.72,      10.44
   8,   100,    512,         3065.20,      12.26
   8,   100,   1024,         3535.29,      14.14
   8,   100,   2048,         3757.66,      15.03
   8,   120,     16,         1991.94,       7.97
   8,   120,     64,         2971.25,      11.89
   8,   120,    128,         3403.37,      13.61
   8,   120,    256,         2750.87,      11.00
   8,   120,    512,         3272.63,      13.09
   8,   120,   1024,         3618.98,      14.48
   8,   120,   2048,         3848.59,      15.39
   8,  1000,     16,         2329.11,       9.32
   8,  1000,     64,         3068.76,      12.28
   8,  1000,    128,         3678.86,      14.72
   8,  1000,    256,         4440.37,      17.76
   8,  1000,    512,         4558.70,      18.23
   8,  1000,   1024,         4620.94,      18.48
   8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406

@netlify
Copy link

netlify bot commented Nov 4, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 17ea372
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690b586d653b25000852ed2f
😎 Deploy Preview https://deploy-preview-5089--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 4, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86236406.

@meta-cla meta-cla bot added the cla signed label Nov 4, 2025
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2098

Pull Request resolved: pytorch#5089

Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:

Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
@Nicoshev Nicoshev force-pushed the export-D86236406 branch 2 times, most recently from 00b3d72 to 9c5601b Compare November 5, 2025 03:46
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 5, 2025
…#5089)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2098

Pull Request resolved: pytorch#5089

Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:

Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
…#5089)

Summary:
X-link: facebookresearch/FBGEMM#2098


Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:


Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 5, 2025

This pull request has been merged in 16aa87b.

Bernard-Liu pushed a commit to ROCm/FBGEMM that referenced this pull request Nov 11, 2025
…#5089)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2098

Pull Request resolved: pytorch#5089

Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:

Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406

fbshipit-source-id: 12c20cbdbbc9b0674ccca8e1aa598b7de144dea9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants