Cuda op for ngram repeat blocking #40

NickNickGo · 2020-09-24T22:59:07Z

This PR involves 3 optimizations.

parallelizing ngram blocking across all samples within a batch.
parallelizing ngram blocking across all ngrams within a sample.
Accessing consecutive words from shared mem instead of global mem.

Transformers BART large BS 128 1k samples, throughput change - 9.1 to 11.8 (including model load time)
Fairseq BART large BS 128 1k samples, throughput change - 13.7 to 15.5. Generation time reduces from 48.4 to 39.7

feihugis

Thanks @NickNickGo for implement this operation! I did not look into the details yet. Leave some general comments:

add the unit tests to make sure the op works as expected for different cases;
add the benchmarking unit test for this op;
check if the inputs are valid either in the Python API or the backend?
add the license header to each file;
Add the docs for both Python and C++ code;

Looking forward to the performance number!

fastseq/clib/cuda/ngrb.py

yuyan2do

Overall it looks good to me. Added some comments for parameter naming.

A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

fastseq/optimizer/fairseq/beam_search_optimizer_v1.py

fastseq/optimizer/transformers/beam_search_optimizer.py

fastseq/clib/cuda/ngrb_cuda_kernel.cu

yuyan2do · 2020-09-25T22:34:00Z

fastseq/clib/cuda/ngrb_cuda_kernel.cu

+  int no_repeat_ngram = 3;
+  int threads = step - no_repeat_ngram +2;
+  int shared_mem_size = (step+1) *sizeof(long);
+  if (threads <=0) return lprobs;


put return check at beginning of this function.

fastseq/clib/cuda/ngrb_cuda_kernel.cu

JiushengChen · 2020-09-26T03:46:42Z

Overall it looks good to me. Added some comments for parameter naming.

A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

This page explains CUDA various architecture well. https://en.wikipedia.org/wiki/CUDA
Key facts about V100:

V100 has 80 SMs.
Each SM has max 32 blocks, max 2048 threads, max 96k shared mem.

fastseq/clib/cuda/ngrb.py

tests/ops/test_ngram_repeat_block.py

fastseq/clib/cuda/ngram_repeat_block_cuda.cpp

tests/ops/test_ngram_repeat_block.py

fastseq/clib/cuda/ngram_repeat_block_cuda.cpp

feihugis · 2020-10-30T05:39:44Z

fastseq/clib/cuda/ngram_repeat_block_cuda_kernel.cu

+    torch::Tensor tokens,
+    torch::Tensor lprobs,


Do we need to check the dimensions of tokens and lprobs? For example, the 0th dim of tokens is assumed to be batch dimension?

Could tokens be constant? const torch::Tensor& tokens?

fastseq/clib/cuda/ngram_repeat_block_cuda_kernel.cu

fastseq/ops/ngram_repeat_block.py

tests/ops/test_ngram_repeat_block.py

NickNickGo · 2020-11-11T21:58:48Z

Before :

Model	W/O FastSeq (in samples/s)	W/ FastSeq (in samples/s)	Speedup
ProphetNet	2.7	10.3	3.8x
Bart (fs)	2.7	12.5	4.6x
Bart (hf)	3.4	8.1	2.4x
DistilBart (hf)	4.0	8.5	2.1x
T5 (hf)	4.8	7.5	1.6x
WMT16 En-De (fs)	84.0	122.0	1.5x

After:

Model	W/O FastSeq (in samples/s)	W/ FastSeq (in samples/s)	Speedup
ProphetNet	2.7	10.3	3.8x
Bart (fs)	2.7	13.3	5x
Bart (hf)	3.4	9.9	2.9x
DistilBart (hf)	4.0	11.9	3x
T5 (hf)	4.8	11.0	2.3x
WMT16 En-De (fs)	84.0	124.0	1.5x

JiushengChen · 2020-11-11T22:27:01Z

README.md

+- ## How it works?
+- We developped a wide range of speedup techniques, including improving beam search efficiency, reducing memory footprint, speeding up calculation for key operations etc, IO speedup etc. To seamlessly connect with community, they were applied to existing models from Fairseq and Huggingface Transformers in the backend, while keeping model interface and usage same as before.
+-


Why adding "-" here? This is supposed to be an independent section:)

…astseq into cuda_op_ngram_block

Cuda op for ngram repeat blocking

54efa6f

NickNickGo requested a review from a team September 24, 2020 22:59

feihugis reviewed Sep 25, 2020

View reviewed changes

fastseq/clib/cuda/ngrb.py Outdated Show resolved Hide resolved

NickNickGo mentioned this pull request Sep 25, 2020

detokenization parallelization #37

Open

NickNickGo requested a review from a team September 25, 2020 07:07

yuyan2do reviewed Sep 25, 2020

View reviewed changes

JiushengChen reviewed Sep 26, 2020

View reviewed changes

fastseq/clib/cuda/ngrb_cuda_kernel.cu Outdated Show resolved Hide resolved

yetingqiaqia reviewed Oct 1, 2020

View reviewed changes

fastseq/clib/cuda/ngrb.py Outdated Show resolved Hide resolved

fastseq/clib/cuda/ngrb.py Outdated Show resolved Hide resolved

clean up

062a9c7

yuyan2do approved these changes Oct 29, 2020

View reviewed changes

Unit test for cuda op

99ecab6

feihugis suggested changes Oct 30, 2020

View reviewed changes

feihugis reviewed Oct 30, 2020

View reviewed changes

tests/ops/test_ngram_repeat_block.py Outdated Show resolved Hide resolved

feihugis reviewed Oct 30, 2020

View reviewed changes

tests/ops/test_ngram_repeat_block.py Outdated Show resolved Hide resolved

NickNickGo and others added 4 commits October 31, 2020 00:41

unit test updated, minor updates in cpp/cu code

c8b451d

updating with code clean PR

25d06f9

Rebased on new codebase , updated all benchmarks

b36a314

Merge branch 'main' into cuda_op_ngram_block

c9fc88c

feihugis approved these changes Nov 11, 2020

View reviewed changes

NickNickGo added 3 commits November 11, 2020 14:24

Update README.md

ea3b370

Update README.md

bddb09b

Update README.md

532d12b

JiushengChen reviewed Nov 11, 2020

View reviewed changes

NickNickGo and others added 5 commits November 12, 2020 21:57

minor change in kernel

ada0682

Merge branch 'cuda_op_ngram_block' of https://github.com/NickNickGo/f…

ee29e85

…astseq into cuda_op_ngram_block

Merge branch 'main' into cuda_op_ngram_block

a215fbf

changing install order

6a6877e

Merge branch 'cuda_op_ngram_block' of https://github.com/NickNickGo/f…

5b65505

…astseq into cuda_op_ngram_block

NickNickGo merged commit 6b3c0cb into microsoft:main Nov 13, 2020

Cuda op for ngram repeat blocking #40

Cuda op for ngram repeat blocking #40

Uh oh!

Conversation

NickNickGo commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feihugis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuyan2do left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuyan2do Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JiushengChen commented Sep 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feihugis Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickNickGo commented Nov 11, 2020

Uh oh!

JiushengChen Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

NickNickGo Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NickNickGo commented Sep 24, 2020 •

edited

Loading

feihugis left a comment •

edited

Loading