Skip to content

Conversation

@NickNickGo
Copy link
Contributor

@NickNickGo NickNickGo commented Sep 24, 2020

This PR involves 3 optimizations.

  1. parallelizing ngram blocking across all samples within a batch.
  2. parallelizing ngram blocking across all ngrams within a sample.
  3. Accessing consecutive words from shared mem instead of global mem.

Transformers BART large BS 128 1k samples, throughput change - 9.1 to 11.8 (including model load time)
Fairseq BART large BS 128 1k samples, throughput change - 13.7 to 15.5. Generation time reduces from 48.4 to 39.7

@NickNickGo NickNickGo requested a review from a team September 24, 2020 22:59
Copy link
Contributor

@feihugis feihugis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NickNickGo for implement this operation! I did not look into the details yet. Leave some general comments:

  1. add the unit tests to make sure the op works as expected for different cases;
  2. add the benchmarking unit test for this op;
  3. check if the inputs are valid either in the Python API or the backend?
  4. add the license header to each file;
  5. Add the docs for both Python and C++ code;

Looking forward to the performance number!

@NickNickGo NickNickGo requested a review from a team September 25, 2020 07:07
Copy link
Contributor

@yuyan2do yuyan2do left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me. Added some comments for parameter naming.

A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

int no_repeat_ngram = 3;
int threads = step - no_repeat_ngram +2;
int shared_mem_size = (step+1) *sizeof(long);
if (threads <=0) return lprobs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put return check at beginning of this function.

@JiushengChen
Copy link
Contributor

Overall it looks good to me. Added some comments for parameter naming.

A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

This page explains CUDA various architecture well. https://en.wikipedia.org/wiki/CUDA
Key facts about V100:

  1. V100 has 80 SMs.
  2. Each SM has max 32 blocks, max 2048 threads, max 96k shared mem.

Comment on lines 52 to 53
torch::Tensor tokens,
torch::Tensor lprobs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the dimensions of tokens and lprobs? For example, the 0th dim of tokens is assumed to be batch dimension?

Could tokens be constant? const torch::Tensor& tokens?

@NickNickGo
Copy link
Contributor Author

Before :

Model W/O FastSeq (in samples/s) W/ FastSeq (in samples/s) Speedup
ProphetNet 2.7 10.3 3.8x
Bart (fs) 2.7 12.5 4.6x
Bart (hf) 3.4 8.1 2.4x
DistilBart (hf) 4.0 8.5 2.1x
T5 (hf) 4.8 7.5 1.6x
WMT16 En-De (fs) 84.0 122.0 1.5x

After:

Model W/O FastSeq (in samples/s) W/ FastSeq (in samples/s) Speedup
ProphetNet 2.7 10.3 3.8x
Bart (fs) 2.7 13.3 5x
Bart (hf) 3.4 9.9 2.9x
DistilBart (hf) 4.0 11.9 3x
T5 (hf) 4.8 11.0 2.3x
WMT16 En-De (fs) 84.0 124.0 1.5x

README.md Outdated
Comment on lines 23 to 25
- ## How it works?
- We developped a wide range of speedup techniques, including improving beam search efficiency, reducing memory footprint, speeding up calculation for key operations etc, IO speedup etc. To seamlessly connect with community, they were applied to existing models from Fairseq and Huggingface Transformers in the backend, while keeping model interface and usage same as before.
-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding "-" here? This is supposed to be an independent section:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@NickNickGo NickNickGo merged commit 6b3c0cb into microsoft:main Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants