-
Notifications
You must be signed in to change notification settings - Fork 37
Cuda op for ngram repeat blocking #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NickNickGo for implement this operation! I did not look into the details yet. Leave some general comments:
- add the unit tests to make sure the op works as expected for different cases;
- add the benchmarking unit test for this op;
- check if the inputs are valid either in the Python API or the backend?
- add the license header to each file;
- Add the docs for both Python and C++ code;
Looking forward to the performance number!
yuyan2do
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me. Added some comments for parameter naming.
A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf
| int no_repeat_ngram = 3; | ||
| int threads = step - no_repeat_ngram +2; | ||
| int shared_mem_size = (step+1) *sizeof(long); | ||
| if (threads <=0) return lprobs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put return check at beginning of this function.
This page explains CUDA various architecture well. https://en.wikipedia.org/wiki/CUDA
|
| torch::Tensor tokens, | ||
| torch::Tensor lprobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check the dimensions of tokens and lprobs? For example, the 0th dim of tokens is assumed to be batch dimension?
Could tokens be constant? const torch::Tensor& tokens?
|
Before :
After:
|
README.md
Outdated
| - ## How it works? | ||
| - We developped a wide range of speedup techniques, including improving beam search efficiency, reducing memory footprint, speeding up calculation for key operations etc, IO speedup etc. To seamlessly connect with community, they were applied to existing models from Fairseq and Huggingface Transformers in the backend, while keeping model interface and usage same as before. | ||
| - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding "-" here? This is supposed to be an independent section:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
This PR involves 3 optimizations.
Transformers BART large BS 128 1k samples, throughput change - 9.1 to 11.8 (including model load time)
Fairseq BART large BS 128 1k samples, throughput change - 13.7 to 15.5. Generation time reduces from 48.4 to 39.7