Avoid calls to tokenizer.added_tokens_decoder #12473
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens:
https://github.com/huggingface/transformers/blob/9be4728af8bec48073ae841881d7f4e2ac3521d1/src/transformers/tokenization_utils_fast.py#L264
Typically this slowdown is imperceptible, but when we have a model like ByteCraft with 100,000 added tokens, suddenly 0.04 * 2 * 100,000 = 8000 seconds extra to process the tokens: https://huggingface.co/SamsungSAILMontreal/ByteCraft/blob/main/added_tokens.json
This fix removes the slowdown entirely by calling it only once at the start (initial tokenizer load is still slow at 2 minutes but that's at least workable)
Make sure to read the contributing guidelines before submitting a PR