Skip to content

Avoid calls to tokenizer.added_tokens_decoder #12473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2025

Conversation

bartowski1182
Copy link
Contributor

@bartowski1182 bartowski1182 commented Mar 20, 2025

tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens:

https://github.com/huggingface/transformers/blob/9be4728af8bec48073ae841881d7f4e2ac3521d1/src/transformers/tokenization_utils_fast.py#L264

Typically this slowdown is imperceptible, but when we have a model like ByteCraft with 100,000 added tokens, suddenly 0.04 * 2 * 100,000 = 8000 seconds extra to process the tokens: https://huggingface.co/SamsungSAILMontreal/ByteCraft/blob/main/added_tokens.json

This fix removes the slowdown entirely by calling it only once at the start (initial tokenizer load is still slow at 2 minutes but that's at least workable)

Make sure to read the contributing guidelines before submitting a PR

tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens
@github-actions github-actions bot added the python python script changes label Mar 20, 2025
@ggerganov ggerganov merged commit 732b5fb into ggml-org:master Mar 20, 2025
5 checks passed
Ivy233 pushed a commit to Ivy233/llama.cpp that referenced this pull request Mar 23, 2025
tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants