-
Notifications
You must be signed in to change notification settings - Fork 615
[deepseek] fix mismatch of model embeddings vs tokenizer size (128815), sync to 671B vocab size, avoids cudaErrorAssert and run termination. #1495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ), | ||
| "16B": DeepSeekV3ModelArgs( | ||
| vocab_size=102400, | ||
| vocab_size=128815, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a target checkpoint we could to load, e.g. DeepSeek MoE 16B?
If so we probably want to align with that, otherwise we should align with 671B.
cc: @wwwjn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if there is a checkpoint for 16B, but for consistency, we can move to 129,280 (671B size) which is just ~465 or so tokens larger and now we have consistent sizing for all three variations.
Upddated PR to match 671B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also addresses my point above about we want a size that is div by 8 for better perf, so extra bonus from synching this to 129280.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
To summary, I think the ideal way to solve this issue is: We use a correct tokenizer for 16B model, and keep the 16B model's parameters as it (As user might not realized the vocab size was changed by us and didn't match the DeepSeek-AI official params) Because we don't have an official 16B model tokenizer, we could use our test tokenizer (vocab size 2000) in train_configs, or tokenizer from here https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/config.json |
|
I agree with @wwwjn. Assuming the 16B model stays around, it would be better not to increase the model's vocab size. Otherwise loading checkpoints like https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat will fail without some kind of custom handling of embedding weights. |
|
ah I see - I didn't realize there was an 'actual' 16B model with checkpoints available. (thought this was just a scaled down dev model to operate it within a single node for development/testing). |
|
opened new PR which uses the 16 tokenizer here: |
Based on the discussion in this PR (#1495), the conclusion was to ensure that 16B uses the proper tokenizer to avoid the cudaAssertError in the current config which comes from mismatch between embeddings and tokenizer vocab. Thus, this PR; 1 - adds additional line to the readme for enabling users to pull the 16B-chat tokenizer, 2- updates the 16_toml config to point to the 16B tokenizer under /assets/tokenizer/deepseek-moe-16b-chat With that, the vocab size of 102400 already in the toml now works flawlessly. **Testing:** run download tokenizer run 20 iters with 16B without issue. <img width="1255" height="201" alt="Screenshot 2025-07-30 at 12 46 38 PM" src="https://github.com/user-attachments/assets/e33556bf-51c6-4fa0-ab71-d1b02ef74d99" />
|
closing as we resolved in the linked PR in a better way. |
…ch#1497) Based on the discussion in this PR (pytorch#1495), the conclusion was to ensure that 16B uses the proper tokenizer to avoid the cudaAssertError in the current config which comes from mismatch between embeddings and tokenizer vocab. Thus, this PR; 1 - adds additional line to the readme for enabling users to pull the 16B-chat tokenizer, 2- updates the 16_toml config to point to the 16B tokenizer under /assets/tokenizer/deepseek-moe-16b-chat With that, the vocab size of 102400 already in the toml now works flawlessly. **Testing:** run download tokenizer run 20 iters with 16B without issue. <img width="1255" height="201" alt="Screenshot 2025-07-30 at 12 46 38 PM" src="https://github.com/user-attachments/assets/e33556bf-51c6-4fa0-ab71-d1b02ef74d99" />
…ch#1497) Based on the discussion in this PR (pytorch#1495), the conclusion was to ensure that 16B uses the proper tokenizer to avoid the cudaAssertError in the current config which comes from mismatch between embeddings and tokenizer vocab. Thus, this PR; 1 - adds additional line to the readme for enabling users to pull the 16B-chat tokenizer, 2- updates the 16_toml config to point to the 16B tokenizer under /assets/tokenizer/deepseek-moe-16b-chat With that, the vocab size of 102400 already in the toml now works flawlessly. **Testing:** run download tokenizer run 20 iters with 16B without issue. <img width="1255" height="201" alt="Screenshot 2025-07-30 at 12 46 38 PM" src="https://github.com/user-attachments/assets/e33556bf-51c6-4fa0-ab71-d1b02ef74d99" />
…ch#1497) Based on the discussion in this PR (pytorch#1495), the conclusion was to ensure that 16B uses the proper tokenizer to avoid the cudaAssertError in the current config which comes from mismatch between embeddings and tokenizer vocab. Thus, this PR; 1 - adds additional line to the readme for enabling users to pull the 16B-chat tokenizer, 2- updates the 16_toml config to point to the 16B tokenizer under /assets/tokenizer/deepseek-moe-16b-chat With that, the vocab size of 102400 already in the toml now works flawlessly. **Testing:** run download tokenizer run 20 iters with 16B without issue. <img width="1255" height="201" alt="Screenshot 2025-07-30 at 12 46 38 PM" src="https://github.com/user-attachments/assets/e33556bf-51c6-4fa0-ab71-d1b02ef74d99" />
When trying to run DS v3, 16B, I encounter the following cudaErrorAssert:
This root cause is that the spec'ed model vocab size of 102,400 is less than the tokenizers vocab of 128,000 + 818 added special tokens, or total vocab of 128815 (max token id + 1).
This PR thus:
updates model config to use the correct vocab_size of > 128815 in both args and init.py.
update: let's move to match 671B vocab size of 129280 for consistency, and this is divisible by 8 which addresses my perf concern below at the same time.
*note - there's a sep discussion to be had about rounding this size to next power of 2 or similar (div by 8) for better perf.
This let's users run successfully and avoid the crash out from the cudaErrorAssert.