Feat: SGL backend for online SD training #564

h-guo18 · 2025-11-14T22:30:37Z

What does this PR do?

Type of change: new feature

Overview:

New trainer with different base model backends available for online training.
- Now the base model can be SGlang, HF-TP, while the student model DDP on different devices.
- Trainer based on previous experimental PR352.
- SGL wrapper based on specforge PR239;
Other improvements:
- Move train_acc.item() out from eagle forward to avoid cuda graph break during torch compile;
- Optionally turn off ar validation pbar to avoid log overflow.

Usage

export NCCL_CUMEM_ENABLE=1;
python train.py  \
 --out_path $OUT \
 --data_path $DATA \
 --model_path $MODEL \
 --teacher_backend <"sglang" or "hf"> \
 --teacher_devices 0,1,2,3 \
 --student_devices 4,5,6,7 \
 --teacher_ep_size 1  #sglang backend only

Testing

Parallelism

Tested base model TP on HF backend;
Tested base model TP and EP on SGLang backend;

Training Quality Test

Compared previous HF trainer, new trainer-HF backend and trainer-SGL backend;
Setting: Llama3.2-1B, magpie, bs=8, lr1e-4, seqlen1k;

slight different is a result of:
- different data loading random seed
- sgl backend use bf16 lm head, instead of fp32;

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2025-11-14T22:30:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2025-11-17T00:40:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

codecov · 2025-11-17T01:35:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.45%. Comparing base (1aaa77d) to head (70515dc).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #564   +/-   ##
=======================================
  Coverage   74.45%   74.45%           
=======================================
  Files         182      182           
  Lines       18250    18250           
=======================================
  Hits        13588    13588           
  Misses       4662     4662

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

examples/speculative_decoding/train.py

yeyu-nvidia · 2025-11-18T18:39:37Z

examples/speculative_decoding/trainer/sgl_wrapper.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# MIT License


Is this code borrowed from OSS? if so, you will need to follow the OSS code procedure to file a ticket and get approval

yeyu-nvidia · 2025-11-18T18:41:41Z

According to the figures, there is a big accuracy/loss gap between SGL trainer and our current trainer. We need to figure it out before this can be merged. Also, as this PR may change our training accuracy fundamentally, we need to train larger models than llama 1B

Signed-off-by: h-guo18 <[email protected]>

h-guo18 force-pushed the haoguo/sgl-backend branch 2 times, most recently from f5bb1ae to 8f00ea5 Compare November 17, 2025 01:22

h-guo18 self-assigned this Nov 17, 2025

h-guo18 changed the title ~~Feat: SGL backend for SD training~~ Feat: SGL backend for online SD training Nov 17, 2025

h-guo18 marked this pull request as ready for review November 17, 2025 02:57

h-guo18 requested a review from a team as a code owner November 17, 2025 02:57

h-guo18 requested review from ChenhanYu and yeyu-nvidia November 17, 2025 02:57

yeyu-nvidia reviewed Nov 18, 2025

View reviewed changes

examples/speculative_decoding/train.py Outdated Show resolved Hide resolved

yeyu-nvidia reviewed Nov 18, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

h-guo18 added 3 commits November 20, 2025 23:11

squash: new trainer with HF and SGL backend

1df1bd0

Signed-off-by: h-guo18 <[email protected]>

add license

6df419f

Signed-off-by: h-guo18 <[email protected]>

debug:sgl backend; use torchrun

a22a948

Signed-off-by: h-guo18 <[email protected]>

h-guo18 force-pushed the haoguo/sgl-backend branch from 6ea8d57 to a22a948 Compare November 20, 2025 23:11

h-guo18 requested a review from yeyu-nvidia November 20, 2025 23:41

minor fix

70515dc

Signed-off-by: h-guo18 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: SGL backend for online SD training #564

Feat: SGL backend for online SD training #564

h-guo18 commented Nov 14, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

yeyu-nvidia Nov 18, 2025

Uh oh!

yeyu-nvidia commented Nov 18, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat: SGL backend for online SD training #564

Are you sure you want to change the base?

Feat: SGL backend for online SD training #564

Conversation

h-guo18 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Parallelism

Training Quality Test

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

yeyu-nvidia Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia commented Nov 18, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h-guo18 commented Nov 14, 2025 •

edited

Loading

codecov bot commented Nov 17, 2025 •

edited

Loading