-
Notifications
You must be signed in to change notification settings - Fork 611
Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory #1287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly look good. We should have a benchmark to understand the performance impact. I will expect DCP version should be faster (I didn't optimize the original logic well).
|
Synced offline, we can land this PR after 1) performance is on par or better, 2) loss curve matches the training without any checkpointing. |
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401
Summary: Pull Request resolved: #156207 This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401
Summary: X-link: meta-pytorch/tnt#1010 This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Add new UT Reviewed By: diego-urgell Differential Revision: D72391401
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401 Pull Request resolved: #156207 Approved by: https://github.com/teja-rao
|
@pytorchbot rebase |
b655098 to
0d583e4
Compare
fegin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…Memory (#1287) Test Titan changes to use DCP ZOC instead of titan default ### Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000 <img width="1126" alt="Screenshot 2025-06-25 at 3 59 54 PM" src="https://github.com/user-attachments/assets/d5c7986d-9a03-402f-a664-239807285117" /> ### Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC <img width="936" alt="Screenshot 2025-06-25 at 4 05 50 PM" src="https://github.com/user-attachments/assets/0f616460-076b-4ef9-ba53-e0e285c030e7" /> ### DCP + Titan Training on LLAMA3 8B Model For 500 Steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255 [rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200). [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step: 1 loss: 12.2478 memory: 42.16GiB(44.38%) tps: 1,191 tflops: 68.97 mfu: 6.97% [rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request... [rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1 [rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10 loss: 9.9493 memory: 49.66GiB(52.28%) tps: 2,397 tflops: 138.80 mfu: 14.03% [rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20 loss: 8.5802 memory: 49.66GiB(52.28%) tps: 5,917 tflops: 342.67 mfu: 34.65% [rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30 loss: 7.7218 memory: 49.66GiB(52.28%) tps: 5,892 tflops: 341.21 mfu: 34.50% [rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40 loss: 7.3309 memory: 49.66GiB(52.28%) tps: 5,881 tflops: 340.60 mfu: 34.44% [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a') [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50 loss: 7.1024 memory: 49.66GiB(52.28%) tps: 5,874 tflops: 340.20 mfu: 34.40% [rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60 loss: 6.8512 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70 loss: 6.9118 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.30 mfu: 34.31% [rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80 loss: 6.6411 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90 loss: 6.6820 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100 loss: 6.4539 memory: 49.66GiB(52.28%) tps: 5,555 tflops: 321.72 mfu: 32.53% [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100 [rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110 loss: 6.6926 memory: 49.66GiB(52.28%) tps: 5,314 tflops: 307.75 mfu: 31.12% [rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120 loss: 6.3697 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.81 mfu: 34.26% [rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130 loss: 6.4435 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140 loss: 6.3808 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.05 mfu: 34.28% [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac') [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150 loss: 6.0557 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160 loss: 6.3298 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.01 mfu: 34.28% [rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170 loss: 6.0259 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180 loss: 6.0728 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190 loss: 6.7391 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200 loss: 6.1409 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.79 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200 [rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210 loss: 6.0412 memory: 49.66GiB(52.28%) tps: 5,377 tflops: 311.41 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220 loss: 6.0333 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.22 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230 loss: 5.9720 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240 loss: 5.8235 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.21 mfu: 34.30% [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e') [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250 loss: 5.8334 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260 loss: 5.7444 memory: 49.66GiB(52.28%) tps: 5,465 tflops: 316.51 mfu: 32.00% [rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270 loss: 5.7170 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280 loss: 5.6201 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.87 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290 loss: 5.7282 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300 loss: 5.7546 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300 [rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310 loss: 5.5701 memory: 49.66GiB(52.28%) tps: 5,390 tflops: 312.17 mfu: 31.56% [rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320 loss: 5.5060 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.09 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330 loss: 5.4723 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.96 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340 loss: 5.5342 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.18 mfu: 34.29% [rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb') [rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360 loss: 5.4645 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370 loss: 5.3705 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380 loss: 5.3711 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390 loss: 5.2803 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400 loss: 5.4034 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds. [rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400 [rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410 loss: 5.2031 memory: 49.66GiB(52.28%) tps: 5,374 tflops: 311.24 mfu: 31.47% [rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420 loss: 5.3476 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430 loss: 5.2579 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440 loss: 5.2553 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f') [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450 loss: 5.1627 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460 loss: 5.1879 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470 loss: 5.0814 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.07 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480 loss: 5.0594 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.80 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490 loss: 5.1015 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.88 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500 loss: 5.1793 memory: 49.66GiB(52.28%) tps: 5,824 tflops: 337.30 mfu: 34.11% [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds. [rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed. ``` ### Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310 [rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401. [rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410 loss: 5.2593 memory: 49.66GiB(52.28%) tps: 2,746 tflops: 159.06 mfu: 16.08% [rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420 loss: 5.4461 memory: 49.66GiB(52.28%) tps: 5,903 tflops: 341.86 mfu: 34.57% [rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430 loss: 5.3764 memory: 49.66GiB(52.28%) tps: 5,877 tflops: 340.34 mfu: 34.41% [rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440 loss: 5.3640 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.74 mfu: 34.35% [rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450 loss: 5.2764 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460 loss: 5.2871 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.91 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470 loss: 5.1846 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480 loss: 5.1532 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.61 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490 loss: 5.2000 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.82 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500 loss: 5.2609 memory: 49.66GiB(52.28%) tps: 5,371 tflops: 311.03 mfu: 31.45% [rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request... [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500 [rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510 loss: 5.0734 memory: 49.66GiB(52.28%) tps: 2,343 tflops: 135.71 mfu: 13.72% [rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520 loss: 5.2781 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530 loss: 5.1984 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540 loss: 5.0945 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac') [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550 loss: 5.1208 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.47 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560 loss: 5.1587 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570 loss: 4.9740 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580 loss: 5.0386 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.50 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590 loss: 5.0741 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.48 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600 loss: 4.9624 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.43 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600 [rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610 loss: 5.0695 memory: 49.66GiB(52.28%) tps: 5,378 tflops: 311.46 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620 loss: 5.0808 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630 loss: 4.9225 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.14 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640 loss: 4.8939 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d') [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds. [rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650 loss: 5.0611 memory: 49.66GiB(52.28%) tps: 5,751 tflops: 333.05 mfu: 33.68% [rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660 loss: 4.9348 memory: 49.66GiB(52.28%) tps: 5,731 tflops: 331.88 mfu: 33.56% [rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670 loss: 4.8798 memory: 49.66GiB(52.28%) tps: 5,769 tflops: 334.13 mfu: 33.78% [rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680 loss: 4.9749 memory: 49.66GiB(52.28%) tps: 5,774 tflops: 334.40 mfu: 33.81% [rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690 loss: 4.7231 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700 loss: 4.8765 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700 [rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710 loss: 4.8973 memory: 49.66GiB(52.28%) tps: 5,383 tflops: 311.76 mfu: 31.52% [rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720 loss: 4.8671 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.67 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730 loss: 5.1143 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740 loss: 4.9568 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.51 mfu: 34.23% [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779') [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds. [rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760 loss: 4.7441 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770 loss: 4.7597 memory: 49.66GiB(52.28%) tps: 5,699 tflops: 330.05 mfu: 33.37% [rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780 loss: 4.7751 memory: 49.66GiB(52.28%) tps: 5,551 tflops: 321.47 mfu: 32.50% [rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790 loss: 4.7187 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800 loss: 4.7215 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.65 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800 [rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810 loss: 4.7147 memory: 49.66GiB(52.28%) tps: 5,364 tflops: 310.63 mfu: 31.41% [rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820 loss: 4.6326 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.97 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830 loss: 4.7056 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840 loss: 4.6464 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf') [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds. [rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850 loss: 4.8118 memory: 49.66GiB(52.28%) tps: 5,846 tflops: 338.58 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860 loss: 4.6158 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.94 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870 loss: 4.5595 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880 loss: 4.4991 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.17 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890 loss: 4.7063 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900 loss: 4.7203 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.74 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900 [rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910 loss: 4.4820 memory: 49.66GiB(52.28%) tps: 5,372 tflops: 311.11 mfu: 31.46% [rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920 loss: 4.6184 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930 loss: 4.6420 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940 loss: 4.5612 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.35 mfu: 34.21% [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299') [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950 loss: 4.6412 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.84 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960 loss: 4.5748 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.60 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970 loss: 4.6774 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980 loss: 4.5435 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.99 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990 loss: 4.6884 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000 loss: 4.6401 memory: 49.66GiB(52.28%) tps: 5,828 tflops: 337.50 mfu: 34.13% [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds. [rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1. [rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds. [rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed. ``` ### TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model ``` [rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step: 1 loss: 12.2552 memory: 42.16GiB(44.38%) tps: 1,512 tflops: 87.59 mfu: 8.86% [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10 loss: 10.2261 memory: 49.66GiB(52.28%) tps: 5,638 tflops: 326.50 mfu: 33.01% [rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20 loss: 8.5252 memory: 49.66GiB(52.28%) tps: 5,928 tflops: 343.32 mfu: 34.71% [rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30 loss: 7.7284 memory: 49.66GiB(52.28%) tps: 5,902 tflops: 341.78 mfu: 34.56% [rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40 loss: 7.3651 memory: 49.66GiB(52.28%) tps: 5,890 tflops: 341.14 mfu: 34.49% [rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50 loss: 7.1575 memory: 49.66GiB(52.28%) tps: 5,886 tflops: 340.87 mfu: 34.47% [rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60 loss: 6.8193 memory: 49.66GiB(52.28%) tps: 5,883 tflops: 340.72 mfu: 34.45% [rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70 loss: 6.9294 memory: 49.66GiB(52.28%) tps: 5,879 tflops: 340.47 mfu: 34.43% [rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80 loss: 6.6544 memory: 49.66GiB(52.28%) tps: 5,880 tflops: 340.56 mfu: 34.44% [rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90 loss: 6.6736 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.27 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100 loss: 6.4444 memory: 49.66GiB(52.28%) tps: 5,702 tflops: 330.24 mfu: 33.39% [rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110 loss: 6.6835 memory: 49.66GiB(52.28%) tps: 5,532 tflops: 320.39 mfu: 32.40% [rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120 loss: 6.3605 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130 loss: 6.4930 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.69 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140 loss: 6.4257 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.53 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150 loss: 6.0684 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160 loss: 6.3202 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170 loss: 5.9970 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180 loss: 6.0888 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190 loss: 6.7146 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.52 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200 loss: 6.1384 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210 loss: 6.0388 memory: 49.66GiB(52.28%) tps: 5,593 tflops: 323.92 mfu: 32.75% [rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220 loss: 6.0348 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.82 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230 loss: 5.9670 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.88 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240 loss: 5.8311 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.59 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250 loss: 5.8383 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260 loss: 5.7541 memory: 49.66GiB(52.28%) tps: 5,467 tflops: 316.60 mfu: 32.01% [rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270 loss: 5.7202 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280 loss: 5.6326 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.71 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290 loss: 5.7467 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300 loss: 5.7704 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310 loss: 5.6014 memory: 49.66GiB(52.28%) tps: 5,596 tflops: 324.09 mfu: 32.77% [rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330 loss: 5.4930 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340 loss: 5.5682 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350 loss: 5.5599 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360 loss: 5.5070 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.63 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370 loss: 5.4143 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380 loss: 5.4226 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390 loss: 5.3208 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.55 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400 loss: 5.4494 memory: 49.66GiB(52.28%) tps: 5,833 tflops: 337.83 mfu: 34.16% [rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410 loss: 5.2551 memory: 49.66GiB(52.28%) tps: 5,588 tflops: 323.63 mfu: 32.72% [rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420 loss: 5.4090 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430 loss: 5.3344 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440 loss: 5.3106 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450 loss: 5.2298 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460 loss: 5.2365 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470 loss: 5.1352 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480 loss: 5.1063 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.20 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490 loss: 5.1559 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500 loss: 5.2053 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 339.00 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510 loss: 5.0252 memory: 49.66GiB(52.28%) tps: 5,565 tflops: 322.31 mfu: 32.59% [rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520 loss: 5.2369 memory: 49.66GiB(52.28%) tps: 5,561 tflops: 322.06 mfu: 32.56% [rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530 loss: 5.1380 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.56 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540 loss: 5.0538 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550 loss: 5.0703 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560 loss: 5.1104 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570 loss: 4.9368 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580 loss: 5.0037 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590 loss: 5.0303 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600 loss: 4.9233 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.78 mfu: 34.25% [rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610 loss: 5.0338 memory: 49.66GiB(52.28%) tps: 5,579 tflops: 323.10 mfu: 32.67% [rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620 loss: 5.0437 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.68 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630 loss: 4.8868 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640 loss: 4.8542 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.19 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650 loss: 5.0227 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660 loss: 4.8901 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670 loss: 4.8374 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.36 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680 loss: 4.9357 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690 loss: 4.6904 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700 loss: 4.8426 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710 loss: 4.8630 memory: 49.66GiB(52.28%) tps: 5,563 tflops: 322.17 mfu: 32.58% [rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720 loss: 4.8307 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.54 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730 loss: 5.0704 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740 loss: 4.9245 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750 loss: 4.6614 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760 loss: 4.7145 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770 loss: 4.7226 memory: 49.66GiB(52.28%) tps: 5,730 tflops: 331.83 mfu: 33.55% [rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780 loss: 4.7430 memory: 49.66GiB(52.28%) tps: 5,455 tflops: 315.93 mfu: 31.94% [rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790 loss: 4.6875 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.35 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800 loss: 4.6901 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.06 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810 loss: 4.6838 memory: 49.66GiB(52.28%) tps: 5,552 tflops: 321.56 mfu: 32.51% [rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820 loss: 4.6000 memory: 49.66GiB(52.28%) tps: 5,876 tflops: 340.28 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830 loss: 4.6733 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.25 mfu: 34.40% [rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840 loss: 4.6178 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.98 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850 loss: 4.7753 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860 loss: 4.5818 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870 loss: 4.5332 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880 loss: 4.4733 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890 loss: 4.6788 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,823 tflops: 337.23 mfu: 34.10% [rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910 loss: 4.4559 memory: 49.66GiB(52.28%) tps: 5,544 tflops: 321.10 mfu: 32.47% [rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920 loss: 4.5920 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930 loss: 4.6219 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.58 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940 loss: 4.5429 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.76 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950 loss: 4.6160 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.93 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960 loss: 4.5495 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.91 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970 loss: 4.6507 memory: 49.66GiB(52.28%) tps: 5,871 tflops: 340.03 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980 loss: 4.5222 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990 loss: 4.6672 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000 loss: 4.6165 memory: 49.66GiB(52.28%) tps: 5,840 tflops: 338.24 mfu: 34.20% [rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed [rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed. ``` --------- Co-authored-by: saiteja64 <[email protected]>
…Memory (pytorch#1287) Test Titan changes to use DCP ZOC instead of titan default ### Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000 <img width="1126" alt="Screenshot 2025-06-25 at 3 59 54 PM" src="https://github.com/user-attachments/assets/d5c7986d-9a03-402f-a664-239807285117" /> ### Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC <img width="936" alt="Screenshot 2025-06-25 at 4 05 50 PM" src="https://github.com/user-attachments/assets/0f616460-076b-4ef9-ba53-e0e285c030e7" /> ### DCP + Titan Training on LLAMA3 8B Model For 500 Steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255 [rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200). [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step: 1 loss: 12.2478 memory: 42.16GiB(44.38%) tps: 1,191 tflops: 68.97 mfu: 6.97% [rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request... [rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1 [rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10 loss: 9.9493 memory: 49.66GiB(52.28%) tps: 2,397 tflops: 138.80 mfu: 14.03% [rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20 loss: 8.5802 memory: 49.66GiB(52.28%) tps: 5,917 tflops: 342.67 mfu: 34.65% [rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30 loss: 7.7218 memory: 49.66GiB(52.28%) tps: 5,892 tflops: 341.21 mfu: 34.50% [rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40 loss: 7.3309 memory: 49.66GiB(52.28%) tps: 5,881 tflops: 340.60 mfu: 34.44% [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a') [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50 loss: 7.1024 memory: 49.66GiB(52.28%) tps: 5,874 tflops: 340.20 mfu: 34.40% [rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60 loss: 6.8512 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70 loss: 6.9118 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.30 mfu: 34.31% [rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80 loss: 6.6411 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90 loss: 6.6820 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100 loss: 6.4539 memory: 49.66GiB(52.28%) tps: 5,555 tflops: 321.72 mfu: 32.53% [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100 [rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110 loss: 6.6926 memory: 49.66GiB(52.28%) tps: 5,314 tflops: 307.75 mfu: 31.12% [rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120 loss: 6.3697 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.81 mfu: 34.26% [rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130 loss: 6.4435 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140 loss: 6.3808 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.05 mfu: 34.28% [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac') [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150 loss: 6.0557 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160 loss: 6.3298 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.01 mfu: 34.28% [rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170 loss: 6.0259 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180 loss: 6.0728 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190 loss: 6.7391 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200 loss: 6.1409 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.79 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200 [rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210 loss: 6.0412 memory: 49.66GiB(52.28%) tps: 5,377 tflops: 311.41 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220 loss: 6.0333 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.22 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230 loss: 5.9720 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240 loss: 5.8235 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.21 mfu: 34.30% [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e') [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250 loss: 5.8334 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260 loss: 5.7444 memory: 49.66GiB(52.28%) tps: 5,465 tflops: 316.51 mfu: 32.00% [rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270 loss: 5.7170 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280 loss: 5.6201 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.87 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290 loss: 5.7282 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300 loss: 5.7546 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300 [rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310 loss: 5.5701 memory: 49.66GiB(52.28%) tps: 5,390 tflops: 312.17 mfu: 31.56% [rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320 loss: 5.5060 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.09 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330 loss: 5.4723 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.96 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340 loss: 5.5342 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.18 mfu: 34.29% [rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb') [rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360 loss: 5.4645 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370 loss: 5.3705 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380 loss: 5.3711 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390 loss: 5.2803 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400 loss: 5.4034 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds. [rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400 [rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410 loss: 5.2031 memory: 49.66GiB(52.28%) tps: 5,374 tflops: 311.24 mfu: 31.47% [rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420 loss: 5.3476 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430 loss: 5.2579 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440 loss: 5.2553 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f') [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450 loss: 5.1627 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460 loss: 5.1879 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470 loss: 5.0814 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.07 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480 loss: 5.0594 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.80 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490 loss: 5.1015 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.88 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500 loss: 5.1793 memory: 49.66GiB(52.28%) tps: 5,824 tflops: 337.30 mfu: 34.11% [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds. [rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed. ``` ### Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310 [rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401. [rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410 loss: 5.2593 memory: 49.66GiB(52.28%) tps: 2,746 tflops: 159.06 mfu: 16.08% [rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420 loss: 5.4461 memory: 49.66GiB(52.28%) tps: 5,903 tflops: 341.86 mfu: 34.57% [rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430 loss: 5.3764 memory: 49.66GiB(52.28%) tps: 5,877 tflops: 340.34 mfu: 34.41% [rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440 loss: 5.3640 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.74 mfu: 34.35% [rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450 loss: 5.2764 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460 loss: 5.2871 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.91 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470 loss: 5.1846 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480 loss: 5.1532 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.61 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490 loss: 5.2000 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.82 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500 loss: 5.2609 memory: 49.66GiB(52.28%) tps: 5,371 tflops: 311.03 mfu: 31.45% [rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request... [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500 [rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510 loss: 5.0734 memory: 49.66GiB(52.28%) tps: 2,343 tflops: 135.71 mfu: 13.72% [rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520 loss: 5.2781 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530 loss: 5.1984 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540 loss: 5.0945 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac') [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550 loss: 5.1208 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.47 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560 loss: 5.1587 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570 loss: 4.9740 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580 loss: 5.0386 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.50 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590 loss: 5.0741 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.48 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600 loss: 4.9624 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.43 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600 [rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610 loss: 5.0695 memory: 49.66GiB(52.28%) tps: 5,378 tflops: 311.46 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620 loss: 5.0808 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630 loss: 4.9225 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.14 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640 loss: 4.8939 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d') [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds. [rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650 loss: 5.0611 memory: 49.66GiB(52.28%) tps: 5,751 tflops: 333.05 mfu: 33.68% [rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660 loss: 4.9348 memory: 49.66GiB(52.28%) tps: 5,731 tflops: 331.88 mfu: 33.56% [rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670 loss: 4.8798 memory: 49.66GiB(52.28%) tps: 5,769 tflops: 334.13 mfu: 33.78% [rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680 loss: 4.9749 memory: 49.66GiB(52.28%) tps: 5,774 tflops: 334.40 mfu: 33.81% [rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690 loss: 4.7231 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700 loss: 4.8765 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700 [rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710 loss: 4.8973 memory: 49.66GiB(52.28%) tps: 5,383 tflops: 311.76 mfu: 31.52% [rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720 loss: 4.8671 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.67 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730 loss: 5.1143 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740 loss: 4.9568 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.51 mfu: 34.23% [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779') [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds. [rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760 loss: 4.7441 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770 loss: 4.7597 memory: 49.66GiB(52.28%) tps: 5,699 tflops: 330.05 mfu: 33.37% [rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780 loss: 4.7751 memory: 49.66GiB(52.28%) tps: 5,551 tflops: 321.47 mfu: 32.50% [rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790 loss: 4.7187 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800 loss: 4.7215 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.65 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800 [rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810 loss: 4.7147 memory: 49.66GiB(52.28%) tps: 5,364 tflops: 310.63 mfu: 31.41% [rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820 loss: 4.6326 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.97 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830 loss: 4.7056 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840 loss: 4.6464 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf') [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds. [rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850 loss: 4.8118 memory: 49.66GiB(52.28%) tps: 5,846 tflops: 338.58 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860 loss: 4.6158 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.94 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870 loss: 4.5595 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880 loss: 4.4991 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.17 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890 loss: 4.7063 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900 loss: 4.7203 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.74 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900 [rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910 loss: 4.4820 memory: 49.66GiB(52.28%) tps: 5,372 tflops: 311.11 mfu: 31.46% [rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920 loss: 4.6184 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930 loss: 4.6420 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940 loss: 4.5612 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.35 mfu: 34.21% [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299') [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950 loss: 4.6412 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.84 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960 loss: 4.5748 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.60 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970 loss: 4.6774 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980 loss: 4.5435 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.99 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990 loss: 4.6884 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000 loss: 4.6401 memory: 49.66GiB(52.28%) tps: 5,828 tflops: 337.50 mfu: 34.13% [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds. [rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1. [rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds. [rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed. ``` ### TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model ``` [rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step: 1 loss: 12.2552 memory: 42.16GiB(44.38%) tps: 1,512 tflops: 87.59 mfu: 8.86% [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10 loss: 10.2261 memory: 49.66GiB(52.28%) tps: 5,638 tflops: 326.50 mfu: 33.01% [rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20 loss: 8.5252 memory: 49.66GiB(52.28%) tps: 5,928 tflops: 343.32 mfu: 34.71% [rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30 loss: 7.7284 memory: 49.66GiB(52.28%) tps: 5,902 tflops: 341.78 mfu: 34.56% [rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40 loss: 7.3651 memory: 49.66GiB(52.28%) tps: 5,890 tflops: 341.14 mfu: 34.49% [rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50 loss: 7.1575 memory: 49.66GiB(52.28%) tps: 5,886 tflops: 340.87 mfu: 34.47% [rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60 loss: 6.8193 memory: 49.66GiB(52.28%) tps: 5,883 tflops: 340.72 mfu: 34.45% [rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70 loss: 6.9294 memory: 49.66GiB(52.28%) tps: 5,879 tflops: 340.47 mfu: 34.43% [rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80 loss: 6.6544 memory: 49.66GiB(52.28%) tps: 5,880 tflops: 340.56 mfu: 34.44% [rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90 loss: 6.6736 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.27 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100 loss: 6.4444 memory: 49.66GiB(52.28%) tps: 5,702 tflops: 330.24 mfu: 33.39% [rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110 loss: 6.6835 memory: 49.66GiB(52.28%) tps: 5,532 tflops: 320.39 mfu: 32.40% [rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120 loss: 6.3605 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130 loss: 6.4930 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.69 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140 loss: 6.4257 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.53 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150 loss: 6.0684 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160 loss: 6.3202 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170 loss: 5.9970 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180 loss: 6.0888 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190 loss: 6.7146 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.52 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200 loss: 6.1384 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210 loss: 6.0388 memory: 49.66GiB(52.28%) tps: 5,593 tflops: 323.92 mfu: 32.75% [rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220 loss: 6.0348 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.82 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230 loss: 5.9670 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.88 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240 loss: 5.8311 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.59 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250 loss: 5.8383 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260 loss: 5.7541 memory: 49.66GiB(52.28%) tps: 5,467 tflops: 316.60 mfu: 32.01% [rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270 loss: 5.7202 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280 loss: 5.6326 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.71 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290 loss: 5.7467 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300 loss: 5.7704 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310 loss: 5.6014 memory: 49.66GiB(52.28%) tps: 5,596 tflops: 324.09 mfu: 32.77% [rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330 loss: 5.4930 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340 loss: 5.5682 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350 loss: 5.5599 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360 loss: 5.5070 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.63 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370 loss: 5.4143 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380 loss: 5.4226 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390 loss: 5.3208 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.55 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400 loss: 5.4494 memory: 49.66GiB(52.28%) tps: 5,833 tflops: 337.83 mfu: 34.16% [rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410 loss: 5.2551 memory: 49.66GiB(52.28%) tps: 5,588 tflops: 323.63 mfu: 32.72% [rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420 loss: 5.4090 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430 loss: 5.3344 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440 loss: 5.3106 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450 loss: 5.2298 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460 loss: 5.2365 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470 loss: 5.1352 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480 loss: 5.1063 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.20 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490 loss: 5.1559 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500 loss: 5.2053 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 339.00 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510 loss: 5.0252 memory: 49.66GiB(52.28%) tps: 5,565 tflops: 322.31 mfu: 32.59% [rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520 loss: 5.2369 memory: 49.66GiB(52.28%) tps: 5,561 tflops: 322.06 mfu: 32.56% [rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530 loss: 5.1380 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.56 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540 loss: 5.0538 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550 loss: 5.0703 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560 loss: 5.1104 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570 loss: 4.9368 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580 loss: 5.0037 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590 loss: 5.0303 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600 loss: 4.9233 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.78 mfu: 34.25% [rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610 loss: 5.0338 memory: 49.66GiB(52.28%) tps: 5,579 tflops: 323.10 mfu: 32.67% [rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620 loss: 5.0437 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.68 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630 loss: 4.8868 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640 loss: 4.8542 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.19 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650 loss: 5.0227 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660 loss: 4.8901 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670 loss: 4.8374 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.36 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680 loss: 4.9357 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690 loss: 4.6904 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700 loss: 4.8426 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710 loss: 4.8630 memory: 49.66GiB(52.28%) tps: 5,563 tflops: 322.17 mfu: 32.58% [rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720 loss: 4.8307 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.54 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730 loss: 5.0704 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740 loss: 4.9245 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750 loss: 4.6614 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760 loss: 4.7145 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770 loss: 4.7226 memory: 49.66GiB(52.28%) tps: 5,730 tflops: 331.83 mfu: 33.55% [rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780 loss: 4.7430 memory: 49.66GiB(52.28%) tps: 5,455 tflops: 315.93 mfu: 31.94% [rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790 loss: 4.6875 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.35 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800 loss: 4.6901 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.06 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810 loss: 4.6838 memory: 49.66GiB(52.28%) tps: 5,552 tflops: 321.56 mfu: 32.51% [rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820 loss: 4.6000 memory: 49.66GiB(52.28%) tps: 5,876 tflops: 340.28 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830 loss: 4.6733 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.25 mfu: 34.40% [rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840 loss: 4.6178 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.98 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850 loss: 4.7753 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860 loss: 4.5818 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870 loss: 4.5332 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880 loss: 4.4733 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890 loss: 4.6788 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,823 tflops: 337.23 mfu: 34.10% [rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910 loss: 4.4559 memory: 49.66GiB(52.28%) tps: 5,544 tflops: 321.10 mfu: 32.47% [rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920 loss: 4.5920 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930 loss: 4.6219 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.58 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940 loss: 4.5429 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.76 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950 loss: 4.6160 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.93 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960 loss: 4.5495 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.91 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970 loss: 4.6507 memory: 49.66GiB(52.28%) tps: 5,871 tflops: 340.03 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980 loss: 4.5222 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990 loss: 4.6672 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000 loss: 4.6165 memory: 49.66GiB(52.28%) tps: 5,840 tflops: 338.24 mfu: 34.20% [rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed [rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed. ``` --------- Co-authored-by: saiteja64 <[email protected]>
#1287 refactors async_with_pinned_mem to use DCP's implementation but that PR didn't set staging correctly. This PR fixes it.
Test Titan changes to use DCP ZOC instead of titan default
Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000
Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC
DCP + Titan Training on LLAMA3 8B Model For 500 Steps
Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps
TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model