Commit 671c30c
Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory (#1287)
Test Titan changes to use DCP ZOC instead of titan default
### Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete
Last Checkpoint (since it's sync save). Then Run Load and Run Training
Until Step 1000
<img width="1126" alt="Screenshot 2025-06-25 at 3 59 54 PM"
src="https://github.com/user-attachments/assets/d5c7986d-9a03-402f-a664-239807285117"
/>
### Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC
<img width="936" alt="Screenshot 2025-06-25 at 4 05 50 PM"
src="https://github.com/user-attachments/assets/0f616460-076b-4ef9-ba53-e0e285c030e7"
/>
### DCP + Titan Training on LLAMA3 8B Model For 500 Steps
```
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774]
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255
[rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200).
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step: 1 loss: 12.2478 memory: 42.16GiB(44.38%) tps: 1,191 tflops: 68.97 mfu: 6.97%
[rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request...
[rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1
[rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10 loss: 9.9493 memory: 49.66GiB(52.28%) tps: 2,397 tflops: 138.80 mfu: 14.03%
[rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20 loss: 8.5802 memory: 49.66GiB(52.28%) tps: 5,917 tflops: 342.67 mfu: 34.65%
[rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30 loss: 7.7218 memory: 49.66GiB(52.28%) tps: 5,892 tflops: 341.21 mfu: 34.50%
[rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40 loss: 7.3309 memory: 49.66GiB(52.28%) tps: 5,881 tflops: 340.60 mfu: 34.44%
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a')
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50 loss: 7.1024 memory: 49.66GiB(52.28%) tps: 5,874 tflops: 340.20 mfu: 34.40%
[rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60 loss: 6.8512 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34%
[rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70 loss: 6.9118 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.30 mfu: 34.31%
[rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80 loss: 6.6411 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32%
[rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90 loss: 6.6820 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33%
[rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100 loss: 6.4539 memory: 49.66GiB(52.28%) tps: 5,555 tflops: 321.72 mfu: 32.53%
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100
[rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110 loss: 6.6926 memory: 49.66GiB(52.28%) tps: 5,314 tflops: 307.75 mfu: 31.12%
[rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120 loss: 6.3697 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.81 mfu: 34.26%
[rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130 loss: 6.4435 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29%
[rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140 loss: 6.3808 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.05 mfu: 34.28%
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac')
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150 loss: 6.0557 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160 loss: 6.3298 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.01 mfu: 34.28%
[rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170 loss: 6.0259 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180 loss: 6.0728 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190 loss: 6.7391 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200 loss: 6.1409 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.79 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200
[rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210 loss: 6.0412 memory: 49.66GiB(52.28%) tps: 5,377 tflops: 311.41 mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220 loss: 6.0333 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.22 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230 loss: 5.9720 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240 loss: 5.8235 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.21 mfu: 34.30%
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e')
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250 loss: 5.8334 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260 loss: 5.7444 memory: 49.66GiB(52.28%) tps: 5,465 tflops: 316.51 mfu: 32.00%
[rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270 loss: 5.7170 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280 loss: 5.6201 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.87 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290 loss: 5.7282 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300 loss: 5.7546 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300
[rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310 loss: 5.5701 memory: 49.66GiB(52.28%) tps: 5,390 tflops: 312.17 mfu: 31.56%
[rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320 loss: 5.5060 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.09 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330 loss: 5.4723 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.96 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340 loss: 5.5342 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.18 mfu: 34.29%
[rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb')
[rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360 loss: 5.4645 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370 loss: 5.3705 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380 loss: 5.3711 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390 loss: 5.2803 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400 loss: 5.4034 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400
[rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410 loss: 5.2031 memory: 49.66GiB(52.28%) tps: 5,374 tflops: 311.24 mfu: 31.47%
[rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420 loss: 5.3476 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34%
[rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430 loss: 5.2579 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440 loss: 5.2553 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36%
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f')
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450 loss: 5.1627 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460 loss: 5.1879 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470 loss: 5.0814 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.07 mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480 loss: 5.0594 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.80 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490 loss: 5.1015 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.88 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500 loss: 5.1793 memory: 49.66GiB(52.28%) tps: 5,824 tflops: 337.30 mfu: 34.11%
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds.
[rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed.
```
### Delete Checkpoint at 500th Step (Since this was sync saved). Run
LLAMA3 8B model training to 1000 steps
```
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774]
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310
[rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401.
[rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410 loss: 5.2593 memory: 49.66GiB(52.28%) tps: 2,746 tflops: 159.06 mfu: 16.08%
[rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420 loss: 5.4461 memory: 49.66GiB(52.28%) tps: 5,903 tflops: 341.86 mfu: 34.57%
[rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430 loss: 5.3764 memory: 49.66GiB(52.28%) tps: 5,877 tflops: 340.34 mfu: 34.41%
[rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440 loss: 5.3640 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.74 mfu: 34.35%
[rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450 loss: 5.2764 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460 loss: 5.2871 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.91 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470 loss: 5.1846 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480 loss: 5.1532 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.61 mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490 loss: 5.2000 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.82 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500 loss: 5.2609 memory: 49.66GiB(52.28%) tps: 5,371 tflops: 311.03 mfu: 31.45%
[rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request...
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500
[rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510 loss: 5.0734 memory: 49.66GiB(52.28%) tps: 2,343 tflops: 135.71 mfu: 13.72%
[rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520 loss: 5.2781 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530 loss: 5.1984 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540 loss: 5.0945 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25%
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac')
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550 loss: 5.1208 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.47 mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560 loss: 5.1587 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570 loss: 4.9740 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580 loss: 5.0386 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.50 mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590 loss: 5.0741 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.48 mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600 loss: 4.9624 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.43 mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600
[rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610 loss: 5.0695 memory: 49.66GiB(52.28%) tps: 5,378 tflops: 311.46 mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620 loss: 5.0808 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630 loss: 4.9225 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.14 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640 loss: 4.8939 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21%
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d')
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds.
[rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650 loss: 5.0611 memory: 49.66GiB(52.28%) tps: 5,751 tflops: 333.05 mfu: 33.68%
[rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660 loss: 4.9348 memory: 49.66GiB(52.28%) tps: 5,731 tflops: 331.88 mfu: 33.56%
[rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670 loss: 4.8798 memory: 49.66GiB(52.28%) tps: 5,769 tflops: 334.13 mfu: 33.78%
[rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680 loss: 4.9749 memory: 49.66GiB(52.28%) tps: 5,774 tflops: 334.40 mfu: 33.81%
[rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690 loss: 4.7231 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700 loss: 4.8765 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700
[rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710 loss: 4.8973 memory: 49.66GiB(52.28%) tps: 5,383 tflops: 311.76 mfu: 31.52%
[rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720 loss: 4.8671 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.67 mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730 loss: 5.1143 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740 loss: 4.9568 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.51 mfu: 34.23%
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779')
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds.
[rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760 loss: 4.7441 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770 loss: 4.7597 memory: 49.66GiB(52.28%) tps: 5,699 tflops: 330.05 mfu: 33.37%
[rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780 loss: 4.7751 memory: 49.66GiB(52.28%) tps: 5,551 tflops: 321.47 mfu: 32.50%
[rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790 loss: 4.7187 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800 loss: 4.7215 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.65 mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800
[rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810 loss: 4.7147 memory: 49.66GiB(52.28%) tps: 5,364 tflops: 310.63 mfu: 31.41%
[rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820 loss: 4.6326 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.97 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830 loss: 4.7056 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840 loss: 4.6464 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29%
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf')
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds.
[rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850 loss: 4.8118 memory: 49.66GiB(52.28%) tps: 5,846 tflops: 338.58 mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860 loss: 4.6158 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.94 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870 loss: 4.5595 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.12 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880 loss: 4.4991 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.17 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890 loss: 4.7063 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900 loss: 4.7203 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.74 mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900
[rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910 loss: 4.4820 memory: 49.66GiB(52.28%) tps: 5,372 tflops: 311.11 mfu: 31.46%
[rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920 loss: 4.6184 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930 loss: 4.6420 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940 loss: 4.5612 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.35 mfu: 34.21%
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299')
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950 loss: 4.6412 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.84 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960 loss: 4.5748 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.60 mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970 loss: 4.6774 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980 loss: 4.5435 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.99 mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990 loss: 4.6884 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21%
[rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000 loss: 4.6401 memory: 49.66GiB(52.28%) tps: 5,828 tflops: 337.50 mfu: 34.13%
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1.
[rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds.
[rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed.
```
### TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps
on LLAMA3 8B Model
```
[rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step: 1 loss: 12.2552 memory: 42.16GiB(44.38%) tps: 1,512 tflops: 87.59 mfu: 8.86%
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10 loss: 10.2261 memory: 49.66GiB(52.28%) tps: 5,638 tflops: 326.50 mfu: 33.01%
[rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20 loss: 8.5252 memory: 49.66GiB(52.28%) tps: 5,928 tflops: 343.32 mfu: 34.71%
[rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30 loss: 7.7284 memory: 49.66GiB(52.28%) tps: 5,902 tflops: 341.78 mfu: 34.56%
[rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40 loss: 7.3651 memory: 49.66GiB(52.28%) tps: 5,890 tflops: 341.14 mfu: 34.49%
[rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50 loss: 7.1575 memory: 49.66GiB(52.28%) tps: 5,886 tflops: 340.87 mfu: 34.47%
[rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60 loss: 6.8193 memory: 49.66GiB(52.28%) tps: 5,883 tflops: 340.72 mfu: 34.45%
[rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70 loss: 6.9294 memory: 49.66GiB(52.28%) tps: 5,879 tflops: 340.47 mfu: 34.43%
[rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80 loss: 6.6544 memory: 49.66GiB(52.28%) tps: 5,880 tflops: 340.56 mfu: 34.44%
[rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90 loss: 6.6736 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.27 mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100 loss: 6.4444 memory: 49.66GiB(52.28%) tps: 5,702 tflops: 330.24 mfu: 33.39%
[rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110 loss: 6.6835 memory: 49.66GiB(52.28%) tps: 5,532 tflops: 320.39 mfu: 32.40%
[rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120 loss: 6.3605 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130 loss: 6.4930 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.69 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140 loss: 6.4257 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.53 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150 loss: 6.0684 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160 loss: 6.3202 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170 loss: 5.9970 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.64 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180 loss: 6.0888 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190 loss: 6.7146 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.52 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200 loss: 6.1384 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210 loss: 6.0388 memory: 49.66GiB(52.28%) tps: 5,593 tflops: 323.92 mfu: 32.75%
[rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220 loss: 6.0348 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.82 mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230 loss: 5.9670 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.88 mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240 loss: 5.8311 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.59 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250 loss: 5.8383 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260 loss: 5.7541 memory: 49.66GiB(52.28%) tps: 5,467 tflops: 316.60 mfu: 32.01%
[rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270 loss: 5.7202 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280 loss: 5.6326 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.71 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290 loss: 5.7467 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300 loss: 5.7704 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310 loss: 5.6014 memory: 49.66GiB(52.28%) tps: 5,596 tflops: 324.09 mfu: 32.77%
[rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330 loss: 5.4930 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340 loss: 5.5682 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350 loss: 5.5599 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360 loss: 5.5070 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.63 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370 loss: 5.4143 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380 loss: 5.4226 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.23 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390 loss: 5.3208 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.55 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400 loss: 5.4494 memory: 49.66GiB(52.28%) tps: 5,833 tflops: 337.83 mfu: 34.16%
[rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410 loss: 5.2551 memory: 49.66GiB(52.28%) tps: 5,588 tflops: 323.63 mfu: 32.72%
[rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420 loss: 5.4090 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.64 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430 loss: 5.3344 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440 loss: 5.3106 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450 loss: 5.2298 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460 loss: 5.2365 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470 loss: 5.1352 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480 loss: 5.1063 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.20 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490 loss: 5.1559 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500 loss: 5.2053 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 339.00 mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510 loss: 5.0252 memory: 49.66GiB(52.28%) tps: 5,565 tflops: 322.31 mfu: 32.59%
[rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520 loss: 5.2369 memory: 49.66GiB(52.28%) tps: 5,561 tflops: 322.06 mfu: 32.56%
[rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530 loss: 5.1380 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.56 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540 loss: 5.0538 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550 loss: 5.0703 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560 loss: 5.1104 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570 loss: 4.9368 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.12 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580 loss: 5.0037 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590 loss: 5.0303 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600 loss: 4.9233 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.78 mfu: 34.25%
[rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610 loss: 5.0338 memory: 49.66GiB(52.28%) tps: 5,579 tflops: 323.10 mfu: 32.67%
[rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620 loss: 5.0437 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.68 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630 loss: 4.8868 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640 loss: 4.8542 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.19 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650 loss: 5.0227 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660 loss: 4.8901 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670 loss: 4.8374 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.36 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680 loss: 4.9357 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690 loss: 4.6904 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700 loss: 4.8426 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710 loss: 4.8630 memory: 49.66GiB(52.28%) tps: 5,563 tflops: 322.17 mfu: 32.58%
[rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720 loss: 4.8307 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.54 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730 loss: 5.0704 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740 loss: 4.9245 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750 loss: 4.6614 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760 loss: 4.7145 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770 loss: 4.7226 memory: 49.66GiB(52.28%) tps: 5,730 tflops: 331.83 mfu: 33.55%
[rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780 loss: 4.7430 memory: 49.66GiB(52.28%) tps: 5,455 tflops: 315.93 mfu: 31.94%
[rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790 loss: 4.6875 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.35 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800 loss: 4.6901 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.06 mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810 loss: 4.6838 memory: 49.66GiB(52.28%) tps: 5,552 tflops: 321.56 mfu: 32.51%
[rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820 loss: 4.6000 memory: 49.66GiB(52.28%) tps: 5,876 tflops: 340.28 mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830 loss: 4.6733 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.25 mfu: 34.40%
[rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840 loss: 4.6178 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.98 mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850 loss: 4.7753 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860 loss: 4.5818 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870 loss: 4.5332 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880 loss: 4.4733 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890 loss: 4.6788 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,823 tflops: 337.23 mfu: 34.10%
[rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910 loss: 4.4559 memory: 49.66GiB(52.28%) tps: 5,544 tflops: 321.10 mfu: 32.47%
[rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920 loss: 4.5920 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930 loss: 4.6219 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.58 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940 loss: 4.5429 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.76 mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950 loss: 4.6160 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.93 mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960 loss: 4.5495 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.91 mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970 loss: 4.6507 memory: 49.66GiB(52.28%) tps: 5,871 tflops: 340.03 mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980 loss: 4.5222 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990 loss: 4.6672 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000 loss: 4.6165 memory: 49.66GiB(52.28%) tps: 5,840 tflops: 338.24 mfu: 34.20%
[rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed
[rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed.
```
---------
Co-authored-by: saiteja64 <[email protected]>1 parent 7104125 commit 671c30c
File tree
2 files changed
+46
-128
lines changed- tests/unit_tests
- torchtitan/components
2 files changed
+46
-128
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
420 | 420 | | |
421 | 421 | | |
422 | 422 | | |
423 | | - | |
| 423 | + | |
424 | 424 | | |
425 | 425 | | |
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
429 | 429 | | |
430 | 430 | | |
431 | | - | |
| 431 | + | |
432 | 432 | | |
433 | 433 | | |
434 | 434 | | |
| |||
446 | 446 | | |
447 | 447 | | |
448 | 448 | | |
449 | | - | |
| 449 | + | |
450 | 450 | | |
| 451 | + | |
| 452 | + | |
451 | 453 | | |
452 | 454 | | |
453 | 455 | | |
| |||
461 | 463 | | |
462 | 464 | | |
463 | 465 | | |
464 | | - | |
| 466 | + | |
465 | 467 | | |
466 | | - | |
| 468 | + | |
467 | 469 | | |
468 | | - | |
469 | | - | |
| 470 | + | |
| 471 | + | |
470 | 472 | | |
471 | 473 | | |
472 | | - | |
473 | | - | |
| 474 | + | |
| 475 | + | |
474 | 476 | | |
475 | 477 | | |
476 | 478 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
21 | 20 | | |
22 | | - | |
| 21 | + | |
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
26 | 25 | | |
27 | 26 | | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | 101 | | |
139 | 102 | | |
140 | 103 | | |
| |||
275 | 238 | | |
276 | 239 | | |
277 | 240 | | |
278 | | - | |
| 241 | + | |
279 | 242 | | |
280 | 243 | | |
281 | 244 | | |
| |||
292 | 255 | | |
293 | 256 | | |
294 | 257 | | |
295 | | - | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
296 | 263 | | |
297 | 264 | | |
298 | 265 | | |
| |||
311 | 278 | | |
312 | 279 | | |
313 | 280 | | |
314 | | - | |
| 281 | + | |
| 282 | + | |
315 | 283 | | |
316 | 284 | | |
317 | 285 | | |
318 | 286 | | |
319 | 287 | | |
320 | 288 | | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
325 | | - | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | 289 | | |
334 | 290 | | |
335 | 291 | | |
| |||
353 | 309 | | |
354 | 310 | | |
355 | 311 | | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
356 | 315 | | |
357 | 316 | | |
358 | 317 | | |
| |||
388 | 347 | | |
389 | 348 | | |
390 | 349 | | |
391 | | - | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
392 | 361 | | |
393 | 362 | | |
394 | | - | |
| 363 | + | |
395 | 364 | | |
396 | 365 | | |
397 | 366 | | |
| |||
475 | 444 | | |
476 | 445 | | |
477 | 446 | | |
478 | | - | |
479 | | - | |
480 | | - | |
481 | | - | |
482 | | - | |
483 | | - | |
484 | | - | |
485 | | - | |
486 | | - | |
487 | | - | |
488 | | - | |
489 | | - | |
490 | | - | |
491 | | - | |
492 | | - | |
493 | | - | |
494 | | - | |
495 | | - | |
496 | | - | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
503 | | - | |
504 | | - | |
| 447 | + | |
505 | 448 | | |
506 | 449 | | |
507 | 450 | | |
| |||
540 | 483 | | |
541 | 484 | | |
542 | 485 | | |
543 | | - | |
| 486 | + | |
544 | 487 | | |
545 | 488 | | |
546 | 489 | | |
| |||
633 | 576 | | |
634 | 577 | | |
635 | 578 | | |
636 | | - | |
637 | | - | |
638 | | - | |
639 | | - | |
640 | | - | |
641 | | - | |
| 579 | + | |
| 580 | + | |
642 | 581 | | |
643 | | - | |
644 | | - | |
645 | | - | |
646 | | - | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
647 | 586 | | |
648 | | - | |
| 587 | + | |
649 | 588 | | |
650 | 589 | | |
651 | 590 | | |
652 | | - | |
653 | | - | |
654 | | - | |
655 | | - | |
656 | | - | |
657 | | - | |
658 | | - | |
659 | | - | |
660 | | - | |
661 | | - | |
662 | | - | |
663 | | - | |
664 | | - | |
665 | | - | |
666 | | - | |
667 | | - | |
668 | | - | |
669 | | - | |
670 | | - | |
671 | | - | |
672 | | - | |
673 | | - | |
674 | | - | |
675 | 591 | | |
676 | 592 | | |
677 | 593 | | |
| |||
0 commit comments