reimplement metric diagnosis, combine step with tensor metrics #1525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

majieyue wants to merge 1 commit into intelligent-machine-learning:master from majieyue:detect-atorch-step-timeout-and-restart-workers

Collaborator

majieyue commented Apr 24, 2025

What changes were proposed in this pull request?

We enable step collector for atorch trainer, so the starting and ending of each step will be reported to master.
If step has begun but not finished before timeout, it might be hang.
We are also continuing to collect key metrics such as tensor util, if tensor drop to zero for a period of time and the step
is also stuck, the job can be regarded as hang job and we will restart the processes to retry

Why are the changes needed?

To improve training availability

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT


          reimplement metric diagnosis, combine step and tensor metrics to dete…

9fc6e30

…ct job hang

majieyue requested review from workingloong, samplise and BalaBalaYi as code owners

April 24, 2025 01:51

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 93.83117% with 19 lines in your changes missing coverage. Please review.

Project coverage is 83.75%. Comparing base (72f664d) to head (9fc6e30).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
dlrover/python/common/event/context.py	87.35%	11 Missing ⚠️
...lrover/python/master/diagnosis/diagnosis_master.py	68.18%	7 Missing ⚠️
dlrover/python/master/main.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1525      +/-   ##
==========================================
+ Coverage   83.62%   83.75%   +0.13%     
==========================================
  Files         267      268       +1     
  Lines       27672    27868     +196     
==========================================
+ Hits        23140    23342     +202     
+ Misses       4532     4526       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BalaBalaYi requested changes

View reviewed changes

dlrover/python/common/event/context.py

@@ @@ -68,6 +70,23 @@ def get_first_step_event(self): @@
                           key = keys[0]
                           return self._step_events[key]
+                  def last_steps_avg_time(self, last_steps):

Collaborator

BalaBalaYi Apr 25, 2025

steps -> step

dlrover/python/common/event/context.py

+                          elif event.type == EventTypeName.END:
+                              if not len(keys):
+                                  logger.error(f"invalid ckpt step without BEGIN: {event}")

Collaborator

BalaBalaYi Apr 25, 2025

replace with warning

dlrover/python/common/event/context.py

+                                      )
+                                      return
+                                  if event.timestamp < last_event.end_timestamp:
+                                      logger.error(

Collaborator

BalaBalaYi Apr 25, 2025

replace with warning

dlrover/python/common/event/context.py

+                                  logger.error(f"invalid ckpt step: {step_event}, {event}")
+                                  return
+                              if step_event.begin_timestamp > event.timestamp:
+                                  logger.error(

Collaborator

BalaBalaYi Apr 25, 2025

There are too many error logs of this type. There should be a unified event validation process here, where validation failures are logged as warnings.
BTW, currently only event logs that impact core processes use the error level.

dlrover/python/common/event/context.py

+                              )
+                              step_event.step = event.step
+                              step_event.event_state = TrainEventState.TRAIN_EVT_END
+                              step_event.localtime = int(datetime.now().timestamp())

Collaborator

BalaBalaYi Apr 25, 2025

int(datetime.now().timestamp()) -> int(time.time())
Since we are already using Unix timestamps here, there's no need to add an extra conversion through datetime.

dlrover/python/diagnosis/datacollector/atorch_event_collector.py

+                                  and event_type == EventTypeName.BEGIN
+                              ):
+                                  logger.info(
+                                      f"Collect first step since last rendezvous: {step}"

Collaborator

BalaBalaYi Apr 25, 2025

There are currently too many similar logs. Many of these logs should use the debug level. Only essential points, such as during output, should use the info level.

Has this been actually tested so far? It’s necessary to evaluate whether the increase in info logs might reduce efficiency during routine log investigations.

dlrover/python/master/diagnosis/diagnosis_master.py

-                              hang_downtime = _dlrover_context.hang_downtime
+                          if self._is_observing_paused:
+                              logger.info(
+                                  f"Pause _metric_diagnose thread due to "

Collaborator

BalaBalaYi Apr 25, 2025

Pause _metric_diagnose thread due to paused status

dlrover/python/master/diagnosis/diagnosis_master.py

-                                          action_type=DiagnosisActionType.RESTART_WORKER,
-                                          instance=DiagnosisConstant.ANY_INSTANCE,
+                                  if step_hang is True:
+                                      logger.info("Restart worker-0 all processes")

Collaborator

BalaBalaYi Apr 25, 2025

why only worker-0 need restart?

dlrover/python/master/servicer.py

                               RendezvousName.ELASTIC_TRAINING
                           ]
                           training_manager.clear_waiting_nodes()
+                      # Pause hang diagnosis during rendezvous
+                      if node_rank == 0:

Collaborator

BalaBalaYi Apr 25, 2025

This implementation should not be placed here. It should be directly implemented under rdzv_manager.

dlrover/python/master/servicer.py

@@ @@ -334,6 +339,9 @@ def _get_comm_world(self, request: comm.CommWorldRequest): @@
                           rdzv_round = rdzv_manager.get_rdzv_round()
                           metrics = {CustomMetricKeys.RDZV_ROUND: rdzv_round}
                           self._job_metric_collector.collect_custom_data(metrics)
+                          # Finish elastic training rendezvous so we continue diagnosis
+                          self._diagnosis_manager.continue_observing()

Collaborator

BalaBalaYi Apr 25, 2025

same this part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet