Skip to content

[grid] stop the health check of a restarted node #15011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 2, 2025
Merged

Conversation

joerg1985
Copy link
Member

@joerg1985 joerg1985 commented Jan 2, 2025

User description

Description

This PR will stop the old health check of a restarted node and remove the old node from the model.
The NodeRestartedEvent will now contain the node status of the old node, this will allow to cleanup related resources.
The status of a newly added node is DOWN in any case, so sending the node status of the new node with the event was kind of useless anyway. This is also the reason for removing the old LocalDistributer.handleNodeRestarted code.

Motivation and Context

This will speed up the cleanup and should fix a leak, as the node is removed from the model before raising the event.
But removing it from the model will stop the dead node detection, so the old health check should never be stopped.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Bug fix, Enhancement


Description

  • Updated NodeRestartedEvent to use previous node status for cleanup.

  • Improved logging with detailed node information (ID and URI).

  • Removed redundant handleNodeRestarted method in LocalDistributor.

  • Simplified node removal logic across multiple session map implementations.


Changes walkthrough 📝

Relevant files
Enhancement
GridModel.java
Refactor node restart and removal logic                                   

java/src/org/openqa/selenium/grid/distributor/GridModel.java

  • Updated NodeRestartedEvent to use previous node status.
  • Enhanced logging to include node ID and URI.
  • Adjusted logic for removing unhealthy or inactive nodes.
  • +13/-4   
    LocalDistributor.java
    Simplify node restart handling in LocalDistributor             

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java

  • Removed handleNodeRestarted method.
  • Simplified node removal logic.
  • Updated event listener for NodeRestartedEvent.
  • +5/-25   
    Bug fix
    JdbcBackedSessionMap.java
    Update session cleanup for restarted nodes in JDBC             

    java/src/org/openqa/selenium/grid/sessionmap/jdbc/JdbcBackedSessionMap.java

    • Updated NodeRestartedEvent listener to use previous node status.
    +2/-1     
    LocalSessionMap.java
    Enhance session cleanup for restarted nodes locally           

    java/src/org/openqa/selenium/grid/sessionmap/local/LocalSessionMap.java

  • Updated NodeRestartedEvent listener to use previous node status.
  • Improved session cleanup logic for restarted nodes.
  • +3/-2     
    RedisBackedSessionMap.java
    Improve session cleanup for restarted nodes in Redis         

    java/src/org/openqa/selenium/grid/sessionmap/redis/RedisBackedSessionMap.java

    • Updated NodeRestartedEvent listener to use previous node status.
    +2/-1     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link
    Contributor

    qodo-merge-pro bot commented Jan 2, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Potential Race Condition

    The node variable is used after being removed from the nodes map. If another thread modifies the map between removal and usage, this could cause a NullPointerException.

    Node node = nodes.remove(nodeId);
    model.remove(nodeId);
    allChecks.remove(nodeId);
    
    if (node instanceof RemoteNode) {
      ((RemoteNode) node).close();
    }
    Early Break

    The break statement inside the unhealthy node check loop means only one unhealthy node will be removed per purge cycle, potentially leaving other unhealthy nodes in the system.

    toRemove.add(node);
    break;

    Copy link
    Contributor

    qodo-merge-pro bot commented Jan 2, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Remove premature loop termination to ensure all unhealthy nodes are properly processed

    Remove the unnecessary break statement in the purgeDeadNodes() method which
    prematurely exits the loop after removing the first unhealthy node, preventing other
    unhealthy nodes from being removed.

    java/src/org/openqa/selenium/grid/distributor/GridModel.java [228-234]

     if (nodeHealthCount.getOrDefault(id, 0) > UNHEALTHY_THRESHOLD) {
         LOG.info(
             String.format(
                 "Removing Node %s (uri: %s), unhealthy threshold has been reached",
                 node.getNodeId(), node.getExternalUri()));
         toRemove.add(node);
    -    break;
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The suggestion fixes a significant logical error where only the first unhealthy node would be removed due to the premature break statement, leaving other unhealthy nodes in the system. This could impact system stability and resource management.

    9
    Add defensive null check to prevent potential NullPointerException when handling node removal

    Add a null check before attempting to close the node to prevent potential
    NullPointerException, since nodes.remove() could return null if the node was already
    removed.

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [484-490]

     Node node = nodes.remove(nodeId);
     model.remove(nodeId);
     allChecks.remove(nodeId);
     
    -if (node instanceof RemoteNode) {
    +if (node != null && node instanceof RemoteNode) {
         ((RemoteNode) node).close();
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion addresses a potential NullPointerException that could occur if the node was already removed. This is a critical defensive programming practice that enhances the code's robustness and prevents runtime crashes.

    8

    @joerg1985 joerg1985 merged commit 4ddb16c into trunk Jan 2, 2025
    33 checks passed
    @joerg1985 joerg1985 deleted the stop-health-check branch January 2, 2025 21:05
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants