-
Notifications
You must be signed in to change notification settings - Fork 9.4k
[Backport] Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport] Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22610
Conversation
… in Parallel SCD Execution Relates to magento#22607 magento#21852 magento#22563
|
Hi @davidalger. Thank you for your contribution
For more details, please, review the Magento Contributor Assistant documentation |
|
Moving this PR on hold till original #22607 will be merged |
|
|
||
| unset($this->inProgress[$package->getPath()]); | ||
| return pcntl_wexitstatus($status) === 0; | ||
| } else if ($result === -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use elseif instead of else if there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. That was a mistake. I'll get it fixed.
|
Why didn't you added changes in |
Honestly? Because the changes were originally developed as a patch applied to the 2.3.1 and 2.2.6 releases inside a CI/CD pipeline. The I'll update the method on this back port though. Actually there is really no reason not to just pull back the entire updated class from 2.3 to include the strict types and phpcs:ignores, etc as well. Thanks for the notes! |
|
@ihor-sviziev PR updated. Let me know if there is anything I can do to help move this (and the PR to 2.3-develop) forward. Thanks again! |
|
Hi @sidolov, thank you for the review. |
|
Hi @ihor-sviziev, thank you for the review. |
|
Hi @davidalger, thank you for your contribution! |
… on Null PIDs in Parallel SCD Execution #22610
Original Pull Request
#22607
Description of Issue and Resolution
I've been building a Concourse CI/CD pipeline and using a project which has 14 themes to deploy given the multi-site setup. Around 50% of the jobs running in Concourse would fail with the following error (prior to the changes made in this PR) which is precisely the issue recorded on issue #21852:
Every time this error occured, the SCD would reach 100% completion and then hang until the 400 second timeout (as defined on
\Magento\Deploy\Process\Queue::DEFAULT_MAX_EXEC_TIME) elapses and the queue exits and attempts to cleanup. The error came from__destructbut the output indicates all packages had reached 100% and there are actually no child processes still present on the system by the time this occurs:With the better error logging and messaging in
__destruct(without fixes in other areas of this class), theRuntimeExceptionthrown reveals what I expected: the__destructmethod is attempting to wait for a child process that has already exited and been reaped by a prior call topcntl_waitpidin\Magento\Deploy\Process\Queue::isDeployedresulting in aPCNTL_ECHILD(errno 10) error:In the above execution, PID 540 was the parent process execution of
bin/magento. PID 543 (the one that tripped up__destructcall) is one of the first child processes that was spawned. Watching the process list inside the container this was running in showed this PID had cleaned up long before the procesess got close to the__destructcall in which theQueueimplementationn is attempting to wait for it to exit. This indicated for some reason the PID was remaining ininProgress(by all appearances, possibly a race condition given the sporadicness of the error on most environment, except this one where I can reproduce it 50% of the time)Similar behaviour of
pcntl_waitpidcan also be observed by running the following single-line command:Errno 10 is equivelant to the constant
PCNTL_ECHILDwhich per the linux man page (https://linux.die.net/man/2/waitpid) indicates the following:From here I added the same error logging I'd setup in
__destructto the usage ofpcntl_waitpidin\Magento\Deploy\Process\Queue::isDeployedas there was no error checking there, just the assumption that it was succeeding with a valid PID or returning because no child had exited (per theWNOHANGoption).After adding error handling to
isDeployedit was revealaed thatisDeployedwas callingpcntl_waitpidwithnullincorrectly and this resulted in a silent error on 2.2x which was not caught due to lack of error checking on thepcntl_waitpidcall (note the missing PID value):On
2.3-developafterdeclare(strict_types=1);was added to theQueueclass file, this resulted in aPHP Fatal error: Uncaught TypeErrorsincepcntl_waitpidis expecting anintand not anullargument. Complicating things worse on2.3-developwas commit 7421dfb which inadvertantly changed the behaviour ofgetPidto return a boolean vs actually returning the child PID.This resolves the issue with
getPidintroduced in commit 7421dfb as well as the issue with passing anullvalue topcntl_waitpidthat ultimately resulted in the SCD process inconsistently failing when parallel execution is used under specific circumstances (such as having a large number of themes).This PR resolves both #22563 and #21852 reliably in my test case. At the beginning of this description I stated I was reproducing the issue on about 50% of builds. After applying this patch to the build I have now run over 75+ builds succesfully without SCD hanging until the timeout elapses and without resulting in any
RuntimeExceptionerrors.Contribution checklist (*)