-
-
Notifications
You must be signed in to change notification settings - Fork 654
Fix multinode tests script #1631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nd of the script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @fco-dv ! Looks good to me 👍
Let me also try it on my infra.
Do you think we should integrate to Circle CI ?
As for failing tests, let decorate What are the problems with |
…st_multinode_distrib_cpu
for
|
OK, it is a precision issue, let's add a tol option as done for XLA. |
@fco-dv have you tackled the issue concerning |
@sdesrozis not yet , my guess is that the process group is not destroyed at the end of |
@fco-dv what's the issue it is about ? |
@vfdev-5 when running with gpu enabled, |
Maybe, we can try to do same as here : https://github.com/pytorch/ignite/blob/master/tests/ignite/conftest.py#L104 ? |
Managed to fix gpu tests with :
|
Seems ok now for: Default conf : 2 | 4 | 0
and with gpu : 2 | 1 | 1
@vfdev-5 for the CI integration would you like me to create another PR or continue on this one ? thanks! |
@fco-dv Thanks ! Let's merge it like that and for Circle CI, I'll enable it on PRs and let's intergrate it in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @fco-dv !
* fix run_multinode_tests_in_docker.sh : run tests with docker python version * add missing modules * build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters * pytorch#1615 : change nproc_per_node default to 4 * pytorch#1615 : fix for gpu enabled tests / container rm step at the end of the script * add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu * fix script gpu_options * add default tol=1e-6 for _test_distrib_compute_on_criterion * fix for "RuntimeError: trying to initialize the default process group twice!" * tolerance for test_multinode_distrib_cpu case only * fix assert None error * autopep8 fix Co-authored-by: vfdev <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: fco-dv <[email protected]>
* Recall/Precision metrics for ddp : average == false and multilabel == true * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612) * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3 * Update ignite/contrib/metrics/regression/canberra_metric.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/manhattan_distance.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/r2_score.py Co-authored-by: vfdev <[email protected]> * Update ignite/handlers/checkpoint.py Co-authored-by: vfdev <[email protected]> * address PR comments Co-authored-by: vfdev <[email protected]> * added TimeLimit handler with its test and doc (#1611) * added TimeLimit handler with its test and doc * fixed documentation * fixed docstring and formatting * flake8 fix trailing whitespace :) * modified class logger , default value and tests * changed rounding to nearest integer * tests refactored , docs modified * fixed default value , removed global logger * fixing formatting * Added versionadded * added test for engine termination Co-authored-by: vfdev <[email protected]> * Update handlers to use setup_logger (#1617) * Fixes #1614 - Updated handlers EarlyStopping and TerminateOnNan - Replaced `logging.getLogger` with `setup_logger` in the mentioned handlers * Updated `TimeLimit` handler. Replaced use of `logger.getLogger` with `setup_logger` from `ignite.utils` Co-authored-by: Pradyumna Rahul K <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Managing Deprecation using decorators (#1585) * Starter code for managing deprecation * Make functions deprecated using the `@deprecated` decorator * Add arguments to the @deprecated decorator to customize it for each function * Improve `@deprecated` decorator and add tests * Replaced the `raise` keyword with added `warnings` * Added tests several possibilities of the decorator usage * Removing the test deprecation to check tests * Add static typing, fix mypy errors * Make `@deprecated` to raise Exceptions or Warning * The `@deprecated` decorator will now always emit warning unless explicitly asked to raise an Exception * Fix mypy errors * Fix mypy errors (hopefully) * Fix the test `test_deprecated_setup_any_logging` * Change the test to work with the `@deprecated` decorator * Change to snake_case, handle mypy ignores * Improve Type Annotations * Update common.py * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5… (#1612) * For v0.4.3 - Add more versionadded, versionchanged tags - Change v0.5.0 to v0.4.3 * Update ignite/contrib/metrics/regression/canberra_metric.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/manhattan_distance.py Co-authored-by: vfdev <[email protected]> * Update ignite/contrib/metrics/regression/r2_score.py Co-authored-by: vfdev <[email protected]> * Update ignite/handlers/checkpoint.py Co-authored-by: vfdev <[email protected]> * address PR comments Co-authored-by: vfdev <[email protected]> * `version` -> version Co-authored-by: vfdev <[email protected]> Co-authored-by: François COKELAER <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Create documentation.md * Distributed tests on Windows should be skipped until fixed. (#1620) * modified CONTRIBUTING.md * bash instead of sh * Added Checkpoint.get_default_score_fn (#1621) * Added Checkpoint.get_default_score_fn to simplify best_model_handler creation * Added score_sign argument * Updated docs * Update about.rst * Update pre-commit hooks and CONTRIBUTING.md (#1622) * Change pre-commit config and CONTRIBUTING.md - Update hook versions - Remove seed-isort-config - Add black profile to isort * Fix files based on new pre-commit config * Add meaningful exclusions to prettier - Also update actions workflow files to match local pre-commit * added requirements.txt and updated readme.md (#1624) * added requirements.txt and updated readme.md * Update examples/contrib/cifar10/README.md Co-authored-by: vfdev <[email protected]> * Update examples/contrib/cifar10/requirements.txt Co-authored-by: vfdev <[email protected]> Co-authored-by: vfdev <[email protected]> * Replace relative paths with raw.githubusercontent (#1629) * Updated cifar10 example (#1632) * Updates for cifar10 example * Updates for cifar10 example * More updates * Updated code * Fixed code-formatting * Fixed failling CI and typos for cifar10 examples (#1633) * Updates for cifar10 example * Updates for cifar10 example * More updates * Updated code * Fixed code-formatting * Fixed typo and failing CI * Fixed hvd spawn fail and better synced qat code * Removed temporary hack to install pth 1.7.1 (#1638) - updated default pth image for gpu tests - updated TORCH_CUDA_ARCH_LIST - fixed /merge -> /head in trigger ci pipeline * [docker] Pillow -> Pillow-SIMD (#1509) (#1639) * [docker] Pillow -> Pillow-SIMD (#1509) * [docker] Pillow -> Pillow-SIMD * replace pillow with pillow-simd in base docker files * chore(docker): apt-get autoremove after pillow-simd installation * apt-get install at once, autoremove g++ * install g++ in pillow installation layer Co-authored-by: Sylvain Desroziers <[email protected]> * Fix g++ install issue Co-authored-by: Jeff Yang <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> * Fix multinode tests script (#1631) * fix run_multinode_tests_in_docker.sh : run tests with docker python version * add missing modules * build an image with test env and add 'nnodes' 'nproc_per_node' 'gpu' as parameters * #1615 : change nproc_per_node default to 4 * #1615 : fix for gpu enabled tests / container rm step at the end of the script * add xfail decorator for tests/ignite/engine/test_deterministic.py::test_multinode_distrib_cpu * fix script gpu_options * add default tol=1e-6 for _test_distrib_compute_on_criterion * fix for "RuntimeError: trying to initialize the default process group twice!" * tolerance for test_multinode_distrib_cpu case only * fix assert None error * autopep8 fix Co-authored-by: vfdev <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: fco-dv <[email protected]> * remove warning for average=False and is_multilabel=True * update docstring and {precision, recall} tests according to test_multilabel_input_NCHW Co-authored-by: vfdev <[email protected]> Co-authored-by: Ahmed Omar <[email protected]> Co-authored-by: Pradyumna Rahul <[email protected]> Co-authored-by: Pradyumna Rahul K <[email protected]> Co-authored-by: Sylvain Desroziers <[email protected]> Co-authored-by: Devanshu Shah <[email protected]> Co-authored-by: Debojyoti Chakraborty <[email protected]> Co-authored-by: Jeff Yang <[email protected]> Co-authored-by: fco-dv <[email protected]>
Fixes #1627
Description: Try to fix the run_miltinode_tests_in_docker.sh
nnodes
|nproc_per_node
|gpu
docker rm
steps at the endCheck list: