-
Notifications
You must be signed in to change notification settings - Fork 533
Memory error in a nested meta-workflow #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dPys could you try setting the memory of the node directly? |
Thanks @mgxd for looking into this and for the quick reply. Unfortunately I had tried setting mem_gb on the node interface previously with no luck. I also tried various approaches to explicitly set the number of threads (see the PyNets workflows.py module). Since posting this issue earlier, I've actually managed to hone in on the error a bit further. It appears to be a nibabel error equivalent to what @oesteban reported a few months ago when debugging a memory leak in fmriprep (nipreps/fmriprep#766). It seems that when a function node loads a nifti file using nibabel (as is done within the NiftiSpheresMasker function), and this is performed iterably with the MultiProc plugin such that many 4d niftis are loaded simultaneously, the mem usage temporarily spikes and the workflow fails. This can occur even if I explicitly set extract_ts_wb_node.mem_gb = 20 gb or more! totally dependent on how many multiproc threads are running and the size of the nifti file inputs... I'm actually not really sure how easily this can be resolved since the gb requirements of the function node scale rather unpredictably as a result... Also, the np.asarray(img.dataobj) trick described once upon a time by @GaelVaroquaux won't work here since NiftiSpheresMasker takes a file path input (i.e. as opposed to a nibabel image object/data array). Other thoughts? |
Derek, you can reduce memory usage by making sure the input file is uncompressed. |
Hi @effigies , thanks for looking into this too. So I just tried this (i.e. decompressing all nifti inputs that feed into this function node, both within and outside of the workflow) and still no luck.
Other thoughts? This is actually starting to look less and less like a nipype or nibabel issue, and more and more like a nilearn issue. I wonder if it would help to revise this one-liner in NiftiSpheresMasker: niimg.get_data().reshape([-1, niimg.shape[3]]).T to: np.asarray(niimg.dataobj).reshape([-1, niimg.shape[3]]).T (Testing this now on a forked version...) |
And the above tweak to NiftiSpheresMasker fails as well:
Seems to really be a problem of loading too many images into memory at once when MultiProc is used. As a band-aid fix, I could add sleep(randint(1, 4)) or something similar within the node before the function call to stagger the img file loads, but it would be sloppy and not a good long-term solution... Is nibabel's img.get_data() designed to scale with nipype's MultiProc (or vice versa)? |
@dPys - i'm going to bring @oesteban into this discussion. as far as i understand if you inform each node about how much memory it will need, the multiproc plugin will take this into account. however, you really need to be generally precise about how much memory. the amount of memory will depend on the process, so i would suggest monitoring a single process by running the entire workflow in linear mode to determine consumption of each process. |
Thanks @satra ! Will try linear mode, monitor consumption, and report back. nipype.utils.profiler.log_nodes_cb with logging still the tool to use for profiling? If so, I'm guessing that I should do this at each workflow layer separately (i.e. 'init_single_subject_wf', the nested meta-workflow 'meta', and the nested meta-nested workflow 'wb_functional_connectometry_wf')? Each of these layers is currently using MultiProc. derek; |
@dPys have you tried enabling the resource monitor config option for your master workflow? It may help isolate the memory consumption |
Thanks @mgxd , wasn't aware of the resource monitor config option. Will incorporate this and see what happens. |
Okay, so after some testing, there doesn't seem to be any obvious issue scaling the NiftiSpheresMasker function itself (see issue nilearn/nilearn#1663 where this is simultaneously being addressed). Although using very high levels of parallelism is probably not what was originally intended for NiftiSpheresMasker, and I've yet to figure out clear-cut heuristics for how its mem reqs scale with respect to number/size of network nodes, what is clear from a first-pass of profiling is that it tends to consume more memory than the other function nodes in the pynets workflow. For this reason, I think this particular memory error is just exposing a broader issue with how memory is handled in pynets in general at the moment. The most obvious culprit may be that the NiftiSpheresMasker function Node (i.e. extract_ts_wb_node) lives in a nested meta-workflow that is:
and, relatedly: Ways I might work around this? @oesteban ? |
Hi I guess that a solution to 1 will need to wait for nipype 2.0. So there's little to do there. Regarging 2, I would configure the nested workflow to work with the You can combine that (or use alternatively) with the Can you elaborate more on the "unpredictable behavior with the nipype scheduler"? If there is a problem there we need to identify it and fix it. |
Thanks @oesteban . Along the lines of what you mentioned, I think I've found a working solution. Limiting the nested workflow to the Linear plugin was not really an option since that would greatly restrict the parallelization of the workflow as a whole (the majority of iterables are in the meta-wf's nested workflow). Instead, I figure out that it was possible to limit the num_threads and mem_gb attributes of the nested workflow nodes at the level of the meta-workflow. For some reason, when using multiproc with a meta-workflow, nipype's default scheduling behavior is to allocate fractions of a thread and GB of memory to each node? The easiest solution was therefore to explicitly set the num_threads and mem_gb in two places: 1) on the function Nodes within the meta-workflow's nested workflow; and 2) on the meta-workflow's 'meta-nodes' as I've done below:
The other thing that probably needed to be done was reserve a single core and gb of mem for the meta-wf and master-wf to maintain their operation as low-resource background processes:
...These tweaks appear to have fixed the issue, but I am still testing so will let you know for sure shortly. Cheers everyone, |
One further improvement would be to switch to In fmriprep we found this problem very often and made the following:
|
yup! but with a 1-4 ratio of threads:mem on the extract_ts_wb_coords_node 🚀 |
Also, great suggestion about the forkserver. Will try this in the coming weeks! |
Make that a 1:3 ratio of threads:mem. Issue closed! |
Hi @oesteban , PyNets is now optimized to scale using any amount of cores for single-subject workflows. When I run it in multi-subject mode, however, I've found that the memory allocation snowballs into memory errors as you mentioned if I allocate anything more than the resources of a single node (i.e. beyond shared memory). What would be the easiest way to get started transitioning from MultiProc to a forkserver? -Derek |
This line sets the forkserver mode: Then, this section: Builds the workflow in a separate process. The workflow is returned and the process is finished and cleared up (i.e. freeing all the memory it took). That way you would start your forks with the minimal memory possible. Then you just need to use multiproc in the traditional way. This is all assuming that the forkserver mode still works after #2598. Otherwise, only the second part of this suggestion will be useful (creating the workflow in a separate process), although pretty limited. |
Thanks @oesteban , will try this! |
Thanks again for the tip on the forkserver. I've now employed this for pynets and the added stability for memory handling has been pretty unreal. See: derek; |
Great to hear it was helpful :) Please reopen if you experience further problems. |
Uh oh!
There was an error while loading. Please reload this page.
Summary
and relatedly:
which happens downstream, and gives me a message to post it as a nipype issue
Actual behavior
Memory error
Expected behavior
No memory error
How to replicate the behavior
Complex to replicate completely, but it occurs when the extract_ts_wb_node is parallelized to ~20 or more threads. Also, I have attempted to explicitly restrict mem usage on the node with no luck:
or even:
Doesn't change anything.
Here's the actual function in the node (two versions-- 1 with caching and 1 without, both of which cause the workflow to break):
Script/Workflow details
Meta-workflow ('meta') is triggered as a nested workflow in the imp_est node of single_subject_wf:
https://github.com/dPys/PyNets/blob/master/pynets/pynets_run.py
which further calls the wb_functional_connectometry workflow from:
https://github.com/dPys/PyNets/blob/master/pynets/workflows.py
and the whole thing breaks when it hits extract_ts_wb_node (line 110) of:
https://github.com/dPys/PyNets/blob/master/pynets/graphestimation.py
Platform details:
{'pkg_path': '/opt/conda/lib/python3.6/site-packages/nipype', 'commit_source': 'installation', 'commit_hash': 'fed0bd94f', 'nipype_version': '1.0.4', 'sys_version': '3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]', 'sys_executable': '/opt/conda/bin/python', 'sys_platform': 'linux', 'numpy_version': '1.14.3', 'scipy_version': '1.1.0', 'networkx_version': '2.1', 'nibabel_version': '2.3.0', 'traits_version': '4.6.0'}
1.0.4
Execution environment
I've tried logging with the 'callback' logger and it doesn't even log anything since this bug seems to occur at several layers deep. Please help.
The text was updated successfully, but these errors were encountered: