Skip to content

Memory error in a nested meta-workflow #2622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dPys opened this issue Jun 20, 2018 · 21 comments
Closed

Memory error in a nested meta-workflow #2622

dPys opened this issue Jun 20, 2018 · 21 comments
Labels

Comments

@dPys
Copy link

dPys commented Jun 20, 2018

Summary

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 68, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 480, in run
    result = self._run_interface(execute=True)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
    return self._run_command(execute)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 644, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/base/core.py", line 521, in run
    runtime = self._run_interface(runtime)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/utility/wrappers.py", line 144, in _run_interface
    out = function_handle(**args)
  File "<string>", line 13, in extract_ts_coords
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 275, in fit_transform
    return self.fit().transform(imgs, confounds=confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 176, in transform
    return self.transform_single_imgs(imgs, confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 321, in transform_single_imgs
    verbose=self.verbose)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 483, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 430, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 675, in call
    output = self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 66, in filter_and_extract
    imgs = _utils.check_niimg(imgs, atleast_4d=True, ensure_ndim=4)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/_utils/niimg_conversions.py", line 271, in check_niimg
    niimg = load_niimg(niimg, dtype=dtype)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/_utils/niimg.py", line 116, in load_niimg
    dtype = _get_target_dtype(niimg.get_data().dtype, dtype)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/dataobj_images.py", line 202, in get_data
    data = np.asanyarray(self._dataobj)
  File "/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py", line 544, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/arrayproxy.py", line 293, in __array__
    raw_data = self.get_unscaled()
  File "/opt/conda/lib/python3.6/site-packages/nibabel/arrayproxy.py", line 288, in get_unscaled
    mmap=self._mmap)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 523, in array_from_file
    data_bytes = bytearray(n_bytes)
MemoryError

and relatedly:

	Node: meta.wb_functional_connectometry.extract_ts_wb_coords_node
	Interface: nipype.interfaces.utility.wrappers.Function
	Traceback:
Traceback (most recent call last):

  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/plugins/base.py", line 338, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 303, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 497, in _get_hashval
    self._get_inputs()

  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 524, in _get_inputs
    results = loadpkl(results_file)

  File "/opt/conda/lib/python3.6/site-packages/nipype/utils/filemanip.py", line 646, in loadpkl
    unpkl = pickle.load(pkl_file)

MemoryError

which happens downstream, and gives me a message to post it as a nipype issue

Actual behavior

Memory error

Expected behavior

No memory error

How to replicate the behavior

Complex to replicate completely, but it occurs when the extract_ts_wb_node is parallelized to ~20 or more threads. Also, I have attempted to explicitly restrict mem usage on the node with no luck:

extract_ts_wb_node.interface.mem_gb = 2
extract_ts_wb_node.interface.num_threads = 1

or even:

extract_ts_wb_node.interface.mem_gb = 20
extract_ts_wb_node.interface.num_threads = 2

Doesn't change anything.

Here's the actual function in the node (two versions-- 1 with caching and 1 without, both of which cause the workflow to break):

version A) spheres_masker = input_data.NiftiSpheresMasker(seeds=coords, radius=float(node_size), allow_overlap=True, standardize=True, verbose=1, memory="%s%s" % ('SpheresMasker_cache_', str(ID)), memory_level=2)

version B) spheres_masker = input_data.NiftiSpheresMasker(seeds=coords, radius=float(node_size), allow_overlap=True, standardize=True, verbose=1)

Script/Workflow details

Meta-workflow ('meta') is triggered as a nested workflow in the imp_est node of single_subject_wf:
https://github.com/dPys/PyNets/blob/master/pynets/pynets_run.py

which further calls the wb_functional_connectometry workflow from:
https://github.com/dPys/PyNets/blob/master/pynets/workflows.py

and the whole thing breaks when it hits extract_ts_wb_node (line 110) of:
https://github.com/dPys/PyNets/blob/master/pynets/graphestimation.py

Platform details:

{'pkg_path': '/opt/conda/lib/python3.6/site-packages/nipype', 'commit_source': 'installation', 'commit_hash': 'fed0bd94f', 'nipype_version': '1.0.4', 'sys_version': '3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]', 'sys_executable': '/opt/conda/bin/python', 'sys_platform': 'linux', 'numpy_version': '1.14.3', 'scipy_version': '1.1.0', 'networkx_version': '2.1', 'nibabel_version': '2.3.0', 'traits_version': '4.6.0'}
1.0.4

Execution environment

  • Container: Singularity container
  • My python environment inside container: python3.6
  • My python environment outside container None

I've tried logging with the 'callback' logger and it doesn't even log anything since this bug seems to occur at several layers deep. Please help.

@dPys dPys changed the title Memory errors in a meta-workflow Memory error in a nested-meta-nested workflow Jun 20, 2018
@dPys dPys changed the title Memory error in a nested-meta-nested workflow Memory error in a nested meta-workflow Jun 20, 2018
@mgxd
Copy link
Member

mgxd commented Jun 20, 2018

@dPys could you try setting the memory of the node directly?
extract_ts_wb_node.mem_gb = 2

@dPys
Copy link
Author

dPys commented Jun 20, 2018

Thanks @mgxd for looking into this and for the quick reply. Unfortunately I had tried setting mem_gb on the node interface previously with no luck. I also tried various approaches to explicitly set the number of threads (see the PyNets workflows.py module).

Since posting this issue earlier, I've actually managed to hone in on the error a bit further. It appears to be a nibabel error equivalent to what @oesteban reported a few months ago when debugging a memory leak in fmriprep (nipreps/fmriprep#766).

It seems that when a function node loads a nifti file using nibabel (as is done within the NiftiSpheresMasker function), and this is performed iterably with the MultiProc plugin such that many 4d niftis are loaded simultaneously, the mem usage temporarily spikes and the workflow fails. This can occur even if I explicitly set extract_ts_wb_node.mem_gb = 20 gb or more! totally dependent on how many multiproc threads are running and the size of the nifti file inputs...

I'm actually not really sure how easily this can be resolved since the gb requirements of the function node scale rather unpredictably as a result... Also, the np.asarray(img.dataobj) trick described once upon a time by @GaelVaroquaux won't work here since NiftiSpheresMasker takes a file path input (i.e. as opposed to a nibabel image object/data array).

Other thoughts?

@effigies
Copy link
Member

Derek, you can reduce memory usage by making sure the input file is uncompressed.

@dPys
Copy link
Author

dPys commented Jun 21, 2018

Hi @effigies , thanks for looking into this too.

So I just tried this (i.e. decompressing all nifti inputs that feed into this function node, both within and outside of the workflow) and still no luck.

180621-03:35:03,61 workflow ERROR:
	 Node extract_ts_wb_coords_node.bI.b1.c16 failed to run on host c455-052.stampede2.tacc.utexas.edu.
180621-03:35:03,63 workflow ERROR:
	 Saving crash info to /scratch/04171/dpisner/pynets_test_data/func/010002/01_AP/Default_k250/crash-20180621-033503-dpisner-extract_ts_wb_coords_node.bI.b1.c16-4f1bd5d1-2812-49c1-bafc-56fbbba89b09.pklz
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 68, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 480, in run
    result = self._run_interface(execute=True)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
    return self._run_command(execute)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 644, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/base/core.py", line 521, in run
    runtime = self._run_interface(runtime)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/utility/wrappers.py", line 144, in _run_interface
    out = function_handle(**args)
  File "<string>", line 15, in extract_ts_coords
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 275, in fit_transform
    return self.fit().transform(imgs, confounds=confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 176, in transform
    return self.transform_single_imgs(imgs, confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 321, in transform_single_imgs
    verbose=self.verbose)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 483, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 430, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 675, in call
    output = self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 98, in filter_and_extract
    memory_level=memory_level)(imgs)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 483, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 430, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 675, in call
    output = self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 135, in __call__
    mask_img=self.mask_img)):
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 111, in _iter_signals_from_spheres
    mask_img=mask_img)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 40, in _apply_mask_and_get_affinity
    X = niimg.get_data().reshape([-1, niimg.shape[3]]).T
MemoryError

Other thoughts?

This is actually starting to look less and less like a nipype or nibabel issue, and more and more like a nilearn issue.

I wonder if it would help to revise this one-liner in NiftiSpheresMasker:

niimg.get_data().reshape([-1, niimg.shape[3]]).T

to:

np.asarray(niimg.dataobj).reshape([-1, niimg.shape[3]]).T

(Testing this now on a forked version...)

@dPys
Copy link
Author

dPys commented Jun 21, 2018

And the above tweak to NiftiSpheresMasker fails as well:

180621-04:06:06,628 workflow ERROR:
	 Node extract_ts_wb_coords_node.bI.b0.c00 failed to run on host c455-052.stampede2.tacc.utexas.edu.
180621-04:06:06,630 workflow ERROR:
	 Saving crash info to /scratch/04171/dpisner/pynets_test_data/func/010002/01_AP/Default_k250/crash-20180621-040606-dpisner-extract_ts_wb_coords_node.bI.b0.c00-e33fd3b3-5267-497a-b9a1-89da6cacebf9.pklz
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 68, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 480, in run
    result = self._run_interface(execute=True)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
    return self._run_command(execute)
  File "/opt/conda/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 644, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/base/core.py", line 521, in run
    runtime = self._run_interface(runtime)
  File "/opt/conda/lib/python3.6/site-packages/nipype/interfaces/utility/wrappers.py", line 144, in _run_interface
    out = function_handle(**args)
  File "<string>", line 15, in extract_ts_coords
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 275, in fit_transform
    return self.fit().transform(imgs, confounds=confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 176, in transform
    return self.transform_single_imgs(imgs, confounds)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/nifti_spheres_masker.py", line 321, in transform_single_imgs
    verbose=self.verbose)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 483, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 430, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 675, in call
    output = self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/input_data/base_masker.py", line 66, in filter_and_extract
    imgs = _utils.check_niimg(imgs, atleast_4d=True, ensure_ndim=4)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/_utils/niimg_conversions.py", line 271, in check_niimg
    niimg = load_niimg(niimg, dtype=dtype)
  File "/opt/conda/lib/python3.6/site-packages/nilearn/_utils/niimg.py", line 116, in load_niimg
    dtype = _get_target_dtype(niimg.get_data().dtype, dtype)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/dataobj_images.py", line 202, in get_data
    data = np.asanyarray(self._dataobj)
  File "/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py", line 544, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/arrayproxy.py", line 293, in __array__
    raw_data = self.get_unscaled()
  File "/opt/conda/lib/python3.6/site-packages/nibabel/arrayproxy.py", line 288, in get_unscaled
    mmap=self._mmap)
  File "/opt/conda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 509, in array_from_file
    offset=offset)
  File "/opt/conda/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory

Seems to really be a problem of loading too many images into memory at once when MultiProc is used. As a band-aid fix, I could add sleep(randint(1, 4)) or something similar within the node before the function call to stagger the img file loads, but it would be sloppy and not a good long-term solution...

Is nibabel's img.get_data() designed to scale with nipype's MultiProc (or vice versa)?

@satra
Copy link
Member

satra commented Jun 27, 2018

@dPys - i'm going to bring @oesteban into this discussion. as far as i understand if you inform each node about how much memory it will need, the multiproc plugin will take this into account. however, you really need to be generally precise about how much memory.

the amount of memory will depend on the process, so i would suggest monitoring a single process by running the entire workflow in linear mode to determine consumption of each process.

@dPys
Copy link
Author

dPys commented Jun 27, 2018

Thanks @satra !

Will try linear mode, monitor consumption, and report back. nipype.utils.profiler.log_nodes_cb with logging still the tool to use for profiling? If so, I'm guessing that I should do this at each workflow layer separately (i.e. 'init_single_subject_wf', the nested meta-workflow 'meta', and the nested meta-nested workflow 'wb_functional_connectometry_wf')? Each of these layers is currently using MultiProc.

derek;

@mgxd
Copy link
Member

mgxd commented Jun 27, 2018

@dPys have you tried enabling the resource monitor config option for your master workflow? It may help isolate the memory consumption

@dPys
Copy link
Author

dPys commented Jun 27, 2018

Thanks @mgxd , wasn't aware of the resource monitor config option. Will incorporate this and see what happens.

@dPys
Copy link
Author

dPys commented Jun 28, 2018

Okay, so after some testing, there doesn't seem to be any obvious issue scaling the NiftiSpheresMasker function itself (see issue nilearn/nilearn#1663 where this is simultaneously being addressed). Although using very high levels of parallelism is probably not what was originally intended for NiftiSpheresMasker, and I've yet to figure out clear-cut heuristics for how its mem reqs scale with respect to number/size of network nodes, what is clear from a first-pass of profiling is that it tends to consume more memory than the other function nodes in the pynets workflow. For this reason, I think this particular memory error is just exposing a broader issue with how memory is handled in pynets in general at the moment.

The most obvious culprit may be that the NiftiSpheresMasker function Node (i.e. extract_ts_wb_node) lives in a nested meta-workflow that is:

  1. disconnected from its master workflow(s) -- a problem that remains difficult to solve given the complex array of iterfields that needs to be compiled in a custom manner (hence the need for a meta-workflow whatsoever-- see Best practices for linking outputs of nested workflows into parent workflows? #2423).

and, relatedly:
2) configured to use MultiProc, which is redundant with the multiproc that runs with the master workflow, and leads to unpredictable behavior with the nipype scheduler.

Ways I might work around this? @oesteban ?

@oesteban
Copy link
Contributor

oesteban commented Jul 2, 2018

Hi

I guess that a solution to 1 will need to wait for nipype 2.0. So there's little to do there.

Regarging 2, I would configure the nested workflow to work with the Linear plugin. As far as I understand, the problem is several NiftiSpheresMasker running at the same time. Alternatively, you can have the nested workflow work with MultiProc and set a very large mem_gb property on the node, say mem_gb=128 so that you ensure only one NiftiSpheresMasker is being run at a given time.

You can combine that (or use alternatively) with the num_procs property of the Node. I'm guessing that NiftiSpheresMasker uses joblib for parallelization so (in theory) it should be compatible.

Can you elaborate more on the "unpredictable behavior with the nipype scheduler"? If there is a problem there we need to identify it and fix it.

@dPys
Copy link
Author

dPys commented Jul 2, 2018

Thanks @oesteban .

Along the lines of what you mentioned, I think I've found a working solution. Limiting the nested workflow to the Linear plugin was not really an option since that would greatly restrict the parallelization of the workflow as a whole (the majority of iterables are in the meta-wf's nested workflow). Instead, I figure out that it was possible to limit the num_threads and mem_gb attributes of the nested workflow nodes at the level of the meta-workflow. For some reason, when using multiproc with a meta-workflow, nipype's default scheduling behavior is to allocate fractions of a thread and GB of memory to each node? The easiest solution was therefore to explicitly set the num_threads and mem_gb in two places: 1) on the function Nodes within the meta-workflow's nested workflow; and 2) on the meta-workflow's 'meta-nodes' as I've done below:

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.WB_fetch_nodes_and_labels_node'))._num_threads = 1

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.WB_fetch_nodes_and_labels_node'))._mem_gb = 2

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.extract_ts_wb_coords_node'))._num_threads = 1

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.extract_ts_wb_coords_node'))._mem_gb = 2

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.thresh_and_fit_node'))._num_threads = 1

meta_wf.get_node("%s%s%s" % ('wb_functional_connectometry_', ID, '.thresh_and_fit_node'))._mem_gb = 2

The other thing that probably needed to be done was reserve a single core and gb of mem for the meta-wf and master-wf to maintain their operation as low-resource background processes:

plugin_args = {'n_procs': int(procmem[0])-1, 'memory_gb': int(procmem[1])-1}
egg = meta_wf.run(plugin='MultiProc', plugin_args=plugin_args)

...These tweaks appear to have fixed the issue, but I am still testing so will let you know for sure shortly.

Cheers everyone,
derek;

@oesteban
Copy link
Contributor

oesteban commented Jul 3, 2018

One further improvement would be to switch to forkserver.

In fmriprep we found this problem very often and made the following:

  • Generate the workflow graph in a separate process, to the wipe out that process (freeing any memory that is not needed anymore).
  • Use the forkserver mode of multiprocessing. With the default mode (fork), workers start with a memory allocation of the same amount of the current memory fingerprint. With large workflows we've seen that this allocation can easily snowball. In the forkserver mode, workers start with a memory allocation of the same amount of the memory fingerprint as that when the server thread was created (i.e. right after calling workflow.run()).

@dPys
Copy link
Author

dPys commented Jul 3, 2018

yup! but with a 1-4 ratio of threads:mem on the extract_ts_wb_coords_node 🚀

@dPys
Copy link
Author

dPys commented Jul 3, 2018

Also, great suggestion about the forkserver. Will try this in the coming weeks!

@dPys
Copy link
Author

dPys commented Jul 3, 2018

Make that a 1:3 ratio of threads:mem. Issue closed!

@dPys dPys closed this as completed Jul 3, 2018
@dPys
Copy link
Author

dPys commented Jul 20, 2018

Hi @oesteban ,

PyNets is now optimized to scale using any amount of cores for single-subject workflows. When I run it in multi-subject mode, however, I've found that the memory allocation snowballs into memory errors as you mentioned if I allocate anything more than the resources of a single node (i.e. beyond shared memory). What would be the easiest way to get started transitioning from MultiProc to a forkserver?

-Derek

@dPys dPys reopened this Jul 22, 2018
@oesteban
Copy link
Contributor

This line sets the forkserver mode:

https://github.com/poldracklab/fmriprep/blob/85c06a3dc64efc3c6e231ad21911a44e047f2587/fmriprep/cli/run.py#L237

Then, this section:

https://github.com/poldracklab/fmriprep/blob/85c06a3dc64efc3c6e231ad21911a44e047f2587/fmriprep/cli/run.py#L266-L282

Builds the workflow in a separate process. The workflow is returned and the process is finished and cleared up (i.e. freeing all the memory it took). That way you would start your forks with the minimal memory possible.

Then you just need to use multiproc in the traditional way.

This is all assuming that the forkserver mode still works after #2598. Otherwise, only the second part of this suggestion will be useful (creating the workflow in a separate process), although pretty limited.

@dPys
Copy link
Author

dPys commented Jul 24, 2018

Thanks @oesteban , will try this!

@mgxd mgxd added the question label Jul 27, 2018
@dPys
Copy link
Author

dPys commented Aug 9, 2018

@oesteban ,

Thanks again for the tip on the forkserver. I've now employed this for pynets and the added stability for memory handling has been pretty unreal. See:
forkserver call

derek;

@oesteban
Copy link
Contributor

oesteban commented Aug 9, 2018

Great to hear it was helpful :)

Please reopen if you experience further problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants