Skip to content

Very slow slicing with indexed_gzip #558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
effigies opened this issue Sep 28, 2017 · 18 comments
Closed

Very slow slicing with indexed_gzip #558

effigies opened this issue Sep 28, 2017 · 18 comments
Labels

Comments

@effigies
Copy link
Member

To reproduce, you can test on an OpenFMRI dataset:

datalad install -r ///openfmri/ds000114
cd ds000114
git annex get sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz

Install indexed_gzip:

>>> import nibabel as nib
>>> img = load('ds000114/sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz')
>>> img.shape
(256, 156, 256)
>>> img.dataobj[0]

I'm not sure how long it takes. Slicing along the last dimension is fast, so it's not reading the file that's the issue.

I think the fix will be to disable indexed_gzip if nibabel.arrayproxy.KEEP_FILE_OPEN_DEFAULT is False.

@effigies effigies added the bug label Sep 28, 2017
@matthew-brett
Copy link
Member

How long does it take if indexed_gzip is not installed? I'd expect that to be very slow on a gzipped file in general, because you're doing lots of little reads far apart.

@effigies
Copy link
Member Author

In [23]: %time img.dataobj[0, :, :]
CPU times: user 3min 21s, sys: 3.79 s, total: 3min 25s
Wall time: 3min 25s
Out[23]: 
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)
In [26]: nib.openers.HAVE_INDEXED_GZIP=False

In [27]: x = nib.load('/data/bids/ds000114/sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz')

In [28]: %time x.dataobj[0, :, :]
CPU times: user 376 ms, sys: 4 ms, total: 380 ms
Wall time: 380 ms
Out[28]: 
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)

@matthew-brett
Copy link
Member

I hate to ask, but what happens of you run the one without indexed_gzip first?

@effigies
Copy link
Member Author

New session:

In [1]: import nibabel as nib

In [2]: nib.openers.HAVE_INDEXED_GZIP=False

In [3]: img = nib.load('/data/bids/ds000114/sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz')

In [4]: %time img.dataobj[0, :, :]
CPU times: user 420 ms, sys: 4 ms, total: 424 ms
Wall time: 424 ms
Out[4]: 
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)

@matthew-brett
Copy link
Member

Does the keep file open flag make a difference here? I thought that call would result in a read with a single open file-handle anyway?

Is it possible that indexed gzip is just much slower doing these types of read than normal gzip?

@effigies
Copy link
Member Author

The first version I showed was with KEEP_FILE_OPEN_DEFAULT == 'auto'. But I get the same result with it set to False. Just to check, I just re-ran both in new sessions:

In [1]: import nibabel as nib

In [2]: nib.arrayproxy.KEEP_FILE_OPEN_DEFAULT
Out[2]: False

In [3]: img = nib.load('/data/bids/ds000114/sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz')

In [4]: %time img.dataobj[0]
CPU times: user 3min 24s, sys: 3.89 s, total: 3min 28s
Wall time: 3min 28s
Out[4]: 
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)
In [1]: import nibabel as nib

In [2]: nib.arrayproxy.KEEP_FILE_OPEN_DEFAULT
Out[2]: False

In [3]: nib.arrayproxy.KEEP_FILE_OPEN_DEFAULT='auto'

In [4]: img = nib.load('/data/bids/ds000114/sub-07/ses-test/anat/sub-07_ses-test_T1w.nii.gz')

In [5]: %time img.dataobj[0]
CPU times: user 3min 26s, sys: 3.81 s, total: 3min 30s
Wall time: 3min 30s
Out[5]: 
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)

@matthew-brett
Copy link
Member

Right - and I guess you get the same result for nib.arrayproxy.KEEP_FILE_OPEN_DEFAULT=True? So it is a feature of indexed_gzip reads rather than the nibabel scaffolding?

@effigies
Copy link
Member Author

Yes and yes.

I don't know the internals of why it's that much slower, but we're talking a 500x slowdown on a slice operation that happens by default when you upgrade to master. I'm tempted to revert for the coming release, and we can try to do some benchmarking and figure out where this penalty is coming from.

@matthew-brett
Copy link
Member

@pauldmccarthy - any thoughts here? Is this slowdown expected? I'm sure you know, but for my own benefit, the slice that Chris is doing will read byte 0 then skip X-1 bytes (where X is the length of the first axis) and read another byte, then skip another X-1 bytes, and so on, doing this a total of Y * Z times (where Y and Z are the lengths of the second and third axes).

@pauldmccarthy
Copy link
Contributor

Hey guys, sorry for the delay - I just got back from holiday. I can reproduce this - indexed_gzip is taking much longer than gzip for this slicing. I had not seen this before, as in FSLeyes I only use indexed_gzip for images bigger than around 1GB uncompressed.

As I see it, the main problem here is that if indexed_gzip is present, it gets used, even if keep_file_open is False. This is probably sub-optimal (and I think @effigies raised it in the PR discussion, but I forgot to address it).

I will open a new PR such that indexed_gzip only gets used if keep_file_open is True. This will need a flag to be passed through to Opener.__init__, so it knows what class to instantiate (GzipFile or IndexdGzipFile).

In the meantime, I'll do some debugging to see if I can get any speed up in indexed_gzip, or if this is just the best that it can do.

@effigies
Copy link
Member Author

effigies commented Oct 2, 2017

In addition to disabling SafeIndexedGzip for keep_file_open=False, would it be possible to set a cache in SafeIndexedGzip to reduce redundant decompressions?

A reasonable minimum might be a 256^3 volume of float32's (64MiB + <1KiB header) as that's about the largest structural image you're likely to get. I'd suggest 2 or 4 times that to try to get a reasonable block of functional volumes, but that might be too large once you start dealing with many files, unless you have a library-wide cache.

@pauldmccarthy
Copy link
Contributor

Hey @effigies, I will look into building a cache into indexed_gzip. In the meantime, I have started a PR (#562) which hopefully provides an interim solution - adjusting the default behaviour so that indexed_gzip is only ever used if it is present and requested.

@pauldmccarthy
Copy link
Contributor

Update: Wrapping the SafeIndexedGzipFile with an io.BufferedReader seems to resolve this problem entirely, giving performance comparable to the built-in gzip.GzipFile class. I'll post some more quantitative benchmarks later today, and will add this functionality into indexed_gzip and cut a new release later this week.

@matthew-brett
Copy link
Member

Paul - great! - well done. Any chance of putting the benchmarks into the PR so we can run them from time to time?

@pauldmccarthy
Copy link
Contributor

@matthew-brett great idea - I will add some unit tests into #562 as well :)

@pauldmccarthy
Copy link
Contributor

I get promising results using the benchmark script below - times from 100 iterations of slicingdataobj[0], and dataobj[:, :, 0]:

gzip         (keep_file_open=False, slice=(0,)):                                                  0 minutes, 0.5148 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)
indexed_gzip (keep_file_open=True,  slice=(0,)):                                                  0 minutes, 0.6076 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)
indexed_gzip (keep_file_open=auto,  slice=(0,)):                                                  0 minutes, 0.5853 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)
gzip         (keep_file_open=False, slice=(slice(None, None, None), slice(None, None, None), 0)): 0 minutes, 0.5011 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)
indexed_gzip (keep_file_open=True,  slice=(slice(None, None, None), slice(None, None, None), 0)): 0 minutes, 0.5856 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)
indexed_gzip (keep_file_open=auto,  slice=(slice(None, None, None), slice(None, None, None), 0)): 0 minutes, 0.5847 seconds (niters: 100, hash: 6f0e0dc99855ada093e69c7b602b8d2a)

This is running against the running against my #562 branch of nibabel, and the master branch of pauldmccarthy/indexed_gzip. I have updated indexed_gzip a bit to make buffering possible, so will do a new release today or tomorrow, and will drop a comment here when it is ready!

import time
import hashlib
import nibabel as nib

filename = 'sub-07_ses-test_T1w.nii.gz'

# (label for summary, keep_file_open,  sliceobj)
tests = [
    ('gzip',          False, (0, )),
    ('indexed_gzip',  True,  (0, )),
    ('indexed_gzip', 'auto', (0, )),
    ('gzip',          False, (slice(None), slice(None), 0)),
    ('indexed_gzip',  True,  (slice(None), slice(None), 0)),
    ('indexed_gzip', 'auto', (slice(None), slice(None), 0)),
]

for label, keep_open, sliceobj in tests:

    niters     = 100
    start      = time.time()
    expectHash = None

    for i in range(niters):
        hashobj = hashlib.md5()
        image   = nib.load(filename, keep_file_open=keep_open)

        hashobj.update(image.dataobj[0].ravel())

        if expectHash is None:
            expectHash = hashobj.hexdigest()
        elif hashobj.hexdigest() != expectHash:
            print('WARNING: {} / {} failed hash check'.format(label, i))

    end      = time.time()
    elapsed  = end - start
    elapsed /= niters

    print('{} (keep_file_open={}, slice={}): {} minutes, '
          '{:0.4f} seconds (niters: {}, hash: {})'.format(
              label,
              keep_open,
              sliceobj,
              int(elapsed / 60),
              elapsed % 60.0,
              niters,
              expectHash))

@effigies
Copy link
Member Author

effigies commented Oct 3, 2017

Those numbers look great. Thanks a lot for this, @pauldmccarthy!

I assume you're also benchmarking on large time series, to make sure that we're not regressing in terms of time or memory, there?

@pauldmccarthy
Copy link
Contributor

I've released a new version of indezed_gzip (0.6.1) which should resolve this problem. Am still working on a benchmark script, which I will add to PR #562

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants