Skip to content

Only use indexed_gzip when explicitly requested #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 6, 2017

Conversation

pauldmccarthy
Copy link
Contributor

This PR aims to address issue #558 - it should have been a part of PR #552

For small files, the current version of indexed_gzip is much slower to use than the built-in GzipFile class. Therefore, indexed_gzip should not be used unless it is requested (via the keep_file_open flag, which is interpreted by the nibabel.arrayproxy.ArrayProxy class). In all other circumstances, the built-in gzip.GzipFile class should be used instead.

I have made this change by having the ArrayProxy pass a "hint" to the Opener class, indicating whether the file handle is going to be kept open for multiple accesses, or whether it is just for a one-time access. The Opener class (actually, the nibabel.openers._gzip_open function) then decides whether or not to use indexed_gzip.

@effigies
Copy link
Member

effigies commented Oct 2, 2017

Thanks for this. It looks good to me on a first pass, but I don't think I'm thinking optimally this morning, so I'll have a more detailed look later, unless Matthew beats me to it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0008%) to 96.272% when pulling 198d903 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

@effigies
Copy link
Member

effigies commented Oct 2, 2017

The failing Travis tests are related to #556. Don't worry about those.

@codecov-io
Copy link

codecov-io commented Oct 2, 2017

Codecov Report

Merging #562 into master will decrease coverage by 0.01%.
The diff coverage is 86.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #562      +/-   ##
==========================================
- Coverage   94.33%   94.32%   -0.02%     
==========================================
  Files         177      177              
  Lines       24680    24689       +9     
  Branches     2635     2638       +3     
==========================================
+ Hits        23283    23288       +5     
- Misses        921      925       +4     
  Partials      476      476
Impacted Files Coverage Δ
nibabel/arrayproxy.py 98.11% <100%> (ø) ⬆️
nibabel/tests/test_arrayproxy.py 100% <100%> (ø) ⬆️
nibabel/tests/test_openers.py 98.61% <100%> (+0.03%) ⬆️
nibabel/pkg_info.py 27.58% <50%> (ø) ⬆️
nibabel/openers.py 80% <69.23%> (-2.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa76141...39a2963. Read the comment docs.

@pauldmccarthy
Copy link
Contributor Author

In light of a discussion over at #557, I've also added a check to the present version of indexed_gzip - if it is older than 0.6.0, it is not used. 0.6.0 is the current latest version, and is also the first version which supports Windows.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 96.26% when pulling 33f6774 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.


if StrictVersion(version) < StrictVersion("0.6.0"):
raise ImportError('indexed_gzip is present, but too old '
'(>= 0.6.0 required): {})'.format(version))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning won't actually be presented to the user due to the except. I'd do:

try:
    from indexed_gzip import SafeIndexedGzip
    HAVE_INDEXED_GZIP = True
except ImportError:
    HAVE_INDEXED_GZIP = False
else:
    from indexed_gzip import __version__ as igzip_version
    if StrictVersion(igzip_version) < StrictVersion("0.6.0"):
        warnings.warn(...)
        HAVE_INDEXED_GZIP = False

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 96.256% when pulling a7cc3c8 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. One question and a minor suggestion.

@@ -117,6 +130,9 @@ class Opener(object):
default_compresslevel = 1
#: whether to ignore case looking for compression extensions
compress_ext_icase = True
#: hint which tells us whether the file handle will be kept open for
# multiple reads/writes, or just for one-time access.
default_keep_open = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class variable intended to be changed by a power user? Under what circumstances might I want to change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure - I was just going with the convention set from the other kwargs. I can't really think of a use-case for changing it, so will change it default it to False in __init__.

# Default keep_open hint
if 'keep_open' in arg_names:
if 'keep_open' not in kwargs:
kwargs['keep_open'] = self.default_keep_open
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can replace the internal if block with:

kwargs.setdefault('keep_open', self.default_keep_open)

(True, {'mode' : 'rb', 'keep_open' : False}, GzipFile),
(True, {'mode' : 'wb', 'keep_open' : True}, GzipFile),
(True, {'mode' : 'wb', 'keep_open' : False}, GzipFile),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@effigies
Copy link
Member

effigies commented Oct 5, 2017

Since 0.6.1 was released, now getting issues with this line:

from indexed_gzip import IndexedGzipFile

Should probably make it SafeIndexedGzipFile.

@pauldmccarthy
Copy link
Contributor Author

Aah true - I changed the inheritance hierarchy so that SafeIndexedGzipFile is no longer a sub-class of IndexedGzipFile (but rather, a subclass of io.BufferedReader).

You were testing against the master branch, right? I didn't pick this up on the PR branch, because indexed_gzip will no longer get used unless requested.

I've been quite busy this week, so still have a bit of work to do on the benchmarking code (and have had to make a few changes to other benchmark modules to get them to run). But I should be able to finalise the outstanding issues before the end of the week.

@pauldmccarthy
Copy link
Contributor Author

Ok, sorry for the delay ... I've knocked up a benchmarking script which compares ArrayProxy-based slicing on:

  • GzipFile/keep_file_open=False
  • GzipFile/keep_file_open=True
  • SafeIndexedGzipFile/keep_file_open=True

Each of these are compared against the time taken to perform an equivalent ArrayProxy-based slice on an uncompressed/mem-mapped image.

In its current form the script takes about 30 minutes to run. Some representative results are as follows:

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.48            |  2.28            |  2.84            |  4.89            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.51            |  0.00            | 683.49           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.01            |  0.00            | 1588.11          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.53            |  2.27            |  2.88            |  0.06            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.36            |  0.00            | 476.13           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.00            |  0.00            | 1639.86          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  4.26            |  2.25            |  1.89            | 16.61            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.02            |  0.00            | 18.37            |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.62            |  0.00            | 975.92           |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

I found it interesting that slices of the form [?, ?, ?, :] (e.g. time series for one voxel) are faster, but not that much faster, with indexed_gzip. I might be able to get better results by tuning some parameters (e.g. seek point spacing, readbuffer sizes).

The last commit in this push contains some changes that I had to make to get all of the existing nibabel/benchmarks scripts running on my system. I'm not too sure about these changes, as I don't know the typical environment in which the benchmarks are executed. If these are not necessary, let me know and I'll delete this commit.

@effigies
Copy link
Member

effigies commented Oct 5, 2017

Yeah, I would see if upping your buffer size can improve performance a bit. On my system, the default is 8KiB, while we set (when possible) the GzipFile max read chunk at 100MiB.

@pauldmccarthy
Copy link
Contributor Author

Although that fix is only applied to python < 3.4 - in 3.5 and above, the GzipFile uses io.BufferedReader with a default buffer size (also 8KiB on my system). In indexed_gzip 0.6.1 I set this buffer size to 1MB (this can be customised via SafeIndexedGzipFile.__init__).

Here are results after changing _gzip_open to do this (confusing parameter names to __init__, I know):

        gzip_file = SafeIndexedGzipFile(filename,
                                        readbuf_size=GZIP_MAX_READ_CHUNK,
                                        buffer_size=GZIP_MAX_READ_CHUNK)
+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.42            |  2.27            |  2.83            |  5.57            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.51            |  0.00            | 648.70           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.01            |  0.00            | 1469.82          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.46            |  2.31            |  2.80            | -3.82            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.34            |  0.00            | 402.97           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.03            |  0.00            | 1426.10          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  2.89            |  2.25            |  1.28            | 115.96           |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.13            |  0.00            | 179.51           |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.71            |  0.00            | 1041.26          |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

It's made volume access slower! I guess because way more data is being read than necessary.

@effigies
Copy link
Member

effigies commented Oct 5, 2017

Hmm. Okay. Still, failing back to standard GzipFile speeds isn't bad. As long as we're not significantly under-performing vanilla gzip for common operations, I'm okay with "This only improves some operations."

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 96.256% when pulling bc7b6f7 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Oct 5, 2017

Well, it does improve performance on all of the tested slices, just not as much as I would like :).

It's not shown directly in the benchmark output, but the most interesting comparison is between gzip/keep_file_open=True, and indexed_gzip/keep_file_open=True. Here are the speed-ups when using indexed_gzip:

slice [?, :, :, :] indexed_gzip / gzip: 6.53 / 4.26 = 1.53
slice [:, :, :, ?] indexed_gzip / gzip: 0.36 / 0.02 = 18.0
slice [?, ?, ?, :] indexed_gzip / gzip: 1.00 / 0.62 = 1.61

The biggest gain is for volume access (x18 speed-up, what I originally wrote indexed_gzip for). We get improvements in the other slice types, but they're not quite as impressive. Another point to note is that these figures are specific to the current number of iterations - they will only get better with more iterations. I've just started a new run with 200 iterations to substantiate this claim (hopefully :) ).

I'll try and put aside a few days over Christmas to do some more digging, and see if I can get some bigger speed-ups.

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Oct 5, 2017

Well I am going to have to eat my words there - no change in results with more iterations - these results were with 200 iterations, on a shape of (100, 100, 100, 100):

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.48            |  2.26            |  2.86            |  7.24            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.50            |  0.00            | 622.59           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.00            |  0.00            | 1407.09          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.54            |  2.28            |  2.87            |  0.09            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.34            |  0.00            | 467.24           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.01            |  0.00            | 1523.59          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  4.22            |  2.22            |  1.90            | 16.72            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.02            |  0.00            | 16.78            |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.64            |  0.00            | 1037.29          |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+


slice [?, :, :, :] gzip / indexed_gzip: 6.54 / 4.22 =  1.55
slice [:, :, :, ?] gzip / indexed_gzip: 0.34 / 0.02 = 17.00
slice [?, ?, ?, :] gzip / indexed_gzip: 1.01 / 0.64 =  1.58

But if I run the benchmark on a bigger image (50 iterations, (200, 200, 200, 100) float32, ~ 3GiB uncompressed) things look a bit more promising:

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        | 31.33            | 10.00            |  3.13            | 124.61           |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  4.46            |  0.01            | 466.79           |  0.25            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  7.45            |  0.00            | 11706.92         |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         | 30.54            | 10.55            |  2.89            | 15.26            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  3.17            |  0.01            | 328.21           |  0.25            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  8.42            |  0.00            | 13253.83         |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] | 27.22            | 10.82            |  2.52            |  1.68            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.07            |  0.01            |  6.82            |  0.21            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.69            |  0.00            | 920.69           |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

slice [?, :, :, :] gzip / indexed_gzip: 30.54 / 27.22 =  1.12
slice [:, :, :, ?] gzip / indexed_gzip:  3.17 /  0.07 = 45.29
slice [?, ?, ?, :] gzip / indexed_gzip:  8.42 /  0.69 = 12.20

Performance for the first slice type [?, :, :, :] has gone down - it's still better than gzip, but not by much. But for the slice types that matter ([:, :, :, ?] == volume, and [?, ?, ?, :] == time course), we get big improvements in performance.

@effigies
Copy link
Member

effigies commented Oct 6, 2017

I'm good with this if you want to merge or rebase to master to fix the tests.

@effigies effigies mentioned this pull request Oct 6, 2017
28 tasks
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 96.253% when pulling cc724ef on pauldmccarthy:indexed_gzip_usage into fa76141 on nipy:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 96.253% when pulling 39a2963 on pauldmccarthy:indexed_gzip_usage into fa76141 on nipy:master.

@effigies effigies merged commit 8c1d0bc into nipy:master Oct 6, 2017
@effigies
Copy link
Member

effigies commented Oct 6, 2017

Thanks for the quick work, @pauldmccarthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants