Only use `indexed_gzip` when explicitly requested #562

pauldmccarthy · 2017-10-02T14:03:33Z

This PR aims to address issue #558 - it should have been a part of PR #552

For small files, the current version of indexed_gzip is much slower to use than the built-in GzipFile class. Therefore, indexed_gzip should not be used unless it is requested (via the keep_file_open flag, which is interpreted by the nibabel.arrayproxy.ArrayProxy class). In all other circumstances, the built-in gzip.GzipFile class should be used instead.

I have made this change by having the ArrayProxy pass a "hint" to the Opener class, indicating whether the file handle is going to be kept open for multiple accesses, or whether it is just for a one-time access. The Opener class (actually, the nibabel.openers._gzip_open function) then decides whether or not to use indexed_gzip.

effigies · 2017-10-02T14:13:08Z

Thanks for this. It looks good to me on a first pass, but I don't think I'm thinking optimally this morning, so I'll have a more detailed look later, unless Matthew beats me to it.

coveralls · 2017-10-02T14:33:32Z

Coverage increased (+0.0008%) to 96.272% when pulling 198d903 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

effigies · 2017-10-02T14:36:06Z

The failing Travis tests are related to #556. Don't worry about those.

codecov-io · 2017-10-02T15:44:08Z

Codecov Report

Merging #562 into master will decrease coverage by 0.01%.
The diff coverage is 86.11%.

@@            Coverage Diff             @@
##           master     #562      +/-   ##
==========================================
- Coverage   94.33%   94.32%   -0.02%     
==========================================
  Files         177      177              
  Lines       24680    24689       +9     
  Branches     2635     2638       +3     
==========================================
+ Hits        23283    23288       +5     
- Misses        921      925       +4     
  Partials      476      476

Impacted Files	Coverage Δ
nibabel/arrayproxy.py	`98.11% <100%> (ø)`	⬆️
nibabel/tests/test_arrayproxy.py	`100% <100%> (ø)`	⬆️
nibabel/tests/test_openers.py	`98.61% <100%> (+0.03%)`	⬆️
nibabel/pkg_info.py	`27.58% <50%> (ø)`	⬆️
nibabel/openers.py	`80% <69.23%> (-2.08%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa76141...39a2963. Read the comment docs.

pauldmccarthy · 2017-10-02T15:47:01Z

In light of a discussion over at #557, I've also added a check to the present version of indexed_gzip - if it is older than 0.6.0, it is not used. 0.6.0 is the current latest version, and is also the first version which supports Windows.

coveralls · 2017-10-02T16:11:10Z

Coverage decreased (-0.01%) to 96.26% when pulling 33f6774 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

effigies · 2017-10-02T15:50:05Z

nibabel/openers.py

+
+    if StrictVersion(version) < StrictVersion("0.6.0"):
+        raise ImportError('indexed_gzip is present, but too old '
+                          '(>= 0.6.0 required): {})'.format(version))


This warning won't actually be presented to the user due to the except. I'd do:

try: from indexed_gzip import SafeIndexedGzip HAVE_INDEXED_GZIP = True except ImportError: HAVE_INDEXED_GZIP = False else: from indexed_gzip import __version__ as igzip_version if StrictVersion(igzip_version) < StrictVersion("0.6.0"): warnings.warn(...) HAVE_INDEXED_GZIP = False

coveralls · 2017-10-02T17:01:39Z

Coverage decreased (-0.01%) to 96.256% when pulling a7cc3c8 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

effigies

This looks good. One question and a minor suggestion.

effigies · 2017-10-03T02:33:38Z

nibabel/openers.py

@@ -117,6 +130,9 @@ class Opener(object):
    default_compresslevel = 1
    #: whether to ignore case looking for compression extensions
    compress_ext_icase = True
+    #: hint which tells us whether the file handle will be kept open for
+    #  multiple reads/writes, or just for one-time access.
+    default_keep_open = False


Is this class variable intended to be changed by a power user? Under what circumstances might I want to change that?

I'm not actually sure - I was just going with the convention set from the other kwargs. I can't really think of a use-case for changing it, so will change it default it to False in __init__.

effigies · 2017-10-03T02:35:00Z

nibabel/openers.py

+        # Default keep_open hint
+        if 'keep_open' in arg_names:
+            if 'keep_open' not in kwargs:
+                kwargs['keep_open'] = self.default_keep_open


Can replace the internal if block with:

kwargs.setdefault('keep_open', self.default_keep_open)

effigies · 2017-10-03T02:35:13Z

nibabel/tests/test_openers.py

+            (True,  {'mode' : 'rb', 'keep_open' : False}, GzipFile),
+            (True,  {'mode' : 'wb', 'keep_open' : True},  GzipFile),
+            (True,  {'mode' : 'wb', 'keep_open' : False}, GzipFile),
+        ]


effigies · 2017-10-05T01:56:09Z

Since 0.6.1 was released, now getting issues with this line:

nibabel/nibabel/tests/test_openers.py

Line 241 in dbe74e1

from indexed_gzip import IndexedGzipFile

Should probably make it SafeIndexedGzipFile.

pauldmccarthy · 2017-10-05T08:28:19Z

Aah true - I changed the inheritance hierarchy so that SafeIndexedGzipFile is no longer a sub-class of IndexedGzipFile (but rather, a subclass of io.BufferedReader).

You were testing against the master branch, right? I didn't pick this up on the PR branch, because indexed_gzip will no longer get used unless requested.

I've been quite busy this week, so still have a bit of work to do on the benchmarking code (and have had to make a few changes to other benchmark modules to get them to run). But I should be able to finalise the outstanding issues before the end of the week.

pauldmccarthy · 2017-10-05T15:00:19Z

Ok, sorry for the delay ... I've knocked up a benchmarking script which compares ArrayProxy-based slicing on:

GzipFile/keep_file_open=False
GzipFile/keep_file_open=True
SafeIndexedGzipFile/keep_file_open=True

Each of these are compared against the time taken to perform an equivalent ArrayProxy-based slice on an uncompressed/mem-mapped image.

In its current form the script takes about 30 minutes to run. Some representative results are as follows:

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.48            |  2.28            |  2.84            |  4.89            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.51            |  0.00            | 683.49           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.01            |  0.00            | 1588.11          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.53            |  2.27            |  2.88            |  0.06            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.36            |  0.00            | 476.13           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.00            |  0.00            | 1639.86          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  4.26            |  2.25            |  1.89            | 16.61            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.02            |  0.00            | 18.37            |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.62            |  0.00            | 975.92           |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

I found it interesting that slices of the form [?, ?, ?, :] (e.g. time series for one voxel) are faster, but not that much faster, with indexed_gzip. I might be able to get better results by tuning some parameters (e.g. seek point spacing, readbuffer sizes).

The last commit in this push contains some changes that I had to make to get all of the existing nibabel/benchmarks scripts running on my system. I'm not too sure about these changes, as I don't know the typical environment in which the benchmarks are executed. If these are not necessary, let me know and I'll delete this commit.

effigies · 2017-10-05T15:16:14Z

Yeah, I would see if upping your buffer size can improve performance a bit. On my system, the default is 8KiB, while we set (when possible) the GzipFile max read chunk at 100MiB.

pauldmccarthy · 2017-10-05T16:01:09Z

Although that fix is only applied to python < 3.4 - in 3.5 and above, the GzipFile uses io.BufferedReader with a default buffer size (also 8KiB on my system). In indexed_gzip 0.6.1 I set this buffer size to 1MB (this can be customised via SafeIndexedGzipFile.__init__).

Here are results after changing _gzip_open to do this (confusing parameter names to __init__, I know):

        gzip_file = SafeIndexedGzipFile(filename,
                                        readbuf_size=GZIP_MAX_READ_CHUNK,
                                        buffer_size=GZIP_MAX_READ_CHUNK)

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.42            |  2.27            |  2.83            |  5.57            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.51            |  0.00            | 648.70           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.01            |  0.00            | 1469.82          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.46            |  2.31            |  2.80            | -3.82            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.34            |  0.00            | 402.97           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.03            |  0.00            | 1426.10          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  2.89            |  2.25            |  1.28            | 115.96           |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.13            |  0.00            | 179.51           |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.71            |  0.00            | 1041.26          |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

It's made volume access slower! I guess because way more data is being read than necessary.

effigies · 2017-10-05T16:21:10Z

Hmm. Okay. Still, failing back to standard GzipFile speeds isn't bad. As long as we're not significantly under-performing vanilla gzip for common operations, I'm okay with "This only improves some operations."

coveralls · 2017-10-05T16:23:24Z

Coverage decreased (-0.01%) to 96.256% when pulling bc7b6f7 on pauldmccarthy:indexed_gzip_usage into dbe74e1 on nipy:master.

pauldmccarthy · 2017-10-05T16:30:10Z

Well, it does improve performance on all of the tested slices, just not as much as I would like :).

It's not shown directly in the benchmark output, but the most interesting comparison is between gzip/keep_file_open=True, and indexed_gzip/keep_file_open=True. Here are the speed-ups when using indexed_gzip:

slice [?, :, :, :] indexed_gzip / gzip: 6.53 / 4.26 = 1.53
slice [:, :, :, ?] indexed_gzip / gzip: 0.36 / 0.02 = 18.0
slice [?, ?, ?, :] indexed_gzip / gzip: 1.00 / 0.62 = 1.61

The biggest gain is for volume access (x18 speed-up, what I originally wrote indexed_gzip for). We get improvements in the other slice types, but they're not quite as impressive. Another point to note is that these figures are specific to the current number of iterations - they will only get better with more iterations. I've just started a new run with 200 iterations to substantiate this claim (hopefully :) ).

I'll try and put aside a few days over Christmas to do some more digging, and see if I can get some bigger speed-ups.

pauldmccarthy · 2017-10-05T21:57:52Z

Well I am going to have to eat my words there - no change in results with more iterations - these results were with 200 iterations, on a shape of (100, 100, 100, 100):

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        |  6.48            |  2.26            |  2.86            |  7.24            |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  0.50            |  0.00            | 622.59           |  0.00            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  1.00            |  0.00            | 1407.09          |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         |  6.54            |  2.28            |  2.87            |  0.09            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  0.34            |  0.00            | 467.24           |  0.00            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  1.01            |  0.00            | 1523.59          |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] |  4.22            |  2.22            |  1.90            | 16.72            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.02            |  0.00            | 16.78            |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.64            |  0.00            | 1037.29          |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+


slice [?, :, :, :] gzip / indexed_gzip: 6.54 / 4.22 =  1.55
slice [:, :, :, ?] gzip / indexed_gzip: 0.34 / 0.02 = 17.00
slice [?, ?, ?, :] gzip / indexed_gzip: 1.01 / 0.64 =  1.58

But if I run the benchmark on a bigger image (50 iterations, (200, 200, 200, 100) float32, ~ 3GiB uncompressed) things look a bit more promising:

+-------------------------------------------------------+------------------+------------------+------------------+------------------+
|                                                       |       Time       |  Baseline time   |    Time ratio    | Memory deviation |
+=======================================================+==================+==================+==================+==================+
| Type gzip, keep_open False, slice [?, :, :, :]        | 31.33            | 10.00            |  3.13            | 124.61           |
| Type gzip, keep_open False, slice [:, :, :, ?]        |  4.46            |  0.01            | 466.79           |  0.25            |
| Type gzip, keep_open False, slice [?, ?, ?, :]        |  7.45            |  0.00            | 11706.92         |  0.00            |
| Type gzip, keep_open True, slice [?, :, :, :]         | 30.54            | 10.55            |  2.89            | 15.26            |
| Type gzip, keep_open True, slice [:, :, :, ?]         |  3.17            |  0.01            | 328.21           |  0.25            |
| Type gzip, keep_open True, slice [?, ?, ?, :]         |  8.42            |  0.00            | 13253.83         |  0.00            |
| Type indexed_gzip, keep_open True, slice [?, :, :, :] | 27.22            | 10.82            |  2.52            |  1.68            |
| Type indexed_gzip, keep_open True, slice [:, :, :, ?] |  0.07            |  0.01            |  6.82            |  0.21            |
| Type indexed_gzip, keep_open True, slice [?, ?, ?, :] |  0.69            |  0.00            | 920.69           |  0.00            |
+-------------------------------------------------------+------------------+------------------+------------------+------------------+

slice [?, :, :, :] gzip / indexed_gzip: 30.54 / 27.22 =  1.12
slice [:, :, :, ?] gzip / indexed_gzip:  3.17 /  0.07 = 45.29
slice [?, ?, ?, :] gzip / indexed_gzip:  8.42 /  0.69 = 12.20

Performance for the first slice type [?, :, :, :] has gone down - it's still better than gzip, but not by much. But for the slice types that matter ([:, :, :, ?] == volume, and [?, ?, ?, :] == time course), we get big improvements in performance.

effigies · 2017-10-06T02:44:41Z

I'm good with this if you want to merge or rebase to master to fix the tests.

multiple accesses, so Opener knows whether it should use indexed_gzip or not.

…dGzipFile classes as appropriate

…ror to a warning, so it will be shown to users.

…is hard coded in __init__

coveralls · 2017-10-06T10:25:09Z

Coverage decreased (-0.01%) to 96.253% when pulling cc724ef on pauldmccarthy:indexed_gzip_usage into fa76141 on nipy:master.

coveralls · 2017-10-06T15:47:06Z

Coverage decreased (-0.01%) to 96.253% when pulling 39a2963 on pauldmccarthy:indexed_gzip_usage into fa76141 on nipy:master.

effigies · 2017-10-06T19:14:23Z

Thanks for the quick work, @pauldmccarthy.

pauldmccarthy mentioned this pull request Oct 2, 2017

Very slow slicing with indexed_gzip #558

Closed

pauldmccarthy mentioned this pull request Oct 2, 2017

Gzip file load error on latest master #557

Closed

effigies reviewed Oct 2, 2017

View reviewed changes

effigies approved these changes Oct 3, 2017

View reviewed changes

effigies mentioned this pull request Oct 6, 2017

2.2.0 release prep #555

Merged

28 tasks

pauldmccarthy added 8 commits October 6, 2017 08:59

RF: ArrayProxy tells Opener whether it intends to keep the file open for

b92fc39

multiple accesses, so Opener knows whether it should use indexed_gzip or not.

TEST: Adjusted tests which make sure that Opener uses GzipFile/Indexe…

73971bb

…dGzipFile classes as appropriate

RF: Restrict the use of indexed_gzip to versions >= 0.6.0

83f5c8e

RF: Changed internally raised and eaten indexed_gzip version ImportEr…

2e1a046

…ror to a warning, so it will be shown to users.

RF: Removed Opener default_keep_open class attribute - default value …

1e80412

…is hard coded in __init__

TEST: Use SafeIndexedGzipFile instead of IndexedGzipFile

e2c7809

TEST: Benchmark script for slicing gzipped files using ArrayProxy

92e4b90

BF: Fixes to get benchmarking scripts running

e2c560a

BF: Syntax error in test case

cc724ef

pauldmccarthy force-pushed the indexed_gzip_usage branch from bc7b6f7 to cc724ef Compare October 6, 2017 08:00

PL: Bowing to the style gods

39a2963

effigies merged commit 8c1d0bc into nipy:master Oct 6, 2017

This was referenced Oct 6, 2017

ENH: Add image slicing #550

Merged

ENH: Add manual value limits to OrthoSlicer3D/SpatialImages.orthoview #491

Closed

pauldmccarthy deleted the indexed_gzip_usage branch March 26, 2018 09:53

Only use indexed_gzip when explicitly requested #562

Only use indexed_gzip when explicitly requested #562

Uh oh!

Conversation

pauldmccarthy commented Oct 2, 2017

Uh oh!

effigies commented Oct 2, 2017

Uh oh!

coveralls commented Oct 2, 2017

Uh oh!

effigies commented Oct 2, 2017

Uh oh!

codecov-io commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pauldmccarthy commented Oct 2, 2017

Uh oh!

coveralls commented Oct 2, 2017

Uh oh!

effigies Oct 2, 2017

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 2, 2017

Uh oh!

effigies left a comment

Choose a reason for hiding this comment

Uh oh!

effigies Oct 3, 2017

Choose a reason for hiding this comment

Uh oh!

pauldmccarthy Oct 3, 2017

Choose a reason for hiding this comment

Uh oh!

effigies Oct 3, 2017

Choose a reason for hiding this comment

Uh oh!

effigies Oct 3, 2017

Choose a reason for hiding this comment

Uh oh!

effigies commented Oct 5, 2017

Uh oh!

pauldmccarthy commented Oct 5, 2017

Uh oh!

pauldmccarthy commented Oct 5, 2017

Uh oh!

effigies commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pauldmccarthy commented Oct 5, 2017

Uh oh!

effigies commented Oct 5, 2017

Uh oh!

coveralls commented Oct 5, 2017

Uh oh!

pauldmccarthy commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pauldmccarthy commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

effigies commented Oct 6, 2017

Uh oh!

coveralls commented Oct 6, 2017

Uh oh!

coveralls commented Oct 6, 2017

Uh oh!

effigies commented Oct 6, 2017

Uh oh!

Uh oh!

Only use `indexed_gzip` when explicitly requested #562

Only use `indexed_gzip` when explicitly requested #562

codecov-io commented Oct 2, 2017 •

edited

Loading

effigies commented Oct 5, 2017 •

edited

Loading

pauldmccarthy commented Oct 5, 2017 •

edited

Loading

pauldmccarthy commented Oct 5, 2017 •

edited

Loading