Skip to content

ENH: Take advantage of IndexedGzipFile drop_handles flag #614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 4, 2018

Conversation

pauldmccarthy
Copy link
Contributor

@pauldmccarthy pauldmccarthy commented Mar 26, 2018

This is an attempt to address #573.

  • Deprecate the keep_file_open parameter
  • Always use indexed_gzip if it is present (this can be disabled by overwriting the nibabel.openers.HAVE_INDEXED_GZIP flag)
  • Change ArrayProxy so that, if indexed_gzip is present, and it is given a .gz file, it creates and uses a single ImageOpener. Otherwise it creates a new ImageOpener on every access.

@coveralls
Copy link

coveralls commented Mar 26, 2018

Coverage Status

Coverage decreased (-0.03%) to 91.798% when pulling e96d71c on pauldmccarthy:rf/deprecate_keep_file_open into 2edca76 on nipy:master.

keep_file_open)
# If using indexed_gzip, we use a single ImageOpener. Otherwise, we
# create a new ImageOpener on each file access
self._persist_opener = openers.HAVE_INDEXED_GZIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thought here is that we only want to use a single file if we have indexed_gzip and the file is a .gz file. Quick read-through suggests that the un-compressed case might not be handled correctly, but I may have just missed it.

I'll try to have a more detailed look soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah yes, you're right. I'll change the condition to something like HAVE_INDEXED_GZIP and isinstance(file_like, str) and file_like.endswith('.gz').

@codecov-io
Copy link

codecov-io commented Mar 28, 2018

Codecov Report

Merging #614 into master will decrease coverage by 0.04%.
The diff coverage is 62.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #614      +/-   ##
==========================================
- Coverage   88.87%   88.83%   -0.05%     
==========================================
  Files          92       92              
  Lines       11274    11278       +4     
  Branches     1847     1848       +1     
==========================================
- Hits        10020    10019       -1     
- Misses        921      925       +4     
- Partials      333      334       +1
Impacted Files Coverage Δ
nibabel/arrayproxy.py 100% <100%> (ø) ⬆️
nibabel/openers.py 76.47% <33.33%> (-3.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2edca76...e96d71c. Read the comment docs.

@pauldmccarthy pauldmccarthy changed the title [WIP] RF: Deprecate keep_file_open, always use indexed_gzip if present RF: Deprecate keep_file_open, always use indexed_gzip if present Mar 28, 2018
@matthew-brett
Copy link
Member

I hate to jump in without understanding. But, would the flag continue to be useful in the following situation, without indexed_gzip installed?

img = nib.load('foo.nii.gz', keep_file_open=True)
for volume_no in range(img.shape[-1]):
    vol_data = img.dataobj[..., volume_no]

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Mar 30, 2018

Hey @matthew-brett, thanks for the comment. This PR would break that example
(i.e. keep_file_open would be ignored, and the file would be decompressed from
its beginning on each iteration of the loop).

Based on discussions between myself and @effigies (#573), I formed the
impression that keep_file_open didn't have any uses outside of
indexed_gzip-based access.

But your example suggests otherwise - is this a real use case, and are you
suggesting that you would like to retain the ability to keep_file_open
(apologies for the grammar)?

@matthew-brett
Copy link
Member

i think I'm right in saying that the example above would be much faster with keep_file_open - is that right?

@pauldmccarthy
Copy link
Contributor Author

Yep, that's correct. So it would be useful if you need to scan through a .gz file sequentially.

@effigies
Copy link
Member

Yes, there's definitely a huge speed-up for keeping files open in that case, so it makes sense to me to keep the option around, given we went through the trouble of putting it in.

So at this point, I think I'd want the logic to be:

IndexedGzipFile keep_file_open .gz Behavior
True True True Keep open
True True False Keep open
True False True Keep open
True False False Re-open
False True True Keep open
False True False Keep open
False False True Re-open
False False False Re-open

So keep open if keep_file_open or IndexedGzipFile and .gz. And I'm pretty sure the default behavior I'd want at this point is equivalent to the keep_file_open == False subset of the table, so we can still deprecate the 'auto' and None options, and make them equivalent to False.

That said, if we're not doing a major refactor, do we want to continue to permit indexed_gzip >= 0.6.0, with the old behavior for 'auto' and None? Unfortunately, NeuroDebian doesn't seem to be updating to 0.7+, so it looks like it might be useful to continue supporting the 0.6 series that for a while longer.

If so, then I think the logic becomes a bit annoying to put in a truth table:

if keep_file_open is None:
    keep_file_open = arrayproxy.KEEP_FILE_OPEN_DEFAULT
if keep_file_open is True:
    return True
if indexed_gzip.__version__ < 0.6:  # I treat missing as 0
    return False
if indexed_gzip.__version__ >= 0.7:
    return fname.endswith('.gz')
return keep_file_open and fname.endswith('.gz')

What do y'all think? (And sorry for initiating this refactor, if we're going to end up throwing most of it away.)

@matthew-brett
Copy link
Member

Sorry, I know this is annoying, but it seems to me keep_file_open=False leading to the file being kept open anyway (indexed_gzip present, .gz file) is a bit odd. Can't we just change the default to 'auto'?

@pauldmccarthy
Copy link
Contributor Author

So how about we just preserve the existing behaviour, but change the default value of keep_file_open to 'auto'?

@matthew-brett
Copy link
Member

Yes, that's my suggestion - but I'm afraid I didn't follow the previous discussion. Does that suit everyone?

@matthew-brett
Copy link
Member

Specifically, I think we should change the default value in nibabel.arrayproxy.KEEP_FILE_OPEN_DEFAULT - and leave None as the default value in the keyword arguments. That way, if anyone is customizing the default behavior there, they will see no change in the default.

@effigies
Copy link
Member

effigies commented Mar 31, 2018 via email

@pauldmccarthy
Copy link
Contributor Author

@effigies I see your logic - for indexed_gzip >= 0.7, the keep_file_open
flag is not actually controlling whether the file handle is kept open, but
rather whether a single Opener (and hence an IndexedGzipFile object) is
kept alive across accesses.

So semantically the name keep_file_open is a bit inaccurate (and this is why
in the current state of the PR, I changed the internal wording in ArrayProxy
to _persist_opener, rather than _keep_file_open).

But in the interest of minimising both confusion and the need to refactor the
code, I'm sort of inclined to go with @matthew-brett's suggestion, and keep
the code more or less as-is. So the keep_file_open flag will simply dictate
whether an ArrayProxy uses one Opener across accesses, or one for each
access.

Happy to be overruled though!

@effigies
Copy link
Member

effigies commented Mar 31, 2018 via email

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Mar 31, 2018

I think for v0.6 (did we decide whether to keep supporting it?)

Probably safe to, given neurodebian.

What about the following:

  1. Clarify the semantics of the keep_file_open parameter and the KEEP_FILE_OPEN_DEFAULT constant:
  • True: A single Opener and file object is used across accesses, regardless of file type

  • False: An Opener and file object is created for each access, regardless of file type

  • 'auto': If file is .gz and indexed_gzip is present, as for True. Otherwise as for False.

  1. Set KEEP_FILE_OPEN_DEFAULT depending on the presence of, and version of indexed_gzip:
indexed_gzip KEEP_FILE_OPEN_DEFAULT
Not installed, or version < 0.7 False
Version >= 0.7 'auto'
  1. (purely cosmetic) Move the KEEP_FILE_OPEN_DEFAULT constant into the openers module, and initialise it in the same place that HAVE_INDEXED_GZIP is initialised. This just has the advantage that we wouldn't need to test for indexed_gzip presence/version in both openers and ArrayProxy.

@effigies
Copy link
Member

That seems fine, if that's the consensus. I just don't think anybody cares about Openers who doesn't hack on nibabel, and changing the value of a constant based on dependency versions feels less user friendly than fiddling with implementation details.

But I'm okay with your approach. Just making sure we are all deciding based on a full picture.

But I don't think we can move the constant, since anybody modifying it will expect it to continue working.

@pauldmccarthy
Copy link
Contributor Author

Sigh, if only there was a nicer way than this to use @property at the module level ... because we could then easily deprecate arrayproxy.KEEP_FILE_OPEN_DEFAULT.

@matthew-brett
Copy link
Member

Sorry I'm on my phone. Are you saying that indexed_gzip will close the open file we pass ?. So the flag says keep open but in fact it gets closed?

@pauldmccarthy
Copy link
Contributor Author

@matthew-brett - yes, as of 0.7.0, an IndexedGzipFile will, by default, close/re-open the file handle on each access. Just like what nibabel does by default. Hence the complexity of this conversation :)

@matthew-brett
Copy link
Member

Sorry to be ignorant here, but do we not have a way of using the final result of keep_file_open to tell indexed_gzip whether to keep the file open?

Is the point that, there is no advantage to keeping the file open with 0.7? If so, can't we deal with that by making auto change its behavior for 0.7?

@effigies
Copy link
Member

effigies commented Mar 31, 2018 via email

@effigies
Copy link
Member

effigies commented Apr 1, 2018

Oh, actually, my mistake. IndexedGzipFile(..., drop_handles=False) is possible (and, of course, passing an open filehandle will not drop it, since there's no way to reopen), but we ran benchmarks, and there is no penalty to closing the file, because the index is so helpful. Thus drop_handles=True is the default. I would not object to passing IndexedGzipFile(..., drop_handles=keep_file_open is not True). Our tests indicate would have negligible performance benefits, but it would be very true to user intent.

For reference, though it's not exactly light reading: pauldmccarthy/indexed_gzip@v0.6.1...v0.7.0

This was referenced May 29, 2018
@pauldmccarthy pauldmccarthy changed the title RF: Deprecate keep_file_open, always use indexed_gzip if present ENH: Take advantage of IndexedGzipFile drop_handles flag Jun 1, 2018
@pauldmccarthy pauldmccarthy force-pushed the rf/deprecate_keep_file_open branch from 76f6614 to 6d93be9 Compare June 1, 2018 12:31
@pauldmccarthy
Copy link
Contributor Author

Hey @effigie and @matthew-brett, sorry for the delay, and apologies if this is too late for 2.3.0.

Here is my new take on how nibabel should behave. If keep_file_open in (True, False), the behaviour is what you would intuitively expect from the name, regardless of whether gzip or indexed_gzip is used. If indexed_gzip is present, and keep_file_open is True, the drop_handles feature of indexed_gzip is not used (i.e. the file handle is kept open).

What do you guys think?

HAVE_INDEXED_GZIP keep_file_open Library used ArrayProxy persists opener indexed_gzip.drop_handles
False False gzip No na
False 'auto' gzip No na
False True gzip Yes na
True False gzip No na
True 'auto' indexed_gzip Yes True
True True indexed_gzip Yes False

@pauldmccarthy pauldmccarthy force-pushed the rf/deprecate_keep_file_open branch from 2b70f5e to d5ad97a Compare June 3, 2018 08:17
@pauldmccarthy
Copy link
Contributor Author

Updated to always use indexed_gzip if present. And if we're always using indexed_gzip, should we get rid of the 'auto' option too? e.g. deprecate it and make it equivalent to False.

Then the ArrayProxy would use if HAVE_INDEXED_GZIP or keep_file_open to determine whether to persist a single ImageOpener.

If we're all happy with this, I will try and finish this PR off tonight (UK time).

@effigies
Copy link
Member

effigies commented Jun 3, 2018

I think that makes sense, as the default was already False, and should simplify the logic. I might wait a version before actively producing a warning for 'auto', though. DeprecationWarnings can be pretty noisy for users when an app developer wants to support multiple versions of a library. (Willing to be overruled for an immediate DeprecationWarning.)

@matthew-brett
Copy link
Member

Just working through the logic, 'auto' means keep_file_open when indexed_gzip is present. But, with your investigations, you concluded that keeping the file open didn't actually have much effect on performance. So 'auto' won't have much effect on performance?

I guess there may be situations where it does? NFS mounts? Slow hard disks?

@effigies
Copy link
Member

effigies commented Jun 3, 2018

'auto' means keep_file_open when indexed_gzip

What it means is, if we update the above table to the following:

HAVE_INDEXED_GZIP keep_file_open Library used ArrayProxy persists opener indexed_gzip.drop_handles
False False gzip No na
False 'auto' gzip No na
False True gzip Yes na
True False indexed_gzip Yes True
True 'auto' indexed_gzip Yes True
True True indexed_gzip Yes False

Then in all cases 'auto' is equivalent to False.

@effigies
Copy link
Member

effigies commented Jun 3, 2018

(The fourth row is the one we've been discussing, for reference.)

@matthew-brett
Copy link
Member

Right - but dropping the file handles for auto is new, right? Why not retain the file handles for auto?

@effigies
Copy link
Member

effigies commented Jun 3, 2018

It has been the case that 'auto' dropped file handles for gzip and not for indexed_gzip. It is not proposed to change that.

Now that we're pushing file-handle dropping into indexed_gzip itself, then a persistent Opener object can hold an indexed_gzip file that itself drops the file handles. So the behavior for keep_file_open==False becomes identical to that of 'auto'.

@matthew-brett
Copy link
Member

Sorry again for not following, but am I right in thinking that auto with indexed_gzip available, used to cause keep_file_open to be set to True? If that logic does not change, then that will now cause drop_handles to be False?

@effigies
Copy link
Member

effigies commented Jun 3, 2018

No worries.

[Am] I right in thinking that auto with indexed_gzip available, used to cause keep_file_open to be set to True?

Yes.

If that logic does not change, then that will now cause drop_handles to be False?

Ah, that logic is changing. The issue here is that we've been conflating Openers with file handles, because when indexed_gzip did not have drop_handles (equivalent to drop_handles = False), they were equivalent. However, we did see, when adding the drop_handles parameter, that it had no performance impact, and thus set the default to True. Therefore, without change in nibabel, the behavior of 'auto' has become to drop file handles when possible.

The benefit of keep_file_open=False has been to preserve the old nibabel behavior of dropping file handles opportunistically to avoid hitting quotas. 'auto' now meets that criterion, and we've been discussing whether to make keep_file_open=False drop an Opener (and thus ignore indexed_gzip) or use indexed_gzip with drop_handles=True.

To address your earlier questions:

But, with your investigations, you concluded that keeping the file open didn't actually have much effect on performance.

We've found two things.

  1. Dropping file-handles within indexed_gzip had no effect on performance. Agreed that this was local, and we did not try over NFS or similar. However, this argues for users to set keep_file_open=True, not False or 'auto', so that decision should be unaffected by the 'auto'/False collapse.

  2. There is a 4%-5% overhead for a single-shot, random-access volume selection with indexed_gzip over gzip. However, disabling indexed_gzip where keep_file_open=False does not seem like an intuitive use of the parameter, and would reflect history more than describe the behavior of the library.

@matthew-brett
Copy link
Member

Right - so I agree that disabling indexed_zip for keep_file_open=False is a bad idea, but I'm wondering whether we should preserve the behavior of auto, causing file handles to be preserved, on the basis that it may sometimes make a difference to performance. Anyone using auto will already have had the chance of running out of filehandles. It's pretty easy to explain that auto is equivalent to keep_file_open=True when indexed_gzip is available, but we can also say that it will likely have little impact on performance.

@effigies
Copy link
Member

effigies commented Jun 3, 2018

I'll start by saying: That's fine. If that makes sense to people, okay.

That said, I don't think it's very intuitive that 'auto' would come to mean "keep files open for indexed_gzip but not gzip". In every case where there's a performance benefit for indexed_gzip, there will be one for gzip (e.g., if closing the file leads to higher latencies over the network). So we would be changing behavior based on the Python environment for historical reasons that were pragmatic and are now speculative.

But again, it's fine, especially if documented as you suggest. I'll always use False (or True in cases where keeping a FH open is a known benefit), so I have no personal stake in 'auto'.

@matthew-brett
Copy link
Member

Yes, I see the argument (that keeping files open = True is always faster regardless of whether you have indexed_gzip. How about warning in the docstring about deprecation this release, and deprecating next release, as you suggested a bit further up?

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Jun 3, 2018

Ok, if I've followed the conversation correctly:

  • if keep_file_open == 'auto' and HAVE_INDEXED_GZIP, we should keep files open (i.e. drop_handles=False)
  • if keep_file_open == 'auto' and not HAVE_INDEXED_GZIP, we should drop file handles (by way of the ArrayProxy dropping ImageOpener instances)
  • Add to the docs that keep_file_open = 'auto' will be deprecated in 2.4, and removed in ... 3.0?
HAVE_INDEXED_GZIP keep_file_open Library used ArrayProxy persists opener indexed_gzip.drop_handles
False False gzip No na
False 'auto' gzip No na
False True gzip Yes na
True False indexed_gzip Yes True
True 'auto' indexed_gzip Yes False
True True indexed_gzip Yes False

Sound good?

I guess I'll also change the default behaviour (dictated by arrayproxy.KEEP_FILE_OPEN_DEFAULT) back to False, so that the default behaviour is to drop file handles.

@effigies
Copy link
Member

effigies commented Jun 3, 2018

  • if keep_file_open == 'auto' and HAVE_INDEXED_GZIP, we should keep files open (i.e. drop_handles=False)
  • if keep_file_open == 'auto' and not HAVE_INDEXED_GZIP, we should drop file handles (by way of the ArrayProxy dropping ImageOpener instances)

I understood Matthew to mean we shouldn't change the behavior of 'auto', so we would end up back to 'auto' == False, but again, I don't care here. Matthew, feel free to clarify your preference. Otherwise, I'm okay with whichever Paul chooses.

  • Add to the docs that keep_file_open = 'auto' will be deprecated in 2.4, and removed in ... 3.0?

Sounds good to me.

I guess I'll also change the default behaviour (dictated by arrayproxy.KEEP_FILE_OPEN_DEFAULT) back to False, so that the default behaviour is to drop file handles.

Yes, please.

indexed_gzip is used, or dropped if gzip is used. Default value for
keep_file_open is now False. Warn that 'auto' will be deprecated soon.
@matthew-brett
Copy link
Member

Paul - yes - that's what I meant - keep the file open, and don't drop file handles, for auto and indexed_gzip. The docstring from current master has:

    If ``'auto'``, and the optional ``indexed_gzip`` dependency is
    present, a single file handle is created and persisted. If
    ``indexed_gzip`` is not available, behaviour is the same as if
    ``keep_file_open is False``

... so I thought (keep file open, don't drop handles, if indexed_gzip present), was as close as possible to that. Chris - did I get that wrong?

@effigies
Copy link
Member

effigies commented Jun 3, 2018

Nope, you're right. I'm 👍 to merge. Any final issues?

@effigies
Copy link
Member

effigies commented Jun 4, 2018

@matthew-brett Sorry to bug, but if you're happy with this, I'd like to merge ASAP.

@matthew-brett
Copy link
Member

Sorry - yes - go ahead - please do merge.

@effigies effigies merged commit dfd113c into nipy:master Jun 4, 2018
effigies added a commit that referenced this pull request Jun 4, 2018
@effigies
Copy link
Member

effigies commented Jun 4, 2018

Thanks, Matthew.

@pauldmccarthy
Copy link
Contributor Author

Thanks guys!

yarikoptic added a commit to yarikoptic/nibabel that referenced this pull request Jun 27, 2018
* commit '2.2.1-261-g12da3be2':
  DOC: Update changelog
  RF: remove duplicate test
  BF: fix example for get_fdata and array images
  BF: array images return array if OK float type
  RF: rewrite return of array / proxy test images.
  RF: refactor image API tests
  DOC: use get_fdata in docs
  DOC: Update changelog to include nipygh-614
  RF: keep_file_open == 'auto' now causes file handles to be kept open if indexed_gzip is used, or dropped if gzip is used. Default value for keep_file_open is now False. Warn that 'auto' will be deprecated soon.
  TEST: Updated to expect indexed_gzip if present
  RF: Always use indexed_gzip for read access to gz files
  TEST,STY: Fixes to opener tests, unused import in benchmark.
  TEST: Further adjustment to arrayproxy benchmark
  RF: Make minimum required indexed_gzip version 0.7.
  TEST: Change unit test arrayproxy mocks - no longer necessary.
  RF: arrayproxy imports openers module, rather than importing individual items from the openers module.
  TEST: Make sure non-gzip file handles are dropped when keep_file_open == 'auto'. Updates to benchmark functions.
  RF,STY: Make sure that non-gzip file handles are dropped when keep_file_open == 'auto'.
  TEST: Update ImageOpener/ArrayProxy unit tests
  RF: ArrayProxy.KEEP_FILE_OPEN default value changed to 'auto'. Opener keep_open flag passed through to indexed gzip drop_handles flag. indexed_gzip versions > 0.6 all supported.
@pauldmccarthy pauldmccarthy deleted the rf/deprecate_keep_file_open branch October 12, 2018 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants