-
Notifications
You must be signed in to change notification settings - Fork 262
ENH: Take advantage of IndexedGzipFile drop_handles flag #614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Take advantage of IndexedGzipFile drop_handles flag #614
Conversation
nibabel/arrayproxy.py
Outdated
keep_file_open) | ||
# If using indexed_gzip, we use a single ImageOpener. Otherwise, we | ||
# create a new ImageOpener on each file access | ||
self._persist_opener = openers.HAVE_INDEXED_GZIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main thought here is that we only want to use a single file if we have indexed_gzip
and the file is a .gz
file. Quick read-through suggests that the un-compressed case might not be handled correctly, but I may have just missed it.
I'll try to have a more detailed look soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah yes, you're right. I'll change the condition to something like HAVE_INDEXED_GZIP and isinstance(file_like, str) and file_like.endswith('.gz')
.
Codecov Report
@@ Coverage Diff @@
## master #614 +/- ##
==========================================
- Coverage 88.87% 88.83% -0.05%
==========================================
Files 92 92
Lines 11274 11278 +4
Branches 1847 1848 +1
==========================================
- Hits 10020 10019 -1
- Misses 921 925 +4
- Partials 333 334 +1
Continue to review full report at Codecov.
|
keep_file_open
, always use indexed_gzip
if presentkeep_file_open
, always use indexed_gzip
if present
I hate to jump in without understanding. But, would the flag continue to be useful in the following situation, without
|
Hey @matthew-brett, thanks for the comment. This PR would break that example Based on discussions between myself and @effigies (#573), I formed the But your example suggests otherwise - is this a real use case, and are you |
i think I'm right in saying that the example above would be much faster with |
Yep, that's correct. So it would be useful if you need to scan through a |
Yes, there's definitely a huge speed-up for keeping files open in that case, so it makes sense to me to keep the option around, given we went through the trouble of putting it in. So at this point, I think I'd want the logic to be:
So keep open if That said, if we're not doing a major refactor, do we want to continue to permit If so, then I think the logic becomes a bit annoying to put in a truth table: if keep_file_open is None:
keep_file_open = arrayproxy.KEEP_FILE_OPEN_DEFAULT
if keep_file_open is True:
return True
if indexed_gzip.__version__ < 0.6: # I treat missing as 0
return False
if indexed_gzip.__version__ >= 0.7:
return fname.endswith('.gz')
return keep_file_open and fname.endswith('.gz') What do y'all think? (And sorry for initiating this refactor, if we're going to end up throwing most of it away.) |
Sorry, I know this is annoying, but it seems to me |
So how about we just preserve the existing behaviour, but change the default value of |
Yes, that's my suggestion - but I'm afraid I didn't follow the previous discussion. Does that suit everyone? |
Specifically, I think we should change the default value in |
The reason to "keep open" for 0.7.0+ is that indexed gzip now closes the
actual filehandle when not performing file operations. My opinion is that
as long as the filehandle gets closed, it doesn't matter whether we use a
persistent Opener object. That's just the means to the end when we're
directly holding the filehandle, or something that acts equivalently.
|
@effigies I see your logic - for So semantically the name But in the interest of minimising both confusion and the need to refactor the Happy to be overruled though! |
_persist_opener is a hidden variable, so it isn't what's relevant to user
expectations, and most will not know what an Opener is or that the normal
behavior is to throw it away. keep_file_open is a good description of our
goal and its impact on OS resource limits.
The whole point of the exercise, to my mind, is to give users the advantage
of indexed_gzip for free when doing so doesn't change OS resource
consumption, which is to say, for version 0.7+.
I think for v0.6 (did we decide whether to keep supporting it?), the old
default should be maintained, for all of the reasons we set it to False.
And if we consider 0.7's internal closing to be in the correct spirit, then
False remains the best default across versions.
|
Probably safe to, given neurodebian. What about the following:
|
That seems fine, if that's the consensus. I just don't think anybody cares about Openers who doesn't hack on nibabel, and changing the value of a constant based on dependency versions feels less user friendly than fiddling with implementation details. But I'm okay with your approach. Just making sure we are all deciding based on a full picture. But I don't think we can move the constant, since anybody modifying it will expect it to continue working. |
Sigh, if only there was a nicer way than this to use |
Sorry I'm on my phone. Are you saying that indexed_gzip will close the open file we pass ?. So the flag says keep open but in fact it gets closed? |
@matthew-brett - yes, as of 0.7.0, an |
Sorry to be ignorant here, but do we not have a way of using the final result of Is the point that, there is no advantage to keeping the file open with 0.7? If so, can't we deal with that by making |
In 0.7, the file is always closed between actions. There is no benefit whatsoever to keeping it open, and thus no way to pass the option was provided. In effect, it's doing what ArrayProxy is doing when
keep_file_open is False, but at a lower level. The only effect of closing an IndexedGzipFile in 0.7 is to throw out the index.
Thus my reasoning is that nobody ever wants to throw out the index except to close filehandles, which is the straightforward meaning of keep_file_open=False. If we're able to give them the index and close filehandles, why bother killing the index over the implementation detail of persistent Openers?
Again, I'm okay with changing the auto constant, if that's the consensus, but it feels less user friendly to me.
We could also switch Opener to persistent, pass the keep open parameter to it, and have it handle the opening and closing, rather than ArrayProxy, if that's cleaner.
…On Sat, Mar 31, 2018, 12:07 Matthew Brett ***@***.***> wrote:
Sorry to be ignorant here, but do we not have a way of using the final
result of keep_file_open to tell indexed_gzip whether to keep the file
open?
Is the point that, there is no advantage to keeping the file open with
0.7? If so, can't we deal with that by making auto change its behavior
for 0.7?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#614 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFF8t6NvbC3UvZ9PO_cixqqvGP3DjMQks5tj6nEgaJpZM4S7FnV>
.
|
Oh, actually, my mistake. For reference, though it's not exactly light reading: pauldmccarthy/indexed_gzip@v0.6.1...v0.7.0 |
keep_file_open
, always use indexed_gzip
if present76f6614
to
6d93be9
Compare
Hey @effigie and @matthew-brett, sorry for the delay, and apologies if this is too late for 2.3.0. Here is my new take on how What do you guys think?
|
2b70f5e
to
d5ad97a
Compare
Updated to always use Then the If we're all happy with this, I will try and finish this PR off tonight (UK time). |
I think that makes sense, as the default was already |
Just working through the logic, 'auto' means keep_file_open when indexed_gzip is present. But, with your investigations, you concluded that keeping the file open didn't actually have much effect on performance. So 'auto' won't have much effect on performance? I guess there may be situations where it does? NFS mounts? Slow hard disks? |
What it means is, if we update the above table to the following:
Then in all cases |
(The fourth row is the one we've been discussing, for reference.) |
Right - but dropping the file handles for |
It has been the case that Now that we're pushing file-handle dropping into |
Sorry again for not following, but am I right in thinking that |
No worries.
Yes.
Ah, that logic is changing. The issue here is that we've been conflating The benefit of To address your earlier questions:
We've found two things.
|
Right - so I agree that disabling |
I'll start by saying: That's fine. If that makes sense to people, okay. That said, I don't think it's very intuitive that But again, it's fine, especially if documented as you suggest. I'll always use |
Yes, I see the argument (that keeping files open = True is always faster regardless of whether you have indexed_gzip. How about warning in the docstring about deprecation this release, and deprecating next release, as you suggested a bit further up? |
Ok, if I've followed the conversation correctly:
Sound good? I guess I'll also change the default behaviour (dictated by |
I understood Matthew to mean we shouldn't change the behavior of
Sounds good to me.
Yes, please. |
indexed_gzip is used, or dropped if gzip is used. Default value for keep_file_open is now False. Warn that 'auto' will be deprecated soon.
Paul - yes - that's what I meant - keep the file open, and don't drop file handles, for
... so I thought (keep file open, don't drop handles, if indexed_gzip present), was as close as possible to that. Chris - did I get that wrong? |
Nope, you're right. I'm 👍 to merge. Any final issues? |
@matthew-brett Sorry to bug, but if you're happy with this, I'd like to merge ASAP. |
Sorry - yes - go ahead - please do merge. |
Thanks, Matthew. |
Thanks guys! |
* commit '2.2.1-261-g12da3be2': DOC: Update changelog RF: remove duplicate test BF: fix example for get_fdata and array images BF: array images return array if OK float type RF: rewrite return of array / proxy test images. RF: refactor image API tests DOC: use get_fdata in docs DOC: Update changelog to include nipygh-614 RF: keep_file_open == 'auto' now causes file handles to be kept open if indexed_gzip is used, or dropped if gzip is used. Default value for keep_file_open is now False. Warn that 'auto' will be deprecated soon. TEST: Updated to expect indexed_gzip if present RF: Always use indexed_gzip for read access to gz files TEST,STY: Fixes to opener tests, unused import in benchmark. TEST: Further adjustment to arrayproxy benchmark RF: Make minimum required indexed_gzip version 0.7. TEST: Change unit test arrayproxy mocks - no longer necessary. RF: arrayproxy imports openers module, rather than importing individual items from the openers module. TEST: Make sure non-gzip file handles are dropped when keep_file_open == 'auto'. Updates to benchmark functions. RF,STY: Make sure that non-gzip file handles are dropped when keep_file_open == 'auto'. TEST: Update ImageOpener/ArrayProxy unit tests RF: ArrayProxy.KEEP_FILE_OPEN default value changed to 'auto'. Opener keep_open flag passed through to indexed gzip drop_handles flag. indexed_gzip versions > 0.6 all supported.
This is an attempt to address #573.
keep_file_open
parameterindexed_gzip
if it is present (this can be disabled by overwriting thenibabel.openers.HAVE_INDEXED_GZIP
flag)ArrayProxy
so that, ifindexed_gzip
is present, and it is given a.gz
file, it creates and uses a singleImageOpener
. Otherwise it creates a newImageOpener
on every access.