Skip to content

Fix for issue 362: nibabel fails to stream gzipped files > 4GB (uncompressed) in Python 3.5 #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 20, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions doc/source/devel/advanced_testing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
.. -*- mode: rst -*-
.. ex: set sts=4 ts=4 sw=4 et tw=79:
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###
#
# See COPYING file distributed along with the NiBabel package for the
# copyright and license terms.
#
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###

.. _advanced_testing:

************
Advanced Testing
************

Setup
-----

Before running advanced tests, please update all submodules of nibabel, by running ``git submodule update --init``


Long-running tests
------------------

Long-running tests are not enabled by default, and can be resource-intensive. To run these tests:

* Set environment variable ``NIPY_EXTRA_TESTS=slow``
* Run ``nosetests``.

Note that some tests may require a machine with >4GB of RAM.

.. include:: ../links_names.txt
1 change: 1 addition & 0 deletions doc/source/devel/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ Developer documentation page
add_image_format
devdiscuss
make_release
advanced_testing
11 changes: 10 additions & 1 deletion doc/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,11 @@ Just install the modules by invoking::
If sudo is not configured (or even installed) you might have to use
``su`` instead.

Now fire up Python and try importing the module to see if everything is fine.

Validating your install
-----------------------

For a basic test of your installation, fire up Python and try importing the module to see if everything is fine.
It should look something like this::

Python 2.7.8 (v2.7.8:ee879c0ffa11, Jun 29 2014, 21:07:35)
Expand All @@ -123,4 +127,9 @@ It should look something like this::
>>> import nibabel
>>>


To run the nibabel test suite, from the terminal run ``nosetests nibabel`` or ``python -c "import nibabel; nibabel.test()``.

To run an extended test suite that validates ``nibabel`` for long-running and resource-intensive cases, please see :ref:`advanced_testing`.

.. include:: links_names.txt
56 changes: 51 additions & 5 deletions nibabel/openers.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,65 @@
""" Context manager openers for various fileobject types
"""

from os.path import splitext
import gzip
import bz2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we checked to make sure large .bz2 files aren't affected by the same upstream bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. What file types allow .bz? .nii.bz didn't seem to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, .bz2. Trying it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fails :(
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like 2.6, and so is probably a different issue from this PR's concern. Possibly worth looking into in a new PR, but low priority in my mind. (Nobody seems to be using large .bz2 files, and who's working on HCP data with seriously old Python?)

What about 3.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was Python 2.7, but good point--I didn't use Python3. Python 3 is hanging, using a ton of resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. bz2 sounds like its own issue, not something we can quickly take care of while we're here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it finally returned successfully in Python 3.5

import gzip
import sys
from os.path import splitext


# The largest memory chunk that gzip can use for reads
GZIP_MAX_READ_CHUNK = 100 * 1024 * 1024 # 100Mb


class BufferedGzipFile(gzip.GzipFile):
"""GzipFile able to readinto buffer >= 2**32 bytes.

This class only differs from gzip.GzipFile
in Python 3.5.0.

This works around a known issue in Python 3.5.
See https://bugs.python.org/issue25626"""

# This helps avoid defining readinto in Python 2.6,
# where it is undefined on gzip.GzipFile.
# It also helps limit the exposure to this code.
if sys.version_info[:3] == (3, 5, 0):
def __init__(self, fileish, mode='rb', compresslevel=9,
buffer_size=2**32-1):
super(BufferedGzipFile, self).__init__(fileish, mode=mode,
compresslevel=compresslevel)
self.buffer_size = buffer_size

def readinto(self, buf):
"""Uses self.buffer_size to do a buffered read."""
n_bytes = len(buf)
if n_bytes < 2 ** 32:
return super(BufferedGzipFile, self).readinto(buf)

# This works around a known issue in Python 3.5.
# See https://bugs.python.org/issue25626
mv = memoryview(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping try .. except blocks as short as possible to make sure that it's clear what is being trapped and why. So here, something like:

try:
    mv = memoryview(buf)
except NameError:
    n_read = super(BufferedGzipFile, self).readinto(buf)
else:
   n_read = 0
   ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to invert?

try:
    n_read = super(BufferedGzipFile, self).readinto(buf)
except OverflowError:
    # Workaround for known issue in Python 3.5
    ...

Or is the memoryview strategy preferred except in Python 2.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine on the try...catch...else. I wrote it the way I did because separating as above breaks into slightly weird chunks.

I believe that the gzip failure is not that fast, which is why I avoided the gzip failure by default. But it's hard to tell. I don't believe either of these methods is "preferred"; I believe readinto is deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken about waiting on an OverflowError. What about making the check explicitly for Python 3.5.0, instead of testing for memoryview? 3.5.1 should fix the issue, and I'm not super comfortable with using a hand-rolled buffering loop by default on 2.7+ because of a 3.5.0-specific bug.

And my understanding is that readinto hasn't been deprecated; it's just been pushed up the hierarchy.

n_read = 0
max_read = 2 ** 32 - 1 # Max for unsigned 32-bit integer
while (n_read < n_bytes):
n_wanted = min(n_bytes - n_read, max_read)
n_got = super(BufferedGzipFile, self).readinto(
mv[n_read:n_read + n_wanted])
n_read += n_got
if n_got != n_wanted:
break
return n_read


def _gzip_open(fileish, *args, **kwargs):
# open gzip files with faster reads on large files using larger chunks
gzip_file = BufferedGzipFile(fileish, *args, **kwargs)

# Speedup for #209; attribute not present in in Python 3.5
# open gzip files with faster reads on large files using larger
# See https://github.com/nipy/nibabel/pull/210 for discussion
gzip_file = gzip.open(fileish, *args, **kwargs)
gzip_file.max_read_chunk = GZIP_MAX_READ_CHUNK
if hasattr(gzip_file, 'max_chunk_read'):
gzip_file.max_read_chunk = GZIP_MAX_READ_CHUNK

return gzip_file


Expand Down
11 changes: 11 additions & 0 deletions nibabel/testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@
''' Utilities for testing '''
from __future__ import division, print_function

import os
import sys
import warnings
from os.path import dirname, abspath, join as pjoin

import numpy as np

from numpy.testing.decorators import skipif
# Allow failed import of nose if not now running tests
try:
from nose.tools import (assert_equal, assert_not_equal,
Expand Down Expand Up @@ -164,3 +166,12 @@ def __init__(self, *args, **kwargs):
warnings.warn('catch_warn_reset is deprecated and will be removed in '
'nibabel v3.0; use nibabel.testing.clear_and_catch_warnings.',
FutureWarning)


EXTRA_SET = os.environ.get('NIPY_EXTRA_TESTS', '').split(',')


def runif_extra_has(test_str):
"""Decorator checks to see if NIPY_EXTRA_TESTS env var contains test_str"""
return skipif(test_str not in EXTRA_SET,
"Skip {0} tests.".format(test_str))
41 changes: 29 additions & 12 deletions nibabel/tests/test_nifti1.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,18 @@

import numpy as np

from ..externals.six import BytesIO
from ..casting import type_info, have_binary128
from ..tmpdirs import InTemporaryDirectory
from ..spatialimages import HeaderDataError
from ..eulerangles import euler2mat
from ..affines import from_matvec
from .. import nifti1 as nifti1
from ..nifti1 import (load, Nifti1Header, Nifti1PairHeader, Nifti1Image,
Nifti1Pair, Nifti1Extension, Nifti1Extensions,
data_type_codes, extension_codes, slice_order_codes)

from nibabel import nifti1 as nifti1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason to avoid relative imports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the standard is to use absolute imports in tests. At least, @effigies and I have been migrating to that standard as we touch test files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for that that you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from nilearn: nilearn/nilearn#384
Who got it from scikit-learn: http://scikit-learn.org/stable/developers/contributing.html#coding-guidelines

Unit tests are an exception to the previous rule; they should use absolute imports, exactly as client code would. A corollary is that, if sklearn.foo exports a class or function that is implemented in sklearn.foo.bar.baz, the test should import it from sklearn.foo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I don't feel strongly either way.

from nibabel.affines import from_matvec
from nibabel.casting import type_info, have_binary128
from nibabel.eulerangles import euler2mat
from nibabel.externals.six import BytesIO
from nibabel.nifti1 import (load, Nifti1Header, Nifti1PairHeader, Nifti1Image,
Nifti1Pair, Nifti1Extension, Nifti1Extensions,
data_type_codes, extension_codes,
slice_order_codes)
from nibabel.openers import ImageOpener
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

from nibabel.spatialimages import HeaderDataError
from nibabel.tmpdirs import InTemporaryDirectory
from ..freesurfer import load as mghload

from .test_arraywriters import rt_err_estimate, IUINT_TYPES
Expand All @@ -35,7 +36,7 @@
from nose.tools import (assert_true, assert_false, assert_equal,
assert_raises)

from ..testing import data_path, suppress_warnings
from ..testing import data_path, suppress_warnings, runif_extra_has

from . import test_analyze as tana
from . import test_spm99analyze as tspm
Expand Down Expand Up @@ -1242,3 +1243,19 @@ def test_rt_bias(self):
# Hokey use of max_miss as a std estimate
bias_thresh = np.max([max_miss / np.sqrt(count), eps])
assert_true(np.abs(bias) < bias_thresh)


@runif_extra_has('slow')
def test_large_nifti1():
image_shape = (91, 109, 91, 1200)
img = Nifti1Image(np.ones(image_shape, dtype=np.float32),
affine=np.eye(4))
# Dump and load the large image.
with InTemporaryDirectory():
img.to_filename('test.nii.gz')
del img
data = load('test.nii.gz').get_data()
# Check that the data are all ones
assert_equal(image_shape, data.shape)
n_ones = np.sum((data == 1.))
assert_equal(np.prod(image_shape), n_ones)