Skip to content

Push loadsave logic into individual classes, remove logic duplication #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bcipolli opened this issue Jun 24, 2015 · 14 comments
Closed

Comments

@bcipolli
Copy link
Contributor

In modifying loadsave.py for CIFTI file format, I found three design choices that made it more challenging for me to understand and modify code:

  1. The logic for whether a file is of each Image type is contained in a single function, not modularized into each image (yet the load/save functionality is).
  2. Image type / file extension associations is duplicated between load and save functions.
  3. The list of available file formats is duplicated (in imports, then again in the explicit logic of load and save, each).

I propose the following redesign (which I think is relatively simple):

  • Each Image class should declare class variable valid_ext (valid filename extensions) and class function is_image(filename) (which returns True if the given filename is of the Image class.
  • loadsave.py should have a global tuple of all image classes. load and save should simply loop through this list, should not have any image-specific code.

Pros:

  • This should improve code modularity, readability, maintainability, and avoid errors in adding new formats.
  • It gives a clear interface for determining file Image type.

Cons:

  • This design will have a small time cost in opening files when a file extension is ambiguous about the file type, and the file type isn't the most common for that file extension.
  • It gives more structure to determining image type, which could be detrimental in the future.

Thoughts?

@effigies
Copy link
Member

Not writing image classes, myself, I can only give an aesthetic opinion. I think having each image type directly perform the work of establishing whether a file is that type is much neater than a global function that figures out the type.

To go further, perhaps is_image should instead be a function that returns a class or None? e.g.

Minc1Image.test_image('test.nii') == None
Minc1Image.test_image('minc1.mnc') == Minc1Image
Minc1Image.test_image('minc2.mnc') == Minc2Image

Since Minc2Image is a subclass of Minc1Image, the test_image code would only need to be written once and only entered once regardless of which class is tested first. Though if it were to go that far, why not a try_image function that returns either a loaded image or None?

I think those options would address con one. Regarding the second, is it really overly structured to give a class the ability to inspect a file to determine if it is that type? (Admittedly I'm thinking of this in terms of load, not save.

Finally, a looping master tuple seems a little crude when you've got a string that should be able to index a dictionary. Why not a global type map from extensions to images that use that extension? Either classes add themselves to it as they are loaded such as with a decorator:

@valid_exts('.ex1', '.ex')
class Ex1Image

Or loadsave.py constructs such a map from the klass.valid_ext tuples. (It looks like there's something kind of like this in imageclasses that doesn't allow for multiple types associated with the same extension.)

Anyway, expect @matthew-brett will soon be along to explain why these exact ideas were scrapped long ago. :-)

@bcipolli
Copy link
Contributor Author

+1 @effigies , I like all those suggestions. I was hoping there were some superclasses to do exactly what you mentioned, but am still unfamiliar with the class structure. Nice idea on the global type map, that's much more elegant.

@matthew-brett
Copy link
Member

To keep the conversation rolling.

The current code could certainly do with some improvement. The current situation is just what happened in the initial fast get-it-working phase, and often does not have a deep justification.

I think we are primarily looking at:

The use-case I was paying most attention to in the initial sketch was that of Analyze vs Nifti for img extensions. This ended up with a single function because I wanted to make sure that the load routine did only one read of the data to determine what type it was. The other use-cases that came up were nifti1 vs nifti2 and minc1 vs minc2 - where we have to read some of the file to know what type it is. The alternative is something like what y'all are describing, where we first establish the characteristic extension, then fetch (from a dictionary or something) all classes that can load files with this extension,


valid_image_types = {'img': [Nifti1Pair, Nifti2Pair, Spm2Analyze, Spm99Analyze, Analyze],
                                   'hdr', [Nifti1Pair, Nifti2Pair, Spm2Analyze, Spm99Analyze, Analyze],
}
ext = get_ext(filename)  # Logic here for removing .gz, bz2?  Or in dictionary?
potential_classes = valid_img_types[ext]
if len(potential_classes) == 1:
     return potential_classes[0].load(filename)
for cls in valid_image_types[ext]:
       if cls.is_valid_file(filename):
           return cls.load(filename)
raise TypeError('No valid image type found')

Here I guess each potential class would need to sniff the metadata for the image. It would also need to determine where the metadata was - for example Nifti and other Analyze types may have img for a file extension, but the metadata is in a file called hdr. I guess this could be added to the valid_image_types dictionary somehow.

I wonder if there is a good way to sniff the file only once - maybe doing a 1K or less read from the relevant file (e.g. img or hdr) at the top of the loop, and passing the sniff to the is_valid_file routine instead of / as well as the filename.

@effigies
Copy link
Member

I wonder if there is a good way to sniff the file only once - maybe doing a 1K or less read from the relevant file (e.g. img or hdr) at the top of the loop, and passing the sniff to the is_valid_file routine instead of / as well as the filename.

I was thinking about this. Would disk/OS caching not be sufficient to make the cost of multiple reads negligible? I'm just curious how much you gain by shaving off the odd double read of a kilobyte or so. If time(load) >> time(is_valid_file), then the cost in complexity is probably greater than the savings in time, especially when considering the img/hdr problem (the load function needing to know enough to read the hdr file to pass to is_valid_file).

@matthew-brett
Copy link
Member

Once the file is open, I am sure 1K reads would be a tiny cost, but opening the file for a read might be slow on a network drive.

@matthew-brett
Copy link
Member

At the moment, we have to deal with the following image data / metadata file pairs:

img, hdr : Analyze-like files including nifti1, 2
par, rec : Philips format

I guess we might also have to think about images in directories like DICOM, and Bruker format : http://imaging.mrc-cbu.cam.ac.uk/imaging/FormatBruker

@matthew-brett
Copy link
Member

Another option is to pass in the sniff if we have it, otherwise None, maybe something like:

sniff = None
for cls in valid_image_types[ext]:
    is_mine, sniff = cls.check_filename(filename, sniff)

@bcipolli
Copy link
Contributor Author

Another option is to pass in the sniff if we have it, otherwise None

I was thinking similarly.

@effigies
Copy link
Member

For the image/metadata pairs, are all of these cases in which they are never unified? What's driving my concern is whether in some cases we'll want to sniff img and other cases hdr, so using a simple sniff variable might hit a conflict.

Regarding network file systems: I know that CIFS, NFS and AFS all employ caching. Are there other filesystems that need to be accounted for? I'm not trying to be contrary, but without a specific reason to be concerned about network latency of multiple openings, it strikes me as dipping into concerns that are more appropriately the domain of the filesystem. The use case in which that seems likely to become the concern of nibabel is if you can reasonably expect enough simultaneous validity checks (including the multiple process case) to overwhelm the caches.

If we always know what file we want to sniff, then the above bit is mostly a non-issue as calling the header file can be done in the comfort of the individual image class. But it would force us into the situation where, if we begin supporting a file where it's not known by the extension whether we need to look at a separate metadata file, then all is_valid_files would have to return a sniff = None and we're back where we started.

@matthew-brett
Copy link
Member

Older versions of SPM liked to have one file per volume, so this could make for thousands of files to open.

You are right that this may not be a problem, but it may be hard to find out - we need to test on some slow file-systems.

At the moment we might like to sniff for img, hdr pairs, mnc files and and nii files, of which only img, hdr pairs have metadata in separate files. I guess when we find a more confusing format, we could have sniff always returned as None, so fall back to the inefficient behavior when necessary.

@matthew-brett
Copy link
Member

While I am thinking about it.

We could go for something like this:

class SpatialImage:
    ...
    def load_if_valid(self, filename, sniff=None):
        if sniff is None:
            sniff = self.sniff_filename(filename)
        if self.is_valid_filename(filename) and self.is_valid_sniff(sniff):
            return self.__class__.load(filename), sniff
        return None, sniff

Then:

def load(filename):
    froot, ext, trailing = splitext_addext(filename, ('.gz', '.bz2'))
    lext = ext.lower()
    try:
        img_classes = IMAGE_MAP[lext]
    except KeyError:
        raise ImageFileError('Cannot work out file type of "%s"' %
                             filename)
    sniff = None
    for img_class in img_classes:
        # This assumes the sniff will be valid for any file in this list
        # We could put a SniffInvalidator in the sequence to deal with this if
        # we wanted: if img_class is SniffInvalidator: sniff=None; continue
        img, sniff = img_class.load_if_valid(filename, sniff)
        if img is not None:
            return img
    raise ImageFileError(
        'No file classes in {img_classes} accept {filename} as valid'.format(
            locals())

@bcipolli
Copy link
Contributor Author

bcipolli commented Jul 7, 2015

I'll play with this a bit; I think the ideas here are an improvement over what currently exists.

matthew-brett added a commit that referenced this issue Aug 27, 2015
MRG: More graceful handling of compressed file Opening

This PR deprecates BinOpener, instead exposing a decorator for creating arbitrary file extension / opener methods (similar to that proposed in #317 and created in #319).
@coolgirl43
Copy link

that is a very helpful tip, it really worked for me. I even asked my teacher in my individual classes at http://preply.com/en and he didn't know. Good thing, you had a solution :)

matthew-brett added a commit that referenced this issue Oct 6, 2015
MRG: Migrate load, save, class_map, ext_map to class properties and methods

This is to address #317 and #323 (and aims to supercede PR #319), both issues around image-specific code being distributed across files, rather than localized in the class definition.

Changes here include:
* Removing image-specific code from `nib.load` and `nib.save` functions, moving the logic into `ImageKlass.is_image` and `HeaderKlass.is_header` class functions.
* Deprecating `class_map` and `ext_map` objects, and moving necessary info to properties defined on the image classes themselves.
@effigies
Copy link
Member

Closed by #329.

grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016
MRG: More graceful handling of compressed file Opening

This PR deprecates BinOpener, instead exposing a decorator for creating arbitrary file extension / opener methods (similar to that proposed in nipy#317 and created in nipy#319).
grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016
MRG: Migrate load, save, class_map, ext_map to class properties and methods

This is to address nipy#317 and nipy#323 (and aims to supercede PR nipy#319), both issues around image-specific code being distributed across files, rather than localized in the class definition.

Changes here include:
* Removing image-specific code from `nib.load` and `nib.save` functions, moving the logic into `ImageKlass.is_image` and `HeaderKlass.is_header` class functions.
* Deprecating `class_map` and `ext_map` objects, and moving necessary info to properties defined on the image classes themselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants