Skip to content

Setting gzip mtime to zero on image save #1023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
effigies opened this issue Jun 3, 2021 · 12 comments · Fixed by #1024
Closed

Setting gzip mtime to zero on image save #1023

effigies opened this issue Jun 3, 2021 · 12 comments · Fixed by #1024

Comments

@effigies
Copy link
Member

effigies commented Jun 3, 2021

Right now ImageOpener() does not allow us to pass an mtime, which means that saving identical images as .nii.gz 1s apart will produce two images with two different checksums.

Is there any value in saving a non-zero mtime? I see a few options, in my order of preference, most preferred first:

  1. Always save as 0.
  2. Default to 0, allow user to override.
  3. Default to a timestamp (current), allow user to override (new).

Note that the way we use GzipFile, it also sets the filename, which means saving the a.nii.gz and b.nii.gz with the same timestamp would still not have identical checksums. I would also be inclined to open GzipFiles in such a way as to not set filenames.

@yarikoptic
Copy link
Member

this is odd -- I thought nibabel already did timestamp annihilation (now I have difficulty quickly finding the project(s) where we had done that in the past, but I know we did! ;)). I think 2. would be the best.

Didn't even think about filename - indeed should not be in the header regardless of the user's desires IMHO ;-)

@effigies
Copy link
Member Author

effigies commented Jun 3, 2021

I'm fine with 2, though in the short term I think it would be a parameter to ImageOpener, which is a reasonably constrained scope of work. We still don't expose those parameters (like compresslevel, see #382) to img.to_filename() and nibabel.save(), which is its own thing that needs doing.

@ltetrel
Copy link

ltetrel commented Jun 3, 2021

The only advantage I see of keeping the mtime and filename in the gzip header is to simplify data carving, in case you accidentally deleted your files...

@effigies
Copy link
Member Author

effigies commented Jun 3, 2021

In principle, we could set filename to something constant, e.g nibabel-gzip, without exposing potentially sensitive information or losing determinism. Setting the mtime by default does not seem likely to help with data carving because you would need to know that mtime.

@yarikoptic
Copy link
Member

In principle, we could set filename to something constant, e.g nibabel-gzip, without exposing potentially sensitive information or losing determinism. Setting the mtime by default does not seem likely to help with data carving because you would need to know that mtime.

and then someone doing "data carving" @ltetrel suggested overwrites all different files with a single "last winner" ;) IMHO not worth it.

@ltetrel
Copy link

ltetrel commented Jun 15, 2021

So I was reading the gzip spec, and the last 8 bytes can actually be used to check data integrity (CRC-32 code followed by size of the object).
https://docs.fileformat.com/compression/gz/#gz-file-footer
This nullify the need of trying to have an identical shasum, since checking the CRC-32 is actually the right way to check for file integrity (and comparing files). And this is much faster than shasum-ing all the file.
wdyt @yarikoptic @effigies

@yarikoptic
Copy link
Member

does python library support adding that data integrity check? if does -- I guess it should not hurt but worth checking if it doesn't ;)

@ltetrel
Copy link

ltetrel commented Jun 15, 2021

I checked and the gzip python library does not (which I found quite weird)..
https://docs.python.org/3/library/gzip.html
But maybe there is another library that is doing that type of check ? Anyway it is not a huge deal to code it, but maintaining it could be more problematic (if the spec changes)...

@effigies
Copy link
Member Author

The spec hasn't changed in 25 years...

@effigies
Copy link
Member Author

I checked and the gzip python library does not (which I found quite weird)..
https://docs.python.org/3/library/gzip.html

I may be misreading, but the gzip in Python's stdlib does check the footer:

    def _read_eof(self):
        # We've read to the end of the file
        # We check that the computed CRC and size of the
        # uncompressed data matches the stored values.  Note that the size
        # stored is the true file size mod 2**32.
        crc32, isize = struct.unpack("<II", self._read_exact(8))
        if crc32 != self._crc:
            raise BadGzipFile("CRC check failed %s != %s" % (hex(crc32),
                                                             hex(self._crc)))
        elif isize != (self._stream_size & 0xffffffff):
            raise BadGzipFile("Incorrect length of data produced")

@ltetrel
Copy link

ltetrel commented Jun 15, 2021

@effigies Oh yes but I mean it is private not public so we cannot use this function member, but again it is simple to just copy it.

@effigies
Copy link
Member Author

Ah, right. You can get it easily with:

try:
    _ = gzipfile.read()
    valid_crc = True
except BadGzipFile:
    valid_crc = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants