Skip to content

FIX: Disable direct creation of non-conformant GiftiDataArrays #1199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 13, 2023

Conversation

effigies
Copy link
Member

GIFTI only defines three valid datatypes. Nibabel has generally allowed users to use any defined NIFTI_TYPE. In limiting this scope, there's a balance to strike between:

  1. Making it easy/default to generate spec-conforming GIFTI files
  2. Permitting loading of non-conforming GIFTI files (including some previously generated with nibabel)
  3. Respecting the user's intent when made explicit, even when it violates the spec

(1) is the primary goal here. For (2) making ourselves intentionally unable to read a file seems like a very bad idea.

(3) seems like an open question. We do not uniformly allow users to violate standards, such as allowing fixed-length strings as the voxel data in NIfTI, even though numpy would permit it. So it could be fine to simply refuse to write files with invalid DataArray types. On the other hand, you could imagine a pipeline of functions that take in a GiftiImage and return a GiftiImage to ensure metadata follows data, and it's only at write where you would really want to enforce a precision reduction. (It would in any event be very difficult to entirely prevent the construction in memory of non-conformant images, due to Python's object model.)


My initial thought was that we could auto-convert to the nearest dtype, but that could silently introduce unexpected results for users, so @matthew-brett suggested raising errors. This is a fairly simple API change.

Because the objects are constructed piecemeal by the parser, this does not prevent loading files. We might still want to find a way to warn on load.

A couple options for making it harder to write a non-conforming GIFTI:

  1. Raise error during GiftiImage.to_xml() if any array dtype is not conforming. This could be overridden with a keyword argument force.
  2. Raise warning during GiftiImage.__init__() if any data arrays have non-conforming dtypes, or if they are added through API mechanisms (GiftiImage.add_gifti_data_array()).

Closes #1198.

@codecov
Copy link

codecov bot commented Feb 17, 2023

Codecov Report

Patch coverage: 93.93% and project coverage change: -0.02 ⚠️

Comparison is base (b8b57cc) 92.15% compared to head (cd1a39a) 92.14%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1199      +/-   ##
==========================================
- Coverage   92.15%   92.14%   -0.02%     
==========================================
  Files          97       97              
  Lines       12334    12360      +26     
  Branches     2534     2544      +10     
==========================================
+ Hits        11367    11389      +22     
- Misses        645      648       +3     
- Partials      322      323       +1     
Impacted Files Coverage Δ
nibabel/gifti/gifti.py 93.26% <93.93%> (-0.08%) ⬇️
nibabel/viewers.py 95.49% <0.00%> (-0.65%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@neurolabusc
Copy link

The primary strength of GIfTI is that it is the standard interchange format for our community, and it was designed from the outset to support domain-specific needs such as describing the anatomical structures, labeling, and spatial transforms. However, GIfTI has inherently poor speed and file size.

My sense is that datatypes of GIfTI were carefully chosen to match the needs of our community (e.g. SNR of our data is low) and the capabilities of our hardware (APIs and GPUs limit of 2 billion vertices and vertex precision limited to float32). While a user might intentionally request extreme datatypes believing that more is better, I suspect that they will experience a lot of unintended consequences.

I appreciate @satra's ability to see the big picture, while I tend to focus on implementation details. Perhaps he can share his thoughts.

@effigies
Copy link
Member Author

Yes, GIFTI is an interchange format, but the nibabel objects are not very heavy. So for internal use, it's not unreasonable for someone to create a float64 GiftiDataArray or even GiftiImage and pass it from function to function, and only want to cast down to float32 at serialization time.

I can also see a use for wanting to inspect the object without conversion, possibly through a schema-agnostic XML viewer.

So I would suggest something like the following on GiftiImage:

class GiftiImage:
    def to_xml(self, enc='utf-8', *, mode='strict'):  # Called by to_filename()
        if mode == 'strict':
            if any(arr.datatype not in ('uint8', 'int32', 'float32')):
                raise ValueError(
                    'GiftiImage contains data arrays with invalid data types; '
                    'use mode="compat" to automatically cast to conforming types'
                )
        elif mode == 'compat':
            gii = copy(self)
            # Convert unsupported float/int types to float32 or int32 if possible
            return gii.to_xml(enc=enc, mode='strict')
        elif mode != 'force':
            raise TypeError(f'Unknown mode {mode}')
        ...

Combined with explicit datatypes, building out a non-conformant image by hand would require two specific overrides:

import numpy as np
import nibabel as nb

darr = nb.gifti.GiftiDataArray(np.zeros((5,), dtype=np.float64), datatype='NIFTI_TYPE_FLOAT64')
gii = nb.GiftiImage([darr])
gii.to_filename('64bit.gii', mode='force')

This would allow someone to create GiftiDataArrays internally and pass them around freely without premature casts. If they attempt to write them to disk without casting, they will get an error unless compatible, so we avoid silent changes in behavior (@matthew-brett's concern). They can then choose whether to coerce to be conformant with the specification or to force writing to disk.

If the force write use case is still uncompelling and two overrides are insufficient friction, I would be okay with dropping it. For loading 64-bit GIFTIs in another nibabel instance, pickling will be much faster:

In [3]: img = nb.load('/data/openneuro/ds002790/.git/annex/objects/2j/3z/MD5E-s3485910--1942b389e2cc
   ...: f637934bf1d62c6f20c2.surf.gii/MD5E-s3485910--1942b389e2ccf637934bf1d62c6f20c2.surf.gii')

In [4]: from pickle import dumps, loads

In [5]: %timeit loads(dumps(img))
2.27 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit img.from_bytes(img.to_bytes())
308 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

And of course, you can still inspect a data array by hand without putting it in a <GIFTI Version="1.0"></GIFTI> wrapper.

In [7]: _ = img.darrays[0].to_xml()

@effigies
Copy link
Member Author

Went ahead and pushed a proposal since I wanted to test it. Not intended to preempt discussion; happy to consider alternatives.

@alexisthual
Copy link

I went through the discussion and I agree with your views @effigies 🙂 (namely, making it possible for users to write non-standard gifti files at the price of setting mode="force" and allow users to load non-standard gifti files, in which case we display a warning and a way to make the file standard).

@effigies
Copy link
Member Author

Thanks for the comments, @alexisthual. Happy to see further discussion or a code review. In the absence of both, I'll try to look at this with fresh eyes some time this week and plan to merge next Monday.

Copy link

@alexisthual alexisthual left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the codebase so I might be missing a lot of important points.
What is missing thus far? Some warnings for non-conformant legacy images?

@@ -834,20 +852,45 @@ def _to_xml_element(self):
GIFTI.append(dar._to_xml_element())
return GIFTI

def to_xml(self, enc='utf-8') -> bytes:
def to_xml(self, enc='utf-8', *, mode='strict') -> bytes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the * for here? Do you expect this function to be called with more kwargs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The * means you have to call mode= as a keyword argument. So to_xml('utf-8', 'force') will fail, to_xml('utf-8', mode='force') will pass. I would be inclined to make enc keyword-only as well, I just didn't want to make it part of this PR...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, thanks!

Comment on lines +866 to +875
if arr.datatype not in GIFTI_DTYPES:
arr = copy(arr)
# TODO: Better typing for recoders
dtype = cast(np.dtype, data_type_codes.dtype[arr.datatype])
if np.issubdtype(dtype, np.floating):
arr.datatype = data_type_codes['float32']
elif np.issubdtype(dtype, np.integer):
arr.datatype = data_type_codes['int32']
else:
raise ValueError(f'Cannot convert {dtype} to float32/int32')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get how this part corrects the darray's datatype attribute, but I don't understand where the actual data is casted to the new datatype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is only cast at write time. Serialization to XML will call DataArray._to_xml_element():

def _to_xml_element(self):
# fix endianness to machine endianness
self.endian = gifti_endian_codes.code[sys.byteorder]
# All attribute values must be strings
data_array = xml.Element(
'DataArray',
attrib={
'Intent': intent_codes.niistring[self.intent],
'DataType': data_type_codes.niistring[self.datatype],
'ArrayIndexingOrder': array_index_order_codes.label[self.ind_ord],
'Dimensionality': str(self.num_dim),
'Encoding': gifti_encoding_codes.specs[self.encoding],
'Endian': gifti_endian_codes.specs[self.endian],
'ExternalFileName': self.ext_fname,
'ExternalFileOffset': str(self.ext_offset),
},
)
for di, dn in enumerate(self.dims):
data_array.attrib['Dim%d' % di] = str(dn)
if self.meta is not None:
data_array.append(self.meta._to_xml_element())
if self.coordsys is not None:
data_array.append(self.coordsys._to_xml_element())
# write data array depending on the encoding
data_array.append(
_data_tag_element(
self.data,
gifti_encoding_codes.specs[self.encoding],
data_type_codes.dtype[self.datatype],
self.ind_ord,
)
)
return data_array

Which calls _data_tag_element():

def _data_tag_element(dataarray, encoding, dtype, ordering):
"""Creates data tag with given `encoding`, returns as XML element"""
import zlib
order = array_index_order_codes.npcode[ordering]
enclabel = gifti_encoding_codes.label[encoding]
if enclabel == 'ASCII':
da = _arr2txt(dataarray, KIND2FMT[dtype.kind])
elif enclabel in ('B64BIN', 'B64GZ'):
out = np.asanyarray(dataarray, dtype).tobytes(order)
if enclabel == 'B64GZ':
out = zlib.compress(out)
da = base64.b64encode(out).decode()
elif enclabel == 'External':
raise NotImplementedError('In what format are the external files?')
else:
da = ''
data = xml.Element('Data')
data.text = da
return data

L380 is the one that finally does it: out = np.asanyarray(dataarray, dtype).tobytes(order)

@effigies effigies merged commit 40e31e8 into nipy:master Mar 13, 2023
@effigies effigies deleted the fix/gifti_dtypes branch March 13, 2023 00:56
@effigies
Copy link
Member Author

Thanks for the review, @alexisthual!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nibabel creates invalid GIfTI files
3 participants