-
Notifications
You must be signed in to change notification settings - Fork 262
FIX: Disable direct creation of non-conformant GiftiDataArrays #1199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1199 +/- ##
==========================================
- Coverage 92.15% 92.14% -0.02%
==========================================
Files 97 97
Lines 12334 12360 +26
Branches 2534 2544 +10
==========================================
+ Hits 11367 11389 +22
- Misses 645 648 +3
- Partials 322 323 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
The primary strength of GIfTI is that it is the standard interchange format for our community, and it was designed from the outset to support domain-specific needs such as describing the anatomical structures, labeling, and spatial transforms. However, GIfTI has inherently poor speed and file size. My sense is that datatypes of GIfTI were carefully chosen to match the needs of our community (e.g. SNR of our data is low) and the capabilities of our hardware (APIs and GPUs limit of 2 billion vertices and vertex precision limited to float32). While a user might intentionally request extreme datatypes believing that I appreciate @satra's ability to see the big picture, while I tend to focus on implementation details. Perhaps he can share his thoughts. |
Yes, GIFTI is an interchange format, but the nibabel objects are not very heavy. So for internal use, it's not unreasonable for someone to create a I can also see a use for wanting to inspect the object without conversion, possibly through a schema-agnostic XML viewer. So I would suggest something like the following on class GiftiImage:
def to_xml(self, enc='utf-8', *, mode='strict'): # Called by to_filename()
if mode == 'strict':
if any(arr.datatype not in ('uint8', 'int32', 'float32')):
raise ValueError(
'GiftiImage contains data arrays with invalid data types; '
'use mode="compat" to automatically cast to conforming types'
)
elif mode == 'compat':
gii = copy(self)
# Convert unsupported float/int types to float32 or int32 if possible
return gii.to_xml(enc=enc, mode='strict')
elif mode != 'force':
raise TypeError(f'Unknown mode {mode}')
... Combined with explicit datatypes, building out a non-conformant image by hand would require two specific overrides: import numpy as np
import nibabel as nb
darr = nb.gifti.GiftiDataArray(np.zeros((5,), dtype=np.float64), datatype='NIFTI_TYPE_FLOAT64')
gii = nb.GiftiImage([darr])
gii.to_filename('64bit.gii', mode='force') This would allow someone to create If the force write use case is still uncompelling and two overrides are insufficient friction, I would be okay with dropping it. For loading 64-bit GIFTIs in another nibabel instance, pickling will be much faster: In [3]: img = nb.load('/data/openneuro/ds002790/.git/annex/objects/2j/3z/MD5E-s3485910--1942b389e2cc
...: f637934bf1d62c6f20c2.surf.gii/MD5E-s3485910--1942b389e2ccf637934bf1d62c6f20c2.surf.gii')
In [4]: from pickle import dumps, loads
In [5]: %timeit loads(dumps(img))
2.27 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit img.from_bytes(img.to_bytes())
308 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) And of course, you can still inspect a data array by hand without putting it in a In [7]: _ = img.darrays[0].to_xml() |
Went ahead and pushed a proposal since I wanted to test it. Not intended to preempt discussion; happy to consider alternatives. |
8919154
to
b400dd5
Compare
I went through the discussion and I agree with your views @effigies 🙂 (namely, making it possible for users to write non-standard gifti files at the price of setting |
Thanks for the comments, @alexisthual. Happy to see further discussion or a code review. In the absence of both, I'll try to look at this with fresh eyes some time this week and plan to merge next Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the codebase so I might be missing a lot of important points.
What is missing thus far? Some warnings for non-conformant legacy images?
@@ -834,20 +852,45 @@ def _to_xml_element(self): | |||
GIFTI.append(dar._to_xml_element()) | |||
return GIFTI | |||
|
|||
def to_xml(self, enc='utf-8') -> bytes: | |||
def to_xml(self, enc='utf-8', *, mode='strict') -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what is the *
for here? Do you expect this function to be called with more kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The *
means you have to call mode=
as a keyword argument. So to_xml('utf-8', 'force')
will fail, to_xml('utf-8', mode='force')
will pass. I would be inclined to make enc
keyword-only as well, I just didn't want to make it part of this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting, thanks!
if arr.datatype not in GIFTI_DTYPES: | ||
arr = copy(arr) | ||
# TODO: Better typing for recoders | ||
dtype = cast(np.dtype, data_type_codes.dtype[arr.datatype]) | ||
if np.issubdtype(dtype, np.floating): | ||
arr.datatype = data_type_codes['float32'] | ||
elif np.issubdtype(dtype, np.integer): | ||
arr.datatype = data_type_codes['int32'] | ||
else: | ||
raise ValueError(f'Cannot convert {dtype} to float32/int32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get how this part corrects the darray's datatype
attribute, but I don't understand where the actual data is casted to the new datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is only cast at write time. Serialization to XML will call DataArray._to_xml_element()
:
nibabel/nibabel/gifti/gifti.py
Lines 487 to 522 in 3a4cc5e
def _to_xml_element(self): | |
# fix endianness to machine endianness | |
self.endian = gifti_endian_codes.code[sys.byteorder] | |
# All attribute values must be strings | |
data_array = xml.Element( | |
'DataArray', | |
attrib={ | |
'Intent': intent_codes.niistring[self.intent], | |
'DataType': data_type_codes.niistring[self.datatype], | |
'ArrayIndexingOrder': array_index_order_codes.label[self.ind_ord], | |
'Dimensionality': str(self.num_dim), | |
'Encoding': gifti_encoding_codes.specs[self.encoding], | |
'Endian': gifti_endian_codes.specs[self.endian], | |
'ExternalFileName': self.ext_fname, | |
'ExternalFileOffset': str(self.ext_offset), | |
}, | |
) | |
for di, dn in enumerate(self.dims): | |
data_array.attrib['Dim%d' % di] = str(dn) | |
if self.meta is not None: | |
data_array.append(self.meta._to_xml_element()) | |
if self.coordsys is not None: | |
data_array.append(self.coordsys._to_xml_element()) | |
# write data array depending on the encoding | |
data_array.append( | |
_data_tag_element( | |
self.data, | |
gifti_encoding_codes.specs[self.encoding], | |
data_type_codes.dtype[self.datatype], | |
self.ind_ord, | |
) | |
) | |
return data_array |
Which calls _data_tag_element()
:
nibabel/nibabel/gifti/gifti.py
Lines 371 to 391 in 3a4cc5e
def _data_tag_element(dataarray, encoding, dtype, ordering): | |
"""Creates data tag with given `encoding`, returns as XML element""" | |
import zlib | |
order = array_index_order_codes.npcode[ordering] | |
enclabel = gifti_encoding_codes.label[encoding] | |
if enclabel == 'ASCII': | |
da = _arr2txt(dataarray, KIND2FMT[dtype.kind]) | |
elif enclabel in ('B64BIN', 'B64GZ'): | |
out = np.asanyarray(dataarray, dtype).tobytes(order) | |
if enclabel == 'B64GZ': | |
out = zlib.compress(out) | |
da = base64.b64encode(out).decode() | |
elif enclabel == 'External': | |
raise NotImplementedError('In what format are the external files?') | |
else: | |
da = '' | |
data = xml.Element('Data') | |
data.text = da | |
return data |
L380 is the one that finally does it: out = np.asanyarray(dataarray, dtype).tobytes(order)
Thanks for the review, @alexisthual! |
MRG: Fix test creating unsupported double-precision GiftiDataArray See: nipy/nibabel#1198 nipy/nibabel#1199 nilearn/nilearn#3649
GIFTI only defines three valid datatypes. Nibabel has generally allowed users to use any defined
NIFTI_TYPE
. In limiting this scope, there's a balance to strike between:(1) is the primary goal here. For (2) making ourselves intentionally unable to read a file seems like a very bad idea.
(3) seems like an open question. We do not uniformly allow users to violate standards, such as allowing fixed-length strings as the voxel data in NIfTI, even though numpy would permit it. So it could be fine to simply refuse to write files with invalid
DataArray
types. On the other hand, you could imagine a pipeline of functions that take in aGiftiImage
and return aGiftiImage
to ensure metadata follows data, and it's only at write where you would really want to enforce a precision reduction. (It would in any event be very difficult to entirely prevent the construction in memory of non-conformant images, due to Python's object model.)My initial thought was that we could auto-convert to the nearest dtype, but that could silently introduce unexpected results for users, so @matthew-brett suggested raising errors. This is a fairly simple API change.
Because the objects are constructed piecemeal by the parser, this does not prevent loading files. We might still want to find a way to warn on load.
A couple options for making it harder to write a non-conforming GIFTI:
GiftiImage.to_xml()
if any array dtype is not conforming. This could be overridden with a keyword argumentforce
.GiftiImage.__init__()
if any data arrays have non-conforming dtypes, or if they are added through API mechanisms (GiftiImage.add_gifti_data_array()
).Closes #1198.