-
Notifications
You must be signed in to change notification settings - Fork 262
BF: Make sure xml is encoded as utf-8 #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -192,7 +193,7 @@ def data_tag(dataarray, encoding, datatype, ordering): | |||
raise NotImplementedError("In what format are the external files?") | |||
else: | |||
da = '' | |||
return "<Data>" + da + "</Data>\n" | |||
return ("<Data>" + da + "</Data>\n").encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess da
is always a str
(for Pythons 2 and 3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; it is computed above, not passed it, so we can be confident.
Sorry - I have a feeling of exhaustion thinking about this, but I know that's not very helpful. I guess the rule is that stuff output from the code in XML should always be bytes encoded in UTF-8, and stuff output as text from methods (not XML) should always be unicode / Python 3 str decoded from UTF-8? |
If you call |
You've put quite a few |
|
Sorry, got my encode / decodes the wrong way round. I think it's clear what to do for the stuff that is going out to XML, that should always be bytes. I'm worrying about the consistency of stuff that is coming in from XML as return arguments from methods - so I should have asked about the 'decode' calls - where you are returning unicode in Python 2. Sorry if I'm not thinking clearly though. |
@matthew-brett I don't see any methods calling Or am I misunderstanding the concern? |
return result | ||
self.ind_ord).decode('utf-8') | ||
result = result + self.to_xml_close().decode('utf-8') | ||
return result.encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function is an encoded concatenation of decoded bytestrings. Can't all of these changes be dropped?
If you simply want to explicitly label this function as dealing in bytes
and not unicode
, you could use a BytesIO
to accumulate:
def to_xml(self):
# fix endianness to machine endianness
self.endian = gifti_endian_codes.code[sys.byteorder]
result = BytesIO(self.to_xml_open())
# write metadata
if not self.meta is None:
result.write(self.meta.to_xml())
# write coord sys
if not self.coordsys is None:
result.write(self.coordsys.to_xml())
# write data array depending on the encoding
dt_kind = data_type_codes.dtype[self.datatype].kind
result.write(data_tag(self.data,
gifti_encoding_codes.specs[self.encoding],
KIND2FMT[dt_kind],
self.ind_ord))
result.write(self.to_xml_close())
return result.getvalue()
While decoding and encoding will be slightly more expensive, writing to a BytesIO object will be slightly less.
Does it make sense in the |
How about using the If this looks good, we can use something similar for CIFTI. |
I think the idea of using an XML library rather than hand-generating it is a solid one, especially if it doesn't introduce a new dependency. I haven't looked deeply at this diff, though. Is there a way to verify that the output of this and the previous output are the same (at least to the extent that parsing the files leads to identical data structures)? |
I believe the I counted on the GIFTI tests. I could beef those up if they're not sufficient. There were definitely errors when I screwed things up, but no guarantee they'd fully cover all cases. |
It looks like |
name = xml.SubElement(md, 'Name') | ||
value = xml.SubElement(md, 'Value') | ||
name.text = ele.name | ||
value.text = ele.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we lose the CDATA
bit here? Maybe:
name.text = '<![CDATA[{0}]]>'.format(ele.name)
value.text = '<![CDATA[{0}]]>'.format(ele.value)
Edit: Or possibly we don't care, because xml
will handle any escaping that needs to be done? (I am not very familiar with XML.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the CDATA
bit is a way to avoid encoding strings as XML-safe. That trick isn't necessary now; the library will XML encode.
In addition, if you do as you suggest, I believe the <
and >
characters will be escaped, and you'll get something you don't want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Just did a little reading and playing in ipython.
Okay, I've looked through and all of the changes make sense to me, and I do agree it looks much cleaner this way. I will note that this does again run into encode-decode-encode loops, but in XML instead of Unicode. My impression of these functions is that this is plumbing, not API (but I could be wrong and the
|
Thanks @effigies ! I agree the current encode/decode isn't the best. I considered the three solutions you gave, but did it this way as the current interface has been published. It would be great to migrate the interface (since I suppose a fourth option would be to have an internal |
Fair. That last option also makes sense, but these are presumably rare enough operations that it's more a decision about aesthetics than efficiency, so I'd be inclined to leave it as is, unless it seems like a good idea to deprecate the public |
I don't think I also wouldn't want a public function to return an object type like I'm leaning more towards having a |
@effigies I pushed two changes:
Maybe it's overkill; I can back out those commits if y'all don't like. |
""" Creates the data tag depending on the required encoding """ | ||
def _data_tag_element(dataarray, encoding, datatype, ordering): | ||
""" Creates the data tag depending on the required encoding, | ||
returns as bytes""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns as *XML Element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
To be clear, I meant rare as in "probably no more than one per second". But that interface doesn't strike me as overkill, especially if it's usable in CIFTI. |
@effigies Cool. Yes, this will be directly applicable to CIFTI, which is also XML-based. |
@matthew-brett Any ETA on when you'd be able to look over this PR? I believe @effigies looked overthe current design and didn't have any major objections. This is one of two PRs I need in for a GIFTI Thanks! |
Ben - so sorry - I have been completely swamped these last few days with various things. I see you're in mid-refactor, and that I'm slowing you up. Chris - would you consider being the the main reviewer for Ben's set of PRs on the XML-related code? I mean, can you take responsibility for the PRs, and merge them when all comments are in? In order not to hold Ben up? |
@matthew-brett I can, if you don't mind occasional pinging about specific questions (especially API issues). My current model of API is: if it's not prefaced with a As to this PR, I have two questions that I was assuming would get answered when you reviewed:
@bcipolli Conservatively (absent a contrary opinion from Matthew), I'd say we should address my first question by deprecating |
@effigies I believe it's |
As for These utilities could live in an |
I agree about the deprecation stuff - no By I'll try and pitch in when I can, but I'm getting ready to head out for a trip, so I may be a bit slow to reply. Please do ping me though, past the point where you think it's polite, I'll do my best. |
Chris, Ben - I've added you as maintainers - so you can both merge if you want to. Chris - can I leave you to be the responsible reviewer on this set of PRs from Ben? Ben, maybe you can be the responsible reviewer for Chris' PRs if he's also working on this stuff? |
@matthew-brett Yup, I'll be responsible reviewer on these. (Just realized you'd asked, not assigned.) |
@bcipolli Do you want to rebase these PRs? |
Will do, thanks! |
Overlap with #353. Another rebase needed. |
@effigies rebased, Travis tests pass. |
) | ||
|
||
def to_xml_close(self): | ||
return "</DataArray>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid, but do you think we should have deprecated to_xml_open
and to_xml_close
? They were public, even if they were kind of weird. If so, I think the easiest way is probably just to restore them verbatim, since neither corresponds to a full XML element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's painful. But you're right, good call! Will work on this.
da = GiftiDataArray.from_array(np.ones((1,)), 'triangle') | ||
with clear_and_catch_warnings() as w: | ||
warnings.filterwarnings('always', category=DeprecationWarning) | ||
assert_true(isinstance(da.to_xml_open(), string_types)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_types
needs to be imported from externals.six.
Rebase and I think we're almost done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Sorry for all the errors; I'm juggling changes three branches (trying to keep things simple/clean #fail) and I'm doing a poor job :)
Fixed, rebased, and pushed up. Looking forward to getting this one in; the next PR is the most interesting one! Looking forward to discussing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. Thanks for your patience. Heading out of town in a couple hours, so the next PR might have to wait for Monday, but I do want to get this in so you don't have to juggle too much more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work. No worries--that one is a WIP and will take some time. I've just been looking forward to beginning that discussion!
I made the change requested. Just more sloppiness from my side. Reordered & diff looks cleaner now.
@property | ||
def metadata(self): | ||
""" Returns metadata as dictionary """ | ||
return self.meta.metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reordering intentional? PR would be a little simpler if print_summary
and {get_,}metadata
went back below to_xml_*
.
RF: Gifti images are XML serializable Eliminate (as much as possible) hand-written XML nibabel.xmlutils.XmlSerializable interface provides to_xml functions Deprecate helper functions that should not have been public
There we go. Thanks for all your work on these. Good luck with the next stage! |
RF: Gifti images are XML serializable Eliminate (as much as possible) hand-written XML nibabel.xmlutils.XmlSerializable interface provides to_xml functions Deprecate helper functions that should not have been public
Valid XML has to be utf-8--in bytes. Current code doesn't explicitly encode or decode, and so different versions of Python return bytes (Python 2) or Unicode (Python 3).
This caused test errors in the
cifti
PR (and was fixed there); a similar issue needs to be fixed for GIFTI.