Skip to content

BF: Make sure xml is encoded as utf-8 #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 17, 2015
Merged

BF: Make sure xml is encoded as utf-8 #354

merged 7 commits into from
Oct 17, 2015

Conversation

bcipolli
Copy link
Contributor

@bcipolli bcipolli commented Oct 7, 2015

Valid XML has to be utf-8--in bytes. Current code doesn't explicitly encode or decode, and so different versions of Python return bytes (Python 2) or Unicode (Python 3).

This caused test errors in the cifti PR (and was fixed there); a similar issue needs to be fixed for GIFTI.

@@ -192,7 +193,7 @@ def data_tag(dataarray, encoding, datatype, ordering):
raise NotImplementedError("In what format are the external files?")
else:
da = ''
return "<Data>" + da + "</Data>\n"
return ("<Data>" + da + "</Data>\n").encode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess da is always a str (for Pythons 2 and 3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; it is computed above, not passed it, so we can be confident.

@matthew-brett
Copy link
Member

Sorry - I have a feeling of exhaustion thinking about this, but I know that's not very helpful.

I guess the rule is that stuff output from the code in XML should always be bytes encoded in UTF-8, and stuff output as text from methods (not XML) should always be unicode / Python 3 str decoded from UTF-8?

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 7, 2015

I guess the rule is that stuff output from the code in XML should always be bytes encoded in UTF-8, and stuff output as text from methods (not XML) should always be unicode / Python 3 str decoded from UTF-8?

If you call to_xml(), you should get proper XML, which is bytes-encoded UTF-8. For other methods, I think you get bytes in Python2, unicode in Python 3. I think that's inline with the rest of the code--we don't force into unicode whenever we return a string type.

@matthew-brett
Copy link
Member

You've put quite a few .encode calls in, right? These are always going to return unicode I think. So the return type will sometimes (depending on the method) be Python 2 str, sometimes Python 2 unicode?

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 7, 2015

.encode should return bytes; .decode should return unicode. I .decode to combine strings, then .encode on the way out of .to_xml to make sure I get utf-8 bytes. I'm pretty confident that's right.

@matthew-brett
Copy link
Member

Sorry, got my encode / decodes the wrong way round.

I think it's clear what to do for the stuff that is going out to XML, that should always be bytes.

I'm worrying about the consistency of stuff that is coming in from XML as return arguments from methods - so I should have asked about the 'decode' calls - where you are returning unicode in Python 2.

Sorry if I'm not thinking clearly though.

@effigies
Copy link
Member

effigies commented Oct 7, 2015

@matthew-brett I don't see any methods calling .to_xml() except other .to_xml() methods and the one situation in giftiio.py where the result is written to a binary-mode filehandle. So mixing .to_xml() bytes with unicode/Py3 str doesn't seem to be a risk.

Or am I misunderstanding the concern?

return result
self.ind_ord).decode('utf-8')
result = result + self.to_xml_close().decode('utf-8')
return result.encode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function is an encoded concatenation of decoded bytestrings. Can't all of these changes be dropped?

If you simply want to explicitly label this function as dealing in bytes and not unicode, you could use a BytesIO to accumulate:

def to_xml(self):
    # fix endianness to machine endianness
    self.endian = gifti_endian_codes.code[sys.byteorder]
    result = BytesIO(self.to_xml_open())
    # write metadata
    if not self.meta is None:
        result.write(self.meta.to_xml())
    # write coord sys
    if not self.coordsys is None:
        result.write(self.coordsys.to_xml())
    # write data array depending on the encoding
    dt_kind = data_type_codes.dtype[self.datatype].kind
    result.write(data_tag(self.data,
                          gifti_encoding_codes.specs[self.encoding],
                          KIND2FMT[dt_kind],
                          self.ind_ord))
    result.write(self.to_xml_close())
    return result.getvalue()

While decoding and encoding will be slightly more expensive, writing to a BytesIO object will be slightly less.

@effigies
Copy link
Member

effigies commented Oct 7, 2015

Does it make sense in the .to_xml functions (and helpers like _arr2txt and data_tag) to explicitly label Unicode strings with u'' notation? In Python 2, you'll coerce up to unicode when needed, and in Python 3 it's just a string, but it might be a good tag, for hygiene.

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 8, 2015

How about using the xml library to dump the XML? I think it cleans up the code nicely; @effigies @matthew-brett ?

If this looks good, we can use something similar for CIFTI.

@effigies
Copy link
Member

effigies commented Oct 8, 2015

I think the idea of using an XML library rather than hand-generating it is a solid one, especially if it doesn't introduce a new dependency.

I haven't looked deeply at this diff, though. Is there a way to verify that the output of this and the previous output are the same (at least to the extent that parsing the files leads to identical data structures)?

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 8, 2015

I believe the xml library is a standard one, and the interface I used wraps whatever parser is available (Expat by default) https://docs.python.org/2/library/xml.html

I counted on the GIFTI tests. I could beef those up if they're not sufficient. There were definitely errors when I screwed things up, but no guarantee they'd fully cover all cases.

@effigies
Copy link
Member

effigies commented Oct 8, 2015

It looks like tests/test_giftiio.py do save-load loops and check consistency. I think that's probably fine. It doesn't test the XML structure itself, but if it produces the same numpy arrays, that seems like as good a verification.

name = xml.SubElement(md, 'Name')
value = xml.SubElement(md, 'Value')
name.text = ele.name
value.text = ele.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we lose the CDATA bit here? Maybe:

name.text = '<![CDATA[{0}]]>'.format(ele.name)
value.text = '<![CDATA[{0}]]>'.format(ele.value)

Edit: Or possibly we don't care, because xml will handle any escaping that needs to be done? (I am not very familiar with XML.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the CDATA bit is a way to avoid encoding strings as XML-safe. That trick isn't necessary now; the library will XML encode.

In addition, if you do as you suggest, I believe the < and > characters will be escaped, and you'll get something you don't want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Just did a little reading and playing in ipython.

@effigies
Copy link
Member

effigies commented Oct 8, 2015

Okay, I've looked through and all of the changes make sense to me, and I do agree it looks much cleaner this way.

I will note that this does again run into encode-decode-encode loops, but in XML instead of Unicode. My impression of these functions is that this is plumbing, not API (but I could be wrong and the bytes output is non-negotiable), so would it make sense to reorganize so that .to_xml() returns XML Element objects? If so, that leaves three options that I see:

  1. GiftiImage.to_xml() is an exception to the rule, and continues returning a bytes object.
  2. Rename GiftiImage.to_xml() to GiftiImage.to_xml_string() (or .to_bytes?)
  3. Replace the write command in giftiio like so: f.write(xml.tostring(img.to_xml(), 'utf-8')).

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 8, 2015

Thanks @effigies ! I agree the current encode/decode isn't the best. I considered the three solutions you gave, but did it this way as the current interface has been published.

It would be great to migrate the interface (since CIFTI support should keep a similar nomenclature). I'm just not sure how to migrate the return type...

I suppose a fourth option would be to have an internal _to_xml_element object that to_xml calls into.

@effigies
Copy link
Member

effigies commented Oct 8, 2015

Fair. That last option also makes sense, but these are presumably rare enough operations that it's more a decision about aesthetics than efficiency, so I'd be inclined to leave it as is, unless it seems like a good idea to deprecate the public .to_xml interface anyway.

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 8, 2015

I don't think GIFTI.save() should be too rare, so efficiency would be nice. Especially because I'd like to reuse the logic here for CIFTI, which could have great use (HCP data).

I also wouldn't want a public function to return an object type like xml.etree.ElementTree.Element when others could work just as well; having public functions be string makes sense too.

I'm leaning more towards having a _to_xml_element private function, which is documented to use the specific xml object model. The implementation was simple (I have it on my machine).

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 8, 2015

@effigies I pushed two changes:

  • Add _to_xml_element and an XmlSerializable interface
  • Move code to xml.py, so that CIFTI support can follow and so that all references to xml.etree.ElementTree will be contained there. It would make changing the XML parser potentially easier.

Maybe it's overkill; I can back out those commits if y'all don't like.

""" Creates the data tag depending on the required encoding """
def _data_tag_element(dataarray, encoding, datatype, ordering):
""" Creates the data tag depending on the required encoding,
returns as bytes"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns as *XML Element

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@bcipolli bcipolli changed the title Make sure xml is encoded as utf-8 BF: Make sure xml is encoded as utf-8 Oct 9, 2015
@effigies
Copy link
Member

effigies commented Oct 9, 2015

To be clear, I meant rare as in "probably no more than one per second". But that interface doesn't strike me as overkill, especially if it's usable in CIFTI.

@bcipolli
Copy link
Contributor Author

bcipolli commented Oct 9, 2015

@effigies Cool. Yes, this will be directly applicable to CIFTI, which is also XML-based.

@bcipolli
Copy link
Contributor Author

@matthew-brett Any ETA on when you'd be able to look over this PR? I believe @effigies looked overthe current design and didn't have any major objections.

This is one of two PRs I need in for a GIFTI load/save PR, and also for the CIFTI PR (#353 is the other).

Thanks!

@matthew-brett
Copy link
Member

Ben - so sorry - I have been completely swamped these last few days with various things.

I see you're in mid-refactor, and that I'm slowing you up.

Chris - would you consider being the the main reviewer for Ben's set of PRs on the XML-related code? I mean, can you take responsibility for the PRs, and merge them when all comments are in? In order not to hold Ben up?

@effigies
Copy link
Member

@matthew-brett I can, if you don't mind occasional pinging about specific questions (especially API issues). My current model of API is: if it's not prefaced with a _, it can't be removed without deprecation.

As to this PR, I have two questions that I was assuming would get answered when you reviewed:

  1. The removal of GiftiCoordSystem.data_tag for GiftiCoordSystem._data_tag_element: Should this be a deprecation?

  2. Are you comfortable with the xmlutils going into nibabel? That's a big decision for me to make for not-my-project.

@bcipolli Conservatively (absent a contrary opinion from Matthew), I'd say we should address my first question by deprecating data_tag with a stub that just runs xml.to_string(self._data_tag_element(...), 'utf-8'). It can get removed in a major release.

@bcipolli
Copy link
Contributor Author

@effigies I believe it's gifti.data_tag - I don't believe that function belongs to any object. But regardless, it's so easy to deprecate and I think you're right we should do so... so I've updated! :)

@bcipolli
Copy link
Contributor Author

As for xmlutils.py, it could also be named as xmlimages.py; in the next iteration I refactor SpatialImage to extract out the file-based bits that are common across spatial and non-spatial images (which I call FileBasedImage). CIFTI and GIFTI could inherit XmlImage, which inherits from FileBasedImage.

These utilities could live in an xmlimages.py, if that better fits the nibabel file schema.

@matthew-brett
Copy link
Member

I agree about the deprecation stuff - no _ - then some kind of deprecation warning - as a guideline - exceptions allowed with a good argument.

By xmlutils - you mean the wrapper that is only a couple of methods at the moment? No problem in general, your call. Whatever leads to the greatest degree of simplicity and readability.

I'll try and pitch in when I can, but I'm getting ready to head out for a trip, so I may be a bit slow to reply. Please do ping me though, past the point where you think it's polite, I'll do my best.

@matthew-brett
Copy link
Member

Chris, Ben - I've added you as maintainers - so you can both merge if you want to.

Chris - can I leave you to be the responsible reviewer on this set of PRs from Ben? Ben, maybe you can be the responsible reviewer for Chris' PRs if he's also working on this stuff?

@effigies
Copy link
Member

@matthew-brett Yup, I'll be responsible reviewer on these. (Just realized you'd asked, not assigned.)

@effigies
Copy link
Member

@bcipolli Do you want to rebase these PRs?

@bcipolli
Copy link
Contributor Author

Will do, thanks!

@effigies
Copy link
Member

Overlap with #353. Another rebase needed.

@bcipolli
Copy link
Contributor Author

@effigies rebased, Travis tests pass.

)

def to_xml_close(self):
return "</DataArray>\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid, but do you think we should have deprecated to_xml_open and to_xml_close? They were public, even if they were kind of weird. If so, I think the easiest way is probably just to restore them verbatim, since neither corresponds to a full XML element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's painful. But you're right, good call! Will work on this.

da = GiftiDataArray.from_array(np.ones((1,)), 'triangle')
with clear_and_catch_warnings() as w:
warnings.filterwarnings('always', category=DeprecationWarning)
assert_true(isinstance(da.to_xml_open(), string_types))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_types needs to be imported from externals.six.

Rebase and I think we're almost done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Sorry for all the errors; I'm juggling changes three branches (trying to keep things simple/clean #fail) and I'm doing a poor job :)

Fixed, rebased, and pushed up. Looking forward to getting this one in; the next PR is the most interesting one! Looking forward to discussing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. Thanks for your patience. Heading out of town in a couple hours, so the next PR might have to wait for Monday, but I do want to get this in so you don't have to juggle too much more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work. No worries--that one is a WIP and will take some time. I've just been looking forward to beginning that discussion!

I made the change requested. Just more sloppiness from my side. Reordered & diff looks cleaner now.

@property
def metadata(self):
""" Returns metadata as dictionary """
return self.meta.metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reordering intentional? PR would be a little simpler if print_summary and {get_,}metadata went back below to_xml_*.

effigies added a commit that referenced this pull request Oct 17, 2015
RF: Gifti images are XML serializable

Eliminate (as much as possible) hand-written XML
nibabel.xmlutils.XmlSerializable interface provides to_xml functions
Deprecate helper functions that should not have been public
@effigies effigies merged commit 63f9ef2 into nipy:master Oct 17, 2015
@effigies
Copy link
Member

There we go. Thanks for all your work on these. Good luck with the next stage!

@effigies effigies mentioned this pull request Oct 17, 2015
@bcipolli bcipolli deleted the gifti-fix3 branch October 21, 2015 14:58
grlee77 pushed a commit to grlee77/nibabel that referenced this pull request Mar 15, 2016
RF: Gifti images are XML serializable

Eliminate (as much as possible) hand-written XML
nibabel.xmlutils.XmlSerializable interface provides to_xml functions
Deprecate helper functions that should not have been public
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants