Skip to content

BUG: iterparse of read_xml not parsing duplicate element and attribute names #47414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 21, 2022

Conversation

ParfaitG
Copy link
Contributor

row[col] = elem.text.strip() if elem.text else None
if col in elem.attrib:
row[col] = elem.attrib[col]
if self.names:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I refactored to use code once in base class of the parsers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing now. Could you split the refactoring and the bug fix into 2 prs? Hard to review like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert back to bug fix only and on subsequent PR refactor to remove repetitive code between parsers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I was not sure if you could share more initially, but if you can share most of the code than we can separate these

@mroeschke mroeschke added this to the 1.5 milestone Jun 21, 2022
@mroeschke mroeschke added the IO XML read_xml, to_xml label Jun 21, 2022
@mroeschke mroeschke merged commit 10967ce into pandas-dev:main Jun 21, 2022
@mroeschke
Copy link
Member

Thanks @ParfaitG

@ParfaitG ParfaitG deleted the iterparse_names branch June 22, 2022 03:09
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…e names (pandas-dev#47414)

* BUG: iterparse of read_xml not parsing duplicate element and attribute names

* Refactor duplicative code in each parser to shared base class

* Add lxml preceding-sibling iterparse cleanup

* Revert code refactoring back to bug fix only

* Remove whatsnew bug fix note on unreleased version feature
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: iterparse on read_xml overwrites with attributes on child elements
3 participants