Skip to content

json_normalize Support for Generators #26647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
msteijaert opened this issue Jun 4, 2019 · 7 comments
Closed

json_normalize Support for Generators #26647

msteijaert opened this issue Jun 4, 2019 · 7 comments
Labels
IO JSON read_json, to_json, json_normalize

Comments

@msteijaert
Copy link

The data argument in json_normalize requires be an dictionary or list of dictionaries. If an iterator is passed, unexpected behavior can occur. In particular, this will result in the loss of the first row. See https://stackoverflow.com/questions/56362810/missing-first-document-when-loading-multi-document-yaml-file-in-pandas-dataframe.

I think this should be catched, either by throwing an error (as proposed in #26646) or by ensuring that the output for the iterator is the same as for the list created by "materializing" the iterator.

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2019

I think the issue can be expressed more succinctly. Basically this works:

In [22]: data = [{'id': 1, 'name': {'first': 'foo'}}, {'id': 2, 'name': {'first': 'bar'}}]
In [23]: json_normalize(data)
Out[23]:
   id name.first
0   1        foo
1   2        bar

But this doesn't

In [23]: json_normalize(x for x in data)
Out[24]:
   id name.first
0   2        bar

I think this could be supported

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action labels Jun 4, 2019
@WillAyd WillAyd changed the title Missing first document when loading multi-document yaml file in pandas dataframe json_normalize Support for Generator Expressions Jun 4, 2019
@WillAyd WillAyd changed the title json_normalize Support for Generator Expressions json_normalize Support for Generators Jun 4, 2019
@WillAyd WillAyd removed the Needs Discussion Requires discussion from core team before further action label Jun 5, 2019
@WillAyd
Copy link
Member

WillAyd commented Jun 5, 2019

If you'd like to try a PR to support this I think would be OK

@msteijaert
Copy link
Author

If you'd like to try a PR to support this I think would be OK

What would be preferable? Raising an error or turning data into a list if data is an iterator? The latter is pretty straightforward. My (naive) solution would be the following:

if data is iter(data):
    data = [x for x in data]

@WillAyd
Copy link
Member

WillAyd commented Jun 5, 2019

I don't think using is is a good approach. If you can check the source to see where the first entry is getting dropped would be preferable to refactor without inference

@zifeo
Copy link

zifeo commented Apr 15, 2020

@WillAyd The issue is here (pulling at least one element out of the iterator), not sure this can be so easily refactored. Do you see any trouble data = list(data) could cause?

@WillAyd
Copy link
Member

WillAyd commented Apr 17, 2020

closed by #33585

@WillAyd WillAyd closed this as completed Apr 17, 2020
@kylekeppler
Copy link
Contributor

Note #33585 does not add support for generators, but clarifies the docstrings and raises an exception if not list (which is great, this is not intend as a complaint), Has the team decided it is undesirable to allow generators?

Would something like the following be a reasonable approach?

if isinstance(obj, typing.Generator):
    obj = RepeatsFirstItemGenerator(obj)

Where RepeatsFirstItemGenerator is a class that wraps a generator, caches, and repeats the first item?

LankyCyril added a commit to LankyCyril/GLOpenAPI that referenced this issue Nov 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

4 participants