json_normalize Support for Generators #26647

msteijaert · 2019-06-04T15:06:31Z

The data argument in json_normalize requires be an dictionary or list of dictionaries. If an iterator is passed, unexpected behavior can occur. In particular, this will result in the loss of the first row. See https://stackoverflow.com/questions/56362810/missing-first-document-when-loading-multi-document-yaml-file-in-pandas-dataframe.

I think this should be catched, either by throwing an error (as proposed in #26646) or by ensuring that the output for the iterator is the same as for the list created by "materializing" the iterator.

WillAyd · 2019-06-04T17:36:08Z

I think the issue can be expressed more succinctly. Basically this works:

In [22]: data = [{'id': 1, 'name': {'first': 'foo'}}, {'id': 2, 'name': {'first': 'bar'}}]
In [23]: json_normalize(data)
Out[23]:
   id name.first
0   1        foo
1   2        bar

But this doesn't

In [23]: json_normalize(x for x in data)
Out[24]:
   id name.first
0   2        bar

I think this could be supported

WillAyd · 2019-06-05T01:03:39Z

If you'd like to try a PR to support this I think would be OK

msteijaert · 2019-06-05T07:04:53Z

If you'd like to try a PR to support this I think would be OK

What would be preferable? Raising an error or turning data into a list if data is an iterator? The latter is pretty straightforward. My (naive) solution would be the following:

if data is iter(data):
    data = [x for x in data]

WillAyd · 2019-06-05T12:20:23Z

I don't think using is is a good approach. If you can check the source to see where the first entry is getting dropped would be preferable to refactor without inference

zifeo · 2020-04-15T21:51:41Z

@WillAyd The issue is here (pulling at least one element out of the iterator), not sure this can be so easily refactored. Do you see any trouble data = list(data) could cause?

WillAyd · 2020-04-17T00:17:16Z

closed by #33585

kylekeppler · 2020-08-08T21:15:16Z

Note #33585 does not add support for generators, but clarifies the docstrings and raises an exception if not list (which is great, this is not intend as a complaint), Has the team decided it is undesirable to allow generators?

Would something like the following be a reasonable approach?

if isinstance(obj, typing.Generator):
    obj = RepeatsFirstItemGenerator(obj)

Where RepeatsFirstItemGenerator is a class that wraps a generator, caches, and repeats the first item?

pandas-dev/pandas#26647

msteijaert mentioned this issue Jun 4, 2019

catch iterator input in io.json.json_normalize #26646

Closed

4 tasks

WillAyd added IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action labels Jun 4, 2019

WillAyd changed the title ~~Missing first document when loading multi-document yaml file in pandas dataframe~~ json_normalize Support for Generator Expressions Jun 4, 2019

WillAyd changed the title ~~json_normalize Support for Generator Expressions~~ json_normalize Support for Generators Jun 4, 2019

WillAyd removed the Needs Discussion Requires discussion from core team before further action label Jun 5, 2019

WillAyd closed this as completed Apr 17, 2020

LankyCyril added a commit to LankyCyril/GLOpenAPI that referenced this issue Nov 14, 2020

Fix loss of first row in json_normalize; see:

8a2e877

pandas-dev/pandas#26647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

json_normalize Support for Generators #26647

json_normalize Support for Generators #26647

msteijaert commented Jun 4, 2019

WillAyd commented Jun 4, 2019

Uh oh!

WillAyd commented Jun 5, 2019

Uh oh!

msteijaert commented Jun 5, 2019

Uh oh!

WillAyd commented Jun 5, 2019

Uh oh!

zifeo commented Apr 15, 2020

Uh oh!

WillAyd commented Apr 17, 2020

Uh oh!

kylekeppler commented Aug 8, 2020

Uh oh!

Uh oh!

json_normalize Support for Generators #26647

json_normalize Support for Generators #26647

Comments

msteijaert commented Jun 4, 2019

WillAyd commented Jun 4, 2019

Uh oh!

WillAyd commented Jun 5, 2019

Uh oh!

msteijaert commented Jun 5, 2019

Uh oh!

WillAyd commented Jun 5, 2019

Uh oh!

zifeo commented Apr 15, 2020

Uh oh!

WillAyd commented Apr 17, 2020

Uh oh!

kylekeppler commented Aug 8, 2020

Uh oh!