Skip to content

Robustness improvement for normalize.py #26328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

Marky0
Copy link

@Marky0 Marky0 commented May 9, 2019

Improves robustness of json_normalize for inconsistently formatted json.
Fixes all observations in #26284

Marky0 added 3 commits May 10, 2019 00:24
Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284
Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284
Update to json_normalize to improve robustness
@pep8speaks
Copy link

pep8speaks commented May 9, 2019

Hello @Marky0! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 63:14: E127 continuation line over-indented for visual indent
Line 64:15: E127 continuation line over-indented for visual indent
Line 308:17: E126 continuation line over-indented for hanging indent

Comment last updated at 2019-05-11 16:28:34 UTC

@codecov
Copy link

codecov bot commented May 10, 2019

Codecov Report

Merging #26328 into master will decrease coverage by 51.33%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26328       +/-   ##
===========================================
- Coverage   92.04%   40.71%   -51.34%     
===========================================
  Files         175      175               
  Lines       52297    52306        +9     
===========================================
- Hits        48138    21296    -26842     
- Misses       4159    31010    +26851
Flag Coverage Δ
#multiple ?
#single 40.71% <0%> (-0.18%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/normalize.py 7.47% <0%> (-89.47%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17247ed...7b64d1b. Read the comment docs.

@codecov
Copy link

codecov bot commented May 10, 2019

Codecov Report

Merging #26328 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26328      +/-   ##
==========================================
- Coverage   92.04%   92.04%   -0.01%     
==========================================
  Files         175      175              
  Lines       52297    52301       +4     
==========================================
  Hits        48138    48138              
- Misses       4159     4163       +4
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 40.72% <0%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/normalize.py 97.19% <100%> (+0.25%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/compat/__init__.py 93.93% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️
pandas/core/sparse/frame.py 95.48% <0%> (-0.01%) ⬇️
pandas/core/computation/expr.py 96.7% <0%> (-0.01%) ⬇️
pandas/core/sparse/scipy_sparse.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17247ed...d0dd982. Read the comment docs.

@Marky0 Marky0 changed the title Robustness improvement for json_normalize.py Robustness improvement for normalize.py May 10, 2019
@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None,
record_path : string or list of strings, default None
Path in each object to list of records. If not passed, data will be
assumed to be an array of records
For arrays of inconsistantly formatted json records, first record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.

What distinction are you trying to make with the first record?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to
For an array of objects with missing key-value pairs in each record,
the first record needs to include all key-value pairs

The point being that if all the keys aren't seen in the first record, its not straightforward to then fill in previous records that have been loaded with NaNs. Easier for the first record to provide a 'template' of all keys, so that as each subsequent record is parsed, the dataframe can be easily loaded with NaNs if the key does not exist.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! As a general comment I would advise against trying to solve multiple issues in one PR. I think here you are attempting to both handle record_path pointing to JSON objects while also accounting for error handling with missing keys.

If that's correct I think you should just focus on solving the first issue and we can come back to the latter in a subsequent PR. Otherwise the review and update process may take much longer

@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None,
record_path : string or list of strings, default None
Path in each object to list of records. If not passed, data will be
assumed to be an array of records
For arrays of inconsistantly formatted json records, first record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.

What distinction are you trying to make with the first record?

if not(isinstance(result, list)):
# Allows collection of single object into dataframe GH26284
result = [result]
except KeyError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an errors parameter which this ignores. I think we'll need to be aware of that here in some way, though from my mine comment I don't think we should try and tackle that here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this. The only error we can catch here is for a missing key ? Any other error would happen in the existing baseline ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this assumes that the user always wants to silently ignore missing keys, which is not desirable and makes for a confusing API since we have an "errors" parameter that controls that behavior for the meta

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah understood. If we want to give control there are clearly two ways....
(i) redefine errors = 'ignore' to cover both meta and record path
(ii) introduce another error flag to differentiate between meta and record path
Is there a convention or a preference in pandas before I implement ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option one

@@ -241,6 +251,12 @@ def _recursive_extract(data, path, seen_meta, level=0):
else:
for obj in data:
recs = _pull_field(obj, path[0])
if recs == {}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment around missing key - need to be cognizant of the errors parameter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. Setting recs to an empty dictionary if there is a missing key (on line 199), is a convenient flag to then load the output with NaNs on line 258. How does the errors parameter come into this ?

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label May 10, 2019
@Marky0
Copy link
Author

Marky0 commented May 11, 2019

I think here you are attempting to both handle record_path pointing to JSON objects while also accounting for error handling with missing keys.

If that's correct I think you should just focus on solving the first issue and we can come back to the latter in a subsequent PR. Otherwise the review and update process may take much longer

@WillAyd you are correct that I have addressed two issues here. The primary one being missing keys, the secondary allowing record_path to parse a single object. The latter only needs two lines of code 194-196 (to check if the parsed result is anything other than a list, and if not make it a list because it is then correctly parsed into the dataframe).

I can move this to a new PR, but it seems a lot of effort for two lines ? Let me know your preference.

@Marky0
Copy link
Author

Marky0 commented May 11, 2019

Can someone explain why pandas-dev.pandas (Windows py37_np141) test is failing ?

Cython-generated file 'pandas_libs/algos.c' not found.

@WillAyd
Copy link
Member

WillAyd commented May 11, 2019

The latter only needs two lines of code 194-196 (to check if the parsed result is anything other than a list, and if not make it a list because it is then correctly parsed into the dataframe).

The fix may be two lines of code but the main problem is that the tests are now ambiguously mixed between the two changes so would still be preferable to have those as separate PRs.

@WillAyd
Copy link
Member

WillAyd commented May 11, 2019

Can someone explain why pandas-dev.pandas (Windows py37_np141) test is failing ?

Cython-generated file 'pandas_libs/algos.c' not found.

Have seen this on a few other PRs something must have changed on the CI side with Windows lately. Not caused by this PR

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some work

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@Marky0 if you want to merge master and address the comments

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

closing as stale, if you want to continue, pls ping.

@jreback jreback closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json_normalize should raise when record_path doesn't point to an array
5 participants