-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Robustness improvement for normalize.py #26328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284
Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284
Update to json_normalize to improve robustness
Hello @Marky0! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2019-05-11 16:28:34 UTC |
Codecov Report
@@ Coverage Diff @@
## master #26328 +/- ##
===========================================
- Coverage 92.04% 40.71% -51.34%
===========================================
Files 175 175
Lines 52297 52306 +9
===========================================
- Hits 48138 21296 -26842
- Misses 4159 31010 +26851
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26328 +/- ##
==========================================
- Coverage 92.04% 92.04% -0.01%
==========================================
Files 175 175
Lines 52297 52301 +4
==========================================
Hits 48138 48138
- Misses 4159 4163 +4
Continue to review full report at Codecov.
|
pandas/io/json/normalize.py
Outdated
@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None, | |||
record_path : string or list of strings, default None | |||
Path in each object to list of records. If not passed, data will be | |||
assumed to be an array of records | |||
For arrays of inconsistantly formatted json records, first record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.
What distinction are you trying to make with the first record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to
For an array of objects with missing key-value pairs in each record,
the first record needs to include all key-value pairs
The point being that if all the keys aren't seen in the first record, its not straightforward to then fill in previous records that have been loaded with NaNs. Easier for the first record to provide a 'template' of all keys, so that as each subsequent record is parsed, the dataframe can be easily loaded with NaNs if the key does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! As a general comment I would advise against trying to solve multiple issues in one PR. I think here you are attempting to both handle record_path
pointing to JSON objects while also accounting for error handling with missing keys.
If that's correct I think you should just focus on solving the first issue and we can come back to the latter in a subsequent PR. Otherwise the review and update process may take much longer
pandas/io/json/normalize.py
Outdated
@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None, | |||
record_path : string or list of strings, default None | |||
Path in each object to list of records. If not passed, data will be | |||
assumed to be an array of records | |||
For arrays of inconsistantly formatted json records, first record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.
What distinction are you trying to make with the first record?
if not(isinstance(result, list)): | ||
# Allows collection of single object into dataframe GH26284 | ||
result = [result] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an errors
parameter which this ignores. I think we'll need to be aware of that here in some way, though from my mine comment I don't think we should try and tackle that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this. The only error we can catch here is for a missing key ? Any other error would happen in the existing baseline ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this assumes that the user always wants to silently ignore missing keys, which is not desirable and makes for a confusing API since we have an "errors" parameter that controls that behavior for the meta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah understood. If we want to give control there are clearly two ways....
(i) redefine errors = 'ignore'
to cover both meta
and record path
(ii) introduce another error flag to differentiate between meta
and record path
Is there a convention or a preference in pandas before I implement ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think option one
@@ -241,6 +251,12 @@ def _recursive_extract(data, path, seen_meta, level=0): | |||
else: | |||
for obj in data: | |||
recs = _pull_field(obj, path[0]) | |||
if recs == {}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment around missing key - need to be cognizant of the errors
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. Setting recs
to an empty dictionary if there is a missing key (on line 199), is a convenient flag to then load the output with NaNs on line 258. How does the errors
parameter come into this ?
@WillAyd you are correct that I have addressed two issues here. The primary one being missing keys, the secondary allowing I can move this to a new PR, but it seems a lot of effort for two lines ? Let me know your preference. |
Can someone explain why pandas-dev.pandas (Windows py37_np141) test is failing ? Cython-generated file 'pandas_libs/algos.c' not found. |
The fix may be two lines of code but the main problem is that the tests are now ambiguously mixed between the two changes so would still be preferable to have those as separate PRs. |
Have seen this on a few other PRs something must have changed on the CI side with Windows lately. Not caused by this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs some work
@Marky0 if you want to merge master and address the comments |
closing as stale, if you want to continue, pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Improves robustness of json_normalize for inconsistently formatted json.
Fixes all observations in #26284