Robustness improvement for normalize.py #26328

Marky0 · 2019-05-09T23:39:13Z

closes json_normalize should raise when record_path doesn't point to an array #26284
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Improves robustness of json_normalize for inconsistently formatted json.
Fixes all observations in #26284

Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284

Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284

Update to json_normalize to improve robustness

pep8speaks · 2019-05-09T23:39:16Z

Hello @Marky0! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pandas/tests/io/json/test_normalize.py:

Line 63:14: E127 continuation line over-indented for visual indent
Line 64:15: E127 continuation line over-indented for visual indent
Line 308:17: E126 continuation line over-indented for hanging indent

Comment last updated at 2019-05-11 16:28:34 UTC

codecov · 2019-05-10T00:01:23Z

Codecov Report

Merging #26328 into master will decrease coverage by 51.33%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26328       +/-   ##
===========================================
- Coverage   92.04%   40.71%   -51.34%     
===========================================
  Files         175      175               
  Lines       52297    52306        +9     
===========================================
- Hits        48138    21296    -26842     
- Misses       4159    31010    +26851

Flag	Coverage Δ
#multiple	`?`
#single	`40.71% <0%> (-0.18%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/normalize.py	`7.47% <0%> (-89.47%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17247ed...7b64d1b. Read the comment docs.

codecov · 2019-05-10T00:01:26Z

Codecov Report

Merging #26328 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26328      +/-   ##
==========================================
- Coverage   92.04%   92.04%   -0.01%     
==========================================
  Files         175      175              
  Lines       52297    52301       +4     
==========================================
  Hits        48138    48138              
- Misses       4159     4163       +4

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`40.72% <0%> (-0.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/normalize.py	`97.19% <100%> (+0.25%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/compat/__init__.py	`93.93% <0%> (-0.35%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️
pandas/core/sparse/frame.py	`95.48% <0%> (-0.01%)`	⬇️
pandas/core/computation/expr.py	`96.7% <0%> (-0.01%)`	⬇️
pandas/core/sparse/scipy_sparse.py	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17247ed...d0dd982. Read the comment docs.

jbrockmendel · 2019-05-10T02:27:06Z

pandas/io/json/normalize.py

@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None,
    record_path : string or list of strings, default None
        Path in each object to list of records. If not passed, data will be
        assumed to be an array of records
+        For arrays of inconsistantly formatted json records, first record


Inconsistently

Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.

What distinction are you trying to make with the first record?

Updated to
For an array of objects with missing key-value pairs in each record,
the first record needs to include all key-value pairs

The point being that if all the keys aren't seen in the first record, its not straightforward to then fill in previous records that have been loaded with NaNs. Easier for the first record to provide a 'template' of all keys, so that as each subsequent record is parsed, the dataframe can be easily loaded with NaNs if the key does not exist.

WillAyd

Thanks for the PR! As a general comment I would advise against trying to solve multiple issues in one PR. I think here you are attempting to both handle record_path pointing to JSON objects while also accounting for error handling with missing keys.

If that's correct I think you should just focus on solving the first issue and we can come back to the latter in a subsequent PR. Otherwise the review and update process may take much longer

WillAyd · 2019-05-10T02:34:52Z

pandas/io/json/normalize.py

@@ -111,6 +111,8 @@ def json_normalize(data, record_path=None, meta=None,
    record_path : string or list of strings, default None
        Path in each object to list of records. If not passed, data will be
        assumed to be an array of records
+        For arrays of inconsistantly formatted json records, first record


Generally not sure of the wording here. Inconsistent isn't really the right term as I think it implies that the JSON is malformed or invalid, which isn't the case.

What distinction are you trying to make with the first record?

WillAyd · 2019-05-10T02:37:02Z

pandas/io/json/normalize.py

+                if not(isinstance(result, list)):
+                    # Allows collection of single object into dataframe GH26284
+                    result = [result]
+            except KeyError:


We have an errors parameter which this ignores. I think we'll need to be aware of that here in some way, though from my mine comment I don't think we should try and tackle that here

Not sure I understand this. The only error we can catch here is for a missing key ? Any other error would happen in the existing baseline ?

IIUC this assumes that the user always wants to silently ignore missing keys, which is not desirable and makes for a confusing API since we have an "errors" parameter that controls that behavior for the meta

Ah understood. If we want to give control there are clearly two ways....
(i) redefine errors = 'ignore' to cover both meta and record path
(ii) introduce another error flag to differentiate between meta and record path
Is there a convention or a preference in pandas before I implement ?

I think option one

WillAyd · 2019-05-10T02:37:26Z

pandas/io/json/normalize.py

@@ -241,6 +251,12 @@ def _recursive_extract(data, path, seen_meta, level=0):
        else:
            for obj in data:
                recs = _pull_field(obj, path[0])
+                if recs == {}:


Same comment around missing key - need to be cognizant of the errors parameter

Not sure I understand. Setting recs to an empty dictionary if there is a missing key (on line 199), is a convenient flag to then load the output with NaNs on line 258. How does the errors parameter come into this ?

pandas/tests/io/json/test_normalize.py

Marky0 · 2019-05-11T17:15:11Z

I think here you are attempting to both handle record_path pointing to JSON objects while also accounting for error handling with missing keys.

If that's correct I think you should just focus on solving the first issue and we can come back to the latter in a subsequent PR. Otherwise the review and update process may take much longer

@WillAyd you are correct that I have addressed two issues here. The primary one being missing keys, the secondary allowing record_path to parse a single object. The latter only needs two lines of code 194-196 (to check if the parsed result is anything other than a list, and if not make it a list because it is then correctly parsed into the dataframe).

I can move this to a new PR, but it seems a lot of effort for two lines ? Let me know your preference.

Marky0 · 2019-05-11T17:18:33Z

Can someone explain why pandas-dev.pandas (Windows py37_np141) test is failing ?

Cython-generated file 'pandas_libs/algos.c' not found.

WillAyd · 2019-05-11T17:40:03Z

The latter only needs two lines of code 194-196 (to check if the parsed result is anything other than a list, and if not make it a list because it is then correctly parsed into the dataframe).

The fix may be two lines of code but the main problem is that the tests are now ambiguously mixed between the two changes so would still be preferable to have those as separate PRs.

WillAyd · 2019-05-11T17:40:59Z

Can someone explain why pandas-dev.pandas (Windows py37_np141) test is failing ?

Cython-generated file 'pandas_libs/algos.c' not found.

Have seen this on a few other PRs something must have changed on the CI side with Windows lately. Not caused by this PR

jreback

this needs some work

jreback · 2019-06-08T20:32:31Z

@Marky0 if you want to merge master and address the comments

jreback · 2019-06-27T22:39:59Z

closing as stale, if you want to continue, pls ping.

Marky0 added 3 commits May 10, 2019 00:24

Update to json_normalize to improve robustness

113e415

Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284

Add new tests to test_normalize.py

9715c95

Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284

Merge pull request #1 from Marky0/json_norm

c51ec4a

Update to json_normalize to improve robustness

Marky0 added 3 commits May 10, 2019 00:52

Update json_normalize in line with PEP8 comments

b6ea15c

Update for pep8

dabf625

Update for pep8

7b64d1b

Marky0 changed the title ~~Robustness improvement for json_normalize.py~~ Robustness improvement for normalize.py May 10, 2019

jbrockmendel reviewed May 10, 2019

View reviewed changes

WillAyd requested changes May 10, 2019

View reviewed changes

WillAyd added the IO JSON read_json, to_json, json_normalize label May 10, 2019

Marky0 added 2 commits May 11, 2019 17:06

Update to address comments in pull request

b90b33f

Update for pep8

d0dd982

jreback requested changes May 12, 2019

View reviewed changes

jreback closed this Jun 27, 2019

Uh oh!

Robustness improvement for normalize.py #26328

Robustness improvement for normalize.py #26328

Uh oh!

Conversation

Marky0 commented May 9, 2019

Uh oh!

pep8speaks commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-05-11 16:28:34 UTC

Uh oh!

codecov bot commented May 10, 2019

Codecov Report

Uh oh!

codecov bot commented May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Marky0 commented May 11, 2019

Uh oh!

Marky0 commented May 11, 2019

Uh oh!

WillAyd commented May 11, 2019

Uh oh!

WillAyd commented May 11, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 8, 2019

Uh oh!

jreback commented Jun 27, 2019

Uh oh!

Uh oh!

pep8speaks commented May 9, 2019 •

edited

Loading

codecov bot commented May 10, 2019 •

edited

Loading