Enhancement include or exclude keys in json normalize #27262

bhavaniravi · 2019-07-06T14:57:57Z

closes ENH: Ignore flattening certain keys in json_normalize #27241
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

Just mulling this over a bit more. Should we be giving consideration to use_keys instead?

Most other IO methods work by inclusion instead of exclusion (see usecols parameter) and I'm wondering if that wouldn't be more applicable here as well. Curious to hear your thoughts

pandas/tests/io/json/test_normalize.py

bhavaniravi · 2019-07-09T02:35:27Z

Just mulling this over a bit more. Should we be giving consideration to use_keys instead?

Most other IO methods work by inclusion instead of exclusion (see usecols parameter) and I'm wondering if that wouldn't be more applicable here as well. Curious to hear your thoughts

Yes I thought about it too. I come from a usecase where I had a large nested dict with n keys and want to ignore certain keys from flattening at all levels. It is easier to mention the keys to ignore than include.

What do you think?

WillAyd · 2019-07-09T02:49:06Z

How about use_keys and it can accept a Callable to determine whether or not something can be included? That callable could provide the exclusion behavior you are after and it would be consistent with usecols behavior for other read methods

bhavaniravi · 2019-07-10T16:17:14Z

Few things I thought about. I understand the consistency part but the usekeys is not fitting. My first intuition on seeing it would be inclusion.

We can do two things

We can name the callable a column_selector of sorts
Or can have two keys for use_keys and exclude_keys

WillAyd · 2019-07-10T22:09:19Z

Option2 is wonky from an API perspective. What is the problem with the following though?

use_keys = lambda x: x not in {'exclude1', 'exclude2'}

I'm not tied to the use_keys moniker per se but I don't think we should deviate far from the similar functionality in the rest of the IO interface

bhavaniravi · 2019-07-11T11:53:04Z

Option2 is wonky from an API perspective. What is the problem with the following though?
use_keys = lambda x: x not in {'exclude1', 'exclude2'}
I'm not tied to the use_keys moniker per se but I don't think we should deviate far from the similar functionality in the rest of the IO interface

Okay. This makes sense. Let me do this.

…ignore_keys_in_json_normalize

…iravi/pandas into enh/ignore_keys_in_json_normalize

WillAyd

Haven’t reviewed in depth yet but general comment on approach

pandas/io/json/_normalize.py

WillAyd · 2019-07-17T15:36:54Z

Sorry I didn't mean to actually change the name to usecols but rather to support the same things, namely a string, list of strings and a callable as input. This was previously only supporting a callable.

Does that make sense?

bhavaniravi · 2019-07-17T15:44:37Z

Should I change it back to use_key?

…

On Wed, Jul 17, 2019, 9:07 PM William Ayd ***@***.***> wrote: Sorry I didn't mean to actually change the name to usecols but rather to support the same things, namely a string, list of strings and a callable as input. This was previously only supporting a callable. Does that make sense? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#27262?email_source=notifications&email_token=ACNFXIBWRB3ABU4WOHGPJ6TP744EVA5CNFSM4H6TTAKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EZTVY#issuecomment-512334295>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACNFXICK7EEDX3Q7544FI3LP744EVANCNFSM4H6TTAKA> .

WillAyd · 2019-07-17T15:48:00Z

Sure I think so. There isn’t really the concept of a column in JSON

…

Sent from my iPhone

On Jul 17, 2019, at 8:44 AM, Bhavani Ravi ***@***.***> wrote: Should I change it back to use_key? On Wed, Jul 17, 2019, 9:07 PM William Ayd ***@***.***> wrote: > Sorry I didn't mean to actually change the name to usecols but rather to > support the same things, namely a string, list of strings and a callable as > input. This was previously only supporting a callable. > > Does that make sense? > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <#27262?email_source=notifications&email_token=ACNFXIBWRB3ABU4WOHGPJ6TP744EVA5CNFSM4H6TTAKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EZTVY#issuecomment-512334295>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ACNFXICK7EEDX3Q7544FI3LP744EVANCNFSM4H6TTAKA> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bhavaniravi · 2019-07-17T16:01:42Z

Will do.

…

On Wed, Jul 17, 2019, 9:18 PM William Ayd ***@***.***> wrote: Sure I think so. There isn’t really the concept of a column in JSON Sent from my iPhone > On Jul 17, 2019, at 8:44 AM, Bhavani Ravi ***@***.***> wrote: > > Should I change it back to use_key? > > On Wed, Jul 17, 2019, 9:07 PM William Ayd ***@***.***> wrote: > > > Sorry I didn't mean to actually change the name to usecols but rather to > > support the same things, namely a string, list of strings and a callable as > > input. This was previously only supporting a callable. > > > > Does that make sense? > > > > — > > You are receiving this because you modified the open/close state. > > Reply to this email directly, view it on GitHub > > < #27262?email_source=notifications&email_token=ACNFXIBWRB3ABU4WOHGPJ6TP744EVA5CNFSM4H6TTAKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EZTVY#issuecomment-512334295 >, > > or mute the thread > > < https://github.com/notifications/unsubscribe-auth/ACNFXICK7EEDX3Q7544FI3LP744EVANCNFSM4H6TTAKA > > > . > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#27262?email_source=notifications&email_token=ACNFXIC43LUESLDMYXQ6QJTP745N5A5CNFSM4H6TTAKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2E2YNI#issuecomment-512338997>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACNFXID7AGDPDDVMOOIS6QLP745N5ANCNFSM4H6TTAKA> .

pandas/io/json/_normalize.py

pandas/tests/io/json/test_normalize.py

…iravi/pandas into enh/ignore_keys_in_json_normalize

datapythonista · 2019-07-28T14:37:42Z

@bhavaniravi in general is probably better if you push your changes once they are in state for us to be reviewed. There are few reasons for it:

Every push sends an email to the people subscribed to it, or to the whole project
Every push triggers a build in the CI (and unnecessary builds can make the builds of other PRs slower)
We don't know when you're done and we should review

None of the above is a big deal, feel free to push when you need to. But if there is no reason for pushing more than at the end, it's probably worth that you are aware of the previous things.

bhavaniravi · 2019-07-29T04:20:22Z

@bhavaniravi in general is probably better if you push your changes once they are in state for us to be reviewed. There are few reasons for it:

Every push sends an email to the people subscribed to it, or to the whole project

Every push triggers a build in the CI (and unnecessary builds can make the builds of other PRs slower)

We don't know when you're done and we should review

None of the above is a big deal, feel free to push when you need to. But if there is no reason for pushing more than at the end, it's probably worth that you are aware of the previous things.

Sure. Will keep that in mind.

bhavaniravi · 2019-08-01T10:53:48Z

@bhavaniravi can you also add a test case that uses nested_input_data and selects the Name key? That key, in particular, is nested a few levels deeper so would be good to validate

@WillAyd I am kind of stuck at this point when the number of nesting levels is > 1 and when we use use_keys as an inclusion method then, should the key have a path something like CreatedBy.Name.FirstName. Mentioning just CreatedBy flattens only 1 level or just Name can confuse with Name key in other levels

{
Name: "PO001"
"CreatedBy": {"Name":{"FirstName":"Name001"}}}
}
json_normalize(use_keys="CreatedBy.Name.FirstName"

Am I thinking right?

WillAyd · 2019-08-01T23:30:32Z

So to summarize the two options given:

[{"a": 1},
 {"level_0": {"a": 1}},
 {"level_0": {"level_1": {"a": 1}}}]

Option 1 assuming use_keys="a" would include all of those entries, so that the evaluation of the expression is done solely against the object key without any real consideration for its place in the hierarchy.

Option 2 as I think you mentioned is to require the full hierarchy, so you'd have to do something like use_key="level_0.level_1.a" to get the last level.

Thinking out loud, Option 1 seems easier from a usage perspective whereas Option 2 is more explicit. I might have a slight preference for option 2 as its a more conservative approach and it's usually easier to expand API functionality than to limit it, conceding that I still think Option 1 is more user friendly.

Open to your thoughts on it for sure

bhavaniravi · 2019-08-08T06:32:26Z

@WillAyd I'm leaned towards being explicit about the levels too. The current implementation doesn't support that. Let me fix that and get back

bhavaniravi · 2019-08-17T11:02:27Z

@WillAyd It's getting way more complicated than I thought and I need your help sorting it out.

Currently, we support 3 types of use_keys param function, list and str. Each type of this param is converted into a callable which checks in or not in _parse_use_keys which is later called whether to flatten a key or not.

Now to incorporate levels into all these 3 types it is getting way to complicated.

use_key="level_0.level_1.a"
use_key=["level_0.level_1.a", "key2.level1"]
use_key= lambda x: x not in ["level_0.level_1.a"]

WillAyd · 2019-08-23T12:17:09Z

Hey @bhavaniravi - I've been on vacation for a few weeks so sorry for lack of response. I'll be back in a few days and can review what you have again in more detail then

WillAyd · 2019-08-24T07:37:23Z

@bhavaniravi just reading through your last comment again. Is your concern that having a user specify the entire path to a key is too rigid? If so agreed it's a little tough from a usage perspective but that goes back to some of the comments in #27262 (comment)

So maybe as a third option we could explicitly document that use_key only works on top level keys of the JSON structure for now, and maybe leave it to a future enhancement to have that work further down in the hierarchy. I don't think that requires a lot of extra effort on top of what you have here but is explicit about the limitations of this. Do you think that might be easier?

WillAyd · 2019-09-05T17:01:54Z

@bhavaniravi is this still active? Thoughts on previous comment?

doc/source/whatsnew/v1.0.0.rst

pandas/io/json/_normalize.py

jreback · 2019-09-08T17:19:35Z

pandas/io/json/_normalize.py

@@ -207,6 +261,23 @@ def json_normalize(
    1   130              60          NaN    Mose Reg
    2   130              60          2.0    Faye Raker

+    Normalizes nested data up to level 1.


can you add an example with a str / list-of-str? are these useful? better to just make this a callable only?

The initial proposal was to provide a list of keys to ignore. After a discussion with @WillAyd he suggested list and list of str be consistent with the inclusion in other modules which made sense.

With callable only I'm not sure if we can achieve the multi-level support we are talking about iin
#27262 (comment)

Let me know your thoughts

bhavaniravi · 2019-09-09T07:20:31Z

@WillAyd In #27262 (comment)

Option 1 assuming use_keys="a" would include all of those entries, so that the evaluation of the expression is done solely against the object key without any real consideration for its place in the hierarchy.

whereas the current implementation only normalizes the a at level=0. since it looks at level_0 key which is not in list and proceeds without normalizing.

I think we are both agreeing on users being explicit for option 1 and 2, leaving the callable for future implementation. is that right?

WillAyd · 2019-09-09T16:49:53Z

If callable is a hangup then feel free to ignore for now, but I am under the impression the issue is more so dealing with nested levels. Can you be sure to add a test case like:

data = {
  "key1": {
    "should_match": 1,
    "no_match": 2
  },
  "should_match": 3
}

And making sure that should_match gets found in all cases? Right now tests only make assertion about items at the top of the JSON hierarchy

…_keys_in_json_normalize

bhavaniravi · 2019-10-05T14:29:10Z

@WillAyd Made those changes you asked for. I had to change and swap a few things to get it right. Now it allows use_keys to be level specific for eg., Level0.Level1.Level2.Key

doc/source/whatsnew/v1.0.0.rst

…_keys_in_json_normalize

WillAyd · 2019-10-08T04:30:02Z

pandas/io/json/_normalize.py

+
+        if is_list_like(use_keys):
+            return any(
+                key.split(".")[-len(i.split(".")) :] == i.split(".") for i in use_keys


Can you break this up into a few lines instead of just one generator expression? I think would be more readable that way

pandas/io/json/_normalize.py

pandas/tests/io/json/test_normalize.py

pandas/io/json/_normalize.py

WillAyd · 2019-11-07T23:16:01Z

pandas/io/json/_normalize.py

+            # current keypath matches the config in use_keys
+            # only dicts gets recurse-flatten
+            if (
+                is_key_match(newkey, use_keys)


I don't think this validation is necessary for the change; can you remove is_key_match?

alimcmaster1 · 2019-11-16T02:05:29Z

@bhavaniravi - could you merge master whenever you have some time

WillAyd · 2019-12-17T17:45:55Z

@bhavaniravi closing as I think this has gone stale but certainly ping if you'd like to pick back up and can reopen

bhavaniravi added 5 commits July 6, 2019 19:23

add ignore_keys param to json_normalize

4c7da78

add ignore_keys and respective documentation to json_normalize

502210e

renamed fixture not reflected

c3268fd

fix black lint and documentatio issue

1dc0c0c

fix documetnation issues

0fdad82

WillAyd requested changes Jul 7, 2019

View reviewed changes

pandas/tests/io/json/test_normalize.py Outdated Show resolved Hide resolved

WillAyd added the IO JSON read_json, to_json, json_normalize label Jul 7, 2019

bhavaniravi closed this Jul 10, 2019

bhavaniravi reopened this Jul 10, 2019

bhavaniravi added 7 commits July 13, 2019 11:43

convert ignore_keys implementation to

cbdaca0

Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…

3458717

…ignore_keys_in_json_normalize

Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…

4d0082b

…ignore_keys_in_json_normalize

Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…

80c1011

…ignore_keys_in_json_normalize

fix black liniting and rst failure

8215729

Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…

49d8334

…iravi/pandas into enh/ignore_keys_in_json_normalize

documentation failure fix

7f14091

WillAyd requested changes Jul 16, 2019

View reviewed changes

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

change 'use_key' to 'usecols' to match csv like implementation

e601693

revert usecols to use_keys, preserving support for list and str

b348038

WillAyd requested changes Jul 17, 2019

View reviewed changes

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

pandas/tests/io/json/test_normalize.py Outdated Show resolved Hide resolved

parametrize test, fix test case failure

e13aece

Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…

f4bc66d

…iravi/pandas into enh/ignore_keys_in_json_normalize

jreback requested changes Sep 8, 2019

View reviewed changes

bhavaniravi added 4 commits September 12, 2019 12:24

Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…

8060574

…_keys_in_json_normalize

- Modify normalize_json to work for use_keys at multiple levels

6fb13c3

Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…

4ba74ff

…_keys_in_json_normalize

fix linting issues using black formatting

d35aeaa

jreback requested changes Oct 5, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

bhavaniravi added 2 commits October 6, 2019 15:41

fix wrong var name in whatsnew change ignore_key to use_keys

d0a077f

Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…

7c94e51

…_keys_in_json_normalize

WillAyd requested changes Oct 8, 2019

View reviewed changes

bhavaniravi added 2 commits October 9, 2019 00:37

remove commented tests, add valid name for function & variable

bc7ac76

simplify list comprehension

a15c463

WillAyd requested changes Oct 9, 2019

View reviewed changes

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

add annotations to inner functions

b0bfa75

WillAyd requested changes Nov 7, 2019

View reviewed changes

WillAyd closed this Dec 17, 2019

Uh oh!

Enhancement include or exclude keys in json normalize #27262

Enhancement include or exclude keys in json normalize #27262

Uh oh!

Conversation

bhavaniravi commented Jul 6, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bhavaniravi commented Jul 9, 2019

Uh oh!

WillAyd commented Jul 9, 2019

Uh oh!

bhavaniravi commented Jul 10, 2019

Uh oh!

WillAyd commented Jul 10, 2019

Uh oh!

bhavaniravi commented Jul 11, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WillAyd commented Jul 17, 2019

Uh oh!

bhavaniravi commented Jul 17, 2019 via email

Uh oh!

WillAyd commented Jul 17, 2019 via email

Uh oh!

bhavaniravi commented Jul 17, 2019 via email

Uh oh!

Uh oh!

Uh oh!

datapythonista commented Jul 28, 2019

Uh oh!

bhavaniravi commented Jul 29, 2019

Uh oh!

bhavaniravi commented Aug 1, 2019

Uh oh!

WillAyd commented Aug 1, 2019

Uh oh!

bhavaniravi commented Aug 8, 2019

Uh oh!

bhavaniravi commented Aug 17, 2019

Uh oh!

WillAyd commented Aug 23, 2019

Uh oh!

WillAyd commented Aug 24, 2019

Uh oh!

WillAyd commented Sep 5, 2019

Uh oh!

Uh oh!

Uh oh!

jreback Sep 8, 2019

Choose a reason for hiding this comment

Uh oh!

bhavaniravi Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

bhavaniravi commented Sep 9, 2019

Uh oh!

WillAyd commented Sep 9, 2019

Uh oh!

bhavaniravi commented Oct 5, 2019

Uh oh!

Uh oh!

WillAyd Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillAyd Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!