Skip to content

Enhancement include or exclude keys in json normalize #27262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bhavaniravi
Copy link
Contributor

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mulling this over a bit more. Should we be giving consideration to use_keys instead?

Most other IO methods work by inclusion instead of exclusion (see usecols parameter) and I'm wondering if that wouldn't be more applicable here as well. Curious to hear your thoughts

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Jul 7, 2019
@bhavaniravi
Copy link
Contributor Author

Just mulling this over a bit more. Should we be giving consideration to use_keys instead?

Most other IO methods work by inclusion instead of exclusion (see usecols parameter) and I'm wondering if that wouldn't be more applicable here as well. Curious to hear your thoughts

Yes I thought about it too. I come from a usecase where I had a large nested dict with n keys and want to ignore certain keys from flattening at all levels. It is easier to mention the keys to ignore than include.

What do you think?

@WillAyd
Copy link
Member

WillAyd commented Jul 9, 2019

How about use_keys and it can accept a Callable to determine whether or not something can be included? That callable could provide the exclusion behavior you are after and it would be consistent with usecols behavior for other read methods

@bhavaniravi bhavaniravi reopened this Jul 10, 2019
@bhavaniravi
Copy link
Contributor Author

Few things I thought about. I understand the consistency part but the usekeys is not fitting. My first intuition on seeing it would be inclusion.

We can do two things

  1. We can name the callable a column_selector of sorts
  2. Or can have two keys for use_keys and exclude_keys

@WillAyd
Copy link
Member

WillAyd commented Jul 10, 2019

Option2 is wonky from an API perspective. What is the problem with the following though?

use_keys = lambda x: x not in {'exclude1', 'exclude2'}

I'm not tied to the use_keys moniker per se but I don't think we should deviate far from the similar functionality in the rest of the IO interface

@bhavaniravi
Copy link
Contributor Author

Option2 is wonky from an API perspective. What is the problem with the following though?

use_keys = lambda x: x not in {'exclude1', 'exclude2'}

I'm not tied to the use_keys moniker per se but I don't think we should deviate far from the similar functionality in the rest of the IO interface

Okay. This makes sense. Let me do this.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven’t reviewed in depth yet but general comment on approach

@WillAyd
Copy link
Member

WillAyd commented Jul 17, 2019

Sorry I didn't mean to actually change the name to usecols but rather to support the same things, namely a string, list of strings and a callable as input. This was previously only supporting a callable.

Does that make sense?

@bhavaniravi
Copy link
Contributor Author

bhavaniravi commented Jul 17, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Jul 17, 2019 via email

@bhavaniravi
Copy link
Contributor Author

bhavaniravi commented Jul 17, 2019 via email

…iravi/pandas into enh/ignore_keys_in_json_normalize
@datapythonista
Copy link
Member

@bhavaniravi in general is probably better if you push your changes once they are in state for us to be reviewed. There are few reasons for it:

  • Every push sends an email to the people subscribed to it, or to the whole project
  • Every push triggers a build in the CI (and unnecessary builds can make the builds of other PRs slower)
  • We don't know when you're done and we should review

None of the above is a big deal, feel free to push when you need to. But if there is no reason for pushing more than at the end, it's probably worth that you are aware of the previous things.

@bhavaniravi
Copy link
Contributor Author

@bhavaniravi in general is probably better if you push your changes once they are in state for us to be reviewed. There are few reasons for it:

  • Every push sends an email to the people subscribed to it, or to the whole project
  • Every push triggers a build in the CI (and unnecessary builds can make the builds of other PRs slower)
  • We don't know when you're done and we should review

None of the above is a big deal, feel free to push when you need to. But if there is no reason for pushing more than at the end, it's probably worth that you are aware of the previous things.

Sure. Will keep that in mind.

@bhavaniravi
Copy link
Contributor Author

@bhavaniravi can you also add a test case that uses nested_input_data and selects the Name key? That key, in particular, is nested a few levels deeper so would be good to validate

@WillAyd I am kind of stuck at this point when the number of nesting levels is > 1 and when we use use_keys as an inclusion method then, should the key have a path something like CreatedBy.Name.FirstName. Mentioning just CreatedBy flattens only 1 level or just Name can confuse with Name key in other levels

{
Name: "PO001"
"CreatedBy": {"Name":{"FirstName":"Name001"}}}
}
json_normalize(use_keys="CreatedBy.Name.FirstName"

Am I thinking right?

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2019

So to summarize the two options given:

[{"a": 1},
 {"level_0": {"a": 1}},
 {"level_0": {"level_1": {"a": 1}}}]

Option 1 assuming use_keys="a" would include all of those entries, so that the evaluation of the expression is done solely against the object key without any real consideration for its place in the hierarchy.

Option 2 as I think you mentioned is to require the full hierarchy, so you'd have to do something like use_key="level_0.level_1.a" to get the last level.

Thinking out loud, Option 1 seems easier from a usage perspective whereas Option 2 is more explicit. I might have a slight preference for option 2 as its a more conservative approach and it's usually easier to expand API functionality than to limit it, conceding that I still think Option 1 is more user friendly.

Open to your thoughts on it for sure

@bhavaniravi
Copy link
Contributor Author

@WillAyd I'm leaned towards being explicit about the levels too. The current implementation doesn't support that. Let me fix that and get back

@bhavaniravi
Copy link
Contributor Author

@WillAyd It's getting way more complicated than I thought and I need your help sorting it out.

Currently, we support 3 types of use_keys param function, list and str. Each type of this param is converted into a callable which checks in or not in _parse_use_keys which is later called whether to flatten a key or not.

Now to incorporate levels into all these 3 types it is getting way to complicated.

  1. use_key="level_0.level_1.a"
  2. use_key=["level_0.level_1.a", "key2.level1"]
  3. use_key= lambda x: x not in ["level_0.level_1.a"]

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Hey @bhavaniravi - I've been on vacation for a few weeks so sorry for lack of response. I'll be back in a few days and can review what you have again in more detail then

@WillAyd
Copy link
Member

WillAyd commented Aug 24, 2019

@bhavaniravi just reading through your last comment again. Is your concern that having a user specify the entire path to a key is too rigid? If so agreed it's a little tough from a usage perspective but that goes back to some of the comments in #27262 (comment)

So maybe as a third option we could explicitly document that use_key only works on top level keys of the JSON structure for now, and maybe leave it to a future enhancement to have that work further down in the hierarchy. I don't think that requires a lot of extra effort on top of what you have here but is explicit about the limitations of this. Do you think that might be easier?

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

@bhavaniravi is this still active? Thoughts on previous comment?

@@ -207,6 +261,23 @@ def json_normalize(
1 130 60 NaN Mose Reg
2 130 60 2.0 Faye Raker

Normalizes nested data up to level 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example with a str / list-of-str? are these useful? better to just make this a callable only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial proposal was to provide a list of keys to ignore. After a discussion with @WillAyd he suggested list and list of str be consistent with the inclusion in other modules which made sense.

With callable only I'm not sure if we can achieve the multi-level support we are talking about iin
#27262 (comment)

Let me know your thoughts

@bhavaniravi
Copy link
Contributor Author

@WillAyd In #27262 (comment)

Option 1 assuming use_keys="a" would include all of those entries, so that the evaluation of the expression is done solely against the object key without any real consideration for its place in the hierarchy.

whereas the current implementation only normalizes the a at level=0. since it looks at level_0 key which is not in list and proceeds without normalizing.

I think we are both agreeing on users being explicit for option 1 and 2, leaving the callable for future implementation. is that right?

@WillAyd
Copy link
Member

WillAyd commented Sep 9, 2019

If callable is a hangup then feel free to ignore for now, but I am under the impression the issue is more so dealing with nested levels. Can you be sure to add a test case like:

data = {
  "key1": {
    "should_match": 1,
    "no_match": 2
  },
  "should_match": 3
}

And making sure that should_match gets found in all cases? Right now tests only make assertion about items at the top of the JSON hierarchy

@bhavaniravi
Copy link
Contributor Author

@WillAyd Made those changes you asked for. I had to change and swap a few things to get it right. Now it allows use_keys to be level specific for eg., Level0.Level1.Level2.Key


if is_list_like(use_keys):
return any(
key.split(".")[-len(i.split(".")) :] == i.split(".") for i in use_keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break this up into a few lines instead of just one generator expression? I think would be more readable that way

# current keypath matches the config in use_keys
# only dicts gets recurse-flatten
if (
is_key_match(newkey, use_keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this validation is necessary for the change; can you remove is_key_match?

@alimcmaster1
Copy link
Member

@bhavaniravi - could you merge master whenever you have some time

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

@bhavaniravi closing as I think this has gone stale but certainly ping if you'd like to pick back up and can reopen

@WillAyd WillAyd closed this Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Ignore flattening certain keys in json_normalize
6 participants