Skip to content

Enhancement include or exclude keys in json normalize #27262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
4c7da78
add ignore_keys param to json_normalize
bhavaniravi Jul 6, 2019
502210e
add ignore_keys and respective documentation to json_normalize
bhavaniravi Jul 6, 2019
c3268fd
renamed fixture not reflected
bhavaniravi Jul 6, 2019
1dc0c0c
fix black lint and documentatio issue
bhavaniravi Jul 6, 2019
0fdad82
fix documetnation issues
bhavaniravi Jul 7, 2019
cbdaca0
convert ignore_keys implementation to
bhavaniravi Jul 13, 2019
3458717
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 13, 2019
4d0082b
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 13, 2019
80c1011
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 15, 2019
8215729
fix black liniting and rst failure
bhavaniravi Jul 15, 2019
49d8334
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 15, 2019
7f14091
documentation failure fix
bhavaniravi Jul 16, 2019
e601693
change 'use_key' to 'usecols' to match csv like implementation
bhavaniravi Jul 17, 2019
b348038
revert usecols to use_keys, preserving support for list and str
bhavaniravi Jul 17, 2019
e13aece
parametrize test, fix test case failure
bhavaniravi Jul 19, 2019
ddea67a
change 0.25.0 to 1.0
bhavaniravi Jul 21, 2019
7e8da34
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 21, 2019
b78797c
move docs to 1.0.0
bhavaniravi Jul 21, 2019
73c1d8d
Update doc/source/user_guide/io.rst
bhavaniravi Jul 25, 2019
7bb81d7
Update pandas/io/json/_normalize.py
bhavaniravi Jul 25, 2019
d02d70d
Update pandas/io/json/_normalize.py
bhavaniravi Jul 25, 2019
a28de10
simplify _parse_use_keys
bhavaniravi Jul 25, 2019
e075305
update postive and negative testcases
bhavaniravi Jul 25, 2019
47f9f24
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 25, 2019
0f4a4bc
move exclude testcase seperately
bhavaniravi Jul 25, 2019
b64411b
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 25, 2019
59b139b
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 25, 2019
fe245e5
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 25, 2019
1efeb50
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 25, 2019
ab125c3
change use_keys in docs and returns True by default
bhavaniravi Jul 26, 2019
9ce4212
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
2733bfd
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
a871818
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
1ff357a
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
ef72c0a
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
7ae7065
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
57959d3
Update pandas/io/json/_normalize.py
bhavaniravi Jul 28, 2019
fcacc1b
fix code formatting issues
bhavaniravi Jul 28, 2019
30d80a5
Merge branch 'master' of git://github.com/pandas-dev/pandas into enh/…
bhavaniravi Jul 28, 2019
cb5fa8b
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 28, 2019
303460b
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 28, 2019
f4bc66d
Merge branch 'enh/ignore_keys_in_json_normalize' of github.com:bhavan…
bhavaniravi Jul 28, 2019
8060574
Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…
bhavaniravi Sep 12, 2019
6fb13c3
- Modify normalize_json to work for use_keys at multiple levels
bhavaniravi Oct 5, 2019
4ba74ff
Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…
bhavaniravi Oct 5, 2019
d35aeaa
fix linting issues using black formatting
bhavaniravi Oct 5, 2019
d0a077f
fix wrong var name in whatsnew change ignore_key to use_keys
bhavaniravi Oct 6, 2019
7c94e51
Merge branch 'master' of github.com:pandas-dev/pandas into enh/ignore…
bhavaniravi Oct 6, 2019
bc7ac76
remove commented tests, add valid name for function & variable
bhavaniravi Oct 8, 2019
a15c463
simplify list comprehension
bhavaniravi Oct 8, 2019
b0bfa75
add annotations to inner functions
bhavaniravi Oct 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2188,6 +2188,21 @@ With max_level=1 the following snippet normalizes until 1st nesting level of the
}]
json_normalize(data, max_level=1)

.. versionadded:: 1.0.0

The ``use_keys`` parameter allows you to include or exclude specific keys while normalizing.
With ``use_keys`` the following snippet normalizes all keys except "Image".

.. ipython:: python

data = [{'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001',
'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, use_keys=lambda key: key not in ["Image"])

.. _io.jsonl:

Line delimited json
Expand Down
1 change: 0 additions & 1 deletion doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ The repr now looks like this:
}]
json_normalize(data, max_level=1)


.. _whatsnew_0250.enhancements.explode:

Series.explode to split list-like values to rows
Expand Down
22 changes: 22 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,28 @@ of the Series or columns of a DataFrame will also have string dtype.
We recommend explicitly using the ``string`` data type when working with strings.
See :ref:`text.types` for more.

.. _whatsnew_1000.enhancements.json_normalize_with_use_keys:

Json normalize with use_keys param support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`json_normalize` normalizes all keys of a given nested dictionary.
The new ``use_keys`` parameter provides more control over
which keys to include or exclude while normalizing (:issue:`27241`):

The repr now looks like this:

.. ipython:: python

from pandas.io.json import json_normalize
data = [{
'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, use_keys=lambda key: key not in ["Image"])

.. _whatsnew_1000.enhancements.other:

Other enhancements
Expand Down
174 changes: 148 additions & 26 deletions pandas/io/json/_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@

from collections import defaultdict
import copy
from typing import DefaultDict, Dict, List, Optional, Union
from typing import Callable, DefaultDict, Dict, Generator, List, Optional, Union

import numpy as np

from pandas._libs.writers import convert_json_to_lines

from pandas import DataFrame
from pandas.api.types import is_list_like


def convert_to_line_delimits(s):
Expand All @@ -26,12 +27,44 @@ def convert_to_line_delimits(s):
return convert_json_to_lines(s)


def _parse_use_keys(use_keys: Optional[Union[str, List, Callable]]) -> Callable:
"""
Converts different types of use_keys into a callable.

Parameters
----------
use_keys : str, list or callable
Returns true or false depending on whether to include or exclude a key.

.. versionadded:: 1.0.0

Returns
-------
callable
It Decides on whether to include a key in processing.
"""
if callable(use_keys):
return use_keys

if is_list_like(use_keys):
return lambda x: x in use_keys # type: ignore

if isinstance(use_keys, str):
return lambda x: x == use_keys

if use_keys is None:
return lambda x: True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the else here? None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

raise TypeError("`use_keys` must be a str, list or a callable")


def nested_to_record(
ds,
prefix: str = "",
sep: str = ".",
level: int = 0,
max_level: Optional[int] = None,
use_keys: Optional[Union[str, List[str], Callable]] = None,
):
"""
A simplified json_normalize
Expand All @@ -49,14 +82,19 @@ def nested_to_record(

.. versionadded:: 0.20.0

level: int, optional, default: 0
level : int, optional, default: 0
The number of levels in the json string.

max_level: int, optional, default: None
max_level : int, optional, default: None
The max depth to normalize.

.. versionadded:: 0.25.0

use_keys : str, list or callable, optional
Criteria for inclusion of a particular JSON object (matches on key).

.. versionadded:: 1.0.0

Returns
-------
d - dict or list of dicts, matching `ds`
Expand All @@ -74,10 +112,39 @@ def nested_to_record(
'nested.e.c': 1,
'nested.e.d': 2}
"""

def is_key_match(key, use_keys):
if callable(use_keys):
return use_keys(key)

if use_keys is None:
return lambda x: True

if isinstance(use_keys, str):
use_keys = [use_keys]

if is_list_like(use_keys):
return key in use_keys

raise TypeError("`use_keys` must be a str, list or a callable")

def flatten_deeper(
_dict: dict, level: int, prev_keys: Optional[List] = None
) -> Generator:
prev_keys = prev_keys if prev_keys else []
for key, val in _dict.items():
if isinstance(val, dict) and (max_level is None or level < max_level):
yield from flatten_deeper(
_dict=val, prev_keys=prev_keys + [key], level=level + 1
)
else:
yield prev_keys + [key], val

singleton = False
if isinstance(ds, dict):
ds = [ds]
singleton = True

new_ds = []
for d in ds:
new_d = copy.deepcopy(d)
Expand All @@ -90,20 +157,29 @@ def nested_to_record(
else:
newkey = prefix + sep + k

# flatten if type is dict and
# current dict level < maximum level provided and
# only dicts gets recurse-flattened
# only at level>1 do we rename the rest of the keys
if not isinstance(v, dict) or (
max_level is not None and level >= max_level
# flatten if
# current dict level <= maximum level provided and
# current keypath matches the config in use_keys
# only dicts gets recurse-flatten
if (
is_key_match(newkey, use_keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this validation is necessary for the change; can you remove is_key_match?

and (max_level is None or level < max_level)
and isinstance(v, dict)
):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
# pop the current key
new_d.pop(k)
# Flatten the value and update it at the current level
for inner_keys, val in flatten_deeper(v, level=level + 1):
new_d[sep.join([k, *inner_keys])] = val

else:
v = new_d.pop(k)
new_d.update(nested_to_record(v, newkey, sep, level + 1, max_level))
if isinstance(v, dict) and (max_level is None or level < max_level):
new_d[k] = nested_to_record(
v, newkey, sep, level + 1, max_level, use_keys
)
else:
new_d[k] = v

new_ds.append(new_d)

if singleton:
Expand All @@ -120,6 +196,7 @@ def json_normalize(
errors: Optional[str] = "raise",
sep: str = ".",
max_level: Optional[int] = None,
use_keys: Optional[Union[Callable, str, List[str]]] = None,
):
"""
Normalize semi-structured JSON data into a flat table.
Expand Down Expand Up @@ -161,6 +238,11 @@ def json_normalize(

.. versionadded:: 0.25.0

use_keys : str or a list of str, callable, optional, default None
Includes or excludes a given key based on a given condition.

.. versionadded:: 1.0.0

Returns
-------
frame : DataFrame
Expand All @@ -179,6 +261,8 @@ def json_normalize(
1 NaN NaN Regner NaN Mose NaN
2 2.0 Faye Raker NaN NaN NaN NaN

Normalizes list of dict into a flattened data frame.

>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
Expand All @@ -187,12 +271,12 @@ def json_normalize(
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, max_level=0)
fitness id name
0 {'height': 130, 'weight': 60} 1.0 Cole Volk
1 {'height': 130, 'weight': 60} NaN Mose Reg
2 {'height': 130, 'weight': 60} 2.0 Faye Raker
id name fitness
0 1.0 Cole Volk {'height': 130, 'weight': 60}
1 NaN Mose Reg {'height': 130, 'weight': 60}
2 2.0 Faye Raker {'height': 130, 'weight': 60}

Normalizes nested data upto level 1.
Normalizes nested data up to level 0.

>>> data = [{'id': 1,
... 'name': "Cole Volk",
Expand All @@ -202,10 +286,42 @@ def json_normalize(
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, max_level=1)
fitness.height fitness.weight id name
0 130 60 1.0 Cole Volk
1 130 60 NaN Mose Reg
2 130 60 2.0 Faye Raker
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mose Reg 130 60
2 2.0 Faye Raker 130 60

Normalizes nested data up to level 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example with a str / list-of-str? are these useful? better to just make this a callable only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial proposal was to provide a list of keys to ignore. After a discussion with @WillAyd he suggested list and list of str be consistent with the inclusion in other modules which made sense.

With callable only I'm not sure if we can achieve the multi-level support we are talking about iin
#27262 (comment)

Let me know your thoughts


>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, use_keys=lambda key: key not in ["fitness"])
id name fitness
0 1.0 Cole Volk {'height': 130, 'weight': 60}
1 NaN Mose Reg {'height': 130, 'weight': 60}
2 2.0 Faye Raker {'height': 130, 'weight': 60}

Ignores specific keys from being flattened

>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, use_keys="fitness")
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mose Reg 130 60
2 2.0 Faye Raker 130 60

Flattens specific set of selected keys

>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
Expand Down Expand Up @@ -254,6 +370,8 @@ def _pull_field(js, spec):
if isinstance(data, dict):
data = [data]

use_key = _parse_use_keys(use_keys)

if record_path is None:
if any([isinstance(x, dict) for x in y.values()] for y in data):
# naive normalization, this is idempotent for flat records
Expand All @@ -263,7 +381,9 @@ def _pull_field(js, spec):
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep, max_level=max_level)
data = nested_to_record(
ds=data, sep=sep, max_level=max_level, use_keys=use_key
)
return DataFrame(data)
elif not isinstance(record_path, list):
record_path = [record_path]
Expand Down Expand Up @@ -296,7 +416,9 @@ def _recursive_extract(data, path, seen_meta, level=0):
for obj in data:
recs = _pull_field(obj, path[0])
recs = [
nested_to_record(r, sep=sep, max_level=max_level)
nested_to_record(
ds=r, sep=sep, max_level=max_level, use_keys=use_key
)
if isinstance(r, dict)
else r
for r in recs
Expand Down
Loading