Skip to content

ENH: Add an option to json_normalize() to protect nested object(s) against flattening #40432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
swiss-knight opened this issue Mar 14, 2021 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@swiss-knight
Copy link

Is your feature request related to a problem?

Not really. I simply wish I could use pandas to protect a nested structure / object / dict against flattening when using pd.json_normalize() on a JSON object, for example an API response.

Describe the solution you'd like

Something like this (see below for a complete example):

df = pd.DataFrame(pd.json_normalize(data, protect="foo.bar.baz"))

# or this, if there isn't more than 1 object with the same name 
# (or even if there are more than one object with the same name, it should apply to all of them) :
df = pd.DataFrame(pd.json_normalize(data, protect="baz"))

API breaking implications

I have no detailed idea what this could/would break.

Describe alternatives you've considered

My current workaround is to duplicate the DataFrame; building one without normalizing, the other with. Keeping only the protected column(s) of the one which hasn't been normalized and concatenating them to the other. Remove redundant columns on the final DataFrame.

Additional context

Here's a dummy example:

import pandas as pd

response = '''
    {
      "results": [
        {
          "geometry": {
            "type": "Polygon",
            "crs": 4326,
            "coordinates": 
              [[
                  [6.0, 49.0],
                  [6.0, 40.0],
                  [7.0, 40.0],
                  [7.0, 49.0],
                  [6.0, 49.0]
              ]]
          },
          "attribute": "layer.metadata",
          "bbox": [6, 40, 7, 49],
          "featureName": "Coniferous_Trees",
          "layerName": "State_Forests",
          "type": "Feature",
          "id": "17",
          "properties": {
            "resolution": "100",
            "Year": "2020",
            "label": "Coniferous"
          }
        }
      ]
    }
'''

data = json.loads(response)['results']
df = pd.DataFrame(pd.json_normalize(data))

Then:

>>> print(df.columns)
Index(['attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
       'geometry.type', 'geometry.crs', 'geometry.coordinates', # <-- the geometry has been flattened along all the other objects
       'properties.resolution', 'properties.Year', 'properties.label'],
      dtype='object')

Desired behaviour:

df = pd.DataFrame(pd.json_normalize(data, protect="results.geometry"))

which would lead to:

>>> print(df.columns)

Index(['attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
       'geometry', # <-- the geometry element has been protected, it stays as a nested JSON structure in its own column in the DataFrame.
       'properties.resolution',  'properties.Year', 'properties.label'],
      dtype='object')

Thanks for reading.

@swiss-knight swiss-knight added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 14, 2021
@jbrockmendel jbrockmendel added the IO JSON read_json, to_json, json_normalize label Mar 23, 2021
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021
@simonjayhawkins simonjayhawkins added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Jun 2, 2022
@simonjayhawkins
Copy link
Member

Thanks @swiss-knight for the suggestion. This looks like a duplicate of #27241, so closing.

feel free to add suggestions regarding the api to the discussion there.

@simonjayhawkins simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

4 participants