Skip to content

BUG: explode() raises ValueError #1223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
martinfleis opened this issue Nov 26, 2019 · 17 comments
Open

BUG: explode() raises ValueError #1223

martinfleis opened this issue Nov 26, 2019 · 17 comments
Assignees

Comments

@martinfleis
Copy link
Member

Hi,

as reported here https://github.com/martinfleis/momepy/issues/123, at certain situation gdf.explode() raises ValueError: Shape of passed values is (132850, 183), indices imply (132842, 183). Using data retrieved from OSM using OSMnx. (Warning - Vancouver gdf is large)

import geopandas as gpd
import osmnx as ox

gdf = ox.footprints.footprints_from_place(place='Vancouver, Canada')
gdf_projected = ox.project_gdf(gdf)
exploded = gdf_projected.explode()

I tried to save a small set to geojson, but after loading back to geopandas it does not cause the error 🤔

@jorisvandenbossche
Copy link
Member

Smaller reproducer:

Most of the entries don't get exploded, only a few are actual MultiPolygons with multiple parts. Taking the first + one that gets exploded (found from gdf.geometry.explode(), on the GeoSeries it works), still gives the error:

In [26]: subset = gdf.loc[[23253981, 4761998], :] 

In [27]: subset  
Out[27]: 
                                                      nodes                                           geometry     building addr:housenumber  ... check_date bridge opening_date          type
23253981  [251629948, 3607852090, 3607852091, 251629949,...  POLYGON ((-123.0727049 49.2147746, -123.073652...       school              NaN  ...        NaN    NaN          NaN           NaN
4761998                                                 NaN  (POLYGON ((-123.1615685 49.2642942, -123.16157...  residential             2475  ...        NaN    NaN          NaN  multipolygon

[2 rows x 182 columns]

In [28]: subset.explode() 
...
ValueError: Shape of passed values is (3, 183), indices imply (2, 183)

Further taking some columns as well:

In [30]: subset = subset[subset.columns[:5]].copy() 

Now, what I noticed when debugging this, is that it is the 2D object block that doesn't get reshaped correctly:

In [32]: subset.explode()   
...
ValueError: Shape of passed values is (3, 6), indices imply (2, 6)

In [33]: %debug
> /home/joris/miniconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py(1718)construction_error()
   1716         raise ValueError("Empty data passed with indices specified.")
   1717     raise ValueError(
-> 1718         "Shape of passed values is {0}, indices imply {1}".format(passed, implied)
   1719     )
   1720 

ipdb> u  
> /home/joris/miniconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py(345)_verify_integrity()
    343         for block in self.blocks:
    344             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 345                 construction_error(tot_items, block.shape[1:], self.axes)
    346         if len(self.items) != tot_items:
    347             raise AssertionError(

ipdb> p self.blocks
(ObjectBlock: slice(0, 4, 1), 4 x 2, dtype: object, IntBlock: slice(4, 5, 1), 1 x 3, dtype: int64, ObjectBlock: slice(5, 6, 1), 1 x 3, dtype: object)
#                                 |-> 2 rows                                      |-> 3 rows                                        |-> 3 rows  

And the original dataframe is also all object dtype (the geometry column as well, but that's just because I am debugging on geopandas 0.5 where I had osmnx installed):

In [34]: subset.dtypes                                                                                                                                                                                             
Out[34]: 
nodes               object
geometry            object
building            object
addr:housenumber    object
addr:street         object
dtype: object

So let's see if changing some to non-object dtype solves something, however, that doesn't fix it:

In [38]:  subset['addr:housenumber'] = subset['addr:housenumber'].astype(float) 

In [39]: subset[['addr:housenumber', 'geometry']].explode() 
...
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)

Another specific thing about this dataset is that it has high integer indices (not default 0,1,2, n):

In [42]: subset[['addr:housenumber', 'geometry']].reset_index(drop=True).explode()
Out[42]: 
     addr:housenumber                                           geometry
0 0               NaN  POLYGON ((-123.0727049 49.2147746, -123.073652...
1 0            2475.0  POLYGON ((-123.1615685 49.2642942, -123.161570...
  1            2475.0  POLYGON ((-123.1622072 49.2643049, -123.162209...

That seems to fix it! And it also does fix it on the original data:

In [45]: gdf.reset_index(drop=True).explode()
Out[45]: 
                                                      nodes       building addr:housenumber        addr:street  ... bridge opening_date          type                                           geometry
0      0  [251629948, 3607852090, 3607852091, 251629949,...         school              NaN                NaN  ...    NaN          NaN           NaN  POLYGON ((-123.0727049 49.2147746, -123.073652...
1      0  [268527777, 472917394, 268527778, 3099866715, ...        stadium              777  Pacific Boulevard  ...    NaN          NaN           NaN  POLYGON ((-123.1135167 49.2763119, -123.113285...
2      0  [1845869695, 1845869693, 268527967, 3714369280...        stadium              800      Griffiths Way  ...    NaN          NaN           NaN  POLYGON ((-123.109011 49.278442, -123.1088138 ...
3      0  [366639854, 1578563638, 1578563641, 1578563640...  train_station             1150     Station Street  ...    NaN          NaN           NaN  POLYGON ((-123.0981085 49.2741719, -123.098080...
4      0  [370490167, 5577882816, 5577882808, 5577882809...            yes             1661      Parker Street  ...    NaN          NaN           NaN  POLYGON ((-123.0709845 49.276187, -123.0710625...
...                                                     ...            ...              ...                ...  ...    ...          ...           ...                                                ...
132837 0                                                NaN     commercial              312        Main Street  ...    NaN          NaN  multipolygon  POLYGON ((-123.0994331 49.2817602, -123.099421...
132838 0                                                NaN            yes              NaN                NaN  ...    NaN          NaN  multipolygon  POLYGON ((-123.1289575 49.227361, -123.1287171...
132839 0                                                NaN            yes              NaN                NaN  ...    NaN          NaN  multipolygon  POLYGON ((-123.096785 49.2618756, -123.0967933...
132840 0                                                NaN            yes              966   West 14th Avenue  ...    NaN          NaN  multipolygon  POLYGON ((-123.1258587 49.2585495, -123.125811...
132841 0                                                NaN     apartments             3736  Commercial Street  ...    NaN          NaN  multipolygon  POLYGON ((-123.0679012 49.2515969, -123.067492...

[132850 rows x 182 columns]

So at least, that gives the original reporter a workaround.
And hopefully those pointers can also help us find the cause ;)

@martinfleis
Copy link
Member Author

It is related to quite weird behaviour of pd.concat. Following works (simulating behaviour of our explode):

df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george'], ['a', 'b']],
                   columns=['animal', 'name'])
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])

df4.index = [0, 1, 1]

pd.concat([df1, df4], axis=1)

	letter	number	animal	name
0	a	1	bird	polly
1	b	2	monkey	george
1	b	2	a	b

But if the order of index values is the opposite, it raises ValueError:

df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george'], ['a', 'b']],
                   columns=['animal', 'name'])
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])

df4.index = [1, 0, 0]

pd.concat([df1, df4], axis=1)

ValueError: Shape of passed values is (3, 4), indices imply (2, 4)

For some reason, index has to be sorted (that is why it works if you do reset_index). Using subset from above:

sort = subset.sort_index()
sort.explode()

               building                                           geometry
4761998  0  residential  POLYGON ((-123.16157 49.26429, -123.16157 49.2...
         1  residential  POLYGON ((-123.16221 49.26430, -123.16221 49.2...
23253981 0       school  POLYGON ((-123.07270 49.21477, -123.07365 49.2...

Explode works as intended. I will fix that by storing order and sorting it in the end.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 24, 2020

But if the order of index values is the opposite, it raises ValueError:

Hmm, that seems a bug. Can you report that to pandas?

@martinfleis
Copy link
Member Author

Will be fixed in pandas-dev/pandas#31113, closing here. Workaround for now is to sort_index() before exploding.

@seizethedata
Copy link

seizethedata commented Mar 7, 2020

@martinfleis unfortunately, sort_index() didn't work for me. Still get the error ValueError: Shape of passed values is (3103, 2), indices imply (3089, 2)

@martinfleis
Copy link
Member Author

@seizethedata Try reset_index() before exploding.

@seizethedata
Copy link

seizethedata commented Mar 7, 2020

@martinfleis I did that too.

My code is:

blocks_saintp_cl = blocks_saintp_clean.reset_index(drop=True) 
blocks = blocks_saintp_cl.sort_index()
sp_blocks = blocks.explode().reset_index(drop=True)

@martinfleis
Copy link
Member Author

martinfleis commented Mar 7, 2020

You have to reset_index before exploding, not after. In case of reseting, you don't need to sort it.

sp_blocks = blocks.reset_index(drop=True).explode()

edit: you changed the code in the meantime. the one above should work, so it looks like a different issue. Can you make minimal reproducible example or share the data by any chance?

@seizethedata
Copy link

seizethedata commented Mar 7, 2020

Still the same, unfortunately.

edit: Sorry, was inserting with code brackets wrong.

@seizethedata
Copy link

@martinfleis I can share the data privately, if that's possible!

@martinfleis
Copy link
Member Author

Send them to [email protected].

@seizethedata
Copy link

I've sent the geojson to you

@martinfleis
Copy link
Member Author

@seizethedata This bug is super strange with your data. I wasn't able to figure out what happens there nor find a workaround with current version of geopandas. But I was able to patch explode to work with your data - #1319.

@martinfleis
Copy link
Member Author

Update - if you want to check properly working patch use #1251. I am reopening this issue to keep an eye on it as it was supposed to be fixed in pandas but that did not happen yet.

If we'll be close to a release, I'll merge #1251 as temporary patch before pandas will fix it.

@martinfleis martinfleis reopened this Mar 7, 2020
@seizethedata
Copy link

@martinfleis thanks!

@Sieboldianus
Copy link

I appear to have the same bug after using dissolve on a specific country in geopandas naturalearth_lowres. To reproduce:

# Mollweide projection epsg code
EPSG_CODE = 54009
# note: Mollweide defined by _esri_
# in epsg.io's database
CRS_PROJ = f"esri:{EPSG_CODE}"
CRS_WGS = "epsg:4326"

world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)
uk = world[world['name'] == "United Kingdom"]
fr = world[world['name'] == "France"]
# remove polygon from French Guiana
# and join back together as multipolygon
fr = fr.explode().iloc[1:].dissolve(by='name')

# the following works:
uk.explode()
# but not on France:
fr.explode()
# however, explode works with the workaround from martinfleis:
exploded_geom = fr.geometry.explode().reset_index(level=-1)
exploded_index = exploded_geom.columns[0]
fr_exploded = fr.drop(fr._geometry_column_name, axis=1).join(exploded_geom)

@jack-tuna
Copy link

image

I have the same issue. A work around for me was to use QGIS to convert vector to single parts and export

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants