Skip to content

Unexpected pd.concat/reindex_axis behaviour for MultiIndexed dataframes with > 10,000 rows #26573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pepicello opened this issue May 30, 2019 · 2 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@pepicello
Copy link
Contributor

Code Sample, a copy-pastable example if possible

Sadly I am having troubles creating a small reproducible example of this issue, as the problem seems to disappear if I pickle the dataframes and re-load them. I can only see what is happening in a pdb session, and for large MultiIndexed dataframes, which makes them hard to analyze.

Given two dataframes, a and b, with identical indices and different columns, and a unique level for each row called unique_level:

small_slice_idx = a.index.get_level_values('unique_level').tolist()[:10000]
big_slice_idx = a.index.get_level_values('unique_level').tolist()[:10001]
right = pd.concat([a.loc[a.index.get_level_values('unique_level').isin(small_slice_idx)], b.loc[b.index.get_level_values('unique_level').isin(small_slice_idx)]], axis=1)
wrong = pd.concat([a.loc[a.index.get_level_values('unique_level').isin(big_slice_idx)], b.loc[b.index.get_level_values('unique_level').isin(big_slice_idx)]], axis=1)

Problem description

Whenever I concatenate two MultiIndexed dataframes with over 10,000 rows (10 levels in index, 1 level in columns), the dataframes are merged correctly (expected shape), but the columns of the second dataframe are transformed to NaNs.

I noticed that if I slice the dataframes, as shown above (I do not use iloc as they are not ordered), to less or equal than 10,000 rows, this does not happen. That is the result of the right dataframe, while the issue appears again for larger dataframes, like wrong.

I noticed the same issue with the reindex_axis, e.g.:

wrong = a.reindex_axis(b.index)
right = a.iloc[:10000].reindex_axis(b.index)

The indexes of a and b are seemingly identical, although in different order, but I have a suspicion that their underlying structure is somehow different, like for issue #20565, which causes troubles when concatenating/reindexing.

It is also very odd that the issue disappeared if I pickled and un-pickled again the dataframes.

I could not try this with the latest version of pandas as it is the result of calculations for which I need the exact version I am using, but I am open to any suggestions to find the smallest working example that I can try on the latest version.

Thanks!

Expected Output

Same as one for dataframe with less than 10,001 rows

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 19.1
setuptools: 40.6.3
Cython: 0.28.5
numpy: 1.14.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.1
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.6.0

@WillAyd
Copy link
Member

WillAyd commented May 30, 2019

Do you have any way of reproducing the error? Without that's its almost impossible to give guidance

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label May 30, 2019
@TomAugspurger
Copy link
Contributor

@pablojim let us know if you can provide a reproducible example and we'll reopen. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

3 participants