-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: concat in pandas 1.4.2 unexpectedly drops rows or duplicates rows if multiIndex has pd.NA #48852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for your report. Did you try this on 1.5.0? Could you please trim your example to only use the operations that are absolutely necessary to reproduce the bug? Also please reduce the number of rows in your DataFrame to the minimal amount possible. |
Hi phofl, thanks for your comment. Regarding the second issue, I'm still trying to generate 2 "input dataframes" (results of groupby) that can reproduce the duplication of rows once concatenated. In my code I currently have 4 dataframes, each one being the results of a groupby, which are then concatenated. I can extract subpart of this dataframe, and reproduce the issue of duplicated rows. But so far I didnt succeed in generating at least 2 dataframes from scratch, which, once concatenated, duplicates rows. I think I have to involve a groupby in the process. Will update the second issue described above, a bit later. For the time being. first issue should be easily reproductible. Regarding your question, I went through the doc describing the updates of version 1.4.3, 1.4.4, and 1.5.0, and I did not find anything related to my issue. I could test, but I doubt it's fixed in v.1.5.0. Will do. And update the post accordingly. |
I think you forgot to update the expected outputs? Yes please try on 1.5.0 |
I can confirm that the first issue reported above also occurs with pandas v.1.5.0
So, very likely the second one too. Will illustrate and demonstrate the second one asap. |
Thanks for this. This comes down to the union call under the hood which again comes down to #37222 |
Thanks for your handling, phofl. This issue was noticed for the first time back in 2020 by yourself, if I'm not mistaken? Don't you think there should be at least a warning in the doc, somewhere on https://pandas.pydata.org/docs/reference/api/pandas.concat.html , until it's fixed? If dataframes to concatenate have pd.NA in their indexes, then these pd.NA should be replaced prior to using concat. I would have appreciated a warning in the doc about that. |
We have lots of open issues. Things get fixed when someone opens a pr. In General, we don’t warm about know bugs |
Thanks for your answers. |
We have 3400 open issues. It would cost way too much resources to document every bug and would complicate the documentation incredibly. Also, we would have to keep everything in sync. This is just not doable |
Uh oh!
There was an error while loading. Please reload this page.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Regarding the first issue,
test_K is:
test_L is:
The output df is:
Question 1: Why is the row with index (pd.NA, 3.0) dropped ? And why does "14" appear on the first row?!
Regarding the second issue:
given that uuu_1 is:
and uuu_2 is:
then ,
pd.concat([uuu_1, uuu_2], axis=1).sort_index()
gives:
Question 2: Why are are the two last rows duplicated?
A NaN is different from another NaN. When concat is involved, it seems that pandas is nevertheless able to put the values on a (pd.Na, pd.NA) row from input A and input B on the same row. Hence, here, we get "(pd.NA, pd.NA) 12" and "(pd.NA, pd.NA) 15" on the same row in the final output. I would have expected another behaviour but so far it's fine.
What I do not get, is why I end with duplicated rows in the output.
If I replace the NaN by strings, the problem is gone. But why do I get duplicated rows if I keep the pd.NA ? I would like to understand this.
Expected Behavior
Regarding the first issue, shouldn't we get an output like this:
I dont get why the (pd.NA, 3.0) is dropped.
Since test_K and test_L are results of groupby, getting a final output in which there might be rows showing NaN values can be critical. It does not convey the same message as an output from which rows full of pd.NA or NaN values are dropped (which seems the case here)
And I would expect to see "NaN" rather than "14" on the first row (index(4.0. 2.0)) neither.
Regarding the second issue, shouldn't we end with either:
or
Installed Versions
pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 28.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: