Skip to content

BUG: concat in pandas 1.4.2 unexpectedly drops rows or duplicates rows if multiIndex has pd.NA #48852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
Stephano120 opened this issue Sep 29, 2022 · 9 comments
Closed
3 tasks done
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@Stephano120
Copy link

Stephano120 commented Sep 29, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

First issue: unexpected dropped rows (concat issue), and values on the wrong row (??)

test_K  = pd.DataFrame([13, pd.NA], columns=['predK'], index=pd.MultiIndex.from_tuples([(4.0, 2.0),(pd.NA, pd.NA)], names=["first", "second"]))
test_L  = pd.DataFrame([14, 13], columns=['predL'], index=pd.MultiIndex.from_tuples([(pd.NA, pd.NA), (pd.NA, 3.0)], names=["first", "second"]))

pd.concat([test_K, test_L], axis=1).sort_index()
_____________________________________________________________________________________________________________________

Second issue: duplicated rows when multiIndex is like (pd.NA, pd.NA)

uuu_1
one  two
4.0  3.0    22
     4.0    44
NaN  1.0    14
     2.0     2
     4.0     6
     NaN    15
Name: providerA, dtype: int64


uuu_2
one  two
3.0  3.0     30
     4.0     53
4.0  2.0      8
     3.0     20
     4.0     99
NaN  NaN     12
Name: providerB, dtype: int64


pd.concat([uuu_1, uuu_2], axis=1).sort_index()
         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    NaN       15.0       12.0
    NaN       15.0       12.0

Issue Description

Regarding the first issue,

test_K is:

             predK
first second      
4.0   2.0       13
NaN   NaN     <NA>

test_L is:

              predL
first second       
NaN   NaN        14
      3.0        13

The output df is:

             predK  predL
first second             
4.0   2.0       13     **14**
NaN   NaN     <NA>     14

Question 1: Why is the row with index (pd.NA, 3.0) dropped ? And why does "14" appear on the first row?!


Regarding the second issue:
given that uuu_1 is:

one  two
4.0  3.0    22
     4.0    44
NaN  1.0    14
     2.0     2
     4.0     6
     NaN    15
Name: providerA, dtype: int64

and uuu_2 is:

one  two
3.0  3.0     30
     4.0     53
4.0  2.0      8
     3.0     20
     4.0     99
NaN  NaN     12
Name: providerB, dtype: int64

then ,
pd.concat([uuu_1, uuu_2], axis=1).sort_index()

gives:

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       12.0**
    **NaN       15.0       12.0**

Question 2: Why are are the two last rows duplicated?

A NaN is different from another NaN. When concat is involved, it seems that pandas is nevertheless able to put the values on a (pd.Na, pd.NA) row from input A and input B on the same row. Hence, here, we get "(pd.NA, pd.NA) 12" and "(pd.NA, pd.NA) 15" on the same row in the final output. I would have expected another behaviour but so far it's fine.

What I do not get, is why I end with duplicated rows in the output.

If I replace the NaN by strings, the problem is gone. But why do I get duplicated rows if I keep the pd.NA ? I would like to understand this.

Expected Behavior

Regarding the first issue, shouldn't we get an output like this:

             predK  predL
first second             
4.0   2.0       13     **NaN**
NaN   NaN     <NA>     14
NaN   3.0        <NA>     13

I dont get why the (pd.NA, 3.0) is dropped.
Since test_K and test_L are results of groupby, getting a final output in which there might be rows showing NaN values can be critical. It does not convey the same message as an output from which rows full of pd.NA or NaN values are dropped (which seems the case here)
And I would expect to see "NaN" rather than "14" on the first row (index(4.0. 2.0)) neither.


Regarding the second issue, shouldn't we end with either:

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       12.0**

or

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       Nan**
    **NaN       NaN       12.0**

Installed Versions

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 28.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@Stephano120 Stephano120 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 29, 2022
@phofl phofl added Needs Info Clarification about behavior needed to assess issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 29, 2022
@phofl
Copy link
Member

phofl commented Sep 29, 2022

Hi, thanks for your report.

Did you try this on 1.5.0?

Could you please trim your example to only use the operations that are absolutely necessary to reproduce the bug? Also please reduce the number of rows in your DataFrame to the minimal amount possible.

@Stephano120
Copy link
Author

Hi phofl, thanks for your comment.
I trimmed the first example provided for the first issue. (test_K, test_L), and edited my ticket.

Regarding the second issue, I'm still trying to generate 2 "input dataframes" (results of groupby) that can reproduce the duplication of rows once concatenated.

In my code I currently have 4 dataframes, each one being the results of a groupby, which are then concatenated. I can extract subpart of this dataframe, and reproduce the issue of duplicated rows. But so far I didnt succeed in generating at least 2 dataframes from scratch, which, once concatenated, duplicates rows. I think I have to involve a groupby in the process. Will update the second issue described above, a bit later. For the time being. first issue should be easily reproductible.

Regarding your question, I went through the doc describing the updates of version 1.4.3, 1.4.4, and 1.5.0, and I did not find anything related to my issue. I could test, but I doubt it's fixed in v.1.5.0. Will do. And update the post accordingly.

@phofl
Copy link
Member

phofl commented Sep 29, 2022

I think you forgot to update the expected outputs?

Yes please try on 1.5.0

@Stephano120
Copy link
Author

I can confirm that the first issue reported above also occurs with pandas v.1.5.0
You can reproduce it extremely quickly with just this:

test_K  = pd.DataFrame([13, pd.NA], columns=['predK'], index=pd.MultiIndex.from_tuples([(4.0, 2.0),(pd.NA, pd.NA)], names=["first", "second"]))
test_L  = pd.DataFrame([14, 13], columns=['predL'], index=pd.MultiIndex.from_tuples([(pd.NA, pd.NA), (pd.NA, 3.0)], names=["first", "second"]))
print(test_K)
print(test_L)
print(pd.concat([test_K, test_L], axis=1).sort_index())

So, very likely the second one too. Will illustrate and demonstrate the second one asap.

@phofl
Copy link
Member

phofl commented Sep 29, 2022

Thanks for this.

This comes down to the union call under the hood which again comes down to #37222

@phofl phofl closed this as completed Sep 29, 2022
@phofl phofl added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex Duplicate Report Duplicate issue or pull request and removed Needs Info Clarification about behavior needed to assess issue labels Sep 29, 2022
@Stephano120
Copy link
Author

Thanks for your handling, phofl.

This issue was noticed for the first time back in 2020 by yourself, if I'm not mistaken?

Don't you think there should be at least a warning in the doc, somewhere on https://pandas.pydata.org/docs/reference/api/pandas.concat.html , until it's fixed?

If dataframes to concatenate have pd.NA in their indexes, then these pd.NA should be replaced prior to using concat.

I would have appreciated a warning in the doc about that.

@phofl
Copy link
Member

phofl commented Sep 29, 2022

We have lots of open issues. Things get fixed when someone opens a pr. In General, we don’t warm about know bugs

@Stephano120
Copy link
Author

Thanks for your answers.
But why sharing knowledge on what works, and not on what doesn't ?

@phofl
Copy link
Member

phofl commented Sep 29, 2022

We have 3400 open issues. It would cost way too much resources to document every bug and would complicate the documentation incredibly. Also, we would have to keep everything in sync. This is just not doable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants