-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
isin fails with large series/lists of tuples #17910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There has been a related issue and fix (#16012, #16969). So it might be this is fixed in the meantime in master. |
I did a fresh install of python 3.6 and upgraded to 0.21.0-rc1 and can confirm the issue is no longer there, it has been solved. For reference:
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-72-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0rc1 I am closing the issue as it is solved already, if needed feel free to reopen! Thank you! |
Thanks for testing! |
Code Sample
Problem description
There are several ways to trigger the bug, either of them resulting in
isin
returning allFalse
whereas some indexes should beTrue
.Take the example above, we have the tuple
(-1,-2,-3)
repeated twice, and it can be checked that bothcounts
andcounts_idxs
are2
and(-1,-2,-3)
, respectively. Then, independently from the rest of theproducts
, the resulting dataset from taking theidxs
fromisin
should have, at least, 2 items. Calling the function as is, does not. Explanation, causes and possible solutions below:Manually importing
from pandas.core.algorithms import isin
and settingsidxs = isin(df['products'], counts[counts >= 2].index)
results in the exact same behaviour.I've tried to reproduce this same behaviour when not using tuples at all and I can't seem to succeed.
Proposed solution
This seems to be a regression in
0.20.x
as using latest0.19.x
(0.19.2) works perfectly fine. Indeed, manually copyingisin
from0.19.x
and using it instead of0.20.x
works. One can see that a particular if was reversed/erased inhttps://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414
and
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L144
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L161
This results in
0.20.x
relying innumpy.in1d
whereas0.19.x
usedlib.ismember
, which is equivalent tohtable.ismember_object
in0.20.x
. One can confirm this becase:works fine, whereas
silently fails.
Now, either this is temporally fixed in pandas by not relying in
in1d
or an issue is submitted to numpy (which I will do once I can take a look atin1d
and see what's happening). Also, one can solve it by not using tuples at all, and applyinghash
beforehand, for example.I've narrowed a bit more the problem and it is not only related to
n
but alsoprodmax
:Any combination with
n > 1000001 && prodmax > 1986
produces and empty dataframe:Whereas having
n <= 1000000
orprodmax <= 1986
works just fine. Parameter values have been deduced from:n
from https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414prodmax
by binary search:Output of
pd.show_versions()
pandas: 0.20.3
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None
This has been confirmed and tested in multiple pcs and environments, always Python 3.x
The text was updated successfully, but these errors were encountered: