We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
At least for large datasets DataFrame.duplicated returns incorrect results.
DataFrame.duplicated
Consider MovieLens10M data (this code will automatically download the data from grouplens website):
import pandas as pd from requests import get from StringIO import StringIO zip_file_url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip' zip_response = get(zip_file_url) zip_contents = StringIO(zip_response.content) with ZipFile(zip_contents) as zfile: zdata = zfile.read('ml-10M100K/ratings.dat') delimiter = ';' zdata = zdata.replace('::', delimiter) # makes data compatible with pandas c-engine mldata = pd.read_csv(StringIO(zdata), sep=delimiter, header=None, engine='c', names=['userid', 'movieid', 'rating', 'timestamp'], usecols=['userid', 'movieid', 'rating'])
The data (mldata variable) contains no duplicates, which can be verified:
mldata
(mldata.groupby(['userid', 'movieid']).size()>1).any() False mldata.set_index(['userid', 'movieid']).index.is_unique True
However, DataFrame.duplicated gives:
dups = mldata.duplicated(['userid', 'movieid'], keep=False) print dups.any() print dups.sum() True 12127
Expected:
False 0
pd.show_versions():
pd.show_versions()
commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 8 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.17.0 nose: 1.3.7 pip: 7.1.2 setuptools: 18.3.2 Cython: 0.23.3 numpy: 1.10.0 scipy: 0.16.0 statsmodels: None IPython: 3.2.1 sphinx: 1.3.1 patsy: 0.3.0 dateutil: 2.4.2 pytz: 2015.6 blosc: None bottleneck: 1.0.0 tables: 3.2.0 numexpr: 2.3.1 matplotlib: 1.4.3 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 1.0.0 xlsxwriter: 0.7.3 lxml: 3.4.4 bs4: 4.3.2 html5lib: 0.999 httplib2: None apiclient: None sqlalchemy: 1.0.5 pymysql: None psycopg2: None
The text was updated successfully, but these errors were encountered:
thanks for the report, a dupe of: #11376
this was already fixed here: #11403
and will be in forthcoming 0.17.1 (it's in master now)
Sorry, something went wrong.
No branches or pull requests
At least for large datasets
DataFrame.duplicated
returns incorrect results.Consider MovieLens10M data (this code will automatically download the data from grouplens website):
The data (
mldata
variable) contains no duplicates, which can be verified:However,
DataFrame.duplicated
gives:Expected:
pd.show_versions()
:INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.23.3
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
The text was updated successfully, but these errors were encountered: