Skip to content

PERF: pd.util.hash_pandas_object slower on string[pyarrow] than object dtypes #48964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
jrbourbeau opened this issue Oct 5, 2022 · 9 comments
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality hashing hash_pandas_object Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@jrbourbeau
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When investigating Dask's shuffle performance on string[pyarrow] data, I observed pd.util.hash_pandas_object was slightly less performant on string[pyarrow] data than on regular object Python objects. This surprised me as I would have expected hashing pyarrow-backed data to be faster than Python objects

In [1]: import pandas as pd

In [2]: s = pd.Series(range(2_000))

In [3]: s_object = s.astype(object)

In [4]: s_pyarrow = s.astype("string[pyarrow]")

In [5]: %timeit pd.util.hash_pandas_object(s_object)
859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow)
1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763
python           : 3.10.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.0
numpy            : 1.23.3
pytz             : 2022.4
dateutil         : 2.8.2
setuptools       : 65.4.1
pip              : 22.2.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 8.5.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 9.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

Prior Performance

No response

@jrbourbeau jrbourbeau added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 5, 2022
@rhshadrach
Copy link
Member

I'm seeing the same behavior on 1.4.x

503 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
594 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Faster on my machine, but same ratio (~1.18).

@rhshadrach rhshadrach added hashing hash_pandas_object Strings String extension data type and string data labels Oct 5, 2022
@mroeschke
Copy link
Member

mroeschke commented Oct 5, 2022

Just noting that string[python] dtype is comparable to string[pyarrow], but pyarrow is still slower. Hashing is still done in terms of numpy arrays so I imagine that is adding overhead

In [6]: s_str = s.astype("string[python]")

In [7]: %timeit pd.util.hash_pandas_object(s_object)
856 µs ± 3.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit pd.util.hash_pandas_object(s_str)
952 µs ± 3.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit pd.util.hash_pandas_object(s_pyarrow)
977 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

EDIT: One more interesting observation: categorize=True, the default here, goes through a categorical routine that is supposed to provide a speed up when there are duplicates which there are none in this example

In [8]: %timeit pd.util.hash_pandas_object(s_object, categorize=True)
843 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit pd.util.hash_pandas_object(s_object, categorize=False)
986 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=True)
986 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=False)
515 µs ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 6, 2022
@mrocklin
Copy link
Contributor

mrocklin commented Oct 6, 2022

@jorisvandenbossche (or anyone else in arrow) do you know if arrow has nice elementwise hashing operations? I did a quick search of the API docs and couldn't find anything.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2022

If I understand correctly, what would be needed here is https://issues.apache.org/jira/browse/ARROW-8991? (and there is an open PR for it: apache/arrow#13487)

@drin
Copy link

drin commented Oct 14, 2022

Just coming back around to working on that PR. It is is almost ready, I'm just trying to add some extra test coverage. I am not sure it would make it into 10.0.0, but maybe if I can finish today it has a chance.

@jrbourbeau
Copy link
Contributor Author

Thanks @drin!

@jbrockmendel
Copy link
Member

We now have EA._hash_pandas_object so can do an arrow-specific implementation in ArrowExtensionArray

@drin
Copy link

drin commented Mar 6, 2023

I had hit a snag for ARROW-8991 and have been kicking that can down the road for a bit too long. I'm hoping to get back to it soon again.

for clarification, @jbrockmendel , if Arrow implements a way of doing hashing, would the EA._hash_pandas_object delegate to that functionality for any pandas element (in a series, dataframe, etc.) that is an arrow object (element in an Arrow Array, for example)?

@jbrockmendel
Copy link
Member

If arrow/pyarrow implement something, we'd update ArrowExtensionArray to use that, which should handle any case where you have a Series/Index or DataFrame column backed by a pyarrow array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality hashing hash_pandas_object Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants