-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: pd.util.hash_pandas_object
slower on string[pyarrow]
than object
dtypes
#48964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm seeing the same behavior on 1.4.x
Faster on my machine, but same ratio (~1.18). |
Just noting that
EDIT: One more interesting observation:
|
@jorisvandenbossche (or anyone else in arrow) do you know if arrow has nice elementwise hashing operations? I did a quick search of the API docs and couldn't find anything. |
If I understand correctly, what would be needed here is https://issues.apache.org/jira/browse/ARROW-8991? (and there is an open PR for it: apache/arrow#13487) |
Just coming back around to working on that PR. It is is almost ready, I'm just trying to add some extra test coverage. I am not sure it would make it into 10.0.0, but maybe if I can finish today it has a chance. |
Thanks @drin! |
We now have EA._hash_pandas_object so can do an arrow-specific implementation in ArrowExtensionArray |
I had hit a snag for ARROW-8991 and have been kicking that can down the road for a bit too long. I'm hoping to get back to it soon again. for clarification, @jbrockmendel , if Arrow implements a way of doing hashing, would the EA._hash_pandas_object delegate to that functionality for any pandas element (in a series, dataframe, etc.) that is an arrow object (element in an Arrow Array, for example)? |
If arrow/pyarrow implement something, we'd update ArrowExtensionArray to use that, which should handle any case where you have a Series/Index or DataFrame column backed by a pyarrow array. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
When investigating Dask's shuffle performance on
string[pyarrow]
data, I observedpd.util.hash_pandas_object
was slightly less performant onstring[pyarrow]
data than on regularobject
Python objects. This surprised me as I would have expected hashingpyarrow
-backed data to be faster than Python objectsInstalled Versions
Prior Performance
No response
The text was updated successfully, but these errors were encountered: