Skip to content

PERF: ArrowExtensionArray.factorize #49177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 19, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Oct 18, 2022

In [1]: data = [1, 2, 3] * 5000 + [None] * 5000

In [2]: import pyarrow as pa

In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data))

In [4]: %timeit arr.factorize()  # pr
138 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit arr.factorize() # main
665 µs ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@mroeschke mroeschke added Performance Memory or execution speed performance Arrow pyarrow functionality labels Oct 18, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice change

@mroeschke mroeschke added this to the 2.0 milestone Oct 19, 2022
@mroeschke mroeschke merged commit bbb1cdf into pandas-dev:main Oct 19, 2022
@mroeschke mroeschke deleted the perf/arrow/factorize branch October 19, 2022 04:42
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* Iterate on simplification

* Complete refactor

* add whatsnew

* Add whatsnew note

* Rename
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants