Skip to content

PERF: pd.concat with EA-backed indexes #49128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 17, 2022
Merged

Conversation

lukemanley
Copy link
Member

Perf improvement for pd.concat when objects contain EA-backed indexes. The bottleneck was EA.tolist. Still relatively slow vs non-EA, but an improvement.

$ asv continuous -f 1.1 upstream/main ea-tolist -b join_merge.ConcatIndexDtype

       before           after         ratio
     [90b4add7]       [77a21f8d]
     <main>           <ea-tolist>
-      19.4±0.2ms      12.0±0.07ms     0.62  join_merge.ConcatIndexDtype.time_concat_series('Int64', 1, True, False)
-      14.6±0.2ms       6.78±0.2ms     0.46  join_merge.ConcatIndexDtype.time_concat_series('Int64', 1, True, True)
-      14.2±0.2ms       6.57±0.1ms     0.46  join_merge.ConcatIndexDtype.time_concat_series('Int64', 1, False, False)
-      14.1±0.2ms      6.47±0.09ms     0.46  join_merge.ConcatIndexDtype.time_concat_series('Int64', 1, False, True)
-      23.9±0.1ms       10.3±0.1ms     0.43  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 1, True, False)
-      19.7±0.3ms      5.53±0.05ms     0.28  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 1, True, True)
-      19.0±0.2ms      5.16±0.03ms     0.27  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 1, False, True)
-      18.5±0.2ms      4.82±0.03ms     0.26  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 1, False, False)
-         108±1ms       24.5±0.4ms     0.23  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 1, True, False)
-      98.4±0.6ms       17.0±0.3ms     0.17  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 1, True, True)
-      97.6±0.8ms       15.9±0.3ms     0.16  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 1, False, True)
-      99.2±0.5ms       16.0±0.3ms     0.16  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 1, False, False)
$ asv continuous -f 1.1 upstream/main ea-tolist -b array.ArrowStringArray

       before           after         ratio
     [90b4add7]       [77a21f8d]
     <main>           <ea-tolist>
-      16.5±0.2ms          399±6μs     0.02  array.ArrowStringArray.time_tolist(True)
-      17.1±0.2ms          404±6μs     0.02  array.ArrowStringArray.time_tolist(False)

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 16, 2022
Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phofl phofl added this to the 2.0 milestone Oct 17, 2022
@phofl phofl merged commit 4583a04 into pandas-dev:main Oct 17, 2022
@phofl
Copy link
Member

phofl commented Oct 17, 2022

thx @lukemanley

@lukemanley lukemanley deleted the ea-tolist branch October 26, 2022 10:18
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants