Skip to content

ENH/PERF: cache sort/mask per column #3539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue May 7, 2013 · 3 comments
Closed

ENH/PERF: cache sort/mask per column #3539

jreback opened this issue May 7, 2013 · 3 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Performance Memory or execution speed performance

Comments

@jreback
Copy link
Contributor

jreback commented May 7, 2013

Along the lines of wes's answer to this question:

http://stackoverflow.com/questions/16384332/how-to-speed-up-pandas-row-filtering-by-string-matching

I think it is possible to have a dictionary recording certain parameters for a series (or a column in a frame), something like

conditions = dict(sorted=False, nulls=False, unique=True)

that would alllow certain operations to be speeded up, of course these conditions
would have to be updated in various scenarios, e.g. when sorting by a certain
column, then you could set the sorted condition = True (and invalidate when
sorting by other columns). however, and this might be a bit complicated to determine
(in which case you could just set sorted = None, meaning I don't know).

But many operations could preserve these conditions (e.g. a reindexing with a monotonic index will preserver the sort, but will invalidate the nulls, if its not identical to the current index)

not-trivial but might be worth it

e.g. using the fact that I already computed nulls, I can go directly to numpy land if I already know I don't need to do the null check

or if its already sorted, then can use searchsorted

In [6]: %timeit values=='A0003'
10 loops, best of 3: 164 ms per loop

In [7]: %timeit pd.lib.scalar_compare(values,'A0003',operator.eq)
1 loops, best of 3: 255 ms per loop

In [8]: %timeit values.sort()
1 loops, best of 3: 2.29 s per loop

In [9]: %timeit pd.isnull(values)
10 loops, best of 3: 105 ms per loop
@jbrockmendel
Copy link
Member

I could imagine this being implemented on a bespoke EA, but implementing on say Block sounds like a giant PITA at this point

@jbrockmendel
Copy link
Member

The existence of views makes cache invalidation way too hard. Closeable?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 22, 2020
@mroeschke
Copy link
Member

Sounds like the non-triviality might make this not worth. Going to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants