ENH/PERF: cache sort/mask per column #3539

jreback · 2013-05-07T16:48:48Z

Along the lines of wes's answer to this question:

http://stackoverflow.com/questions/16384332/how-to-speed-up-pandas-row-filtering-by-string-matching

I think it is possible to have a dictionary recording certain parameters for a series (or a column in a frame), something like

conditions = dict(sorted=False, nulls=False, unique=True)

that would alllow certain operations to be speeded up, of course these conditions
would have to be updated in various scenarios, e.g. when sorting by a certain
column, then you could set the sorted condition = True (and invalidate when
sorting by other columns). however, and this might be a bit complicated to determine
(in which case you could just set sorted = None, meaning I don't know).

But many operations could preserve these conditions (e.g. a reindexing with a monotonic index will preserver the sort, but will invalidate the nulls, if its not identical to the current index)

not-trivial but might be worth it

e.g. using the fact that I already computed nulls, I can go directly to numpy land if I already know I don't need to do the null check

or if its already sorted, then can use searchsorted

In [6]: %timeit values=='A0003'
10 loops, best of 3: 164 ms per loop

In [7]: %timeit pd.lib.scalar_compare(values,'A0003',operator.eq)
1 loops, best of 3: 255 ms per loop

In [8]: %timeit values.sort()
1 loops, best of 3: 2.29 s per loop

In [9]: %timeit pd.isnull(values)
10 loops, best of 3: 105 ms per loop

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-06-09T20:58:02Z

I could imagine this being implemented on a bespoke EA, but implementing on say Block sounds like a giant PITA at this point

jbrockmendel · 2020-09-22T00:06:46Z

The existence of views makes cache invalidation way too hard. Closeable?

mroeschke · 2021-04-11T02:00:17Z

Sounds like the non-triviality might make this not worth. Going to close.

jreback mentioned this issue Feb 18, 2015

Support pad/backfill/nearest reindexing even for unsorted indexes by storing a sorted index? #9510

Closed

jreback added Enhancement Performance Memory or execution speed performance labels Feb 19, 2015

TomAugspurger added Difficulty Intermediate labels Jul 8, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 22, 2020

mroeschke closed this as completed Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH/PERF: cache sort/mask per column #3539

ENH/PERF: cache sort/mask per column #3539

jreback commented May 7, 2013

jbrockmendel commented Jun 9, 2020

Uh oh!

jbrockmendel commented Sep 22, 2020

Uh oh!

mroeschke commented Apr 11, 2021

Uh oh!

Uh oh!

ENH/PERF: cache sort/mask per column #3539

ENH/PERF: cache sort/mask per column #3539

Comments

jreback commented May 7, 2013

jbrockmendel commented Jun 9, 2020

Uh oh!

jbrockmendel commented Sep 22, 2020

Uh oh!

mroeschke commented Apr 11, 2021

Uh oh!