ENH: Faster merge_asof() through a single-pass algo #13902

chrisaycock · 2016-08-04T14:48:45Z

Out of curiosity, I took a crack at a single-pass merge_asof(). My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I use PyObjectHashTable while scanning through the right DataFrame to cache the most recently found row for each "by" object.

I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like ['ticker', 'exchange']. Still investigating.

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash c4302949:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                               41.07ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.90ms
[ 50.00%] · For pandas commit hash 97de42ab:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              608.08ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           81.03ms
   before     after       ratio
  [97de42ab] [c4302949]
-   81.03ms    12.90ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  608.08ms    41.07ms      0.07  join_merge.merge_asof_by.time_merge_asof_by

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The text was updated successfully, but these errors were encountered:

jreback · 2016-08-04T15:01:28Z

so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.

chrisaycock · 2016-08-04T15:05:07Z

@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.

I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...

jreback · 2016-08-04T15:32:59Z

@chrisaycock you can use the groupby factorization (its quite cheap to do this)

In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})

In [6]: g = df.groupby(['A', 'B'])

In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)

chrisaycock · 2016-08-04T16:46:24Z

Using the setup from the join_merge.merge_asof_by benchmark:

In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop

But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:

In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop

In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop

The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[], then we need to pay attention to performance.)

…13902) This version passes existing regression tests but is ultimately wrong because it requires the "by" column to be a single object. A proper version would handle int (and possily float) columns through type differentiation. Author: Christopher C. Aycock <[email protected]> Closes #13903 from chrisaycock/master and squashes the following commits: f0d0165 [Christopher C. Aycock] ENH: Faster merge_asof() performs a single pass when joining tables (#13902)

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 4, 2016

chrisaycock closed this as completed Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Faster merge_asof() through a single-pass algo #13902

ENH: Faster merge_asof() through a single-pass algo #13902

chrisaycock commented Aug 4, 2016 •

edited

Loading

jreback commented Aug 4, 2016

Uh oh!

chrisaycock commented Aug 4, 2016 •

edited

Loading

Uh oh!

jreback commented Aug 4, 2016

Uh oh!

chrisaycock commented Aug 4, 2016

Uh oh!

Uh oh!

ENH: Faster merge_asof() through a single-pass algo #13902

ENH: Faster merge_asof() through a single-pass algo #13902

Comments

chrisaycock commented Aug 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jreback commented Aug 4, 2016

Uh oh!

chrisaycock commented Aug 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Aug 4, 2016

Uh oh!

chrisaycock commented Aug 4, 2016

Uh oh!

chrisaycock commented Aug 4, 2016 •

edited

Loading

chrisaycock commented Aug 4, 2016 •

edited

Loading