Skip to content

ENH: Faster merge_asof() through a single-pass algo #13902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chrisaycock opened this issue Aug 4, 2016 · 4 comments
Closed

ENH: Faster merge_asof() through a single-pass algo #13902

chrisaycock opened this issue Aug 4, 2016 · 4 comments
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@chrisaycock
Copy link
Contributor

chrisaycock commented Aug 4, 2016

Out of curiosity, I took a crack at a single-pass merge_asof(). My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I use PyObjectHashTable while scanning through the right DataFrame to cache the most recently found row for each "by" object.

I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like ['ticker', 'exchange']. Still investigating.

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash c4302949:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                               41.07ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.90ms
[ 50.00%] · For pandas commit hash 97de42ab:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              608.08ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           81.03ms
   before     after       ratio
  [97de42ab] [c4302949]
-   81.03ms    12.90ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  608.08ms    41.07ms      0.07  join_merge.merge_asof_by.time_merge_asof_by

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@jreback
Copy link
Contributor

jreback commented Aug 4, 2016

so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.

@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 4, 2016
@chrisaycock
Copy link
Contributor Author

chrisaycock commented Aug 4, 2016

@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.

I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...

@jreback
Copy link
Contributor

jreback commented Aug 4, 2016

@chrisaycock you can use the groupby factorization (its quite cheap to do this)

In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})

In [6]: g = df.groupby(['A', 'B'])

In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)

@chrisaycock
Copy link
Contributor Author

Using the setup from the join_merge.merge_asof_by benchmark:

In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop

But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:

In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop

In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop

The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[], then we need to pay attention to performance.)

jreback pushed a commit that referenced this issue Aug 8, 2016
…13902)

This version passes existing regression tests but is ultimately wrong
because it requires the "by" column to be a single object. A proper
version  would handle int (and possily float) columns through type
differentiation.

Author: Christopher C. Aycock <[email protected]>

Closes #13903 from chrisaycock/master and squashes the following commits:

f0d0165 [Christopher C. Aycock] ENH: Faster merge_asof() performs a single pass when joining tables (#13902)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants