-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Faster merge_asof() through a single-pass algo #13902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
so you don't actually need tempita here. you factorize things, and so only need to deal with int64's. |
@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation. I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state... |
@chrisaycock you can use the groupby factorization (its quite cheap to do this)
|
Using the setup from the
But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:
The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s |
…13902) This version passes existing regression tests but is ultimately wrong because it requires the "by" column to be a single object. A proper version would handle int (and possily float) columns through type differentiation. Author: Christopher C. Aycock <[email protected]> Closes #13903 from chrisaycock/master and squashes the following commits: f0d0165 [Christopher C. Aycock] ENH: Faster merge_asof() performs a single pass when joining tables (#13902)
Uh oh!
There was an error while loading. Please reload this page.
Out of curiosity, I took a crack at a single-pass
merge_asof()
. My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I usePyObjectHashTable
while scanning through the right DataFrame to cache the most recently found row for each "by" object.I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like
['ticker', 'exchange']
. Still investigating.The text was updated successfully, but these errors were encountered: