-
Notifications
You must be signed in to change notification settings - Fork 157
Parallel join technique documentation #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04952cc
to
1317fe7
Compare
WARNING: Performing a join outside of KS relinquishes ordering efforts KS applies to populating each side of the join - i.e. there is no effort to apply any ordering to the corresponding sides of this join. | ||
When joining within KS, this is taken care of for you. | ||
Be careful using this technique if your operation is sensitive to the order in which data is populated in the state store vs arriving from the event stream. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time synchronization of joins in KS isn't done for GlobalKTables (and interactive queries?) anyway 🤔
Another important difference between a KTable and a GlobalKTable is time synchronization: while processing KTable records is time synchronized based on record timestamps to all other streams, a GlobalKTable is not time synchronized.
https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/
Do you know if there is any way to make sure that a GlobalKTable has been populated before parallel-consumer processing starts to avoid/minimize chance of join-misses? GlobalKTables are populated in a separate thread. It's usually super-fast, but have had some issues with join-misses in KS for stream-globaltable joins during startup if the globaltable is very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GKT work a bit different.. they "fully" hydrate on application start. As in, head offset is marked for input topic at start, then topic is loaded to reach that point, then it's just whatever hits first. So, there is some effort there - just trying to point out it's different, and should be understood.
Ah yes, trying to sync with PC separately... 🤔 That's a great question. You could do something janky like read the head message, and wait until that can be queried from the GKT... Otherwise, ideally you could hook into an event listener system to know when GTK bootstrap has finished. I guess there isn't anything for this currently. Oh actually - it might be represented in the run state of KS - that's worth looking into. Thanks for the heads up!
WARNING: Performing a join outside of KS relinquishes ordering efforts KS applies to populating each side of the join - i.e. there is no effort to apply any ordering to the corresponding sides of this join. | ||
When joining within KS, this is taken care of for you. | ||
Be careful using this technique if your operation is sensitive to the order in which data is populated in the state store vs arriving from the event stream. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some suggestions on synchronising with external KS app as @JorgenRingen says.. |
5c841c3
to
bc85ba3
Compare
6312a34
to
3ad49c2
Compare
4c62a9f
to
b5b166f
Compare
Rendered docs: https://github.com/astubbs/parallel-consumer/tree/parallel-join-technique#parallel-joins