-
Notifications
You must be signed in to change notification settings - Fork 915
Is producer.flush() a must? #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can anyone please help me with this? |
produce() is asynchronous, all it does is enqueue the message on an internal queue which is later (>= queue.buffering.max.ms) served by internal threads and sent to the broker (if a leader is available, else wait some more). What this means in practice is that if you do:
Then most likely neither message 1 or message 2 will have actually been sent to the broker, much less acknowledged by it. Adding flush() before exiting will make the client wait for any outstanding messages to be delivered to the broker (and this will be around queue.buffering.max.ms, plus latency). Let me reconstruct your loop to be more performant, look for FIXME comments:
|
Thanks a ton for your solution. I have a few questions:
|
1749 * 6MB is about 10 GB. Producing that amount in 3 minutes gives a rate of about 60MB/s or almost 500 Mbit/s. |
I'm having s similar issue with flush() not completing. I'm consuming 464k messages to topic A and producing into B. All messages appear to be in topic B. I tried following your recommendations to snehamj with no luck. Output
Code
|
@edenhill what is the time unit in poll/flush? The code for flush at least multiples the values by 1000, here you are passing it in milliseconds. The docs don't indicate a unit. Thanks! |
@isaacdd The timeout unit is in seconds, but I mistakenly wrote it as milliseconds above. |
@edenhill
When I use above snippet, i am seeing delivery_callback for last message. it is perfect
When I use above snippet, i am seeing delivery_callback for every message(means as soon as i produce to Kafka). why so ? i don't understand clearly. Confluent-kafka-version : 0.11.0 |
produce() is asynchronous, it simply enqueues your message on an internal queue for later transmission to the broker. When you call poll(0) directly after produce() it is highly unlikely that the message you just produced has been sent to the broker, processed, and an ack response received from the broker in that short time. So you will most likely only see delivery reports from previous messages at that point. |
@edenhill . Thanks for info. poll(0) is fine. then what about poll(10). what this will do ? |
@edenhill Let me explain clearly what i am doing. We have a Kafka consumer which will read messages and do so stuff and again publish to Kafka topic using below script producer config :
I haven't configured any other configuration like I am assuming these all will be going to be default values. my understanding : When internal queue reaches either of My producer snippet:
from this post i understand that using flush after every message, producer is going to be sync producer . If I change above snippet to
Is there any performance will be improved ? Can you clarify my understanding. Thanks |
the produce() is still asynchronous and all it does is put the message on an internal queue. By calling flush() after each produce() you effectively turn it into a sync produce, which is slow, see here: https://github.com/edenhill/librdkafka/wiki/FAQ#why-is-there-no-sync-produce-interface Calling poll(0) after produce() will serve delivery reports of already sent and acked messages, which is typically never the message you just produced. |
Thanks @edenhill . I have changed my code from :
to
In first scenario snippet used to take ~45ms because it is waiting for internal queue to get empty. Thanks |
What if our app fail before flush calling? Will we lose our data? |
Depends on if produce()d messages have made it to the broker yet, or are still in the producer queue or in socket buffers. |
@edenhill |
A message is considered in-queue until you've polled for its delivery report, which means that if you don't call poll the internal producer queue (queue.buffering.max.messages) will eventually fill up and reject further messages. If you know that your producer is short-lived and will not produce more messages than queue.buffering.max.messages you can skip poll() and just go for the final flush(), but for long-running producers you should call poll() frequently. |
Also note that poll() will trigger error callbacks (error_cb), stats, etc. |
If I call poll() right after every produce(), does that mean I send data to broker record by record, not in batch mode? |
No, if you call poll() with a timeout of 0 it will serve any delivery reports from previous produce-calls, or none, without blocking. |
I Used poll(0) instead flush() but no message produce to topics, Am I missing something? |
Hi, @ArmanAjdani can you share your code snippet? |
this is my produce function:
|
Producer won't produce the message to the broker immediately it will wait until the producer queue (queue.buffering.max.messages)gets full or size of the queue(queue.buffering.max.kbytes). calling flush immediately after produce will publish all messages to the broker irrespective of these two config values. keep poll and try to publish more messages. you will receive a callback for the previous message. |
i have a producer code using callback . but i see my message not calling the callback or throwing the error even if no producer. always it show success. but if kafka is up then producing the message. import json logger = get_logger() class SimpleProducer(config):
configuration = {'linger.ms': 30, 'acks': 'All', 'request.timeout.ms': 10000, 'message.timeout.ms': 90000} count = 1 except BufferError as e: |
The configuration "conf" lets avoid an error of "buffereing.max.message". credits to: confluentinc/confluent-kafka-python#137 (comment)
On a similar code, which has around 4.5 million records (size of 70 MB) takes around 4 hours to completely produce. |
My multi-threaded producer doesn't seem to be sending any messages if flush is NOT included in the end. This is my script:
Adding flush() increases the run time drastically. Is it a must? Is there any other way I can make sure all the messages have reached the topics?
The text was updated successfully, but these errors were encountered: