-
Notifications
You must be signed in to change notification settings - Fork 915
Re-subscribe to a topic after disconnection/reconnection #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The underlying library should automatically reconnect and re-subscribe once the group coordinator broker comes back up, but there were some corner case bugs around this in librdkafka 0.9.1 which have been fixed. We'll be releasing librdkafka 0.9.2 within a week, but if you want to try out the fix before that please download and build librdkafka master or the v0.9.2-RC1 tag: |
So I compiled 0.9.2-RC1 and I'm getting a way worse problem. |
@patrickviet And that's without the re-subscribe-every-10s thing, right? Can you reproduce this with |
Output without the re-subscribe every 10 sec, from when I restart the kafka broker
|
And output with the re-subscribe because I find it interesting too
|
For the second one, the restart of the kafka broken happens right in the middle, you can see the lost connection. My producer is a simple loop which publishes the 'kwak' message every 5 sec. |
Thank you, one last request; can you try latest master (instead of 0.9.2-RC1)? |
When reproducing, do it with debug "cgrp,topic,protocol,broker" |
Code from confluent_kafka import Consumer, KafkaError
import sys
import time
c = Consumer({
'bootstrap.servers':'kafkahost',
'group.id':'patricklaptop',
'default.topic.config': {'auto.offset.reset':'smallest'},
'debug':'cgrp,protocol,topic,broker',
})
c.subscribe(['patricktest'])
last_resubscribe = 0
while True:
#if time.time() - last_resubscribe > 10:
# print "resubscribing - just in case"
# c.subscribe(['patricktest'])
# last_resubscribe = time.time()
msg = c.poll(timeout=1.0)
if msg is None:
print "msg is None"
#time.sleep(1)
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
# Not an error really
sys.stderr.write('%% %s [%d] reached end at offset %d\n' % (msg.topic(), msg.partition(), msg.offset()))
else:
print "raising exception"
raise KafkaException(msg.error())
else:
# Good stuff here.
print('Received message: %s' % msg.value().decode('utf-8'))
print "end of while cycle"
c.close() full output
|
Extra info, no idea if it's relevant: Running last kafka to this date: https://www.apache.org/dyn/closer.cgi?path=/kafka/0.10.1.0/kafka_2.11-0.10.1.0.tgz On Centos 6.8 with Oracle JVM 1.8.0u92 |
You just have the one broker, right? |
Yes, just one broker for this test.
|
I think this is what messes it up:
MetadataResponse returns error 38 (INVALID_REPLICATION_FACTOR) in some transient state after the broker comes back up, since the partition count in the response is set to 0 there can't be a leader broker for the partition that librdkafka is consuming and thus librdkafka undelegates the current leader for that partition and stops consuming. The situation should then be remedied within half a minute. Can you give it a shot? |
Thanks so much! Note: I managed to crash master, after a few kafka restarts *** rdkafka_cgrp.c:2420:rd_kafka_cgrp_metadata_update_check: assert: thrd_is_current(rkcg->rkcg_rk->rk_thread) *** Would it make any sense to have a default value lower for the topic metadata if the replication factor is zero? a sort of "retry faster if it all failed" option? |
Great! Yeah, it needs to handle this somehow and there is a fast metadata refresher that kicks in for various other errors, but not for this code path. I'll look into it. You wouldn't happen to have a core-file from that crash?
|
Sadly, I looked on the full filesystem of this virtual machine and didn't find any core file. Thanks again |
Created upstream issue, closing this one since a workaround is provided (set topic.metadata.refresh.interval.ms) |
@edenhill I'm using version 1.6.0 and facing this issue. Our entire cluster went down recently which took down connectivity with all kafka brokers. The connection was reestablished but consumer didn't start pulling messages even after 5 min until I restarted my service in all the environments after several hours. Does kafka consumer automatically re-subscribe to the topics once brokers come online? This issue showed up on front page when I searched for similar query as the title. This StackOverflow answer says, it connects automatically but doesn't resubscribe after reconnecting. |
The consumer will resume fetching (and/or rejoin the group if it has expired) when the cluster is up again and the client has refreshed its metadata, which may take a minute or so depending on configuration. If you have any logs (preferably debug logs) from this situation please create a new issue and provide config, logs, etc. |
Hello, I am using almost identical infrastructure as in the original question. Only my logs are somewhat different and I do not understand them. I do know that kafka broker was unavailable and as a result consumer was logging errors. Could you kindly clarify these error messages?
|
Here is my test scenario. It's limited to the bare basics to isolate my problem
I have single kafka broker. Single topic, with a single partition.
Producer creates random messages to topic patricktest
Consumer reads the messages from topic patricktest
Producer
Consumer
So as you can see, it's pretty basic. In my test scenario, I run the consumer, and it will read the messages, then print "msg is None" quite a bit, until I produce more messages by running the producer code.
Now here is the problem: if I stop kafka while the consumer is running, then restart kafka, the consumer will reconnect BUT it won't re-subscribe to the topic.
I have not found ANY WAY to know that I was disconnected. All I know is that poll returns None. That's it.
For now the workaround that I have found is to add this at the beginning of the loop.
But sadly I also get this kind of thing every 10 seconds on kafka server:
kafka/server.out
Is there any way to catch the reconnect so that I can re-subscribe to a topic?
Or maybe have a way to query the consumer object to know my current status?
Or maybe have the client automatically resubscribe to the topics itself?
While my workaround is functional, it's pretty awful AND produces extra load/logs on the broker.
I'm suspecting that it would even trigger a full rebalance and other things if I had several consumers / partitions rather than this minimal one consumer / partition / topic minimal scenario...
Thanks!
--Patrick
The text was updated successfully, but these errors were encountered: