Description
Ubuntu 14.04.4 LTS / Python 2.7.x / Kakfa 0.10 (Confluent Platform 3) / Python client (latest)
- app servers: 4 core, 32GB RAM, SATA (running python scripts)
- db servers: 8 core, 64GB RAM, SSD (5-node Kafka/Cassandra cluster + 3-node ES cluster)
- 10Gb/s private NIC and bare metal stack
I'm seeing a large number of BufferError [Local] Queue full
errors in logs for Producer client. I searched for the error yesterday and saw an issue from 2014 for librdkafka that was resolved by changing a few configuration parameters. I posted in this issue and changed my config and initial errors went away but as the program ran overnight, a flood of errors filled the logs. Out of 500,000 messages consumed from the topics, I'm missing over 100,000 in the subsequent topic.
I have a python stream processor that instantiates both Consumer and Producer classes and consumes from 6 topics, performing diff operation/upsert against matching record if exists in Cassandra cluster, and then publishing diff'ed object to another topic (...ListingEditEvent). When it tries to publish to the subsequent topic, messages are getting lost. Transformer program picks up from the ListingEditEvent topic and converts to our schema and publishes to ListingEditTransformed topic for Logstash consumption to Elasticsearch. I'm seeing differences in the records in ES compared to Kafka topics and trying to resolve. I appreciate any tips on how to solve or better configuration values.
I edited the config for Producer client to the following:
conf = {
'bootstrap.servers': ','.join(map(str, self.config.get('hosts'))),
'queue.buffering.max.messages': 500000, # is this too small?
'queue.buffering.max.ms': 60000, # is this too long?
'batch.num.messages': 100, # is this too small?
'log.connection.close': False,
'client.id': socket.gethostname(),
'default.topic.config': {'acks': 'all'}
}
I'm thinking of reducing the max time and increasing max messages, perhaps reduce to 5000ms, and 250 batch size, and 1 million max?
Errors not constant so must just exceed buffer as it's processing and recover and then exceed again:
2016-07-07 09:58:42,952 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160002361]
2016-07-07 10:02:55,094 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:02:55,106 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:02:55,189 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160010014]
2016-07-07 10:02:55,199 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160010014]
2016-07-07 10:02:57,466 - DEBUG - diff_processor.py - Error with lat [None], lon [None] for listing [160009744]
2016-07-07 10:02:57,475 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:08:03,292 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:121]
2016-07-07 10:08:03,311 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:9]
2016-07-07 10:08:04,807 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:1549]
2016-07-07 10:08:04,822 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:8199]
2016-07-07 10:08:08,017 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009089]
2016-07-07 10:08:09,728 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:140009614]
2016-07-07 10:13:17,459 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009935]
2016-07-07 10:13:17,468 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009935]
2016-07-07 10:13:17,541 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009962]
2016-07-07 10:13:17,550 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009962]
2016-07-07 10:13:17,565 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160010015]
2016-07-07 10:18:25,977 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160004679]
2016-07-07 10:18:25,985 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160004679]
2016-07-07 10:18:26,012 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160007175]
2016-07-07 10:18:26,021 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160007175]
2016-07-07 10:18:26,044 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160008663]
2016-07-07 10:18:26,053 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160008663]
My producer class doesn't call flush()
like your example client since the calling module connects and keeps publishing. I also don't call poll(0)
like example but unsure if that matters???