Skip to content

BufferError [Local] Queue full for producer even after changing librdkafka config #16

Closed
@mikesparr

Description

@mikesparr

Ubuntu 14.04.4 LTS / Python 2.7.x / Kakfa 0.10 (Confluent Platform 3) / Python client (latest)

  • app servers: 4 core, 32GB RAM, SATA (running python scripts)
  • db servers: 8 core, 64GB RAM, SSD (5-node Kafka/Cassandra cluster + 3-node ES cluster)
  • 10Gb/s private NIC and bare metal stack

I'm seeing a large number of BufferError [Local] Queue full errors in logs for Producer client. I searched for the error yesterday and saw an issue from 2014 for librdkafka that was resolved by changing a few configuration parameters. I posted in this issue and changed my config and initial errors went away but as the program ran overnight, a flood of errors filled the logs. Out of 500,000 messages consumed from the topics, I'm missing over 100,000 in the subsequent topic.

I have a python stream processor that instantiates both Consumer and Producer classes and consumes from 6 topics, performing diff operation/upsert against matching record if exists in Cassandra cluster, and then publishing diff'ed object to another topic (...ListingEditEvent). When it tries to publish to the subsequent topic, messages are getting lost. Transformer program picks up from the ListingEditEvent topic and converts to our schema and publishes to ListingEditTransformed topic for Logstash consumption to Elasticsearch. I'm seeing differences in the records in ES compared to Kafka topics and trying to resolve. I appreciate any tips on how to solve or better configuration values.

I edited the config for Producer client to the following:

            conf = {
                'bootstrap.servers': ','.join(map(str, self.config.get('hosts'))),
                'queue.buffering.max.messages': 500000, # is this too small?
                'queue.buffering.max.ms': 60000, # is this too long?
                'batch.num.messages': 100, # is this too small?
                'log.connection.close': False,
                'client.id': socket.gethostname(),
                'default.topic.config': {'acks': 'all'}
            }

I'm thinking of reducing the max time and increasing max messages, perhaps reduce to 5000ms, and 250 batch size, and 1 million max?

Errors not constant so must just exceed buffer as it's processing and recover and then exceed again:

2016-07-07 09:58:42,952 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160002361]
2016-07-07 10:02:55,094 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:02:55,106 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:02:55,189 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160010014]
2016-07-07 10:02:55,199 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160010014]
2016-07-07 10:02:57,466 - DEBUG - diff_processor.py - Error with lat [None], lon [None] for listing [160009744]
2016-07-07 10:02:57,475 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009744]
2016-07-07 10:08:03,292 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:121]
2016-07-07 10:08:03,311 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:9]
2016-07-07 10:08:04,807 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:1549]
2016-07-07 10:08:04,822 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:8199]
2016-07-07 10:08:08,017 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009089]
2016-07-07 10:08:09,728 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:140009614]
2016-07-07 10:13:17,459 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009935]
2016-07-07 10:13:17,468 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009935]
2016-07-07 10:13:17,541 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160009962]
2016-07-07 10:13:17,550 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160009962]
2016-07-07 10:13:17,565 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160010015]
2016-07-07 10:18:25,977 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160004679]
2016-07-07 10:18:25,985 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160004679]
2016-07-07 10:18:26,012 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160007175]
2016-07-07 10:18:26,021 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160007175]
2016-07-07 10:18:26,044 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.ListingEditEvent]:[None]-[nnrmls:160008663]
2016-07-07 10:18:26,053 - ERROR - stream_producer.pyc - BufferError publishing topic [rets.nnrmls.PhotoEditEvent]:[None]-[nnrmls:160008663]

My producer class doesn't call flush() like your example client since the calling module connects and keeps publishing. I also don't call poll(0) like example but unsure if that matters???

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions