BufferError: Local: Queue full #18624

stayallive · 2020-03-31T08:35:29Z

I am receiving this error once every 2-4 days and I need to restart Sentry to fix it. This started after moving to the Docker version of Sentry.

I never noticed this being an issue on 9.1.2 also with Clickhouse and Snuba running, but without Kafka.

https://observ.app/share/issue/4e4f208a500d48cc898770930706959a/

I am not sure where to look / poke / monitor to see this queue that is being spoken of and how I can flush it / enlarge it if needed.

sentry queues list showed all 0's so it's not looking like there is a massive backlog of events.

Any help is appreciated!

The text was updated successfully, but these errors were encountered:

stayallive · 2020-04-03T06:48:59Z

It seems like .poll() should be called somewhere to flush this queue according to this: confluentinc/confluent-kafka-dotnet#703 (comment).

I'm sure this can be done in other parts of the code too so it might be happening, but currently I have to restart Sentry every X messages it seems like.

petrprikryl · 2020-04-08T06:57:02Z

I am facing exactly same thing after 40 hours of uptime. Every event produces this error. All queues from sentry queues list are empty.

06:37:35 [ERROR] sentry: Local: Queue full
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 406, in dispatch
    request, helper, project_config, origin=origin, *args, **kwargs
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 492, in _dispatch
    **kwargs
  File "/usr/local/lib/python2.7/site-packages/django/views/generic/base.py", line 88, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 559, in post
    event_id = self.process(request, data=data, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 638, in process
    event_manager, project, key, remote_addr, helper, attachments, project_config
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 245, in process_event
    category=data_category,
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/outcomes.py", line 162, in track_outcome
    "quantity": quantity,
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/pubsub.py", line 75, in publish
    self.producer.produce(topic=channel, value=value, key=key)
BufferError: Local: Queue full

So could anybody add poll here https://github.com/getsentry/sentry/blob/master/src/sentry/utils/pubsub.py#L75?

Some more metadata from last line:

channel | 'outcomes'
key | None
self | <sentry.utils.pubsub.KafkaPublisher object at 0x7fb9e0ef4e90>
value | '{"category":0,"reason":"key_quota","outcome":2,"key_id":2,"event_id":"f394b43b73bf4e6994f4d34b18a60ecc","timestamp":"2020-04-08T06:01:25.732754Z","project_id":2,"org_id":1,"quantity":1}'

mattrobenolt · 2020-04-09T09:52:01Z

I’m not entirely sure why you’re getting it, but it’s being misdiagnosed. The local queue can’t flush because it’s unable to write to Kafka. Why it can’t write to Kafka, I don’t know. But that’s the issue. The only reason that queue will get full is if the broker is down or unable to accept data or something along those lines. So there’s nothing wrong here or anything wrong with what we’re doing.

mattrobenolt · 2020-04-09T09:53:32Z

Also the “queue” in this case is referring to an in-memory queue being flushing/writing to Kafka since writes are async. Nothing will be yielded from sentry queues list since that’s explicitly for the celery queues. So rabbitmq or redis.

petrprikryl · 2020-04-10T09:44:36Z

I’m not entirely sure why you’re getting it, but it’s being misdiagnosed. The local queue can’t flush because it’s unable to write to Kafka. Why it can’t write to Kafka, I don’t know. But that’s the issue. The only reason that queue will get full is if the broker is down or unable to accept data or something along those lines. So there’s nothing wrong here or anything wrong with what we’re doing.

I don't see any error logs in kafka container. And restarting only the web container is resolving this problem temporarily. So the problem must be in web container in kafka internal queue described here confluentinc/confluent-kafka-dotnet#703 (comment)

...in python you need to do this (this is automatic in .net), otherwise the internal queue will fill up with delivery reports, and they'll never get removed.

Next thing I have noticed is that all events are processed by Sentry "just fine". So I can see them on Sentry Issue detail page. And in Sentry it looks like everything is working in spite of the Queue full error is bombarding web container logs.

petrprikryl · 2020-04-13T07:49:57Z

I have added self.producer.poll(0) above the self.producer.produce(topic=channel, value=value, key=key) line and problem seems to be resolved. No more buffer errors.

Btw. https://github.com/getsentry/sentry/blob/master/src/sentry/eventstream/kafka/backend.py#L52

BYK · 2020-04-22T21:33:58Z

@petrprikryl can you tell us if you see any outcomes data on your instance? You should see something under the path /organizations/<org name>/stats/. You may also click on the "Stats" link on the UI.

lucas-zimerman · 2020-04-29T16:03:01Z

@BYK it happened with us today,the errors came exclusively from a specific project, and it's the only one that uses filters for not accepting events, at the moment it happened there was 13,096‬ filtered events vs 12.862 queue full events.

Note: We are using an Error Message inbound filtering for filtering the message

extra pic

lucas-zimerman · 2020-05-05T12:45:53Z

@BYK I validated it again (because it happened again) and removing the filters seems to have fixed it (at the cost of receiving spam) I feel like using the Discarded Feature will "fix" it but maybe it would be nice to see why such errors are happening if you get something around 3.5K filtered events in minutes

BYK · 2020-05-05T16:34:52Z

@ibm5155 it seems like this is indeed a missing poll(0) call as @petrprikryl suggested above. I'll verify and submit a patch.

mattrobenolt · 2020-05-05T17:41:24Z

Why don’t we have an issue with this ourselves in production? I’d be skeptical of slapping down a patch without understanding that.

cc @getsentry/sns

tkaemming · 2020-05-05T18:01:22Z

We do see this in Snuba from time to time, but only with a producer that follows a similar usage pattern as the one in Sentry. We haven't seen this in any producers that explicitly call flush (the main consumer calls flush immediately after producing batches of replacements, subscriptions continually calls poll in a separate thread.)

BYK · 2020-05-06T14:40:34Z

Why don’t we have an issue with this ourselves in production?

We simply may have more resources at hand. Also, AFAIK relay would mitigate most of the issue for outcomes. /cc @jan-auer

I’d be skeptical of slapping down a patch without understanding that.

There are multiple issues on Kafka repo itself, strongly recommending the use of a poll(0) before or after a publish.

We haven't seen this in any producers that explicitly call flush (the main consumer calls flush immediately after producing batches of replacements)

This is essentially making the call synchronous as flush() calls poll() until the pending queue is empty so that makes sense.

subscriptions continually calls poll in a separate thread.)

This is also what Rust lib does from what I heard from @jan-auer which also supports adding the poll(0) call.

Fixes #18624. Kafka needs `poll()` to be called at regular intervals to clear its in-memory buffer and triggering any callbacks for producers. This patch adds the missing `poll(0)` call which is essentially free, to the main `KafkaProducer` class, mainly affecting the `track_outcomes` producer as the other user uses the synchronous mode, calling `flush()` which calls `poll()` behind the scenes already.

jan-auer · 2020-05-06T14:46:59Z

FWIW, the Kafka library we use in Relay also calls poll in regular intervals from a background thread.

AFAIK relay would mitigate most of the issue for outcomes.

Tiny clarification and disclaimer: I do not know what the root cause of this is, that is, why sending the message is rejected. With the introduction of Relay to onpremise we've also added an outcomes consumer. So if the problem is that Kafka no longer accepts because it fills up or the offset gets too large, then that added consumer will help.

BYK · 2020-05-06T14:48:35Z

So if the problem is that Kafka no longer accepts because it fills up or the offset gets too large, then that added consumer will help.

I think this is happening even with that consumer.

Fixes #18624. Kafka needs `poll()` to be called at regular intervals to clear its in-memory buffer and triggering any callbacks for producers. This patch adds the missing `poll(0)` call which is essentially free, to the main `KafkaProducer` class, mainly affecting the `track_outcomes` producer as the other user uses the synchronous mode, calling `flush()` which calls `poll()` behind the scenes already.

mattrobenolt closed this as completed Apr 9, 2020

BYK reopened this Apr 22, 2020

BYK self-assigned this Apr 22, 2020

BYK transferred this issue from getsentry/self-hosted May 5, 2020

tkaemming mentioned this issue May 5, 2020

ref(kafka): Log record_query delivery status and error details getsentry/snuba#927

Merged

BYK mentioned this issue May 6, 2020

fix(kafka): Call poll after produce #18644

Merged

BYK closed this as completed in #18644 May 6, 2020

github-actions bot locked and limited conversation to collaborators Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BufferError: Local: Queue full #18624

BufferError: Local: Queue full #18624

stayallive commented Mar 31, 2020

stayallive commented Apr 3, 2020

Uh oh!

petrprikryl commented Apr 8, 2020

Uh oh!

mattrobenolt commented Apr 9, 2020

Uh oh!

mattrobenolt commented Apr 9, 2020

Uh oh!

petrprikryl commented Apr 10, 2020

Uh oh!

petrprikryl commented Apr 13, 2020

Uh oh!

BYK commented Apr 22, 2020

Uh oh!

lucas-zimerman commented Apr 29, 2020 •

edited

Loading

Uh oh!

lucas-zimerman commented May 5, 2020

Uh oh!

BYK commented May 5, 2020

Uh oh!

mattrobenolt commented May 5, 2020

Uh oh!

tkaemming commented May 5, 2020

Uh oh!

BYK commented May 6, 2020 •

edited

Loading

Uh oh!

jan-auer commented May 6, 2020

Uh oh!

BYK commented May 6, 2020

Uh oh!

Uh oh!

BufferError: Local: Queue full #18624

BufferError: Local: Queue full #18624

Comments

stayallive commented Mar 31, 2020

stayallive commented Apr 3, 2020

Uh oh!

petrprikryl commented Apr 8, 2020

Uh oh!

mattrobenolt commented Apr 9, 2020

Uh oh!

mattrobenolt commented Apr 9, 2020

Uh oh!

petrprikryl commented Apr 10, 2020

Uh oh!

petrprikryl commented Apr 13, 2020

Uh oh!

BYK commented Apr 22, 2020

Uh oh!

lucas-zimerman commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucas-zimerman commented May 5, 2020

Uh oh!

BYK commented May 5, 2020

Uh oh!

mattrobenolt commented May 5, 2020

Uh oh!

tkaemming commented May 5, 2020

Uh oh!

BYK commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-auer commented May 6, 2020

Uh oh!

BYK commented May 6, 2020

Uh oh!

lucas-zimerman commented Apr 29, 2020 •

edited

Loading

BYK commented May 6, 2020 •

edited

Loading