Skip to content

BufferError: Local: Queue full #18624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stayallive opened this issue Mar 31, 2020 · 15 comments · Fixed by #18644
Closed

BufferError: Local: Queue full #18624

stayallive opened this issue Mar 31, 2020 · 15 comments · Fixed by #18644
Assignees

Comments

@stayallive
Copy link

I am receiving this error once every 2-4 days and I need to restart Sentry to fix it. This started after moving to the Docker version of Sentry.

I never noticed this being an issue on 9.1.2 also with Clickhouse and Snuba running, but without Kafka.

https://observ.app/share/issue/4e4f208a500d48cc898770930706959a/

I am not sure where to look / poke / monitor to see this queue that is being spoken of and how I can flush it / enlarge it if needed.

sentry queues list showed all 0's so it's not looking like there is a massive backlog of events.

Any help is appreciated!

@stayallive
Copy link
Author

It seems like .poll() should be called somewhere to flush this queue according to this: confluentinc/confluent-kafka-dotnet#703 (comment).

I'm sure this can be done in other parts of the code too so it might be happening, but currently I have to restart Sentry every X messages it seems like.

@petrprikryl
Copy link

I am facing exactly same thing after 40 hours of uptime. Every event produces this error. All queues from sentry queues list are empty.

06:37:35 [ERROR] sentry: Local: Queue full
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 406, in dispatch
    request, helper, project_config, origin=origin, *args, **kwargs
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 492, in _dispatch
    **kwargs
  File "/usr/local/lib/python2.7/site-packages/django/views/generic/base.py", line 88, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 559, in post
    event_id = self.process(request, data=data, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 638, in process
    event_manager, project, key, remote_addr, helper, attachments, project_config
  File "/usr/local/lib/python2.7/site-packages/sentry/web/api.py", line 245, in process_event
    category=data_category,
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/outcomes.py", line 162, in track_outcome
    "quantity": quantity,
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/pubsub.py", line 75, in publish
    self.producer.produce(topic=channel, value=value, key=key)
BufferError: Local: Queue full

So could anybody add poll here https://github.com/getsentry/sentry/blob/master/src/sentry/utils/pubsub.py#L75?

Some more metadata from last line:

channel | 'outcomes'
key | None
self | <sentry.utils.pubsub.KafkaPublisher object at 0x7fb9e0ef4e90>
value | '{"category":0,"reason":"key_quota","outcome":2,"key_id":2,"event_id":"f394b43b73bf4e6994f4d34b18a60ecc","timestamp":"2020-04-08T06:01:25.732754Z","project_id":2,"org_id":1,"quantity":1}'

@mattrobenolt
Copy link
Contributor

I’m not entirely sure why you’re getting it, but it’s being misdiagnosed. The local queue can’t flush because it’s unable to write to Kafka. Why it can’t write to Kafka, I don’t know. But that’s the issue. The only reason that queue will get full is if the broker is down or unable to accept data or something along those lines. So there’s nothing wrong here or anything wrong with what we’re doing.

@mattrobenolt
Copy link
Contributor

Also the “queue” in this case is referring to an in-memory queue being flushing/writing to Kafka since writes are async. Nothing will be yielded from sentry queues list since that’s explicitly for the celery queues. So rabbitmq or redis.

@petrprikryl
Copy link

I’m not entirely sure why you’re getting it, but it’s being misdiagnosed. The local queue can’t flush because it’s unable to write to Kafka. Why it can’t write to Kafka, I don’t know. But that’s the issue. The only reason that queue will get full is if the broker is down or unable to accept data or something along those lines. So there’s nothing wrong here or anything wrong with what we’re doing.

I don't see any error logs in kafka container. And restarting only the web container is resolving this problem temporarily. So the problem must be in web container in kafka internal queue described here confluentinc/confluent-kafka-dotnet#703 (comment)

...in python you need to do this (this is automatic in .net), otherwise the internal queue will fill up with delivery reports, and they'll never get removed.

Next thing I have noticed is that all events are processed by Sentry "just fine". So I can see them on Sentry Issue detail page. And in Sentry it looks like everything is working in spite of the Queue full error is bombarding web container logs.

@petrprikryl
Copy link

I have added self.producer.poll(0) above the self.producer.produce(topic=channel, value=value, key=key) line and problem seems to be resolved. No more buffer errors.

Btw. https://github.com/getsentry/sentry/blob/master/src/sentry/eventstream/kafka/backend.py#L52

@BYK BYK reopened this Apr 22, 2020
@BYK BYK self-assigned this Apr 22, 2020
@BYK
Copy link
Member

BYK commented Apr 22, 2020

@petrprikryl can you tell us if you see any outcomes data on your instance? You should see something under the path /organizations/<org name>/stats/. You may also click on the "Stats" link on the UI.

@lucas-zimerman
Copy link
Contributor

lucas-zimerman commented Apr 29, 2020

@BYK it happened with us today,the errors came exclusively from a specific project, and it's the only one that uses filters for not accepting events, at the moment it happened there was 13,096‬ filtered events vs 12.862 queue full events.

Note: We are using an Error Message inbound filtering for filtering the message
image
extra pic

@lucas-zimerman
Copy link
Contributor

@BYK I validated it again (because it happened again) and removing the filters seems to have fixed it (at the cost of receiving spam) I feel like using the Discarded Feature will "fix" it but maybe it would be nice to see why such errors are happening if you get something around 3.5K filtered events in minutes

@BYK BYK transferred this issue from getsentry/self-hosted May 5, 2020
@BYK
Copy link
Member

BYK commented May 5, 2020

@ibm5155 it seems like this is indeed a missing poll(0) call as @petrprikryl suggested above. I'll verify and submit a patch.

@mattrobenolt
Copy link
Contributor

Why don’t we have an issue with this ourselves in production? I’d be skeptical of slapping down a patch without understanding that.

cc @getsentry/sns

@tkaemming
Copy link
Contributor

We do see this in Snuba from time to time, but only with a producer that follows a similar usage pattern as the one in Sentry. We haven't seen this in any producers that explicitly call flush (the main consumer calls flush immediately after producing batches of replacements, subscriptions continually calls poll in a separate thread.)

@BYK
Copy link
Member

BYK commented May 6, 2020

Why don’t we have an issue with this ourselves in production?

We simply may have more resources at hand. Also, AFAIK relay would mitigate most of the issue for outcomes. /cc @jan-auer

I’d be skeptical of slapping down a patch without understanding that.

There are multiple issues on Kafka repo itself, strongly recommending the use of a poll(0) before or after a publish.

We haven't seen this in any producers that explicitly call flush (the main consumer calls flush immediately after producing batches of replacements)

This is essentially making the call synchronous as flush() calls poll() until the pending queue is empty so that makes sense.

subscriptions continually calls poll in a separate thread.)

This is also what Rust lib does from what I heard from @jan-auer which also supports adding the poll(0) call.

BYK added a commit that referenced this issue May 6, 2020
Fixes #18624.

Kafka needs `poll()` to be called at regular intervals to clear its in-memory buffer and triggering any callbacks for producers. This patch adds the missing `poll(0)` call which is essentially free, to the main `KafkaProducer` class, mainly affecting the `track_outcomes` producer as the other user uses the synchronous mode, calling `flush()` which calls `poll()` behind the scenes already.
@jan-auer
Copy link
Member

jan-auer commented May 6, 2020

FWIW, the Kafka library we use in Relay also calls poll in regular intervals from a background thread.

AFAIK relay would mitigate most of the issue for outcomes.

Tiny clarification and disclaimer: I do not know what the root cause of this is, that is, why sending the message is rejected. With the introduction of Relay to onpremise we've also added an outcomes consumer. So if the problem is that Kafka no longer accepts because it fills up or the offset gets too large, then that added consumer will help.

@BYK
Copy link
Member

BYK commented May 6, 2020

So if the problem is that Kafka no longer accepts because it fills up or the offset gets too large, then that added consumer will help.

I think this is happening even with that consumer.

@BYK BYK closed this as completed in #18644 May 6, 2020
BYK added a commit that referenced this issue May 6, 2020
Fixes #18624.

Kafka needs `poll()` to be called at regular intervals to clear its in-memory buffer and triggering any callbacks for producers. This patch adds the missing `poll(0)` call which is essentially free, to the main `KafkaProducer` class, mainly affecting the `track_outcomes` producer as the other user uses the synchronous mode, calling `flush()` which calls `poll()` behind the scenes already.
@github-actions github-actions bot locked and limited conversation to collaborators Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants