Retry flush metrics from ThreadStats to Datadog over RemoteDisconnected errors. #138
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Retry metric flushing from ThreadStats to Datadog over RemoteDisconnected errors. The root issue https://hg.python.org/cpython/rev/eba80326ba53 has been discussed extensively for years, but from what I can tell no fix was implemented at any level. The main reason to not fix (e.g., retry at the http client level) seems to be concerns with POST requests are often not idempotent and not always safe to retry, which is generally speaking true. But according to my comprehensive tests (one submission per minute for 2 days), all the failures resulted in missing data and therefore were safe to be retried. The testing function has been running for a day and I did not see any duplicate date points.
Based on the lengthy discussion on the internet over this particular issue, I don't think we can really wait for a true/better fix in the underlying libraries. The benefits of moving forward with a retry far outweigh the potential concerns over idempotency.
Specific changes are:
RemoteDisconnected
, otherwise log the exception for debuggingMotivation
Metric data points are lost sporadically when being submitted synchronously from the Datadog Lambda library to Datadog API (i.e., without using the Forwarder Lambda or Lambda Extension). The specific error reads below:
Testing Guidelines
Additional Notes
Types of Changes
Check all that apply