Skip to content

fix: prevent sub-batch 413's from blocking whole batch #972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 6, 2021

Conversation

yaauie
Copy link
Contributor

@yaauie yaauie commented Oct 6, 2020

The Http Client breaks batches of actions into sub-batches that are up to 20MB
in size, sending larger actions as batches-of-one, and zips the responses
together to emulate a single batch response from the Elasticsearch API.

When an individual HTTP request is rejected by Elasticsearch (or by an
intermediate proxy or load-balancer) with an HTTP/1.1 413, we can emulate
the error response instead of blowing an exception through to the whole batch.

Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it's feasible to add a unit test that verify the new behavior introduced, overwriting TARGET_BULK_BYTES doubling the LogStash::Outputs::ElasticSearch::HttpClient::Pool client to return a 413 on the post request.

@yaauie yaauie force-pushed the sub-batch-failures branch 3 times, most recently from c6a1006 to 2cd4843 Compare October 30, 2020 15:25
@yaauie
Copy link
Contributor Author

yaauie commented Oct 30, 2020

@andsel I've added specs and improved logging some more.

@yaauie yaauie force-pushed the sub-batch-failures branch 2 times, most recently from 6475020 to 0f601cd Compare November 6, 2020 21:13
Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed only a minor repetition in the Changelog.
The changes to me seems OK, but I don't understand why the it fails on Travis

@yaauie yaauie force-pushed the sub-batch-failures branch from 0f601cd to 9a43f31 Compare November 11, 2020 15:51
@andsel andsel self-requested a review November 12, 2020 08:19
Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andsel andsel assigned yaauie and unassigned andsel Jan 15, 2021
yaauie added 2 commits April 2, 2021 15:08
The Http Client breaks batches of actions into sub-batches that are up to 20MB
in size, sending larger actions as batches-of-one, and zips the responses
together to emulate a single batch response from the Elasticsearch API.

When an individual HTTP request is rejected by Elasticsearch (or by an
intermediate proxy or load-balancer) with an HTTP/1.1 413, we can emulate
the error response instead of blowing an exception through to the whole batch.
This allows only the offending events/actions to be subject to retry logic.

Along the way, we improve logging at the `debug` level for sub-batches, and
emit clear `warn`-level logs with payload sizes when we hit HTTP 413 rejections.
@yaauie yaauie force-pushed the sub-batch-failures branch from d970253 to b2ef074 Compare April 2, 2021 17:24
@yaauie
Copy link
Contributor Author

yaauie commented Apr 2, 2021

Reviving, and expanding the scope to address a situation where we could send a <20MB payload that expanded beyond Elasticsearch's 100MB buffers (#823). Bulk grouping is now entirely determined by decompressed size.

@yaauie yaauie requested a review from andsel April 2, 2021 17:26
@yaauie
Copy link
Contributor Author

yaauie commented Apr 3, 2021

Failing tests on 8.0-SNAPSHOT are unrelated -- it looks like integration tests do some cleanup of ILM policies without first cleaning up the indices governed by those policies, which will be an error in 8.x. I'll see if fixing is trivial, or will file a separate issue to chase that down outside this PR.

Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The default value of Elasticsearch's `action.destructive_requires_rename` has
changed to true in elastic/elasticsearch#66908 which
causes our integration tests' wildcard deletes to fail. By specifying this
config explicitly, we ensure the desired behaviour is selected.
@yaauie
Copy link
Contributor Author

yaauie commented Apr 6, 2021

Theoretically have a fix in place for the failing integration tests against ES 8 snapshot -- They recently made a change to the default value of action.destructive_requires_name to true in elastic/elasticsearch#66908 , which causes our attempt to wildcard-delete the indices prior to removal of the ILM policies fail.

Once CI goes green, I plan to merge.

@yaauie yaauie merged commit b9c66b5 into logstash-plugins:master Apr 6, 2021
@yaauie yaauie deleted the sub-batch-failures branch April 6, 2021 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants