Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients #196

wbingli · 2019-11-14T07:16:26Z

Issue #, if available: #195

Description of changes: . Same as #194 , the CloudWatch clients uses retry with 16 times. With 16 times, it will takes max to (0.5 + 1 + 2 + 4 + 8 + 16 + 20 + 9*20 = 231.5 seconds ~= 4 minutes). This will also cause lambda function timeout in case any CloudWatch throttling or outage.

SDK Default retry policy: https://sdk.amazonaws.com/java/api/2.0.0/software/amazon/awssdk/core/internal/retry/SdkDefaultRetrySetting.html

Strategy: FullJitterBackoffStrategy
Base delay: Duration.ofMillis(100L);
THROTTLED_BASE_DELAY = Duration.ofMillis(500L);
max backoff: Duration.ofMillis(20000L);
Num retries: 3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rjlohan · 2019-11-14T14:08:14Z

I think this is insufficient retries, we should also reconsider #194

The default will back off 100,200,400 milliseconds which doesn't even let the handler wait for 1 second before it'll fail hard on throttling errors. 8 retries might be a better sweet spot, giving us ~25 seconds per handler of backoff before it'll fail with throttling errors.

wbingli · 2019-11-15T06:37:28Z

I think this is insufficient retries, we should also reconsider #194

The default will back off 100,200,400 milliseconds which doesn't even let the handler wait for 1 second before it'll fail hard on throttling errors. 8 retries might be a better sweet spot, giving us ~25 seconds per handler of backoff before it'll fail with throttling errors.

The base delay is 100 ms, but the default throttling delay is 500 ms. (forget to add this in the default retry policy). So as for throttling, it will be (0.5 +1 + 2) and some jitter time, it will be around 4 seconds.

I don't think retry more times means a good policy, the risk is it increases load and can make matters significantly worse and delay service recovery , etc. Also the service is multiple layers, each layer already make retries, most of cases it doesn't help to make retry. Even for throttling, retry more times is selfish for a single resource, the total resources success rate would be even lower as the increasing load can cause throttling for all the resources with longer time.

Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients

361f501

wbingli requested review from johnttompkins and rjlohan November 14, 2019 07:16

johnttompkins approved these changes Nov 14, 2019

View reviewed changes

rjlohan closed this Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients #196

Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients #196

Uh oh!

wbingli commented Nov 14, 2019 •

edited

Loading

Uh oh!

rjlohan commented Nov 14, 2019

Uh oh!

wbingli commented Nov 15, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients #196

Use standard AWS SDK retry policy for CloudWatch/Logs/Events clients #196

Uh oh!

Conversation

wbingli commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjlohan commented Nov 14, 2019

Uh oh!

wbingli commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wbingli commented Nov 14, 2019 •

edited

Loading

wbingli commented Nov 15, 2019 •

edited

Loading