Skip to content

Conversation

@wbingli
Copy link

@wbingli wbingli commented Nov 14, 2019

Issue #, if available: #195

Description of changes: . Same as #194 , the CloudWatch clients uses retry with 16 times. With 16 times, it will takes max to (0.5 + 1 + 2 + 4 + 8 + 16 + 20 + 9*20 = 231.5 seconds ~= 4 minutes). This will also cause lambda function timeout in case any CloudWatch throttling or outage.

SDK Default retry policy: https://sdk.amazonaws.com/java/api/2.0.0/software/amazon/awssdk/core/internal/retry/SdkDefaultRetrySetting.html

Strategy: FullJitterBackoffStrategy
Base delay: Duration.ofMillis(100L);
THROTTLED_BASE_DELAY = Duration.ofMillis(500L);
max backoff: Duration.ofMillis(20000L);
Num retries: 3

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rjlohan
Copy link
Contributor

rjlohan commented Nov 14, 2019

I think this is insufficient retries, we should also reconsider #194

The default will back off 100,200,400 milliseconds which doesn't even let the handler wait for 1 second before it'll fail hard on throttling errors. 8 retries might be a better sweet spot, giving us ~25 seconds per handler of backoff before it'll fail with throttling errors.

@wbingli
Copy link
Author

wbingli commented Nov 15, 2019

I think this is insufficient retries, we should also reconsider #194

The default will back off 100,200,400 milliseconds which doesn't even let the handler wait for 1 second before it'll fail hard on throttling errors. 8 retries might be a better sweet spot, giving us ~25 seconds per handler of backoff before it'll fail with throttling errors.

The base delay is 100 ms, but the default throttling delay is 500 ms. (forget to add this in the default retry policy). So as for throttling, it will be (0.5 +1 + 2) and some jitter time, it will be around 4 seconds.

I don't think retry more times means a good policy, the risk is it increases load and can make matters significantly worse and delay service recovery , etc. Also the service is multiple layers, each layer already make retries, most of cases it doesn't help to make retry. Even for throttling, retry more times is selfish for a single resource, the total resources success rate would be even lower as the increasing load can cause throttling for all the resources with longer time.

@rjlohan rjlohan closed this Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants