-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Do not abort SAR loop on throttling #2240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
except ClientError as e: | ||
LOG.exception(e) | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the exception get logged if this is raised and not caught? Any specific reason we added this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered the same thing -- I added this here to preserve the existing behavior. I removed LOG.exception
from _sar_service_call
as throttles are not really exceptions for GetCloudFormation
. I can do a bit more digging to see if we need that extra log.
) | ||
except ClientError as e: | ||
error_code = e.response["Error"]["Code"] | ||
if error_code == "TooManyRequestsException": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only catch this here? Can't this happen in the create_cloud_formation_template
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably can -- we just haven't seen it in any availability dips. There's no loop around the create at the moment so we'd either have to add a loop (to _sar_service_call
if we want it to be common) or consider throttling an error.
while (time() - start_time) < self.TEMPLATE_WAIT_TIMEOUT_SECONDS: | ||
temp = self._in_progress_templates | ||
self._in_progress_templates = [] | ||
throttled = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce some jitter?
Should we update our Boto config to be standard (default is legacy)? Seems like that provides a better default retry behavior and I think standard is the newer recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should -- I'd suggest doing that in a future PR so we're introducing one change at a time. I might have misread the botocore code but I thought even the legacy retry had some jitter built into it.
""" | ||
if self._wait_for_template_active_status and not self._validate_only: | ||
start_time = time() | ||
while (time() - start_time) < self.TEMPLATE_WAIT_TIMEOUT_SECONDS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resets the throttle
flag. Is the idea that the 2s sleep will give us enough break from calling that we can safely call SAR again? Might be worth commenting about line 319 to make it clear. This method is pretty hard to parse (lots of nesting).
def setUp(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this, since it is only a pass
.
Codecov Report
@@ Coverage Diff @@
## develop #2240 +/- ##
===========================================
+ Coverage 93.58% 94.47% +0.89%
===========================================
Files 90 95 +5
Lines 6124 6610 +486
Branches 1260 1333 +73
===========================================
+ Hits 5731 6245 +514
+ Misses 183 170 -13
+ Partials 210 195 -15
Continue to review full report at Codecov.
|
if not throttled: | ||
response = self._sar_service_call( | ||
get_cfn_template, application_id, application_id, template_id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since now we manually skip boto3 calls when throttled, it looks like it will go to next iteration immediately, which is likely to result in another throttle. Do you think we should add sleep
at the end of the iteration if throttled
is True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should (will?) refactor this code to make it easier to read. Once throttled
is True
, we will continue looping but will not make any more SAR calls. We continue looping to in order to call _handle_get_cfn_template_response
. Once the for loop is done, we will sleep(self.SLEEP_TIME_SECONDS)
before going through the while
again, which resets Throttled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, just found the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is it possible we can tune the boto3 sar client's retry strategy instead of having a retrying loop.
What's the difference between these two?
- checking SAR A -> B -> C -> A -> B -> C -> A -> B -> C
- checking SAR A -> A -> A -> B -> B -> B -> C -> C -> C
"template.".format(application_id, template_id, status) | ||
) | ||
raise InvalidResourceException(application_id, message) | ||
self._in_progress_templates.append((application_id, template_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one was hidden here, I like the new change
Discussed the changes with Jacob and his comments are either addressed or can be addressed in future iterations. Jacob is now on vacation, and there are two additional PR approvals in place.
Issue #, if available:
n/a
Description of changes:
Currently, if an API call to Serverless Application Repository (SAR) is throttled while querying for the status of an application, the transform fails. This change allows the transform to sleep and then retry, up to the time limit configured in code.
Description of how you validated changes:
Unit tests were added.
Checklist:
Intrinsic Functions- n/amake pr
passesUpdate documentation- n/aVerify transformed template deploys and application functions as expectedno changes to transformation logicExamples?