Skip to content

Conversation

marekaiv
Copy link
Contributor

Issue #, if available:
n/a

Description of changes:
Currently, if an API call to Serverless Application Repository (SAR) is throttled while querying for the status of an application, the transform fails. This change allows the transform to sleep and then retry, up to the time limit configured in code.

Description of how you validated changes:
Unit tests were added.

Checklist:

  • Add/update tests using:
    • Correct values
    • Bad/wrong values (None, empty, wrong type, length, etc.)
    • Intrinsic Functions - n/a
  • make pr passes
  • Update documentation - n/a
  • Verify transformed template deploys and application functions as expected no changes to transformation logic

Examples?

jfuss
jfuss previously requested changes Nov 23, 2021
Comment on lines 183 to 185
except ClientError as e:
LOG.exception(e)
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the exception get logged if this is raised and not caught? Any specific reason we added this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered the same thing -- I added this here to preserve the existing behavior. I removed LOG.exception from _sar_service_call as throttles are not really exceptions for GetCloudFormation. I can do a bit more digging to see if we need that extra log.

)
except ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == "TooManyRequestsException":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only catch this here? Can't this happen in the create_cloud_formation_template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can -- we just haven't seen it in any availability dips. There's no loop around the create at the moment so we'd either have to add a loop (to _sar_service_call if we want it to be common) or consider throttling an error.

while (time() - start_time) < self.TEMPLATE_WAIT_TIMEOUT_SECONDS:
temp = self._in_progress_templates
self._in_progress_templates = []
throttled = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce some jitter?

Should we update our Boto config to be standard (default is legacy)? Seems like that provides a better default retry behavior and I think standard is the newer recommendation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should -- I'd suggest doing that in a future PR so we're introducing one change at a time. I might have misread the botocore code but I thought even the legacy retry had some jitter built into it.

"""
if self._wait_for_template_active_status and not self._validate_only:
start_time = time()
while (time() - start_time) < self.TEMPLATE_WAIT_TIMEOUT_SECONDS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resets the throttle flag. Is the idea that the 2s sleep will give us enough break from calling that we can safely call SAR again? Might be worth commenting about line 319 to make it clear. This method is pretty hard to parse (lots of nesting).

Comment on lines 260 to 261
def setUp(self):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this, since it is only a pass.

@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2021

Codecov Report

Merging #2240 (292e9f4) into develop (e7a1496) will increase coverage by 0.89%.
The diff coverage is 98.18%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2240      +/-   ##
===========================================
+ Coverage    93.58%   94.47%   +0.89%     
===========================================
  Files           90       95       +5     
  Lines         6124     6610     +486     
  Branches      1260     1333      +73     
===========================================
+ Hits          5731     6245     +514     
+ Misses         183      170      -13     
+ Partials       210      195      -15     
Impacted Files Coverage Δ
samtranslator/model/lambda_.py 93.10% <ø> (ø)
samtranslator/plugins/globals/globals.py 99.05% <ø> (ø)
samtranslator/translator/logical_id_generator.py 100.00% <ø> (+9.09%) ⬆️
samtranslator/region_configuration.py 77.77% <63.63%> (-22.23%) ⬇️
samtranslator/model/api/api_generator.py 93.24% <90.00%> (-1.13%) ⬇️
samtranslator/model/eventsources/pull.py 92.89% <97.77%> (+14.20%) ⬆️
samtranslator/swagger/swagger.py 93.30% <98.24%> (-0.07%) ⬇️
samtranslator/__init__.py 100.00% <100.00%> (ø)
samtranslator/feature_toggle/dialup.py 100.00% <100.00%> (ø)
samtranslator/feature_toggle/feature_toggle.py 100.00% <100.00%> (+12.16%) ⬆️
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e816e50...292e9f4. Read the comment docs.

Comment on lines 335 to 338
if not throttled:
response = self._sar_service_call(
get_cfn_template, application_id, application_id, template_id
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since now we manually skip boto3 calls when throttled, it looks like it will go to next iteration immediately, which is likely to result in another throttle. Do you think we should add sleep at the end of the iteration if throttled is True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should (will?) refactor this code to make it easier to read. Once throttled is True, we will continue looping but will not make any more SAR calls. We continue looping to in order to call _handle_get_cfn_template_response. Once the for loop is done, we will sleep(self.SLEEP_TIME_SECONDS) before going through the while again, which resets Throttled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, just found the line.

Copy link
Contributor

@aahung aahung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is it possible we can tune the boto3 sar client's retry strategy instead of having a retrying loop.

What's the difference between these two?

  1. checking SAR A -> B -> C -> A -> B -> C -> A -> B -> C
  2. checking SAR A -> A -> A -> B -> B -> B -> C -> C -> C

"template.".format(application_id, template_id, status)
)
raise InvalidResourceException(application_id, message)
self._in_progress_templates.append((application_id, template_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one was hidden here, I like the new change

@marekaiv marekaiv dismissed jfuss’s stale review December 14, 2021 01:55

Discussed the changes with Jacob and his comments are either addressed or can be addressed in future iterations. Jacob is now on vacation, and there are two additional PR approvals in place.

@marekaiv marekaiv merged commit 6daf706 into aws:develop Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants