Skip to content

Conversation

DP19
Copy link
Contributor

@DP19 DP19 commented Aug 20, 2020

Issue #, if available:
#229
Description of changes:
Added a for loop for getting scheduled maintenance events and spot instance events. Set the retry limit to 1.

Not sure of the best way to add this to the test suite. Also curious if throwing a panic to sovle the second item in the issue is what's best to do here or not.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2020

Codecov Report

Merging #244 into master will increase coverage by 0.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   81.43%   81.60%   +0.16%     
==========================================
  Files          10       10              
  Lines         792      799       +7     
==========================================
+ Hits          645      652       +7     
  Misses        131      131              
  Partials       16       16              
Impacted Files Coverage Δ
pkg/ec2metadata/ec2metadata.go 97.01% <100.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24ff89c...f459e12. Read the comment docs.

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping in to help out with these code changes!!

I think a better spot for the retry logic would be in the Request() function so that we don't need to duplicate the 401 retry logic in the individual paths.

I would recommend moving the retry for-loop around this block

if e.v2Token == "" || e.tokenTTL <= secondsBeforeTTLRefresh {
e.Lock()
token, ttl, err := e.getV2Token()
if err != nil {
e.v2Token = ""
e.tokenTTL = -1
log.Log().Msgf("Unable to retrieve an IMDSv2 token, continuing with IMDSv1: %v", err)
} else {
e.v2Token = token
e.tokenTTL = ttl
}
e.Unlock()
}
if e.v2Token != "" {
req.Header.Add(tokenRequestHeader, e.v2Token)
}
httpReq := func() (*http.Response, error) {
return e.httpClient.Do(req)
}
resp, err := retry(e.tries, 2*time.Second, httpReq)
if err != nil {
return nil, fmt.Errorf("Unable to get a response from IMDS: %w", err)
}
, checking for 401 at the end and deciding whether to break or unset e.v2Token and e.tokenTTL=0 to do a retry on the request while fetching a new token.

The panic should occur in the main pkg. An error returned by the GetSpotITNEvent() or GetScheduledMaintenanceEvents() functions will propagate up to this monitoring loop:

for _, fn := range monitoringFns {
go func(monitor monitor.Monitor) {
log.Log().Msgf("Started monitoring for %s events", monitor.Kind())
for range time.Tick(time.Second * 2) {
err := monitor.Monitor()
if err != nil {
log.Log().Msgf("There was a problem monitoring for %s events: %v", monitor.Kind(), err)
metrics.ErrorEventsInc(monitor.Kind())
}
}
}(fn)
}
. From here, we can count the errors and panic when it breaches a threshold. I think it should also require consecutive errors so that some intermittent errors can't build up over time and cause a restart for just another intermittent error that just so happened to breach the threshold because it's been running for months.

DP19 added 2 commits August 20, 2020 10:04
move Panic to main function.
add tests for 401 retries.
@DP19
Copy link
Contributor Author

DP19 commented Aug 20, 2020

@bwagner5 - thanks for the feedback! I've move this logic to the Request function and add two new local vars to the monitor loop to track the previous error and if you get the same error 3 times in a row it should panic. This way it should cover more than just the 401 error but any error that's duplicated

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This looks great!

  • Can you also run make fmt to correct the go report card issue?

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks!!

@bwagner5 bwagner5 merged commit 92fb0c2 into aws:master Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants