Add retries when reponse from IMDSv2 retruns a 401 #244

DP19 · 2020-08-20T04:08:11Z

Issue #, if available:
#229
Description of changes:
Added a for loop for getting scheduled maintenance events and spot instance events. Set the retry limit to 1.

Not sure of the best way to add this to the test suite. Also curious if throwing a panic to sovle the second item in the issue is what's best to do here or not.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-commenter · 2020-08-20T04:26:23Z

Codecov Report

Merging #244 into master will increase coverage by 0.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   81.43%   81.60%   +0.16%     
==========================================
  Files          10       10              
  Lines         792      799       +7     
==========================================
+ Hits          645      652       +7     
  Misses        131      131              
  Partials       16       16

Impacted Files	Coverage Δ
pkg/ec2metadata/ec2metadata.go	`97.01% <100.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24ff89c...f459e12. Read the comment docs.

bwagner5

Thanks for jumping in to help out with these code changes!!

I think a better spot for the retry logic would be in the Request() function so that we don't need to duplicate the 401 retry logic in the individual paths.

I would recommend moving the retry for-loop around this block

aws-node-termination-handler/pkg/ec2metadata/ec2metadata.go

Lines 184 to 206 in 24ff89c

    
           if e.v2Token == "" || e.tokenTTL <= secondsBeforeTTLRefresh { 
        
           	e.Lock() 
        
           	token, ttl, err := e.getV2Token() 
        
           	if err != nil { 
        
           		e.v2Token = "" 
        
           		e.tokenTTL = -1 
        
           		log.Log().Msgf("Unable to retrieve an IMDSv2 token, continuing with IMDSv1: %v", err) 
        
           	} else { 
        
           		e.v2Token = token 
        
           		e.tokenTTL = ttl 
        
           	} 
        
           	e.Unlock() 
        
           } 
        
           if e.v2Token != "" { 
        
           	req.Header.Add(tokenRequestHeader, e.v2Token) 
        
           } 
        
           httpReq := func() (*http.Response, error) { 
        
           	return e.httpClient.Do(req) 
        
           } 
        
           resp, err := retry(e.tries, 2*time.Second, httpReq) 
        
           if err != nil { 
        
           	return nil, fmt.Errorf("Unable to get a response from IMDS: %w", err) 
        
           }

, checking for 401 at the end and deciding whether to break or unset e.v2Token and e.tokenTTL=0 to do a retry on the request while fetching a new token.

The panic should occur in the main pkg. An error returned by the GetSpotITNEvent() or GetScheduledMaintenanceEvents() functions will propagate up to this monitoring loop:

aws-node-termination-handler/cmd/node-termination-handler.go

Lines 100 to 111 in 24ff89c

    
           for _, fn := range monitoringFns { 
        
           	go func(monitor monitor.Monitor) { 
        
           		log.Log().Msgf("Started monitoring for %s events", monitor.Kind()) 
        
           		for range time.Tick(time.Second * 2) { 
        
           			err := monitor.Monitor() 
        
           			if err != nil { 
        
           				log.Log().Msgf("There was a problem monitoring for %s events: %v", monitor.Kind(), err) 
        
           				metrics.ErrorEventsInc(monitor.Kind()) 
        
           			} 
        
           		} 
        
           	}(fn) 
        
           }

. From here, we can count the errors and panic when it breaches a threshold. I think it should also require consecutive errors so that some intermittent errors can't build up over time and cause a restart for just another intermittent error that just so happened to breach the threshold because it's been running for months.

This reverts commit c1e3477.

move Panic to main function. add tests for 401 retries.

DP19 · 2020-08-20T14:48:08Z

@bwagner5 - thanks for the feedback! I've move this logic to the Request function and add two new local vars to the monitor loop to track the previous error and if you get the same error 3 times in a row it should panic. This way it should cover more than just the 401 error but any error that's duplicated

cmd/node-termination-handler.go

bwagner5

Awesome! This looks great!

Can you also run make fmt to correct the go report card issue?

bwagner5

Great work! Thanks!!

Add retries when reponse from IMDSv2 retruns a 401

c1e3477

bwagner5 requested changes Aug 20, 2020

View reviewed changes

DP19 added 2 commits August 20, 2020 10:04

Revert "Add retries when reponse from IMDSv2 retruns a 401"

e48b203

This reverts commit c1e3477.

move IMDSv2 401 retries to Request function.

b086944

move Panic to main function. add tests for 401 retries.

add missing equals to duplicateErrorThershold check in main function

7399d79

bwagner5 reviewed Aug 20, 2020

View reviewed changes

cmd/node-termination-handler.go Outdated Show resolved Hide resolved

fix spelling of NTH

99f8ee4

bwagner5 requested changes Aug 20, 2020

View reviewed changes

fix report card

f459e12

bwagner5 approved these changes Aug 20, 2020

View reviewed changes

bwagner5 merged commit 92fb0c2 into aws:master Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retries when reponse from IMDSv2 retruns a 401 #244

Add retries when reponse from IMDSv2 retruns a 401 #244

Uh oh!

DP19 commented Aug 20, 2020

Uh oh!

codecov-commenter commented Aug 20, 2020 •

edited

Loading

Uh oh!

bwagner5 left a comment

Uh oh!

DP19 commented Aug 20, 2020

Uh oh!

Uh oh!

bwagner5 left a comment

Uh oh!

bwagner5 left a comment

Uh oh!

Uh oh!

	if e.v2Token == "" \|\| e.tokenTTL <= secondsBeforeTTLRefresh {
	e.Lock()
	token, ttl, err := e.getV2Token()
	if err != nil {
	e.v2Token = ""
	e.tokenTTL = -1
	log.Log().Msgf("Unable to retrieve an IMDSv2 token, continuing with IMDSv1: %v", err)
	} else {
	e.v2Token = token
	e.tokenTTL = ttl
	}
	e.Unlock()
	}
	if e.v2Token != "" {
	req.Header.Add(tokenRequestHeader, e.v2Token)
	}
	httpReq := func() (*http.Response, error) {
	return e.httpClient.Do(req)
	}
	resp, err := retry(e.tries, 2*time.Second, httpReq)
	if err != nil {
	return nil, fmt.Errorf("Unable to get a response from IMDS: %w", err)
	}

	for _, fn := range monitoringFns {
	go func(monitor monitor.Monitor) {
	log.Log().Msgf("Started monitoring for %s events", monitor.Kind())
	for range time.Tick(time.Second * 2) {
	err := monitor.Monitor()
	if err != nil {
	log.Log().Msgf("There was a problem monitoring for %s events: %v", monitor.Kind(), err)
	metrics.ErrorEventsInc(monitor.Kind())
	}
	}
	}(fn)
	}

Add retries when reponse from IMDSv2 retruns a 401 #244

Add retries when reponse from IMDSv2 retruns a 401 #244

Uh oh!

Conversation

DP19 commented Aug 20, 2020

Uh oh!

codecov-commenter commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bwagner5 left a comment

Choose a reason for hiding this comment

Uh oh!

DP19 commented Aug 20, 2020

Uh oh!

Uh oh!

bwagner5 left a comment

Choose a reason for hiding this comment

Uh oh!

bwagner5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Aug 20, 2020 •

edited

Loading