Skip to content

Conversation

blakestoddard
Copy link
Contributor

@blakestoddard blakestoddard commented Nov 12, 2020

Removes the NTHManagedASG tag constant in favor of a configurable ManagedAsgTag option to allow you to run NTH in queue mode across multiple clusters in an account. EC2 instance events don't have any filtering ability currently, which will cause NTH to drop into a crash loop when it runs into an error while trying to cordon a node that doesn't exist in the cluster. See here.

Fixes #272

@blakestoddard blakestoddard changed the title Allow ManagedAsgTag to be configurable Allow the NTHManagedAsg tag to be configurable Nov 12, 2020
Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like build is failing due to unit tests:

  • pkg/monitor/sqsevent/sqs-monitor_internal_test.go:51: undefined: NTHManagedASG
  • pkg/monitor/sqsevent/sqs-monitor_test.go:542: undefined: NTHManagedASG

@blakestoddard
Copy link
Contributor Author

I wasn't sure the best way to really fix that because passing a string to mockIsManagedTrue during struct initialization feels just as gross as just updating the constant reference in the function with a new string, so that's what I did.

Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unit tests are still failing. After some deep dive:

  • Looks like you need to initialize SQSMonitor with a ManagedAsgTag set to the default/tag value; otherwise the if *tag.Key == m.ManagedAsgTag fails

Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the Helm artifacts in this folder including ReadMe so that ManagedAsgTag can also be configurable as a Helm param?

the files should be:

edit: removing daemonset configs

@blakestoddard
Copy link
Contributor Author

I don't see any of the related queue processor variables (queueURL, checkASGTagBeforeDraining, etc.) defined in the two daemonset configs, so I didn't add those. Should they be?

@brycahta
Copy link
Contributor

I don't see any of the related queue processor variables (queueURL, checkASGTagBeforeDraining, etc.) defined in the two daemonset configs, so I didn't add those. Should they be?

Ahh sorry that's my mistake. queue-processor vars shouldn't be in daemonset configs at all; i'll remove from comment

@codecov-io
Copy link

codecov-io commented Nov 17, 2020

Codecov Report

Merging #287 (b05d777) into main (7402427) will decrease coverage by 2.34%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #287      +/-   ##
==========================================
- Coverage   77.10%   74.76%   -2.35%     
==========================================
  Files          17       17              
  Lines        1162     1169       +7     
==========================================
- Hits          896      874      -22     
- Misses        226      253      +27     
- Partials       40       42       +2     
Impacted Files Coverage Δ
pkg/config/config.go 93.75% <100.00%> (+0.13%) ⬆️
pkg/monitor/sqsevent/sqs-monitor.go 91.66% <100.00%> (-3.71%) ⬇️
pkg/monitor/sqsevent/asg-lifecycle-event.go 41.86% <0.00%> (-37.21%) ⬇️
pkg/monitor/sqsevent/spot-itn-event.go 52.00% <0.00%> (-12.00%) ⬇️
...monitor/sqsevent/rebalance-recommendation-event.go 44.00% <0.00%> (-12.00%) ⬇️
pkg/ec2metadata/ec2metadata.go 95.23% <0.00%> (-1.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7402427...b05d777. Read the comment docs.

@blakestoddard
Copy link
Contributor Author

Are those failures in Travis related? Seems like they're from prometheus-operator.

@brycahta brycahta requested a review from bwagner5 November 17, 2020 19:31
@brycahta
Copy link
Contributor

Are those failures in Travis related? Seems like they're from prometheus-operator.

Correct-- these failures aren't related to your changes.

Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing to drain/cordon causes CrashLoopBackOff
4 participants