Skip to content

Reduced resource consumption of async credential providers. #3275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 9, 2022

Conversation

millems
Copy link
Contributor

@millems millems commented Jun 29, 2022

Fixes #3259.

  1. Share thread pools across async credential providers (anything using CachedSupplier's NonBlocking prefetch strategy).
  2. Log a warning if an extreme number of concurrent refreshes are happening, to help users detect when they're not closing their credential providers.

Even though this is an increase in resource sharing, it should not cause increased availability risks. Because these threads are only used for background refreshes, if one particular type of credential provider has availability problems (e.g. SSO or STS high latency), it only disables background refreshes, not prefetches or synchronous fetches.

@millems millems requested a review from a team as a code owner June 29, 2022 23:41
@millems millems force-pushed the millem/reduce-background-credential-threads branch 2 times, most recently from 4c41d77 to bf7ef79 Compare June 29, 2022 23:52
@millems millems force-pushed the millem/reduce-background-credential-threads branch 2 times, most recently from a48f917 to 069261e Compare June 30, 2022 19:14
@millems millems force-pushed the millem/reduce-background-credential-threads branch 2 times, most recently from 3cb87af to 075f34f Compare July 6, 2022 20:17
@cloudshiftchris
Copy link

There are a few places where JavaDoc could be tweaked to reflect these changes:

SsoCredentialsProvider, StsCredentialsProvider - suggested text for prefetch attribute on the Builder:

/**
 * Configure the amount of time, relative to STS token expiration, that the cached credentials are considered
 * close to stale and should be updated. 
 *
 * Prefetch updates will occur between the specified time and the stale time of the provider, based on demand
 * and load distribution.  Prefetch updates may be asynchronous; see {@link #asyncCredentialUpdateEnabled}.
 *
 * <p>By default, this is 5 minutes.</p>
 */

StsCredentialsProvider speaks to updating creds asynchronously (existing JavaDoc below); this appears incorrect, as the default for asyncCredentialUpdateEnabled is false. Similar challenge for StsAssumeRoleCredentialsProvider and SsoCredentialsProvider

/**
 * An implementation of {@link AwsCredentialsProvider} that is extended within this package to provide support for periodically-
 * updating session credentials. When credentials get close to expiration, this class will attempt to update them asynchronously
 * using {@link #getUpdatedCredentials(StsClient)}. If the credentials end up expiring, this class will block all calls to
 * {@link #resolveCredentials()} until the credentials can be updated.
 */

Additionally for StsAssumeRoleCredentialsProvider, it erroneously states:

 * This provider creates a thread in the background to periodically update credentials. If this provider is no longer needed,
 * the background thread can be shut down using {@link #close()}.

Perhaps remove and defer to asyncCredentialUpdateEnabled.

@cloudshiftchris
Copy link

StsCredentialsProvider erroneously states that the sessionCache is async (it is strategized based on PrefetchStrategy)

    /**
     * The session cache that will update the credentials asynchronously in the background when they get close to expiring.
     */
    private final CachedSupplier<SessionCredentialsHolder> sessionCache;

Perhaps The session cache that will update the credentials when they get close to expiring.

@millems
Copy link
Contributor Author

millems commented Jul 7, 2022

Yeah, it's a bit more constrained now. The goal for capping it to 5 minutes was to make log dives easier, because you know that the refresh will happen within 5 minutes of the calculated prefetch time (which is usually less than 5 minutes before the expiration time). If it was unbounded, people are more likely to have to search more logs to find the info.

I just picked 5 minutes, because most of our default prefetch times are within 5 minutes of the stale times and most service throttling limits are windowed at one second. Some are longer, though, like IMDS's up-to-30 minutes.

What would be an ideal max time in your mind?

@cloudshiftchris
Copy link

I think it's a cleaner story to say that prefetch happens in the window from the prefetch time to the stale time (with a bit of padding). Providing a larger prefetch window simply jitters the prefetch over that larger duration.

Where E = token expiration, effective prefetch = rand([E - prefetch, E - stale - 1m]) (with the validity checks for overlaps etc)

Practically speaking, most SDK consumers won't change the prefetch (or have a need to change the prefetch), leaving them with reasonable defaults (and hence a constrained window for log analysis, etc).

In the case of larger numbers of multi-tenant cred providers one may wish to spread that over a larger window (say 10m).

I don't practically see a use-case for bloated/unbounded pre-fetch windows - after say 10m you're starting to get into "fetching too frequently" territory. Sure, folks can configure a prefetch of 3500s on a 3600s token expiration if they wish - if we want to prevent that perhaps the prefetch could be validated (at configuration time) to be max(5m or 10% of token expiration duration). Alternately, the JavaDoc could be updated with that guidance, something along the lines of "There are diminishing returns for prefetch windows > 10% of token duration".

@millems
Copy link
Contributor Author

millems commented Jul 7, 2022

You've convinced me, I think. Let me remove the jitter maximum.

@millems millems force-pushed the millem/reduce-background-credential-threads branch 2 times, most recently from ae9a2e7 to 9e1ce1b Compare July 8, 2022 18:21
@millems
Copy link
Contributor Author

millems commented Jul 8, 2022

Alright, I believe I've resolved your comments, including the javadoc updates (good catch, wish there was a mechanism to keep those correct).

@cloudshiftchris
Copy link

Thanks; reviewed everything earlier - no further comments, this addresses the goals. Great work on this.

1. Share thread pools across async credential providers (anything using CachedSupplier's NonBlocking prefetch strategy).
2. Log a warning if an extreme number of concurrent refreshes are happening, to help users detect when they're not closing their credential providers.

Even though this is an increase in resource sharing, it should not cause increased availability risks. Because these threads are only used for background refreshes, if one particular type of credential provider has availability problems (e.g. SSO or STS high latency), it only disables background refreshes, not prefetches or synchronous fetches.
@millems millems force-pushed the millem/reduce-background-credential-threads branch from 0107254 to 0f060fd Compare July 8, 2022 22:10
@millems millems enabled auto-merge (squash) July 8, 2022 22:14
@millems millems merged commit f1fd88e into master Jul 9, 2022
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jul 9, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 10 Code Smells

89.0% 89.0% Coverage
0.0% 0.0% Duplication

dagnir added a commit that referenced this pull request Sep 2, 2022
There were some issues after the merge with master related timing because of the
CachedSupplier update from #3275.
@millems millems deleted the millem/reduce-background-credential-threads branch October 19, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StsCredentialsProvider uses excessive number of threads in multi-tenant setup
3 participants