Skip to content

Intermittent issue on sdk v2 - Unable to load credentials from service endpoint #3448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
striker50 opened this issue Sep 28, 2022 · 23 comments
Closed
Assignees
Labels
bug This issue is a bug. no-auto-closure

Comments

@striker50
Copy link

striker50 commented Sep 28, 2022

Describe the bug

I am running a Spark application on an EMR 5.30.1 cluster. Seeing intermittent credential access issue after upgrading AWS SDK version from 1.11.297 to 2.17.11.
We upgraded from v1 to v2 as v1 doesn't have stable support for configuring custom VPC endpoints - As recommended here aws/aws-sdk-java#2135 (comment)

On moving to sdk v2, the VPC endpoint access is working fine but we are seeing INTERMITTENT SQS sendMessage() failures because of credential access issue due to connection timed out.
Following https://docs.amazonaws.cn/en_us/sdk-for-java/latest/developer-guide/migration-client-credentials.html I also enabled async credential refresher using below code during SQS client initialization. But the issue still occurs

Please advise on how to fix the intermittent credential access issue on aws sdk v2. We get the credentials from InstanceProfileCredentialsProvider.

Expected Behavior

Consistent credential access for the SQSClient when sending messages

Current Behavior

code snippet:

URI endpointURI = new URI(sqsVPCEndpoint);
InstanceProfileCredentialsProvider provider = InstanceProfileCredentialsProvider.builder()
.asyncCredentialUpdateEnabled(true)
.build();
SqsClient sqs = SqsClient.builder().region(Region.of(awsRegion)).endpointOverride(endpointURI).
credentialsProvider(provider).build();

Error message:
java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) ...... at java.lang.Thread.run(Thread.java:750) Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint. at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.refreshCredentials(HttpCredentialsProvider.java:110) at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:132) at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:89) at java.util.Optional.map(Optional.java:215) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.resolveCredentials(HttpCredentialsProvider.java:146) at software.amazon.awssdk.awscore.client.handler.AwsClientHandlerUtils.createExecutionContext(AwsClientHandlerUtils.java:79) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.createExecutionContext(AwsSyncClientHandler.java:68) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:99) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:169) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:95) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55) at software.amazon.awssdk.services.sqs.DefaultSqsClient.sendMessage(DefaultSqsClient.java:1528) ....... ....... Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at sun.net.www.http.HttpClient.New(HttpClient.java:339) at sun.net.www.http.HttpClient.New(HttpClient.java:357) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1207) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) at software.amazon.awssdk.regions.internal.util.ConnectionUtils.connectToEndpoint(ConnectionUtils.java:45) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:112) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:91) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.refreshCredentials(HttpCredentialsProvider.java:79) ... 21 more

Reproduction Steps

code snippet for AWS SDK v2:

URI endpointURI = new URI(sqsVPCEndpoint);
InstanceProfileCredentialsProvider provider = InstanceProfileCredentialsProvider.builder()
.asyncCredentialUpdateEnabled(true)
.build();
SqsClient sqs = SqsClient.builder().region(Region.of(awsRegion)).endpointOverride(endpointURI).
credentialsProvider(provider).build();

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.17.11

JDK version used

1.8

Operating System and version

EMR clusters

@striker50 striker50 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 28, 2022
@striker50 striker50 changed the title (short issue description) Intermittent issue on sdk v2 - Unable to load credentials from service endpoint Sep 28, 2022
@yasminetalby yasminetalby self-assigned this Sep 28, 2022
@yasminetalby
Copy link

Hello @striker50

Thank you very much for your submission.

Here, the unable to connect to service endpoint indicates that the InstanceProfileCredentialsProvider attempted to refresh the credentials but could not connect to the service endpoint before timeout.

After some research, I have found some other users issue submission reporting high latency of credential refresh : see
This latency issue seems to be present in later version of v1 (started in 1.11.678 because it's when the new Instance Metadata Service v2 was released, we have seen various reports of increased latency on the service side), this is why I believed this wasn’t an issue you have encountered prior to upgrading from 1.11.297 to 2.17.11.

Since you are using EMR, you can adjust the hop limit using the modify-instance-metadata-options command if you need to make it larger. You can find more information on the use of IMDSv2 as well as hop-limit configuration here.

Note: This seems to be a bug submission question for the AWS Java SDK V2 rather than a bug submission for AWS Java SDK V1. To facilitate other user guidance search, I will update the label accordingly and transfer the submission to the appropriate repository.

Best,

Yasmine

@yasminetalby yasminetalby added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Sep 29, 2022
@yasminetalby yasminetalby transferred this issue from aws/aws-sdk-java Sep 29, 2022
@striker50
Copy link
Author

striker50 commented Oct 4, 2022

@yasminetalby We created a new EMR cluster with below configs but still seeing the same issue

EMR cluster config

Please guide if there is some misconfiguration here. This is on EMR 5.32.1. We need this configuration to be applicable for core and task nodes as well
Here is the code snippet. Should I make any change to code/IAM role policy?

URI endpointURI = new URI(sqsVPCEndpoint);
InstanceProfileCredentialsProvider provider = InstanceProfileCredentialsProvider.builder()
.asyncCredentialUpdateEnabled(true)
.build();
SqsClient sqs = SqsClient.builder().region(Region.of(awsRegion)).endpointOverride(endpointURI).
credentialsProvider(provider).build();

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 4, 2022
@skumarstrike02
Copy link

@yasminetalby Any update on this issue ? we are blocked on deploying our services to prod due to this issue. Can you please escalate it.

@yasminetalby
Copy link

Hello @skumarstrike02 , @striker50 ,

Apologies for the delay.
To confirm the origin of the behavior: Could you please provide the full stack trace associated with the issue as well as the verbose wirelog?

Please make sure to remove any sensitive information.

It seems that the EC2 team behind the Metadata Service and are still investigating the latency issue mentioned above.

Best,

Yasmine

@yasminetalby
Copy link

I was also wondering it you have enabled the SDK metrics?
Do you notice high CredentialsFetchDuration? Or higher latency of a specific request?

Best,

Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 11, 2022
@skumarstrike02
Copy link

HI @yasminetalby We ran the service with updated logging settings, below are the logs around the error :
2022-10-12 02:21:28 DEBUG software.amazon.awssdk.request:84 - Sending Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=<VPC ENDPOINT>, encodedPath=, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent], queryParameters=[]) 2022-10-12 02:21:28 DEBUG software.amazon.awssdk.request:84 - Received successful response: 200 2022-10-12 02:21:28 DEBUG software.amazon.awssdk.request:84 - Sending Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=<VPC ENDPOINT>, encodedPath=, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent], queryParameters=[]) 2022-10-12 02:21:28 ERROR <code error message> java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) ...... at java.lang.Thread.run(Thread.java:750) Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from service endpoint. at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.refreshCredentials(HttpCredentialsProvider.java:110) at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:132) at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:89) at java.util.Optional.map(Optional.java:215) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.resolveCredentials(HttpCredentialsProvider.java:146) at software.amazon.awssdk.awscore.client.handler.AwsClientHandlerUtils.createExecutionContext(AwsClientHandlerUtils.java:79) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.createExecutionContext(AwsSyncClientHandler.java:68) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:99) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:169) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:95) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55) at software.amazon.awssdk.services.sqs.DefaultSqsClient.sendMessage(DefaultSqsClient.java:1528) .... at java.lang.Iterable.forEach(Iterable.java:75) .... at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at sun.net.www.http.HttpClient.New(HttpClient.java:339) at sun.net.www.http.HttpClient.New(HttpClient.java:357) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1207) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) at software.amazon.awssdk.regions.internal.util.ConnectionUtils.connectToEndpoint(ConnectionUtils.java:45) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:112) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:91) at software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider$InstanceProviderCredentialsEndpointProvider.endpoint(InstanceProfileCredentialsProvider.java:150) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:112) at software.amazon.awssdk.regions.util.HttpResourcesUtils.readResource(HttpResourcesUtils.java:91) at software.amazon.awssdk.auth.credentials.HttpCredentialsProvider.refreshCredentials(HttpCredentialsProvider.java:79) ... 21 more 2022-10-12 02:21:28 DEBUG software.amazon.awssdk.request:84 - Received successful response: 200

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 12, 2022
@yasminetalby
Copy link

Hello @skumarstrike02 ,

Thank you very much for the extra information and for your collaboration.
Could you also provide the SDKmetrics information requested above with high latency requests?

Best,

Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 13, 2022
@skumarstrike02
Copy link

HI @yasminetalby
we don't have SDKmetrics enabled , please let me know if you want any further info/logs on this issue.

Thanks
Sumeet

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 13, 2022
@yasminetalby
Copy link

Hello Sumeet,

Is it possible for you to enable the metrics? This would allow us to see if there is any configuration change that can be made that will help in your case.

Best,

Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 14, 2022
@skumarstrike02
Copy link

Hi @yasminetalby , Can it be enabled on a running EMR cluster, if yes please point me to the steps on how to enable it.

Thanks
Sumeet

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 14, 2022
@yasminetalby
Copy link

Hello @skumarstrike02 ,

You can find how to enable SDK metrics in the documentation provided in earlier comments : https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics.html

The metrics that would be interesting for us in this case is : CredentialsFetchDuration (see : https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html)

Here you can see a similar issue where the customer provides the metrics : #1667 (comment)

Best,

Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 14, 2022
@skumarstrike02
Copy link

skumarstrike02 commented Oct 19, 2022

HI @yasminetalby
Please find the dashboard below which fetches the CredentialsFetchDuration metric
https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#dashboards:name=Orion-CredentialsFetchDuration;start=PT3H

Is there any update on the issue from SDK team ?

credentils_fetch

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 19, 2022
@yasminetalby
Copy link

Hello @skumarstrike02 ,

Thank you very much for providing the documentation.
From the CredentialFetchDuration metrics, we can confirm that the behavior is caused because CredentialFetch is reaching the 1second connection timeout.

There is an internal ticket open about this case investigation. Have you attempted to update the SDK to the latest version? Have you seen any improvement on the behavior?

Best,

Yasmine

@skumarstrike02
Copy link

skumarstrike02 commented Oct 20, 2022

Hi @yasminetalby
we tried with latest version of aws sdk( 2.17.295 ) . The issue still exists there.

2022-10-20 03:03:16 ERROR [taskmanager-0]  - Failure in managed task with taskId [27a4b647-4527-470f-92f6-f8819922c344]
java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException: Failed to load credentials from IMDS.
	

@yasminetalby
Copy link

Hello @skumarstrike02 ,

Thank you very much for the update.
We have forwarded this information to the service team for further investigation regarding the connectivity issue with IMDS.
I am currently working internally to support this case with AWS Support.
We believe it might be easier for you to communicate the information regarding this case directly on the AWS Support case to prevent duplicated communication work on your end.
Would you be ok for me to close this GitHub issue in favor of the support communication platform?

Thank you very much for your collaboration.

Sincerely,

Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 20, 2022
@skumarstrike02
Copy link

hello @yasminetalby
On the support case, Engineer says to follow up on Github issue so I am not sure what should be right platform to follow up.

I checked the Github issue that you have raised and could see that the issue has already been assigned to the SDK team. Hence, I informed you to kindly allow a couple of days for the SDK team to look into the issue , replicate and reply back. The team will be replying directly on the GitHub issue. You can always work with the team/follow up on the issue via the GitHub link. If you do not get any update even after following up with the team, feel free to write back to me. I will be happy to reach out to the team and request them to provide an update.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 20, 2022
@yasminetalby
Copy link

Hello @skumarstrike02 ,

Thank you very much for bringing this up to my attention. I apologize for any confusing guidance provided.

I am happy to keep this GitHub issue open if this is your preferred medium of communication.
I have checked with AWS Support and they have confirmed that the Support Case would be the best medium to provide information regarding this case and will be checked on regularly for updates as well.

To provide the latest update, the behavior was raised to the service team to investigate the connectivity issue. We are currently waiting on an update from IMDS regarding this case.

Thank you very much for your time and collaboration. Please let me know which is your preferred medium of communication. I will keep on providing updates here based on your communication preferences and the latest service team update.

Best regards,
Yasmine

@yasminetalby yasminetalby added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Oct 20, 2022
@github-actions
Copy link

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added the closing-soon This issue will close in 4 days unless further comments are made. label Oct 26, 2022
@skumarstrike02
Copy link

Please keep the issue open until we get a resolution from aws support and/or sdk team.

@github-actions github-actions bot removed closing-soon This issue will close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Oct 27, 2022
@skumarstrike02
Copy link

AWS sdk team helped us with code optimization w.r.t making static SQS client and that change seems to be working for us and helped in resolving the credentials throttling issue.

Thanks everyone. this issue can be closed.

@yasminetalby
Copy link

Hello @skumarstrike02 ,

We are happy to hear that the fix worked. Thank you very much for for your collaboration and for letting us know that this issue submission could be resolved.

Sincerely,

Yasmine

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@venkat4p
Copy link

@skumarstrike02 what was the change here? what do you mean by static SQS client? does it different than the Singleton ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. no-auto-closure
Projects
None yet
Development

No branches or pull requests

4 participants