Skip to content

Seeing curl_easy_perform stuck at aws-sdk 1.7.336 #1861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sihanwang41 opened this issue Feb 11, 2022 · 1 comment
Closed

Seeing curl_easy_perform stuck at aws-sdk 1.7.336 #1861

sihanwang41 opened this issue Feb 11, 2022 · 1 comment
Labels
closed-for-staleness guidance Question that needs advice or information. third-party This issue is related to third-party libraries or applications.

Comments

@sihanwang41
Copy link

Describe the issue

We are using tensorflow 2.6 (by default, it is using aws-sdk-cpp 1.7.336).

The issue doesn't always happen, but it happens quite often on some of host in one big cluster. We tried to set httpRequestTimeoutMs with 10s, retry 10 times is able to help to resolve the issue.

We have hundreds of hosts (500 -1000) will query the same object at the near same time.

Thread 123 (Thread 0x7f954c3cc700 (LWP 321)):
#0 0x00007f9b01954cb9 in poll () from ./libc.so.6
#1 0x0000557e95c5bec2 in Curl_poll () at /usr/include/c++/8/ext/new_allocator.h:86
#2 0x0000557e95c56e89 in multi_wait.part () at /usr/include/c++/8/ext/new_allocator.h:86
#3 0x0000557e95c57079 in curl_multi_poll () at /usr/include/c++/8/ext/new_allocator.h:86
#4 0x0000557e95c4b3b3 in curl_easy_perform () at /usr/include/c++/8/ext/new_allocator.h:86
#5 0x0000557e95a7ab4b in Aws::Http::CurlHttpClient::MakeRequestInternal(Aws::Http::HttpRequest&, std::shared_ptrAws::Http::Standard::StandardHttpResponse&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const () at /usr/include/c++/8/ext/new_allocator.h:86
#6 0x0000557e95a7cc59 in Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptrAws::Http::HttpRequest const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const () at /usr/include/c++/8/ext/new_allocator.h:86
#7 0x0000557e95bfcda6 in Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptrAws::Http::HttpRequest const&, Aws::AmazonWebServiceRequest const&, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#8 0x0000557e95bfd3f4 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#9 0x0000557e95bfe245 in Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#10 0x0000557e95ad9b4f in Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const () at /usr/include/c++/8/ext/new_allocator.h:86
#11 0x0000557e95a573dd in tensorflow::(anonymous namespace)::S3RandomAccessFile::ReadS3Client (scratch=0x7f92b8c01580 "", result=0x7f954c3b4b80, n=, offset=,
this=0x7f8c07542d10) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#12 tensorflow::(anonymous namespace)::S3RandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const ()
at external/org_tensorflow/tensorflow/core/platform/s3/s3_file_system.cc:255
#13 0x0000557e95a17d0f in tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}::operator()() const (__closure=) at /usr/include/c++/8/bits/unique_ptr.h:345
#14 std::_Function_handler<tensorflow::Status (), tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/8/bits/std_function.h:283
#15 0x0000557e95a41e7a in std::function<tensorflow::Status ()>::operator()() const (this=0x7f954c3b4a10) at /usr/include/c++/8/bits/std_function.h:682
#16 tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, std::function<void (long)> const&, tensorflow::RetryConfig const&) (f=..., sleep_usec=..., config=...)
at external/org_tensorflow/tensorflow/core/platform/retrying_utils.cc:54
#17 0x0000557e95a42512 in tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, tensorflow::RetryConfig const&) (f=..., config=...) at /usr/include/c++/8/new:169
#18 0x0000557e95a18c89 in tensorflow::retrying_internals::RetryingRandomAccessFile::Read (this=, offset=955128096, n=83425632, result=0x7f954c3b4b80, scratch=0x7f92b8c01580 "")
at /usr/include/c++/8/bits/std_function.h:87
#19 0x0000557e91773cbd in tensorflow::BundleReader::GetValue (this=this@entry=0x7f954c3b5570, entry=..., val=val@entry=0x7f8c08633820)
at bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/core/protobuf/tensor_bundle.pb.h:641
#20 0x0000557e9177dc9d in tensorflow::BundleReader::Lookup(std::basic_string_view<char, std::char_traits >, tensorflow::Tensor*) ()
at external/org_tensorflow/tensorflow/core/util/tensor_bundle/tensor_bundle.cc:947
#21 0x0000557e8dc587f1 in tensorflow::(anonymous namespace)::RestoreOp::run(tensorflow::BundleReader*) () at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMorphing.h:653

Steps to Reproduce

No response

Current behavior

No response

AWS CPP SDK version used

1.7.336

compiler and version used

6.5.0

Operating System and version

UBUNTU 18.04

@sihanwang41 sihanwang41 added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Feb 11, 2022
@KaibaLopez
Copy link
Contributor

Hi @sihanwang41 ,
Kind of hard to help when you are using such an old version of the SDK and in conjunction with a 3rd party.
But yea increasing request timeout and number of retries would be the proposed workaround for these.

@KaibaLopez KaibaLopez added closing-soon This issue will automatically close in 4 days unless further comments are made. third-party This issue is related to third-party libraries or applications. and removed needs-triage This issue or PR still needs to be triaged. labels Feb 11, 2022
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness guidance Question that needs advice or information. third-party This issue is related to third-party libraries or applications.
Projects
None yet
Development

No branches or pull requests

2 participants