Seeing curl_easy_perform stuck at aws-sdk 1.7.336 #1861
Labels
closed-for-staleness
guidance
Question that needs advice or information.
third-party
This issue is related to third-party libraries or applications.
Describe the issue
We are using tensorflow 2.6 (by default, it is using aws-sdk-cpp 1.7.336).
The issue doesn't always happen, but it happens quite often on some of host in one big cluster. We tried to set httpRequestTimeoutMs with 10s, retry 10 times is able to help to resolve the issue.
We have hundreds of hosts (500 -1000) will query the same object at the near same time.
Thread 123 (Thread 0x7f954c3cc700 (LWP 321)):
#0 0x00007f9b01954cb9 in poll () from ./libc.so.6
#1 0x0000557e95c5bec2 in Curl_poll () at /usr/include/c++/8/ext/new_allocator.h:86
#2 0x0000557e95c56e89 in multi_wait.part () at /usr/include/c++/8/ext/new_allocator.h:86
#3 0x0000557e95c57079 in curl_multi_poll () at /usr/include/c++/8/ext/new_allocator.h:86
#4 0x0000557e95c4b3b3 in curl_easy_perform () at /usr/include/c++/8/ext/new_allocator.h:86
#5 0x0000557e95a7ab4b in Aws::Http::CurlHttpClient::MakeRequestInternal(Aws::Http::HttpRequest&, std::shared_ptrAws::Http::Standard::StandardHttpResponse&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const () at /usr/include/c++/8/ext/new_allocator.h:86
#6 0x0000557e95a7cc59 in Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptrAws::Http::HttpRequest const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const () at /usr/include/c++/8/ext/new_allocator.h:86
#7 0x0000557e95bfcda6 in Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptrAws::Http::HttpRequest const&, Aws::AmazonWebServiceRequest const&, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#8 0x0000557e95bfd3f4 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#9 0x0000557e95bfe245 in Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*) const ()
at /usr/include/c++/8/ext/new_allocator.h:86
#10 0x0000557e95ad9b4f in Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const () at /usr/include/c++/8/ext/new_allocator.h:86
#11 0x0000557e95a573dd in tensorflow::(anonymous namespace)::S3RandomAccessFile::ReadS3Client (scratch=0x7f92b8c01580 "", result=0x7f954c3b4b80, n=, offset=,
this=0x7f8c07542d10) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#12 tensorflow::(anonymous namespace)::S3RandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const ()
at external/org_tensorflow/tensorflow/core/platform/s3/s3_file_system.cc:255
#13 0x0000557e95a17d0f in tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}::operator()() const (__closure=) at /usr/include/c++/8/bits/unique_ptr.h:345
#14 std::_Function_handler<tensorflow::Status (), tensorflow::retrying_internals::RetryingRandomAccessFile::Read(unsigned long, unsigned long, std::basic_string_view<char, std::char_traits >, char) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/8/bits/std_function.h:283
#15 0x0000557e95a41e7a in std::function<tensorflow::Status ()>::operator()() const (this=0x7f954c3b4a10) at /usr/include/c++/8/bits/std_function.h:682
#16 tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, std::function<void (long)> const&, tensorflow::RetryConfig const&) (f=..., sleep_usec=..., config=...)
at external/org_tensorflow/tensorflow/core/platform/retrying_utils.cc:54
#17 0x0000557e95a42512 in tensorflow::RetryingUtils::CallWithRetries(std::function<tensorflow::Status ()> const&, tensorflow::RetryConfig const&) (f=..., config=...) at /usr/include/c++/8/new:169
#18 0x0000557e95a18c89 in tensorflow::retrying_internals::RetryingRandomAccessFile::Read (this=, offset=955128096, n=83425632, result=0x7f954c3b4b80, scratch=0x7f92b8c01580 "")
at /usr/include/c++/8/bits/std_function.h:87
#19 0x0000557e91773cbd in tensorflow::BundleReader::GetValue (this=this@entry=0x7f954c3b5570, entry=..., val=val@entry=0x7f8c08633820)
at bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/core/protobuf/tensor_bundle.pb.h:641
#20 0x0000557e9177dc9d in tensorflow::BundleReader::Lookup(std::basic_string_view<char, std::char_traits >, tensorflow::Tensor*) ()
at external/org_tensorflow/tensorflow/core/util/tensor_bundle/tensor_bundle.cc:947
#21 0x0000557e8dc587f1 in tensorflow::(anonymous namespace)::RestoreOp::run(tensorflow::BundleReader*) () at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMorphing.h:653
Steps to Reproduce
No response
Current behavior
No response
AWS CPP SDK version used
1.7.336
compiler and version used
6.5.0
Operating System and version
UBUNTU 18.04
The text was updated successfully, but these errors were encountered: