Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation #815

crusader-mike · 2018-02-28T19:14:03Z

Platform: Windows
SDK version: 1.3.31

I am chasing a very rare crash (deep in AWS SDK) that happens in GetObject() call. After spending more than a day trying to reproduce it I finally caught one incidence in a debugger. Unfortunately nuget package doesn't contain .pdb files and symbols package was not refreshed since 2016.

If you have them laying around -- can you provide me with pdb files for x64 v140 Debug 1.3.31 build of -core and -s3 dlls, please?

Also, it would be really helpful if .symbols nuget packages were released along with redist packages (or maybe even included into them).

Code is very simple and I suspect a bug in SDK itself:

Aws::S3::Model::GetObjectRequest q;
q.SetKey( stream_info.obj_key );
static_assert(sizeof(unsigned long long) >= sizeof(streamoff), "");
q.SetRange(common::str_printf("bytes=%llu-%llu", static_cast<unsigned long long>(read_offset), static_cast<unsigned long long>(read_offset + chunk_size - 1)));
auto r = s3c.GetObject(q);                  // <-- this is where it happens

Stack:

 	aws-cpp-sdk-core.dll!000007fed638169e()	Unknown
 	aws-cpp-sdk-core.dll!000007fed6382a0f()	Unknown
 	aws-cpp-sdk-core.dll!000007fed639a9a8()	Unknown
 	aws-cpp-sdk-core.dll!000007fed639a516()	Unknown
 	aws-cpp-sdk-s3.dll!000007fed66f45b1()	Unknown
 	aws-cpp-sdk-core.dll!000007fed639503f()	Unknown
 	aws-cpp-sdk-core.dll!000007fed6393c7b()	Unknown
 	aws-cpp-sdk-core.dll!000007fed63958d6()	Unknown
 	aws-cpp-sdk-s3.dll!000007fed66e73e5()	Unknown
>	blah-blah-blah::download_file_(...)

The text was updated successfully, but these errors were encountered:

JonathanHenson · 2018-02-28T19:36:01Z

Dumb question, are you calling InitApi() before you do any of this?

crusader-mike · 2018-02-28T19:39:48Z

:-) Yes. It is possible that this logic is invoked outside of "aws init" guard object scope. But I've checked -- there are 8 guard objects active at this moment -- SDK is definitely initialized.

There is at least a dozen of parallel threads retrieving other objects using the same download_file_() function right now. Could be some weird memory corruption, but it is hard to investigate anything without debug symbols.

one more note: crash is access violation at 0x00000020 address -- i.e. something somewhere is NULL when it shouldn't be. Offending code:

000007FED6381660  mov         qword ptr [rsp+8],rcx  
000007FED6381665  push        rdi  
000007FED6381666  sub         rsp,0B0h  
000007FED638166D  mov         rdi,rsp  
000007FED6381670  mov         ecx,2Ch  
000007FED6381675  mov         eax,0CCCCCCCCh  
000007FED638167A  rep stos    dword ptr [rdi]  
000007FED638167C  mov         rcx,qword ptr [rsp+0C0h]  
000007FED6381684  mov         rax,qword ptr [7FED662E5F0h]  
000007FED638168B  xor         rax,rsp  
000007FED638168E  mov         qword ptr [rsp+0A0h],rax  
000007FED6381696  mov         rax,qword ptr [rsp+0C0h]  
000007FED638169E  cmp         qword ptr [rax+8],0                      <--- here, rax is 24
000007FED63816A3  jne         000007FED63816A6  
000007FED63816A5  int         3  
000007FED63816A6  mov         rax,qword ptr [rsp+0C0h]  
000007FED63816AE  cmp         qword ptr [rax+10h],0  
000007FED63816B3  jne         000007FED63816B6  
000007FED63816B5  int         3  
000007FED63816B6  mov         rax,qword ptr [rsp+0C0h]  
000007FED63816BE  mov         eax,dword ptr [rax]  
000007FED63816C0  and         eax,100h  
000007FED63816C5  test        eax,eax

JonathanHenson · 2018-02-28T19:56:05Z

Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?

marcomagdy · 2018-02-28T20:02:06Z

We don't keep pdb files around from earlier versions unfortunately. Also, Microsoft's debugger does some annoying timestamp checking to ensure the pdb file was created for the corresponding dll. So even if the code is the same it won't let you use the pdb.

Your best bet is to build from source.
Also, aren't you using a custom streambuf? Did you rule out your implementation?

crusader-mike · 2018-02-28T20:18:38Z

@JonathanHenson

Also, nuget defaults don’t upgrade everything when you pull in a new dependency, can you check that all of the versions match?

Everything seems in order:

AWSSDKCPP-Core                      {1.3.31}
AWSSDKCPP-Core.redist               {1.3.31}
AWSSDKCPP-S3                        {1.3.20060301.31}
AWSSDKCPP-S3.redist                 {1.3.20060301.31}

@marcomagdy

Your best bet is to build from source.

Sigh... I'll have to figure out how to do it -- we were always using nuget packages on windows. Too bad this problem popped up like literally one day before major release. :-\

Even .map file can be useful here -- I'd be able to find in which function this happens using it's address.

Also, aren't you using a custom streambuf? Did you rule out your implementation?

Custom streambuf is used only for upload -- for download I opted for piece-by-piece download using SetRange() with default stream (std::stringstream, afair).

marcomagdy · 2018-02-28T20:55:54Z

@crusader-mike what credentials provider are you using?

crusader-mike · 2018-02-28T21:06:48Z

@marcomagdy I pass credentials directly into S3Client ctor:

Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
                  awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections));

crusader-mike · 2018-02-28T23:05:07Z

Huh... Upgrading SDK to 1.4.1 caused the problem to go away, it seems -- I can no longer reproduce it. What is more interesting -- CPU usage went down ~2 times and network utilization went up ~5 times (debug build).

On the bad side -- our QA will chew my head off. They've just finished multi-week regression tests.

crusader-mike · 2018-03-01T00:51:02Z

Actually, taking it back... With latest SDK it shows up in another place and is even harder to reproduce.

I spotted something though -- AWSClient ctor/dtor call InitializeGlobalStatics/CleanupGlobalStatics. And everything seems fine... Until you realize that generated move ctor is not gonna do that! Guess what -- my code moves S3Client instances. I am not sure that this is the root cause, but I'll keep digging.

Also, InitializeGlobalStatics/CleanupGlobalStatics implementation looks very weird to me (and I've done a fair amount of lockless programming) -- I highly doubt it is correct.

marcomagdy · 2018-03-01T00:57:17Z

The compiler won't generate a move ctor if there's a user-defined dtor; and in this case, there is one defined.
shameless plug https://stackoverflow.com/questions/4943958/conditions-for-automatic-generation-of-default-copy-move-ctor-and-copy-move-assi/38257488#38257488

Which means we have a different kind of bug. When the client is copied, we don't increment the refcount.

crusader-mike · 2018-03-01T01:02:33Z

... but it will generate copy ctor, right? which won't call InitializeGlobalStatics()...

marcomagdy · 2018-03-01T01:03:45Z

Yes, correct. That's probably the bug. See my updated comment above.
You can avoid this problem now by avoiding copying the client.

crusader-mike · 2018-03-01T01:09:33Z

Working on getting rid of "S3Client move" logic in our code and testing. (Edit: nope, crash is still there) Pretty sure there are 2 bugs here -- InitializeGlobalStatics/CleanupGlobalStatics() implementation is not good, I feel there is a race there.

Some background -- I have a S3Client cache:

    using s3cache_t = map<tuple<string, string, string, string>, Aws::S3::S3Client>;    // { (endpoint, key, secret, region) -> S3Client }
    s3cache_t s3cache_;

    Aws::S3::S3Client& get_client(opt_string const& endpoint, opt_string const& access_key_id, opt_string const& secret_key, opt_string const& auth_region) override
    {
        lock_guard<mutex> lock{mutex_};     // it is ok to reuse this mutex for cache protection

        auto& s3_endpoint       = endpoint      ? *endpoint         : opts_.s3_endpoint     ;
        auto& s3_access_key_id  = access_key_id ? *access_key_id    : opts_.s3_access_key_id;
        auto& s3_secret_key     = secret_key    ? *secret_key       : opts_.s3_secret_key   ;
        auto& s3_auth_region    = auth_region   ? *auth_region      : opts_.s3_auth_region  ;

        auto key = make_tuple(s3_endpoint, s3_access_key_id, s3_secret_key, s3_auth_region);
        auto it = s3cache_.lower_bound(key);
        if (it == s3cache_.end() || it->first != key)
            it = s3cache_.emplace_hint(it, key, Aws::S3::S3Client(Aws::Auth::AWSCredentials(s3_access_key_id, s3_secret_key),
                                                                  awsClientConfiguration(s3_endpoint, s3_auth_region, opts_.s3_request_timeout_ms, opts_.s3_connect_timeout_ms, opts_.s3_max_connections)));
        return it->second;
    }

As you see -- after all is done, I could end up with "global statics" not being initialized even though I have live instances of S3Client in cache. I think my crashes happen during destruction of this cache.

crusader-mike · 2018-03-01T02:20:40Z

@marcomagdy What cmake cmdline I should use to build the same binary as one you publish in nuget? I tried different combinations -- every time I end up with dll size that differs from dll in nuget package. (my download speed ends up being 4 times slower too)

crusader-mike · 2018-03-01T02:37:15Z

Caught it -- same problem as #781

>	aws-cpp-sdk-core.dll!Aws::External::tinyxml2::StrPair::GetStr() Line 199	C++
 	aws-cpp-sdk-core.dll!Aws::External::tinyxml2::XMLNode::Value() Line 635	C++
 	aws-cpp-sdk-core.dll!Aws::Utils::Xml::XmlNode::GetName() Line 59	C++
 	aws-cpp-sdk-core.dll!Aws::Client::XmlErrorMarshaller::ContainsError(const Aws::Http::HttpResponse & httpResponse) Line 124	C++
 	aws-cpp-sdk-s3.dll!Aws::S3::S3Client::DoesResponseGenerateError(const std::shared_ptr<Aws::Http::HttpResponse> & response) Line 2801	C++
 	aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptOneRequest(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 295	C++
 	aws-cpp-sdk-core.dll!Aws::Client::AWSClient::AttemptExhaustively(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 174	C++
 	aws-cpp-sdk-core.dll!Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(const Aws::Http::URI & uri, const Aws::AmazonWebServiceRequest & request, Aws::Http::HttpMethod method, const char * signerName) Line 341	C++
 	aws-cpp-sdk-s3.dll!Aws::S3::S3Client::GetObject(const Aws::S3::Model::GetObjectRequest & request) Line 1441	C++

crusader-mike · 2018-03-01T22:16:57Z

To summarize:

AWSClient ctor/dtor call InitializeGlobalStatics/CleanupGlobalStatics(), but (if you copy/move S3Client object) generated copy ctor -- doesn't.
I believe InitializeGlobalStatics/CleanupGlobalStatics() is not good and has races in it. My suggestion is to stay away from lockless programming -- it is exceptionally tricky (and AWS SDK switches off exceptions :) )
I would appreciate if someone tell me cmake line used by official nuget build -- to be able to build same binaries locally (with debug symbols).

Thank you.

marcomagdy · 2018-05-22T23:39:30Z

Fixed in 1.4.55
see 832c839 for more details.

crusader-mike · 2018-05-23T01:17:33Z

This fix isn't good -- there are still race conditions here. For example, imagine a case of two threads calling AWSClient ctor and the first one (one that observed s_refCount++ == 0 being true) is preempted (or suspended) for a long time. Second thread will happily continue and eventually crash because global state isn't initialized (even though s_refCount is > 0).

Similarly, there is a race between dtor and ctor.

Edit: I can't reopen this issue (and too lazy to file a new one), gotta leave this on your conscience :-)

marcomagdy · 2018-05-23T02:22:39Z

Whoops, that’s embarrassing. We’ll get that fixed. Thanks for spotting it.

marcomagdy · 2018-05-23T03:47:38Z

@crusader-mike can you take a look at #879 see if I missed a use case?

crusader-mike · 2018-05-23T08:07:09Z

Nope. You can't use atomic variable here to properly protect global state (unless your entire state is atomic variable). Think about it -- for other threads at any point in time there should be only two observable states: "s_refCount == 0 and global state is completely unitialized" and "s_refCount != 0 and global state is completely initialized". You are guaranteed to break this invariant by modifying s_refCount and global state separately (from other thread perspective). You have to use mutex here to "merge" these modifications into one (in eyes of observers).

P.S. I am probably repeating myself -- but stay away from lockless programming unless you are ready to sink significant time into it. Even if you get to the point when you can comfortably write such code -- necessary knowledge will take so much space in your brain that it will probably push out most of other C++-related stuff :)

marcomagdy · 2018-05-23T16:39:17Z

I think I'll take @singku's suggestion and initialize this container in Aws::init. This whole ordeal was about trying to optimize that container so that it doesn't grow unbounded. But that reasoning is not strong enough to complicate the constructor's logic (and definitely doesn't warrant a mutex).

stay away from lockless programming unless you are ready to sink significant time into it.

I appreciate your concern. But if I'm going to stay away from hard programming paradigms, I might as well go program in JavaScript or Ruby 😄.
But in all seriousness, I usually have my race-conditions covered well, and we have good engineers on the team that catch the things I miss. But obviously this case is an exception. I dropped the ball, but I'm glad you caught it.

crusader-mike · 2018-05-23T18:14:44Z

It isn't just a concern -- I think every time I stumbled upon an atomic variable in SDK code there was a race condition nearby.

You can achieve similar effect using static variables or std::call_once -- though afaik they are implemented using mutex.

Yes, moving this into AwsInit sounds like a good idea.

atomic variables isn't necessarily better than mutex -- each atomic variable has a "mutex" associated with it (google MESI protocol) and price of using it grows with number of processors in your system and gets especially bad on modern servers (due to NUMA). A lot of wait-free algorithms turned out to be slower than straightforward mutex-protected code simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.

If you are interested in lockless programming I suggest reading/watching these:

there were two presentations Herb Sutter did about C++11 memory model -- it should not be hard to find them in google. Don't trust everything he says, though -- I think there were few mistakes made :)
Memory Consistency Models for Shared-Memory Multiprocessors -- afaik this is a foundation on which current Java/C++ memory model is built on. It is pointless to learn everything in this paper, but it makes sense to familiarize yourself with it.
ABA problem
Futexes are tricky
Wait-Free Synchronization
A methodology for implementing highly concurrent data objects

JonathanHenson · 2018-05-23T18:26:25Z

IIRC this code was written before aws init existed.

WRT luckless stuff, I agree.

Another aspect here in hindsight, lockless algorithms aren’t great for long term code viability. Even if they are correct when written, it increases cognitive load for all future developers and will inevitably be broken by future changes.

crusader-mike · 2018-05-23T21:39:08Z

luckless™ code is amazing for generating cognitive load, lol. And sometimes it takes people years to realize that "it broke".

marcomagdy · 2018-05-23T23:17:33Z

simply because with mutex you pay this price only twice (lock + unlock) but with atomic variables -- on every access.

That is not true. An atomic read of register-wide variables, like pointers and integers (at least on x86) is simply a memory load. Locking a mutex is a system call.

crusader-mike · 2018-05-23T23:29:55Z

My low-lvl x86 is a bit rusty, but I am sure there is a fence on read -- even if you don't observe in generated asm because x86 is naturally consistent (and acquire-read is just a read), it still exists on higher level (compiler can't reorder certain things around it).

Also, afaik, on modern systems locking mutex is system call (or better to say -- switch to kernel mode) only if you have to wait. Grabbing mutex that isn't locked is inexpensive. In pre-futex times on Linux mutex was expensive, yes.

JonathanHenson · 2018-05-24T00:16:51Z

Unfortunately I’ve had to read a ton of clang and llvm source code lately and pthread_mutex using the fast mutex init is only a system call under contention. The same is true for SRWLocks on windows... std::mutex simply calls into those apis.

The days of critical sections and full mutex objects on windows are, thankfully, behind us..

JonathanHenson · 2018-05-24T00:20:11Z

Also, fwiw, you’re screwed on ARM, SPARC, and PowerPC regardless of which you use.

crusader-mike · 2018-05-24T06:29:38Z

@JonathanHenson Are you referring to this or the fact that on these platforms memory model is more relaxed than C++11 model (which means you can't use C++ to squeeze the maximum performance-wise)?

marcomagdy · 2018-05-29T17:20:27Z

Fixed in v1.4.57

crusader-mike · 2018-05-29T21:05:03Z

Small note: since you've left ~AWSClient() in place it probably makes sense to add move ctor and op=? E.g. smth like AWSClient(AWSClient&&) = default;. Not sure if this will force you to do the same for copy ctor/op=.

marcomagdy · 2018-05-29T21:13:13Z

I have to leave the dtor so it can be marked as virtual. I did have a defaulted copy/move ctors and assignment operators, but the code failed to build on VS2013, which unfortunately we still support. It is by far the worst C++11 compliant compiler.

crusader-mike · 2018-05-29T21:45:41Z

Yeah, VS2013 was rough around the edges... Can you work around that like this:

struct goddamitbill { virtual ~goddamitbill() = default; };

class AwsClient : public goddamitbill { ... };

?

addresses the issue raised in aws#815

see discussion at aws#815

wps132230 added the help wanted label Feb 28, 2018

marcomagdy added bug This issue is a bug. and removed help wanted labels Mar 1, 2018

crusader-mike changed the title ~~Debug symbols for nuget packages~~ Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation Mar 1, 2018

marcomagdy closed this as completed May 22, 2018

marcomagdy reopened this May 23, 2018

marcomagdy closed this as completed May 29, 2018

svart-riddare pushed a commit to svart-riddare/aws-sdk-cpp that referenced this issue Apr 10, 2025

Add copy and move constructors to AWSClient

832c839

addresses the issue raised in aws#815

svart-riddare pushed a commit to svart-riddare/aws-sdk-cpp that referenced this issue Apr 10, 2025

Remove refcounted EnumOverflowContainer

d4d44a5

see discussion at aws#815

Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation #815

Unbalanced InitializeGlobalStatics/CleanupGlobalStatics() calls in AWSClient and race in implementation #815

Comments

crusader-mike commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JonathanHenson commented Feb 28, 2018

Uh oh!

crusader-mike commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanHenson commented Feb 28, 2018

Uh oh!

marcomagdy commented Feb 28, 2018

Uh oh!

crusader-mike commented Feb 28, 2018

Uh oh!

marcomagdy commented Feb 28, 2018

Uh oh!

crusader-mike commented Feb 28, 2018

Uh oh!

crusader-mike commented Feb 28, 2018

Uh oh!

crusader-mike commented Mar 1, 2018

Uh oh!

marcomagdy commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusader-mike commented Mar 1, 2018

Uh oh!

marcomagdy commented Mar 1, 2018

Uh oh!

crusader-mike commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusader-mike commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusader-mike commented Mar 1, 2018

Uh oh!

crusader-mike commented Mar 1, 2018

Uh oh!

marcomagdy commented May 22, 2018

Uh oh!

crusader-mike commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcomagdy commented May 23, 2018

Uh oh!

marcomagdy commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusader-mike commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcomagdy commented May 23, 2018

Uh oh!

crusader-mike commented May 23, 2018

Uh oh!

JonathanHenson commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusader-mike commented May 23, 2018

Uh oh!

marcomagdy commented May 23, 2018

Uh oh!

crusader-mike commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanHenson commented May 24, 2018

Uh oh!

JonathanHenson commented May 24, 2018

Uh oh!

crusader-mike commented May 24, 2018

Uh oh!

marcomagdy commented May 29, 2018

Uh oh!

crusader-mike commented May 29, 2018

Uh oh!

marcomagdy commented May 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

crusader-mike commented Feb 28, 2018 •

edited

Loading

crusader-mike commented Feb 28, 2018 •

edited

Loading

marcomagdy commented Mar 1, 2018 •

edited

Loading

crusader-mike commented Mar 1, 2018 •

edited

Loading

crusader-mike commented Mar 1, 2018 •

edited

Loading

crusader-mike commented May 23, 2018 •

edited

Loading

marcomagdy commented May 23, 2018 •

edited

Loading

crusader-mike commented May 23, 2018 •

edited

Loading

JonathanHenson commented May 23, 2018 •

edited

Loading

crusader-mike commented May 23, 2018 •

edited

Loading

marcomagdy commented May 29, 2018 •

edited

Loading