Skip to content

most requests fail with curl code 77 error #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crusader-mike opened this issue May 17, 2017 · 21 comments
Closed

most requests fail with curl code 77 error #530

crusader-mike opened this issue May 17, 2017 · 21 comments
Labels
help wanted We are asking the community to submit a PR to resolve this issue.

Comments

@crusader-mike
Copy link

crusader-mike commented May 17, 2017

My code was working just fine until decision was made to start linking AWS SDK statically. Now most of requests (~80%) fail with (error 99) 'Can connect to endpoint' and log show this:
[ERROR] 2017-05-16 20:05:18 CurlHttpClient [140198444779264] Curl returned error code 77

I spent about half of day trying to figure it out, but so far no luck. Google is full of 'curl code 77' problems related to certificate storage access, but it can't explain why about 20% of requests work just fine (and why static linking could cause this).

I suspect it might be related to linker options I am using (order of some libs is wrong?). As of now end of that line looks like this:
-laws-cpp-sdk-s3 -laws-cpp-sdk-core -lcurl -lcrypto

I had to add curl and crypto after switching to static linking.

Here is how I built SDK (note we use v1.0.59):

- install AWS SDK:
	- install cmake v3.0+:
		cd ~
		wget https://cmake.org/files/v3.7/cmake-3.7.2.tar.gz
		tar xzf cmake-3.7.2.tar.gz
		cd cmake-3.7.2
		./bootstrap
		make -j
		sudo make install
		hash -r
	- install AWS C++ SDK v1.0.59 (and related dependencies):
		sudo yum -y install libcurl-devel openssl-devel libuuid-devel
		git clone https://github.com/aws/aws-sdk-cpp.git
		cd aws-sdk-cpp
		git reset --hard 1.0.59
		mkdir ~/aws-sdk-cpp-build
		cd ~/aws-sdk-cpp-build
		cmake -DCUSTOM_MEMORY_MANAGEMENT=0 -DBUILD_SHARED_LIBS=0 -DMINIMIZE_SIZE=1 -DBUILD_ONLY=s3 -DENABLE_TESTING=OFF -DCMAKE_INSTALL_LIBDIR=lib64 ~/aws-sdk-cpp
		make -j
		sudo make install
		sudo ldconfig

OS: CentOS 7 (minimal installation + few packages like 'Developers Tools', etc)

Help!

@JonathanHenson
Copy link
Contributor

side-note: you don't need libuuid anymore.

Can you send me a trace output log that contains a request succeeding, and some failing?

@JonathanHenson
Copy link
Contributor

Also, this just started happening? Was this perchance S3 in region SA-EAST-1 ?

@crusader-mike
Copy link
Author

crusader-mike commented May 17, 2017

you don't need libuuid anymore.

Since which SDK version?

Can you send me a trace output log that contains a request succeeding, and some failing?

Here: aws_sdk_2017-05-16-20.zip

Here you could see multiple threads executing PutObject. Most of them fail. All of them share same S3Client. These requests are parts of bigger tasks that execute undo actions (DeleteObject requests) if one of subtasks fail. Setting verifySSL to false causes problems to go away.

Also, this just started happening? Was this perchance S3 in region SA-EAST-1 ?

No, it is a Webscale demo account. I highly doubt it has anything to do with server side since I switched to static linking over course of few hours (it was working before the switch).

@JonathanHenson
Copy link
Contributor

doesn't look like it starts failing until we make our 7th tcp connection. Can you try setting your client config to only have 6 connections? (This is for debugging, not for the solution).

@JonathanHenson
Copy link
Contributor

6 / 30 connections == 20% success rate.

That is curious indeed.

@JonathanHenson
Copy link
Contributor

We removed libuuid ages ago and moved to just hitting /dev/urandom. Unfortunately, we didn't remove it from the build dependencies on linux until this week.

@JonathanHenson
Copy link
Contributor

hmmm max file handles on your CA file?

@crusader-mike
Copy link
Author

Log is probably a concatenation of multiple runs. Sometimes entire run is successful. Each test run -- 3 files to upload (3 tasks --> 6 PutObject requests). Each upload is executed (typically) on a new thread.

Another thing wanted to mention -- on linker line I ended up with both -lcrypt and -lcrypto. Probably unrelated (not sure exactly what these libs are doing).

@crusader-mike
Copy link
Author

crusader-mike commented May 17, 2017

hmmm max file handles on your CA file?

can you elaborate? Not sure what you mean. Also, why it happens only when static linking SDK?

@JonathanHenson
Copy link
Contributor

It fails when it opens the 7th tcp connection and all other connections will fail after that. So you keep succeeding 20% of the time, but fail on all others, because the original 6 are left open.

error code 77 indicates libcurl wasn't able to read the CA_FILE... just brainstroming here

@JonathanHenson
Copy link
Contributor

I should note, we open a new connection for each concurrent request until we reach the maximum number of connections in the pool. Once a thread finishes, the connection is returned (still open) to the pool.

@crusader-mike
Copy link
Author

Pretty sure I've seen it failing on first connection. As well as not failing at all

@JonathanHenson
Copy link
Contributor

You should be able to use:

pkg-config libcurl --libs pkg-config openssl --libs for your linker line.

@crusader-mike
Copy link
Author

Thanks. I'll give it a try tomorrow.

@JonathanHenson
Copy link
Contributor

ok, i'll think on this some more. Maybe something will come to me.

My best guess, is you are linking to something that has different behavior than before.

Like maybe libnss ?

-lcrypto wouldn't be the culprit since that isn't where TLS resides.

LibCurl is likely the culprit. My guess is libcurl.a is linked against libnss (which is the spawn of satan) while libcurl.so is linked against openssl.

@crusader-mike
Copy link
Author

crusader-mike commented May 17, 2017

My best guess, is you are linking to something that has different behavior than before.

I was thinking maybe MINIMIZE_SIZE could cause change in behaviour -- it combines all cpp files into one to compile. C++ doesn't like stuff like that.

My guess is libcurl.a is linked against libnss

I am pretty sure I link against libcurl.so. Will check tmrw.

libnss (which is the spawn of satan)

:-) Once I had to write some logic that was calling into openssl -- after few days I was thinking smth along the same lines.

P.S. pretty sure I don't have -lnss on my linker line.

@crusader-mike
Copy link
Author

crusader-mike commented May 18, 2017

Spent entire day today dealing with this. It looks like a race condition in libnss.so. Here is what I've found:

  • problem happens regardless how SDK is linked

  • I reduced my workload to just one PutObject call -- it works. Well, few calls still fail with curl code 28, but it is fixed with connection timeout increase

  • I changed workload to execute two PutObject calls in parallel on threads that are not ones that initialized SDK -- problem now manifests in majority of cases. In fact it is hard to get both of them to succeed (but possible). File size in question -- zero bytes.

  • I read up on libcurl a bit and checked SDK code for MT-related issues. Everything seems ok. Discovered CURL tracing code that was commented out, put it back in, recompiled SDK, ran my tests. Logs are here:
    success_but_long_sleep.txt
    bad.txt
    good.txt

  • attempts to debug it show that we often sit in _nss_dns_gethostbyaddr4 (or smth similar). And from time to time it takes ~6 seconds for this function to return.

  • I've used strace and checked every file being touched -- they are all accessible under my account. Running test under root yielded same result.

I am running out of ideas here, tbh... Google is full of pages mentioning race conditions in libnss, but they all seem to be already fixed and my packages seem fresh enough:

$ yum list installed | grep curl
curl.x86_64                          7.29.0-35.el7.centos           @base
libcurl.x86_64                       7.29.0-35.el7.centos           @base
libcurl-devel.x86_64                 7.29.0-35.el7.centos           @base
python-pycurl.x86_64                 7.19.0-17.el7                  @anaconda
$ yum list installed | grep nss
jansson.x86_64                       2.4-6.el7                      @anaconda
nss.x86_64                           3.28.4-1.0.el7_3               @updates
nss-softokn.x86_64                   3.16.2.3-14.4.el7              @base
nss-softokn-freebl.i686              3.16.2.3-14.4.el7              @base
nss-softokn-freebl.x86_64            3.16.2.3-14.4.el7              @base
nss-sysinit.x86_64                   3.28.4-1.0.el7_3               @updates
nss-tools.x86_64                     3.28.4-1.0.el7_3               @updates
nss-util.x86_64                      3.28.4-1.0.el7_3               @updates
openssh.x86_64                       6.6.1p1-22.el7                 @anaconda
openssh-clients.x86_64               6.6.1p1-22.el7                 @anaconda
openssh-server.x86_64                6.6.1p1-22.el7                 @anaconda
openssl.x86_64                       1:1.0.1e-60.el7_3.1            @updates
openssl-devel.x86_64                 1:1.0.1e-60.el7_3.1            @updates
openssl-libs.x86_64                  1:1.0.1e-60.el7_3.1            @updates

Help!

P.S. Checked all this against AWS (instead of Webscale) -- result is the same.

@crusader-mike
Copy link
Author

crusader-mike commented May 18, 2017

Huh... I might have found it:
https://curl.haxx.se/mail/lib-2016-08/0119.html
https://bugzilla.mozilla.org/show_bug.cgi?id=1297397

If this is the same bug -- updating libcurl to v7.54+ should fix it (see curl patch here: curl/curl@3a5d5de9)

@JonathanHenson
Copy link
Contributor

JonathanHenson commented May 18, 2017 via email

@crusader-mike
Copy link
Author

I am not particularly familiar with art of deployment on Linux. What do I need to do to ensure my app ends up using lubcurl+openssl on client's machine? Linking against them statically is likely possible but isn't smth I am looking forward to -- this means security patches/etc won't apply to my app...

Also, any ideas on these mysterious 6 seconds delays observe?

@crusader-mike
Copy link
Author

crusader-mike commented May 19, 2017

Ended up building latest libcurl (v7.54) from sources (with OpenSSL -- libNSS refused to compile due to some missing NSPR headers :D). Linked both AWS SDK and libcurl statically (this app is not going to talk to Internet, so security patches are not particularly necessary). Problem is gone!

Turned out there is another one which was largely obscured before. :-\ I'll submit it as another issue.

P.S. portion of my linker's cmdline: -laws-cpp-sdk-s3 -laws-cpp-sdk-core -l:libcurl.a -lssl -lz -lcrypto

@justnance justnance added help wanted We are asking the community to submit a PR to resolve this issue. and removed help wanted labels Apr 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We are asking the community to submit a PR to resolve this issue.
Projects
None yet
Development

No branches or pull requests

3 participants