-
Notifications
You must be signed in to change notification settings - Fork 18k
net/http: TestTransportMaxConnsPerHost still flaky #31982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a bit interesting because we get the right number of dials and TLS handshakes but the connection returned is supposed not marked reused. |
The problem is that the GotConn for http2 (h2_bundle:6878) is called before we know whether we are creating a new stream (h2_bundle: 7449) which is what determines reuse. This is not a flaky test but a real bug. |
Using the nextStreamID is flawed anyhow. If you allow HTTP, we start at 3 which means its immediately reused. |
If we move the GotConn call into getClientConn(h2_bundle:749), we can distinguish between a new connection and a used connection. |
➜ http go test -run TestTransportMaxConnsPerHost -count=100 -race |
by the way, |
Change https://golang.org/cl/176720 mentions this issue: |
I submitted a fix which does highlight the issue with http2. I did some other testing and found some of the other ClientTrace calls like TLSHandshakreXXX are missing. |
The |
I just got another TryBot flake on this: |
CC @bradfitz |
nextStreamID was used as a means to determine if the connection was being reused. Multiple requests can see a new connection because the nextStreamID is updated after a ClientTrace reports it is being reused. Updates golang/go#31982 Change-Id: Iaa4b62b217f015423cddb99fd86de75a352f8320 Reviewed-on: https://go-review.googlesource.com/c/net/+/176720 Reviewed-by: Brad Fitzpatrick <[email protected]>
Change https://golang.org/cl/177037 mentions this issue: |
Whoops, I'd forgotten to submit 177037. Done. Hopefully it's better now. |
For: http2: track reused connections https://golang.org/cl/176720 (updates #31982) Some x/sys/unix updates come along for the ride too. I filed #32031 for making the bundling process less difficult and error-prone in the future. Change-Id: Ic822080991ffa2d50352c5f613e45648a327cf16 Reviewed-on: https://go-review.googlesource.com/c/go/+/177037 Run-TryBot: Brad Fitzpatrick <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
In the other direction, it appears that |
Closing as duplicate of #32336 now that we understand how severe that one is. |
Change https://golang.org/cl/184262 mentions this issue: |
There were at least three races in the implementation of the pool of idle HTTP connections before this CL. The first race is that HTTP/2 connections can be shared for many requests, but each requesting goroutine would take the connection out of the pool and then immediately return it before using it; this created unnecessary, tiny little race windows during which another goroutine might dial a second connection instead of reusing the first. This CL changes the idle pool to just leave the HTTP/2 connection in the pool permanently (until there is reason to close it), instead of doing the take-it-out-put-it-back dance race. The second race is that “is there an idle connection?” and “register to wait for an idle connection” were implemented as two separate steps, in different critical sections. So a client could end up registered to wait for an idle connection and be waiting or perhaps dialing, not having noticed the idle connection sitting in the pool that arrived between the two steps. The third race is that t.getIdleConnCh assumes that the inability to send on the channel means the client doesn't need the result, when it could mean that the client has not yet entered the select. That is, the main dial does: idleConnCh := t.getIdleConnCh(cm) select { case v := <-dialc: ... case pc := <-idleConnCh ... ... } But then tryPutIdleConn does: waitingDialer := t.idleConnCh[key] // what getIdleConnCh(cm) returned select { case waitingDialer <- pconn: // We're done ... return nil default: if waitingDialer != nil { // They had populated this, but their dial won // first, so we can clean up this map entry. delete(t.idleConnCh, key) } } If the client has returned from getIdleConnCh but not yet reached the select, tryPutIdleConn will be unable to do the send, incorrectly conclude that the client does not care anymore, and put the connection in the idle pool instead, again leaving the client dialing unnecessarily while a connection sits in the idle pool. (It's also odd that the success case does not clean up the map entry, and also that the map has room for only a single waiting goroutine for a given host.) None of these races mattered too much before Go 1.11: at most they meant that connections were not reused quite as promptly as possible, or a few more than necessary would be created. But Go 1.11 added Transport.MaxConnsPerHost, which limited the number of connections created for a given host. The default is 0 (unlimited), but if a user did explicitly impose a low limit (2 is common), all these misplaced conns could easily add up to the entire limit, causing a deadlock. This was causing intermittent timeouts in TestTransportMaxConnsPerHost. The addition of the MaxConnsPerHost support added its own races. For example, here t.incHostConnCount could increment the count and return a channel ready for receiving, and then the client would not receive from it nor ever issue the decrement, because the select need not evaluate these two cases in order: select { case <-t.incHostConnCount(cmKey): // count below conn per host limit; proceed case pc := <-t.getIdleConnCh(cm): if trace != nil && trace.GotConn != nil { trace.GotConn(httptrace.GotConnInfo{Conn: pc.conn, Reused: pc.isReused()}) } return pc, nil ... } Obviously, unmatched increments are another way to get to a deadlock. TestTransportMaxConnsPerHost deadlocked approximately 100% of the time with a small random sleep added between incHostConnCount and the select: ch := t.incHostConnCount(cmKey): time.Sleep(time.Duration(rand.Intn(10))*time.Millisecond) select { case <-ch // count below conn per host limit; proceed case pc := <-t.getIdleConnCh(cm): ... } The limit also did not properly apply to HTTP/2, because of the decrement being attached to the underlying net.Conn.Close and net/http not having access to the underlying HTTP/2 conn. The alternate decrements for HTTP/2 may also have introduced spurious decrements (discussion in #29889). Perhaps those spurious decrements or other races caused the other intermittent non-deadlock failures in TestTransportMaxConnsPerHost, in which the HTTP/2 phase created too many connections (#31982). This CL replaces the buggy, racy code with new code that is hopefully neither buggy nor racy. Fixes #29889. Fixes #31982. Fixes #32336. Change-Id: I0dfac3a6fe8a6cdf5f0853722781fe2ec071ac97 Reviewed-on: https://go-review.googlesource.com/c/go/+/184262 Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
From git root repo
$ cd src/net/http
$ go test -run TestTransportMaxConnsPerHost -count=100 -race
What did you expect to see?
Test passed
What did you see instead?
The text was updated successfully, but these errors were encountered: