Context cancel hangs if there is no network connectivity #620

banks · 2017-05-31T20:29:36Z

Hi,

Not sure if this is a bug or if it is effectively the same as #584. But it appears to be a significant Gotcha if I'm understanding correctly.

Setup

package main

import (
	"context"
	"database/sql"
	"fmt"
	"time"

	_ "github.com/lib/pq"
)

func main() {
	db, err := sql.Open("postgres", "postgresql://127.0.0.1:5432/postgres?sslmode=disable&connect_timeout=3")
	if err != nil {
		panic(err)
	}

	for {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		rows, err := db.QueryContext(ctx, "SELECT 1")
		if err != nil {
			fmt.Println("ERROR:", err)
		}
		rows.Next()
		var i int
		err = rows.Scan(&i)
		rows.Close()
		cancel()
		if err != nil {
			panic(err)
		}
		fmt.Println("Got", i)
		time.Sleep(1 * time.Second)
	}
}

Behaviour

Code above works fine with a local postgres instance up
If you stop the PG instance the connect will be reset and it will fail loudly as expected
Do any of the following to simulate failure:
simulate process busy/hung by finding the PID of the connected postgres process and then kill -stop <pid>
simulate packet loss by adding echo "block drop quick on lo0 proto tcp from any to any port = 5432" | sudo pfctl -ef - on OS X or equivalent iptables rule to drop packets to the local database
Expected behavior: that the client's timeout after 5 seconds with an error and print to screen before retrying
** Actual Behaviour**: the client remains hung indefinitiely in either case until the "failure" is resolved

Cause

It seems to be due to the fact that the code handling Context cancelation (https://github.com/lib/pq/blob/master/conn_go18.go#L67) attempts to send a cancel op to the database and waits forever for a response
The cancel method even calls dial internally but this is not even respecting the connect_timeout option, possibly because it's returning a pooled connection? Even my trivial test example doesn't timeout though and there should only be a single connection made.

Ramifications

In my case the motivation for wanting to use Contexts in the first place is that I need tight bounds on execution time and can't afford to have clients hang indefinitely on DB whatever the error condition is in the real-world (both packet loss and hung server processes happen).

I could implement this externally presumably with my own watcher on the context in another Goroutine that tears down the whole connection, but it seems like this mechanism should work for my case without re-inventing it outside.

Is there an alternative option?

The text was updated successfully, but these errors were encountered:

tamird · 2017-05-31T21:05:43Z

Yep, this is a bug! We need to set connection deadlines using net.Conn.Set{,Read}Deadline in our cancel function and close the underlying connection (the primary, not the secondary one used for cancellation) when they time out. cc @mjibson

maddyblue · 2017-05-31T21:06:11Z

I'll work on this soon.

maddyblue · 2017-06-06T07:51:13Z

I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options.

The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future.

I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger?

banks · 2017-06-06T08:38:43Z

You could test if you control the Postgres process by sending sigstop and sigcont. But yes there are networking options to simulate failures in testing that might be simpler. I'd say it's at least surprising behaviour as it is and an important missing feature if there is no way to have a deadline on queries that is actually respected in case of a slow database or network failure. FWIW I did have connect timeout set to 5 seconds and this was _not_ respected in the failure cases described. Perhaps that is the bug?

…

On 6 Jun 2017, at 08:51, Matt Jibson ***@***.***> wrote: I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options. The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future. I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

banks · 2017-06-06T08:48:11Z

If the concern is that it might be unsafe to return a connection to pool after cancelling, would it be simplest to treat this as a special case where a connection that hit timeout before getting cancellation response is just discarded as an error/unknown connection state (I.e. Not returned to pool) or left to run for some longish time in background to see if cancellation gets processed eventually? I'm not intimately familiar with the interface database/sql requires from the driver so that might preclude just discarding connections as I suggest.

…

On 6 Jun 2017, at 08:51, Matt Jibson ***@***.***> wrote: I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options. The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future. I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

lattamore · 2017-06-06T21:49:05Z

I have the same issue, but the thread waits forever for the cn.query in QueryContext. There are no client-side network timeouts. If for whatever reason the TCP FIN never makes it to the client the driver is at the mercy of the OS to signal the socket as disconnected. This takes 15+ minutes on my ubuntu box. I am currently using the workaround with github.com/Kount/pq-timeouts. Are there any reasons the driver does not have client side timeouts implemented (IE: databse/sql)?

banks · 2017-08-07T12:47:56Z

FYI I worked around this issue immediately by using https://github.com/Kount/pq-timeouts which wraps this driver to implement timeouts better.

chrispassas · 2017-11-28T15:50:19Z

I'm currently experencing this issue as described by @banks. Using pq-timeouts won't help in my case because I'm using a long lived connection pool and have different timeout values per-query. pq-timeouts is only a solution if all queries can use the same timeout value or you create a new connection per-query.

It seems like a bug that if the database your querying is for whatever reason unable to respond that it will hang forever. The whole point of context is to terminate things that run to long.

In my case of PostgreSQL databases are so loaded that at times they can't respond to the cancel query request so db.QueryContext() hangs forever.

Even if I wrap this in a switch statement with timeout I'll be leaking connections that never close.

mwuertinger · 2018-05-07T14:06:05Z

Any news on this? Sounds like a serious issue and it's almost a year since it was first reported.

importnil · 2019-04-08T14:34:15Z

Another year has passed, and I've stumbled upon this issue as well.

ShawnXxy · 2019-05-07T06:11:41Z

There might be a workaround for this:

Adding another layer of PG Bouncer side car above Golang PG client.
And set the following TCP keepalive parameters actively detect broken TCP connection:
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=5
net.ipv4.tcp_keepalive_probes=3
After this change, the connection can fail fast and reconnect when the current TCP connection is broken.

Updates:
Below settings work more effective
tcp_keepcnt = 3
tcp_keepidle = 1
tcp_keepintvl = 1

t33m · 2020-02-01T10:27:03Z

Almost 3 years passed and still no support context.WithTimeout() for query...

vtolstov · 2020-02-01T10:49:38Z

may be https://github.com/jackc/pgx does not have such bug?

narcis96 · 2023-04-21T08:22:50Z

Hello !

Are there any updated on this issue? We are still facing this critical issue in our system.
We would like to know if there is a plan for fixing this issue or we should start using other driver.
Any response is more than welcomed.

Gunni · 2025-04-25T12:19:39Z

This bug is now 8 years old, how about some progress?

maddyblue self-assigned this May 31, 2017

banks mentioned this issue Aug 7, 2017

conn.go panics if TLS connection drops #644

Closed

georgysavva mentioned this issue Mar 27, 2020

pq library alternatives. cockroachdb/cockroach-go#59

Closed

maddyblue removed their assignment Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context cancel hangs if there is no network connectivity #620

Context cancel hangs if there is no network connectivity #620

banks commented May 31, 2017

tamird commented May 31, 2017

Uh oh!

maddyblue commented May 31, 2017

Uh oh!

maddyblue commented Jun 6, 2017

Uh oh!

banks commented Jun 6, 2017 via email

Uh oh!

banks commented Jun 6, 2017 via email

Uh oh!

lattamore commented Jun 6, 2017

Uh oh!

banks commented Aug 7, 2017

Uh oh!

chrispassas commented Nov 28, 2017 •

edited

Loading

Uh oh!

mwuertinger commented May 7, 2018

Uh oh!

importnil commented Apr 8, 2019

Uh oh!

ShawnXxy commented May 7, 2019 •

edited

Loading

Uh oh!

t33m commented Feb 1, 2020 •

edited

Loading

Uh oh!

vtolstov commented Feb 1, 2020

Uh oh!

narcis96 commented Apr 21, 2023

Uh oh!

Gunni commented Apr 25, 2025

Uh oh!

Context cancel hangs if there is no network connectivity #620

Context cancel hangs if there is no network connectivity #620

Comments

banks commented May 31, 2017

Setup

Behaviour

Cause

Ramifications

tamird commented May 31, 2017

Uh oh!

maddyblue commented May 31, 2017

Uh oh!

maddyblue commented Jun 6, 2017

Uh oh!

banks commented Jun 6, 2017 via email

Uh oh!

banks commented Jun 6, 2017 via email

Uh oh!

lattamore commented Jun 6, 2017

Uh oh!

banks commented Aug 7, 2017

Uh oh!

chrispassas commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwuertinger commented May 7, 2018

Uh oh!

importnil commented Apr 8, 2019

Uh oh!

ShawnXxy commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t33m commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtolstov commented Feb 1, 2020

Uh oh!

narcis96 commented Apr 21, 2023

Uh oh!

Gunni commented Apr 25, 2025

Uh oh!

chrispassas commented Nov 28, 2017 •

edited

Loading

ShawnXxy commented May 7, 2019 •

edited

Loading

t33m commented Feb 1, 2020 •

edited

Loading