Skip to content

Context cancel hangs if there is no network connectivity #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
banks opened this issue May 31, 2017 · 15 comments
Open

Context cancel hangs if there is no network connectivity #620

banks opened this issue May 31, 2017 · 15 comments

Comments

@banks
Copy link

banks commented May 31, 2017

Hi,

Not sure if this is a bug or if it is effectively the same as #584. But it appears to be a significant Gotcha if I'm understanding correctly.

Setup

package main

import (
	"context"
	"database/sql"
	"fmt"
	"time"

	_ "github.com/lib/pq"
)

func main() {
	db, err := sql.Open("postgres", "postgresql://127.0.0.1:5432/postgres?sslmode=disable&connect_timeout=3")
	if err != nil {
		panic(err)
	}

	for {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		rows, err := db.QueryContext(ctx, "SELECT 1")
		if err != nil {
			fmt.Println("ERROR:", err)
		}
		rows.Next()
		var i int
		err = rows.Scan(&i)
		rows.Close()
		cancel()
		if err != nil {
			panic(err)
		}
		fmt.Println("Got", i)
		time.Sleep(1 * time.Second)
	}
}

Behaviour

  • Code above works fine with a local postgres instance up
  • If you stop the PG instance the connect will be reset and it will fail loudly as expected
  • Do any of the following to simulate failure:
  • simulate process busy/hung by finding the PID of the connected postgres process and then kill -stop <pid>
  • simulate packet loss by adding echo "block drop quick on lo0 proto tcp from any to any port = 5432" | sudo pfctl -ef - on OS X or equivalent iptables rule to drop packets to the local database
  • Expected behavior: that the client's timeout after 5 seconds with an error and print to screen before retrying
  • ** Actual Behaviour**: the client remains hung indefinitiely in either case until the "failure" is resolved

Cause

  • It seems to be due to the fact that the code handling Context cancelation (https://github.com/lib/pq/blob/master/conn_go18.go#L67) attempts to send a cancel op to the database and waits forever for a response
  • The cancel method even calls dial internally but this is not even respecting the connect_timeout option, possibly because it's returning a pooled connection? Even my trivial test example doesn't timeout though and there should only be a single connection made.

Ramifications

In my case the motivation for wanting to use Contexts in the first place is that I need tight bounds on execution time and can't afford to have clients hang indefinitely on DB whatever the error condition is in the real-world (both packet loss and hung server processes happen).

I could implement this externally presumably with my own watcher on the context in another Goroutine that tears down the whole connection, but it seems like this mechanism should work for my case without re-inventing it outside.

Is there an alternative option?

@tamird
Copy link
Collaborator

tamird commented May 31, 2017

Yep, this is a bug! We need to set connection deadlines using net.Conn.Set{,Read}Deadline in our cancel function and close the underlying connection (the primary, not the secondary one used for cancellation) when they time out. cc @mjibson

@maddyblue
Copy link
Collaborator

I'll work on this soon.

@maddyblue maddyblue self-assigned this May 31, 2017
@maddyblue
Copy link
Collaborator

I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options.

The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future.

I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger?

@banks
Copy link
Author

banks commented Jun 6, 2017 via email

@banks
Copy link
Author

banks commented Jun 6, 2017 via email

@lattamore
Copy link

I have the same issue, but the thread waits forever for the cn.query in QueryContext. There are no client-side network timeouts. If for whatever reason the TCP FIN never makes it to the client the driver is at the mercy of the OS to signal the socket as disconnected. This takes 15+ minutes on my ubuntu box. I am currently using the workaround with github.com/Kount/pq-timeouts. Are there any reasons the driver does not have client side timeouts implemented (IE: databse/sql)?

@banks
Copy link
Author

banks commented Aug 7, 2017

FYI I worked around this issue immediately by using https://github.com/Kount/pq-timeouts which wraps this driver to implement timeouts better.

@chrispassas
Copy link

chrispassas commented Nov 28, 2017

I'm currently experencing this issue as described by @banks. Using pq-timeouts won't help in my case because I'm using a long lived connection pool and have different timeout values per-query. pq-timeouts is only a solution if all queries can use the same timeout value or you create a new connection per-query.

It seems like a bug that if the database your querying is for whatever reason unable to respond that it will hang forever. The whole point of context is to terminate things that run to long.

In my case of PostgreSQL databases are so loaded that at times they can't respond to the cancel query request so db.QueryContext() hangs forever.

Even if I wrap this in a switch statement with timeout I'll be leaking connections that never close.

@mwuertinger
Copy link

Any news on this? Sounds like a serious issue and it's almost a year since it was first reported.

@importnil
Copy link

Another year has passed, and I've stumbled upon this issue as well.

@ShawnXxy
Copy link

ShawnXxy commented May 7, 2019

There might be a workaround for this:

Adding another layer of PG Bouncer side car above Golang PG client.
And set the following TCP keepalive parameters actively detect broken TCP connection:
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=5
net.ipv4.tcp_keepalive_probes=3
After this change, the connection can fail fast and reconnect when the current TCP connection is broken.

Updates:
Below settings work more effective
tcp_keepcnt = 3
tcp_keepidle = 1
tcp_keepintvl = 1

@t33m
Copy link

t33m commented Feb 1, 2020

Almost 3 years passed and still no support context.WithTimeout() for query...

@vtolstov
Copy link

vtolstov commented Feb 1, 2020

may be https://github.com/jackc/pgx does not have such bug?

@narcis96
Copy link

Hello !

Are there any updated on this issue? We are still facing this critical issue in our system.
We would like to know if there is a plan for fixing this issue or we should start using other driver.
Any response is more than welcomed.

@Gunni
Copy link

Gunni commented Apr 25, 2025

This bug is now 8 years old, how about some progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests