You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge pull request #1114 from haaawk/stream_ids_fix
Stop reusing stream ids of requests that have timed out due to client-side timeout (#1114)
* ResponseFuture: do not return the stream ID on client timeout
When a timeout occurs, the ResponseFuture associated with the query
returns its stream ID to the associated connection's free stream ID pool
- so that the stream ID can be immediately reused by another query.
However, that it incorrect and dangerous. If query A times out before it
receives a response from the cluster, a different query B might be
issued on the same connection and stream. If response for query A
arrives earlier than the response for query B, the first one might be
misinterpreted as the response for query B.
This commit changes the logic so that stream IDs are not returned on
timeout - now, they are only returned after receiving a response.
* Connection: fix tracking of in_flight requests
This commit fixes tracking of in_flight requests. Before it, in case of
a client-side timeout, the response ID was not returned to the pool, but
the in_flight counter was decremented anyway. This counter is used to
determine if there is a need to wait for stream IDs to be freed -
without this patch, it could happen that the driver throught that it can
initiate another request due to in_flight counter being low, but there
weren't any free stream IDs to allocate, so an assertion was triggered
and the connection was defuncted and opened again.
Now, requests timed out on the client side are tracked in the
orphaned_request_ids field, and the in_flight counter is decremented
only after the response is received.
* Connection: notify owning pool about released orphaned streams
Before this patch, the following situation could occur:
1. On a single connection, multiple requests are spawned up to the
maximum concurrency,
2. We want to issue more requests but we need to wait on a condition
variable because requests spawned in 1. took all stream IDs and we
need to wait until some of them are freed,
3. All requests from point 1. time out on the client side - we cannot
free their stream IDs until the database node responds,
4. Responses for requests issued in point 1. arrive, but the Connection
class has no access to the condition variable mentioned in point 2.,
so no requests from point 2. are admitted,
5. Requests from point 2. waiting on the condition variable time out
despite there are stream IDs available.
This commit adds an _on_orphaned_stream_released field to the Connection class, and now
it notifies the owning pool in case a timed out request receives a late
response and a stream ID is freed by calling _on_orphaned_stream_released
callback.
* HostConnection: implement replacing overloaded connections
In a situation of very high overload or poor networking conditions, it
might happen that there is a large number of outstanding requests on a
single connection. Each request reserves a stream ID which cannot be
reused until a response for it arrives, even if the request already
timed out on the client side. Because the pool of available stream IDs
for a single connection is limited, such situation might cause the set
of free stream IDs to shrink to a very small size (including zero),
which will drastically reduce the available concurrency on
the connection, or even render it unusable for some time.
In order to prevent this, the following strategy is adopted: when the
number of orphaned stream IDs reaches a certain threshold (e.g. 75% of
all available stream IDs), the connection becomes marked as overloaded.
Meanwhile, a new connection is opened - when it becomes available, it
replaces the old one, and the old connection is moved to "trash" where
it waits until all its outstanding requests either respond or time out.
This feature is implemented for HostConnection but not for
HostConnectionPool, which means that it will only work for clusters
which use protocol v3 or newer.
This fix is heavily inspired by the fix for JAVA-1519.
Co-authored-by: Piotr Dulikowski <[email protected]>
# this is used in conjunction with the connection streams. Not using the connection lock because the connection can be replaced in the lifetime of the pool.
0 commit comments