Skip to content

graphql, node, rpc: improve HTTP write timeout handling #25457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Dec 7, 2022

Conversation

s1na
Copy link
Contributor

@s1na s1na commented Aug 1, 2022

A work-around for golang/go#47229. Trigger timeout a bit before stdlib closes the connection so users see a proper error message. Note I still kept the WriteTimeout of the http.Server.

I'm not sure what happens if the method handler and the timeout statement both write at the same time

Fixes #21430

@s1na s1na requested a review from fjl as a code owner August 1, 2022 12:50
@s1na s1na requested a review from gballet as a code owner August 1, 2022 15:18
@s1na s1na requested a review from holiman as a code owner August 8, 2022 17:15
@s1na
Copy link
Contributor Author

s1na commented Aug 9, 2022

I implemented a different approach we roughly discussed with @fjl:

the rpc handler will run the method in a goroutine now. It waits for that to finish OR the context deadline to expire, in which case returns an error. The method will be running in background until it stops (possible leak). Therefore it's important that long-running methods respect context deadline. I made sure this happens in getLogs which received most of the tickets of this issue.

The timeout is being set in the outer layer (node/rpcstack). Alternatively we can pass it through to handler for it to set the timeout.

Note: The error type I added feels awkward between official json-rpc errors.

node/rpcstack.go Outdated
@@ -198,6 +198,9 @@ func (h *httpServer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// if http-rpc is enabled, try to serve request
rpc := h.httpHandler.Load().(*rpcHandler)
if rpc != nil {
ctx, cancel := context.WithTimeout(r.Context(), h.timeouts.ReadTimeout-(50*time.Millisecond))
Copy link
Contributor Author

@s1na s1na Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this here. 50 ms is totally a random number. I think lower numbers like 20ms should also be fine. Important thing is the timeout hits before the stdlib's http library kills the connection.

@holiman
Copy link
Contributor

holiman commented Aug 30, 2022

This seems to work well. With the test-method like this:

diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index 90322033b9..b90ac5048d 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -2035,3 +2035,10 @@ func toHexSlice(b [][]byte) []string {
 	}
 	return r
 }
+
+func (api *DebugAPI) TimeMeOut(ctx context.Context, seconds uint64) (string, error) {
+	log.Info("TimeMeOut sleeping...", "seconds", seconds)
+	time.Sleep(time.Second * time.Duration(seconds))
+	log.Info("TimeMeOut waking up!")
+	return "Oll korrekt!", nil
+}

The 25s works fine, but 35s times out

[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[35],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"error":{"code":408,"message":"request timed out"}}

And the timeout triggers after 30s:

INFO [08-30|18:47:55.105] TimeMeOut sleeping...                    seconds=35
WARN [08-30|18:48:25.056] Served debug_timeMeOut                   conn=127.0.0.1:44346 reqid=1 duration=29.950591423s err="request timed out"
INFO [08-30|18:48:30.106] TimeMeOut waking up! 

Copy link
Contributor

@fjl fjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't like about this PR, is that the RPC handler will keep running in case of timeout, because it executes the RPC call on a background goroutine now.

It would be nicer to keep this out and rely on the individual RPC handlers to exit promptly when the context is canceled.

rpc/handler.go Outdated
@@ -334,8 +334,19 @@ func (h *handler) handleCall(cp *callProc, msg *jsonrpcMessage) *jsonrpcMessage
return msg.errorResponse(&invalidParamsError{err.Error()})
}
start := time.Now()
answer := h.runMethod(cp.ctx, msg, callb, args)
answerCh := make(chan *jsonrpcMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This channel needs to be buffered, otherwise the handler goroutine will leak when there is a timeout.

@holiman
Copy link
Contributor

holiman commented Aug 31, 2022

What I don't like about this PR, is that the RPC handler will keep running in case of timeout, because it executes the RPC call on a background goroutine now.

I agree about this. If you have a DoS-RPC request that you use to crash some remote node (or infura), then this feature right here would make it even easier to cause DoS. You could just sequentially fire and forget, and get a multihreaded impact.

@holiman
Copy link
Contributor

holiman commented Nov 5, 2022

TODO Triage discussion: worth doing this or is it dead in the water, due to the inherent limitations of this apprach?

@fjl
Copy link
Contributor

fjl commented Nov 15, 2022

Alternative approach:

When handling RPC call, launch a timer using time.AfterFunc(timeout, callback) where the callback will deliver the error response if the response was not sent yet. This can be done using a 1-buffered channel for example.

responseSlot := make(chan struct{}, 1)
responseSlot <- struct{}{}

responded := make(struct{})

timeoutResponseTimer := time.AfterFunc(timeout, func() {
    select {
    case <-responseSlot:
         // The timeout occurred and the method has not responded yet.
         // send the timeout error response
         close(responded)
    case <-responded:
         // The method responded.
    }
})

response := runMethod()
timeoutResponseTimer.Stop()
select {
case <-responseSlot:
     // send the response
      close(responded)
case <-responded:
     // timeout error response was already sent
}

@fjl
Copy link
Contributor

fjl commented Nov 15, 2022

Actually, it can also be done with less channels using sync.Once

var respondOnce sync.Once
timeoutResponseTimer := time.AfterFunc(timeout, func() {
    respondOnce.Do(func () {
        // send timeout error
    })
})

response := runMethod()
timeoutResponseTimer.Stop()
respondOnce.Do(func () {
    // send response
})

@fjl fjl changed the title node: rpc write timeout work-around graphql, node, rpc: HTTP write timeout handling Dec 6, 2022
@fjl fjl changed the title graphql, node, rpc: HTTP write timeout handling graphql, node, rpc: improve HTTP write timeout handling Dec 6, 2022
@fjl fjl added this to the 1.11.0 milestone Dec 6, 2022
@fjl
Copy link
Contributor

fjl commented Dec 6, 2022

I have decided to remove changes in eth/filters and eth/tracers from the PR. We can submit them in a separate change.

@fjl
Copy link
Contributor

fjl commented Dec 6, 2022

RPC method handler changes resubmitted in #26320

@fjl fjl merged commit f20eba4 into ethereum:master Dec 7, 2022
shekhirin pushed a commit to shekhirin/go-ethereum that referenced this pull request Jun 6, 2023
Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Apr 24, 2025
Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.
gzliudan added a commit to XinFinOrg/XDPoSChain that referenced this pull request Apr 25, 2025
Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

Co-authored-by: Sina Mahmoodi <[email protected]>
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Apr 25, 2025
…Org#982)

Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

Co-authored-by: Sina Mahmoodi <[email protected]>
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Apr 25, 2025
…Org#982)

Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

Co-authored-by: Sina Mahmoodi <[email protected]>
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Apr 27, 2025
…Org#982)

Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

Co-authored-by: Sina Mahmoodi <[email protected]>
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Apr 30, 2025
…Org#982)

Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

Co-authored-by: Sina Mahmoodi <[email protected]>
pdrobnjak pushed a commit to Tenderly/net-scroll-geth that referenced this pull request Apr 30, 2025
…h#42)

* rpc: improve error codes for internal server errors (ethereum#25678)

This changes the error code returned by the RPC server in certain situations:

- handler panic: code -32603
- result marshaling error: code -32603
- attempt to subscribe via HTTP: code -32001

In all of the above cases, the server previously returned the default error
code -32000.

Co-authored-by: Nicholas Zhao <[email protected]>
Co-authored-by: Felix Lange <[email protected]>

* rpc: add PeerInfo (ethereum#24255)

This replaces the sketchy and undocumented string context keys for HTTP requests
with a defined interface. Using string keys with context is discouraged because
they may clash with keys created by other packages.

We added these keys to make connection metadata available in the signer, so this
change also updates signer/core to use the new PeerInfo API.

* graphql: add query timeout (ethereum#26116)

This PR adds a 60 second timeout to graphql queries.

* graphql, node, rpc: improve HTTP write timeout handling (ethereum#25457)

Here we add special handling for sending an error response when the write timeout of the
HTTP server is just about to expire. This is surprisingly difficult to get right, since is
must be ensured that all output is fully flushed in time, which needs support from
multiple levels of the RPC handler stack:

The timeout response can't use chunked transfer-encoding because there is no way to write
the final terminating chunk. net/http writes it when the topmost handler returns, but the
timeout will already be over by the time that happens. We decided to disable chunked
encoding by setting content-length explicitly.

Gzip compression must also be disabled for timeout responses because we don't know the
true content-length before compressing all output, i.e. compression would reintroduce
chunked transfer-encoding.

* eth/filters, eth/tracers: add request cancellation checks (ethereum#26320)

This ensures that RPC method handlers will react to a timeout or
cancelled request soon after the event occurs.

Co-authored-by: Sina Mahmoodi <[email protected]>

* rpc: add limit for batch request items and response size (ethereum#26681)

This PR adds server-side limits for JSON-RPC batch requests. Before this change, batches
were limited only by processing time. The server would pick calls from the batch and
answer them until the response timeout occurred, then stop processing the remaining batch
items.

Here, we are adding two additional limits which can be configured:

- the 'item limit': batches can have at most N items
- the 'response size limit': batches can contain at most X response bytes

These limits are optional in package rpc. In Geth, we set a default limit of 1000 items
and 25MB response size.

When a batch goes over the limit, an error response is returned to the client. However,
doing this correctly isn't always possible. In JSON-RPC, only method calls with a valid
`id` can be responded to. Since batches may also contain non-call messages or
notifications, the best effort thing we can do to report an error with the batch itself is
reporting the limit violation as an error for the first method call in the batch. If a batch is
too large, but contains only notifications and responses, the error will be reported with
a null `id`.

The RPC client was also changed so it can deal with errors resulting from too large
batches. An older client connected to the server code in this PR could get stuck
until the request timeout occurred when the batch is too large. **Upgrading to a version
of the RPC client containing this change is strongly recommended to avoid timeout issues.**

For some weird reason, when writing the original client implementation, @fjl worked off of
the assumption that responses could be distributed across batches arbitrarily. So for a
batch request containing requests `[A B C]`, the server could respond with `[A B C]` but
also with `[A B] [C]` or even `[A] [B] [C]` and it wouldn't make a difference to the
client.

So in the implementation of BatchCallContext, the client waited for all requests in the
batch individually. If the server didn't respond to some of the requests in the batch, the
client would eventually just time out (if a context was used).

With the addition of batch limits into the server, we anticipate that people will hit this
kind of error way more often. To handle this properly, the client now waits for a single
response batch and expects it to contain all responses to the requests.

---------

Co-authored-by: Felix Lange <[email protected]>
Co-authored-by: Martin Holst Swende <[email protected]>

* format

* ethereum, ethclient: add FeeHistory support (ethereum#25403)

Co-authored-by: Felix Lange <[email protected]>

* internal/ethapi: return error when requesting invalid trie key (ethereum#25762)

This change makes eth_getProof and eth_getStorageAt return an error when
the argument contains invalid hex in storage keys.

Co-authored-by: Felix Lange <[email protected]>

* internal/ethapi: handle odd length hex in decodeHash (ethereum#25883)

This change adds zero-padding (prefix) of odd nibbles in the decodeHash function. 

Co-authored-by: ty <[email protected]>

* eth/filters, ethclient/gethclient: add fullTx option to pending tx filter (ethereum#25186)

This PR adds a way to subscribe to the _full_ pending transactions, as opposed to just being notified about hashes.

In use cases where client subscribes to newPendingTransactions and gets txhashes only to then request the actual transaction, the caller can now shortcut that flow and obtain the transactions directly.

Co-authored-by: Felix Lange <[email protected]>

* graphql: check header first in blocks query (ethereum#24190)

Fixes ethereum#24167

New behaviour is that the endpoint returns results only for available
blocks without returning an error when it doesn't find a block. Note we
skip any block after a non-existent block.

This adds a header fetch for every block in range (even if header
is not needed). Alternatively, we could do the check in every field's
resolver method to avoid this overhead.

* graphql: embed *Resolver instead of backend interface (ethereum#25468)

This creates some infrastructure to share resources between graphql
API objects.

* eth/filters: fix getLogs for pending block (ethereum#24949)

* eth/filters: fix pending for getLogs

* add pending method to test backend

* fix block range validation

* eth/filters: add global block logs cache (ethereum#25459)

This adds a cache for block logs which is shared by all filters. The cache
size of is configurable using the `--cache.blocklogs` flag.

Co-authored-by: Felix Lange <[email protected]>

* eth/filters: send rpctransactions in pending-subscription (ethereum#26126)

This PR changes the pending tx subscription to return RPCTransaction types instead of normal Transaction objects. This will fix the inconsistencies with other tx returning API methods (i.e. getTransactionByHash), and also fill in the sender value for the tx.

co-authored by @s1na

* rpc: fix unmarshaling of null result in CallContext (ethereum#26723)

The change fixes unmarshaling of JSON null results into json.RawMessage.

---------

Co-authored-by: Jason Yuan <[email protected]>
Co-authored-by: Jason Yuan <[email protected]>

* eth/filters: fix a breaking change and return rpctransaction (ethereum#26757)

* eth/filters: fix a breaking change and return rpctransaction

* eth/filters: fix test cases

---------

Co-authored-by: Catror <[email protected]>

---------

Co-authored-by: Nicholas <[email protected]>
Co-authored-by: Nicholas Zhao <[email protected]>
Co-authored-by: Felix Lange <[email protected]>
Co-authored-by: Ahmet Avci <[email protected]>
Co-authored-by: Sina Mahmoodi <[email protected]>
Co-authored-by: Sina Mahmoodi <[email protected]>
Co-authored-by: mmsqe <[email protected]>
Co-authored-by: Martin Holst Swende <[email protected]>
Co-authored-by: lightclient <[email protected]>
Co-authored-by: TY <[email protected]>
Co-authored-by: ty <[email protected]>
Co-authored-by: lmittmann <[email protected]>
Co-authored-by: Jason Yuan <[email protected]>
Co-authored-by: Jason Yuan <[email protected]>
Co-authored-by: Yier <[email protected]>
Co-authored-by: Catror <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calls for block/transaction information sometimes don't return
3 participants