-
Notifications
You must be signed in to change notification settings - Fork 38.5k
WebFlux with Tomcat, intermittent timeout when creating response #23096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As another data point, if I change the Controller
Running load against the endpoint for 1 hour produced the following output: From wrk (showing 13 timeout events):
And from the application console output:
This output appears to indicate that no timeouts occurred (even with the much tighter 50ms timeout) on the Mono instances returned by the Controller |
I tried changing the demo project dependencies to use the Netty based server and the Jetty server, and neither saw any repros in hour-long runs. I also tried updating the demo project dependencies to build against the |
First, on WebFlux we make sure the Servlet container doesn't timeout. So you do need to insert a The stacktrace shows Tomcat is performing an async dispatch. This happens only in case of a request handling error after the response is committed. In that case we have to dispatch briefly back into the container in order to raise the error from the container thread. The problem I think is that the onError handler where we dispatch sets the internal This is all a little convoluted but it's how the Servlet API works. It's possible that Jetty just handles it better and Reactor Netty is completely different. |
Interesting. Thank you for taking a look! For reference, having run the demo many more times with various small changes now, I have only very rarely seen the |
Actually the current behavior is based on #20600. The issue was with a chunked response that fails in the middle of being written, and the client doesn't realize there was a failure because the response just ends with a zero chunk. The change made is explained here:
At this point I'm back to not being sure what root cause is. Perhaps Tomcat isn't performing the async dispatch or failing to dispatch (as in the above stacktrace) leaving the response incomplete. Turning up logging for |
I've updated to used a logger in the WebFilter and added a couple new log calls as well as enabling TRACE level logging for
Based on the timing, it looks like the only relevant log line from Regarding your question about the metric where the mentioned apparent connection leak was observed, this was in a metric emitted by the application with a value based on Tomcat's connectionCount mbean. Again, this apparent leak went away when the WebFilter with timeout was added to ensure request handling eventually completed. |
So perhaps that's all it is then, i.e. making sure request handling does complete? It would be useful to know what part of the chain doesn't complete but aside from that nothing to fix here then. |
I'm not sure I follow. As I understand it, the logging seems to indicate that the Flux returned by the handler method in the Controller completed immediately following the handler method being called - but then ~10 seconds later the |
Okay I see what you mean. Meanwhile I uncovered #23175 coincidentally and I think it may be related to this issue. |
I wasn't able to dig deeper, and since I don't know that the cause is or whether it's feasible to apply to 5.1.x, I've moved it to 5.2 RC2 for now. |
Alright. Thank you for the update. I will try to find some time to dig deeper from my end to try to better understand what is going wrong. |
I suspect something something in the Servlet 3.1 reactive bridge, i.e. the classes used by |
Thanks for the pointer. I finally had some time yesterday to dig deeper here, and I believe I have identified the race that is responsible for this issue. It occurs in A quick fix I applied via synchronization to hopefully make the issue clear can be seen here: danielra@02dfda2 With this fix in place, I ran load against my demo repro project for 15 hours (1,186,638,734 requests) without any repros of the response timeout issue. Whereas in contrast without this fix I typically see multiple repros within a 5 minute run. |
Indeed it looks like there is a race. Thanks for narrowing it down! I think we can avoid the synchronization blocks. I'll experiment with that. There are also a few similar flags in other related classes, on the reading side and for writing with flush boundaries. Those will also need similar updates. |
Okay, I've come up with something. Give it a try with 5.1.10 snapshots once the build is complete. I ran for 10 minutes with no timeout errors (previously 5 min was giving a timeout consistently). |
Great. I'll this out once a snapshot version is available with this change included. Thanks! |
I had a chance to test this out. I ran a 15 hour test against the 5.1.10.BUILD-SNAPSHOT version with no response timeout occurrences. So, this looks good to me. Thanks! I am looking forward to the release! |
Thanks for the time to report, debug, and test! |
Affects: \5.1.7.RELEASE
I am running a spring-boot application which is using WebFlux with Tomcat as the underlying http server. In my real application with fairly high request rate, I noticed in metrics what appeared to be a slow leak in tomcat connections. I have since added a WebFilter to implement a response timeout, and this confirmed that a low volume of responses were never being completed. With the WebFilter in place, the observed connection leak is fixed - but I would like to resolve the response timeouts.
I have created a relatively simple isolated demo project which reproduces the issue here: https://github.com/danielra/webflux_response_timeout_repro
The project can be built via
./gradlew clean build
and then run viajava -jar ./build/libs/demo-0.0.1-SNAPSHOT.jar
. To reproduce the issue with the demo project I have then been usingwrk
( https://github.com/wg/wrk ) to throw load against the running application.For example:
Note that for this 5 minutes of load, 1 timeout was observed. The corresponding application console output was as follows:
The line
Response timeout after 10001 milliseconds for GET request with uri 'http://localhost:8080/get'. Response status code was already committed: '200 OK'.
is from the WebFilter included in the demo project. In this case there was also an apparently corresponding IllegalStateException logged, though in other runs I don't always see this sort of exception logged when the WebFilter's log line appears.For quick reference, the Controller method which handles the relevant requests in the demo project can be seen here: https://github.com/danielra/webflux_response_timeout_repro/blob/master/src/main/java/com/example/demo/controller/DemoController.java#L30
And the WebFilter that implements the response timeout can be seen here: https://github.com/danielra/webflux_response_timeout_repro/blob/master/src/main/java/com/example/demo/filter/DemoWebFilter.java#L25
Note that while the timeout is set to 10 seconds in this example, the responses seem to never be completed regardless of how long is waited. Apparently leaked tomcat connections could be observed building up for more than a week in my real application's metrics.
Please let me know if there is anything additional from my end that would be helpful in understanding and resolving this issue. Thank you for your time!
The text was updated successfully, but these errors were encountered: