-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Incorrect Assumption There Is A TCP Connection When There May Not Be #2750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it still respond to ping when its stuck in this loop forever state for you? if so, I have seen this issue where only a power cycle will get it back. Kicking it from the AP and having it reconnect does not help wake it back up. it does respond to ping. |
@mtnbrit that is interesting. Yes it will still respond to ping. The packets can be lost anywhere between the remote machine and the ESP device, in my case most often in the Telco mobile network. If the packets are lost beyond the router then it will work fine when pinged from the local network and the packet loss can be temporary so it could be pinged from elsewhere as well once the network is working again. However, a failed TCP connection will not be healed so it will remain in the endless loop referred to regardless of the network healing. What is interesting is where @mtnbrit says:-
because without a WiFi network it should know the TCP connection is broken. How do you kick it from the AP, is it by a command on the router? Also how long is it disconnected from the router because it should be disconnected for a while before it gives up. The TCP layer can deal with intermittent packet loss and needs a few seconds to try and recover from a short term failure before it classifies packet loss as a failed TCP connection. You could try something like blocking the ESP MAC address from connecting to your router for 10 seconds or so and see if it is still stuck in an endless loop or the TCP connection has failed gracefully. |
Any progress on this one? WiFi down should definitely trigger a connection closed signal on a device where WiFi is the only connectivity. |
Disconnection from an AP does indeed close all active WiFiClient connections. The point from the original issue is a valid one. My assumption was that lwip will kill inactive PCBs after a while. I will check whether this indeed happens. |
I'm thrilled to get a response from @igrr. In a real system that is expected to run for long periods this is a major flaw so I was a little disappointed there was so little interest when I first raised the issue. It seems to me to be so major that commercial products would be insufficiently reliable to be viable with this library. I have had devices run for weeks in the stuck condition so I'm confident lwip does not pick up the absence of a remote socket. In the TCP layer there is a concept of keep alive packets but these are often not implemented at all and the relevant RFS says they shouldn't be sent at less than 2 hour intervals see http://ltxfaq.custhelp.com/app/answers/detail/a_id/1512/~/tcp-keepalives-explained . You will see all over the internet people stressing the importance of large capacitors with ESP8266's to keep them reliable. I think that claim is not in general true and mostly large capacitors are used to overcome this software error. Radio transmit increases current draw a lot which can cause a voltage drop sufficient for the processor to continue to work but the radio to fail temporarily. Then you can get into the state described above where the device is stuck waiting for a packet that will never come. I can reproduce this reliably by powering an ESP8266 that sends data at 1 minute intervals off alkaline batteries until they become flat. Alkaline batteries have a high internal resistance making the voltage sensitive to current draw and while the exact battery voltage where the ESP will get stuck is unpredictable it almost always does, long before the batteries are flat and usually from about 2/3 flat with 3 cells in series. Rebooting makes the ESP functional again on the same batteries. I'm not a hundred percent sure of this because I am assuming the processor can recover from a brownout satisfactorily. What I have checked though is that the processor is still running when in this state as the current draw fluctuates and it appears to be fluctuating enough to be still using the radio and it will still flash the led on a NodeMCU occasionally. Most applications will work OK with an occasional reboot but stuckness is fatal and that is what, as far as I can tell, most large capacitors are added to avoid. Looking some more, I came across the statement "LWIP_TCP_KEEPALIVE controls compiling in support for TCP keepalives and by default each connection has keepalives off." So it seems keep alives can be enabled in lwip but I hope the interval is much less than 2 hours. |
Thanks for this information, @kentaylor. I have been experiencing TCP connections from node MCU units getting into a bad state (unit thinks it is still connected via WiFiClient.connected() API, but it cannot get any more data from server) where a power cycle of the unit, or dropping the connection from the server, has been the only way to make communication work again. On server side the connections are eventually detected to be broken (exact exception in python is TimeoutException). My workaround so far is dropping TCP connections from node MCU units after awhile, for example never keeping it for more than five minutes, by closing it on the server. It does feel really bad though when TCP sockets are supposed to be super reliable. The read on keep alive packets was good, it seems like a good balance between reliability and low transmission overhead. |
We can check for (in)activity on any given PCB and close it after certain timeout. This however would break applications which use persistent TCP connections without application-level keepalives (like MQTT does). For instance, I have an application with a TCP connection which is expected to stay up for hours, without any data being passed in either direction. Would it be best to add some reasonable timeouts at application layer? For instance HTTPClient library can have some timeout associated with the |
Updating the WiFiclient example to include connection reliability would be good. What do you mean by "at the application layer"? |
@kentaylor "Most applications will work OK with an occasional reboot but stuckness is fatal and that is what, as far as I can tell, most large capacitors are added to avoid." - as far as i remember from some posts, is that one possible issue is that the Flash operates on 3V on most of the devices. So using 2 alkaline batteries is already on it's low level. |
@igrr My application does not use battery for power (status lamps in software development offices). So I prefer reliability in favour of low power consumption or low bytes-per-minute.
|
Does anyone know how I can enable keepalive for TCP communications with the ESP8266/Arduino ? I see some mention by @igrr and @kentaylor in this chain. My issue is I have multiple clients connecting and sometimes a client can drop off and the TCP socket remains open. If I could implement a keepalive function, I could free the socket. I have found "espconn_set_keepalive" function in the espconn.h, but I have no idea how to implement it in the Arduino environment. |
Basically your clients need to keep sending something say every 5 sec if nothing else is happening and your server needs to reset a timer on data received. When the timer times out, the server should close the socket on its end. See the code in this project Of course your clients must send the keepalive messages :-) |
Unfortunately I have no control over the attached client and I have no way of knowing if it drops from the network. The ESP8266 as the server to the client needs to send the keepalive message every minute or so to see if the client has dropped from the network so I can stop the client connection. The client may be attached for 1 minute or 1 or mores hours without sending any data, so setting a timer for lack of communications would be unpredictable. Without releasing the sockets when a client drops off the network unexpectedly I run out of sockets and the clients cannot attach to me. |
I seems to be able to reproduce this issue easily. |
Yes. Keepalives do that. Here is a better reference.
I don't think it will, An application will see a failed socket and should be robust to this. To quote from the reference "Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong." The only downside is increased network traffic.
An application like this will usually fail when it is behind a router that does network address translation (NAT) without keepalives. You need them in this case for a different reason to that described above. The NAT will drop the mapping after a while and the device on the internet side of the NAT will no longer be able to initiate messages to the device behind the NAT. See reference.
Assuming the application layer includes libraries like http client then yes, timeouts should be used rather than relying only on the WiFiClient::connected() to indicate there is no longer a connection to the remote machine. ESTABLISHED, without keepalives, means there has been a connection to the remote machine that hasn't been closed but it may nevertheless, no longer exist. In my view keepalives should also be used as they will maintain NAT mapping, cause sockets open a long time to lose their ESTABLISHED status when the far end is lost, and would be sufficient on their own to eliminate stuckness caused by the existing libraries. User programs that made the same errors as the current libraries would also benefit. The downside is extra network traffic. In the MQTT application mentioned by @igrr, if messages were sent rarely and the interval is short the keepalives could be most of the traffic. On wired networks the extra traffic is insignificant. On mobile phone networks data is expensive. The Telcos in Australia charge in 1KB increments, they used to round to 1MB and some resellers still do. They start a new increment if the time between data transmissions is too long (I'm not sure how long). I find this annoying because I've found on some low bandwidth applications most messages sent, even a few bytes count as 1KB, previously 1MB, and data volume as measured by the Telco is much higher than expected. It's becoming less of an issue as data costs decline though. A keepalive interval shorter than the Telco new connection charging interval could actually reduce data costs on mobile networks. |
@kentaylor is this issue still valid with latest git, or better yet, with #3215 ? |
Keepalives are enabled in LwIP, but the timeout is still the default one:
We can manipulate keepidle value for each TCP connection. The question is, where should the timeout come from? Should we introduce yet another ESP8266-specific API to WiFiClient class, or let applications retrieve |
@igrr my humble opinion: The fact that a specific API is added isn't a problem, as long as it's not necessary to call it with normal usage. That means that there is a default set under the hood. In this case, I it would be called only by those users who actually need it a value other than the default. |
Keepalives have been added but the initial problem I reported "Where there is an unreliable network connection the ESP8266 using the Arduino Core can get stuck in an endless loop waiting for a message that will never arrive" still exists according to my testing. I have been testing over a mobile phone data connection where the the poor connection quality triggers this problem regularly. In the tcp_keepalive function I can see that, when debugging the keep alive acknowledgements are printed. I don't fully understand the logic but this comment implies keepalives are only intended to be sent over an idle connection. Stuckness is happening when a socket is waiting for input but will never receive it. The decision to close a socket in the absence of a keepalive acknowlegement is made here but the pcb->tmr is set equal to tcp_ticks here while the socket is waiting for data from the remote socket. Therefore, the socket is never closed while waiting for data from a remote socket even though it may never come. I'm finding the logic in the code difficult to follow but Espressif is struggling too. It is now 12 months since I reported the problem, 9 months since @igrr acknowledged it and Espressif's attempts to fix it have not yet suceeded. |
Have you tried latest master with lwip-v2 enabled ? |
@d-a-v I'm not sure if I've used the latest implementation and I thought keepalives are enabled by default. Aren't they? I updated to the latest code in the latter half of October and tested it for a bit over a month over a Vodafone mobile networked WIFi access point. It failed regularly at first but has not suffered stuckness in the last two weeks. Should I update again? Previously I have found a stuckness failure every few months on a wired internet connection, frequent stuckness on the Telstra network over a period of a couple of weeks and on the Vodafone Telco network it has run for weeks without failure then regular failure for a few weeks then back to no failure for a while. I suspect reliability depends on the amount of traffic in the Telco cell but I speculate. Whatever the cause in the Telco network, it shouldn't matter, it should not get stuck. There is a work around of using an additional hardware based watchdog timer in the application layer as described here. There is a fundamental issue that Espressif seems to be ignoring. It is impossible to know whether a remote machine will respond in the expected way. Therefore implementations need to be robust to any response/non response from the remote end and this needs timeouts whenever waiting for something. Some of this robustness can be provided in the libraries and where possible it should be. But some can only be provided in the application layer. For example where a HTTP client is waiting for a response there should be a timer that will cause the HTTP request to fail after a reasonable, ideally configurable, amount of time but where an application is maintaining an open TCP connection it is impossible for the library to know what is a reasonable time to wait. In that case the application should have some periodical interaction with the remote end to ensure it is still behaving as expected. Keepalives provide assurance that the socket is still open as well as maintaining routes through NATS but they don't provide assurance that the application using that socket will send what it should or send anything at all. Keepalives operate below the TCP layer and do not interfere with the message flow in TCP so their use should not depend on what the TCP connection is doing at any moment. According to this good reference "Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong." |
About stuckness problem, and because of your:
In your original post above (sorry I did not read that thread before), I think that this is similar to #2925 and addressed by lwip2 which is only available in master. (I was not talking about keepalive, I rekon my message was confusing) |
The problem identified in this issue is not similar to #2995 - "New Simple Easily Reproducable Catastrophic TCP Stack Failure Results in Low Heap Crash shown with WiFiServer, Latency > 100ms, and Larger Send Bytes". This is a problem where the ESP8266 can get stuck in an endless loop waiting for a message that will never arrive as more fully described in the first post of this issue. It is caused by not timing out when the ESP is waiting for a response from a webserver. |
As pointed out by @EnricoElCartmanez, here, keepalives are not enabled by default but must be turned on for each socket. Therefore, the testing reported above is invalid as keepalives were not enabled Looking again at the test of whether a socket should be closed due to failure to receive a keep alive acknowledgement, I expect it will succeed in eliminating stuckness. However, keepalives should not be required to fix stuckness. The root cause is that the ESP is not timing out while waiting for a response that will never come. For example this loop will never return if the ESP thinks the socket exists even though the socket may no longer exist on the remote server. As @d-a-v is a contributor could this be fixed? |
As I tried to explain, keepalive code is enabled with defines, then must be set for every TCP connection in lwip's raw pcb in our case, I guess ClientContext.h is the common location for that. @kentaylor do we all agree? I am a contributor because I submitted a pull request which, by chance, was accepted. I have no more rights than you have. I can do the fix, you can too, but before that we need to know if such a pull-request has a chance to be accepted by members of this repo. Enabled keepalive is already in sources. Setting automatically it for all connection is not, but it can be done by hand in sketches, so to prove the fix works as expected and is necessary. We also need to know if this must always be enabled, and what keep-alive values would have to be set, or what new API is desired for setting parameters Proofs and coherency are needed for members to accept such a PR. You can also try lwip2 at no cost even if your problem is not similar to #2995, even if automatic keep-alive may be interesting or needed for this project. As I say lwip2 addresses problems with tcp stuckness that I experimented, not only me. TCP stuckness problems is the primary reason why lwip2 has been ported here. |
About stuckness problem, and it is to my mind not exactly but strongly related with your references to the test and loop above and your original issue too: lwip2 compared to previously-default-lwip1.4 also fixes the following behaviour:
You can reproduce that with the TelnetSerial example. |
That is broken behaviour. I'm quite surprised it's not happening more in general...? Is there a clear way to reproduce the behaviour? |
I don't agree with the use case, or I don't understand it. |
I think in the case of HTTPClient library this may not happen, as at this point the client is not sending anything to the server. Assume that the client has received a packet, sent an ack, and is waiting for the next packet. At this point, plug is pulled from the server. The client will faithfully sit waiting for the next packet, which will never arrive. This can be either solved using keepalives, or by fixing the logic of HTTPClient (i.e. taking some timeout into account). I would probably consider the second option safer (less possible side effects) at the moment. |
Thankyou @igrr. Yes, that is what happens. I would only add that fixing the logic of HTTPClient should be done regardless of whether keepalives are enabled even though enabling keepalives should fix the problem as well. The principle is, that it is not possible to know if a TCP connection still exists so when awaiting data there always needs to be a timeout. Therefore a timeout should be added to this loop. This is not the only place in the Arduino code where the device does not have a timeout while waiting for data from a TCP socket but fixing this one would be a great start. In response to @objarni, yes it is broken behaviour and it has been broken for a long time. In this thread I have linked to lots of reports where the problem being reported is most likely caused by this error. It is also, I suspect, why lots of people think large capacitors are needed to make the esp8266 reliable for the reasons explained here. I can reproduce the problem quite reliably when I power a esp8266 from 3 alkaline batteries as described, I see it often when using mobile phone networks to connect to the internet, rarely on wired internet connections and once when the Thingspeak server was faulty devices connected to Thingspeak that were kilometers apart on different networks all got stuck at the same time. Apart from powering with alkaline batteries I can't reproduce stuckness reliably as it requires dropping IP packets at the correct moment and I don't know how to do that. |
@kentaylor thanks for keeping at this very real issue. I hope some kind of fix which does not involve "turn the world upside down" is merged is in place for 2.4 release. An explanation of why my application - a internet connected "information radiator/orb" (physical pixel?) RGB led - does not suffer is because we do not use HTTPClient. Instead, we use raw TCP sockets and control connection state ourselves (reboot device if 90 seconds without any information from server seen). |
Alright, in the words of my country: Somebody Has To Cut The Cake (paraphrase), so I will. Conclusion: About keep-alives being disabled: If during a comm one end waits for a packet, but the other end dies unexpectedly, then the packet may never arrive. This wait state is not a transmit-wait-for-ack, where a retransmit algorithm within the tcp layer would help. Consider the following sequence (simplified):
The waiting is at above socket level, while the packet send-ack receive is at tcp level. In this case, if the low level tcp layer does not have an inactivity timer (which I do believe is our case), the app will wait forever. The connection has been lost, the tcp layer won't detect it, because there is nothing pending to transmit or ack pending to be received. Here, _tcp-> connected() returns true, which means that:
wold return true, which means that a loop that depends on HTTPClient::connected() would be stuck forever. I believe @kentaylor refers to this case. In that case, a timeout of some sort is needed. And, in fact, in the while(connected()) loop that @kentaylor links to, there is already a timeout implemented:
And so, in that case, there won't be an infinite loop. There other cases, such as the ThinkSpeakClient in the esp8266-weather-station implementation:
where there is no timeout implemented, hence the danger of infinite loop: client.connected() returns true, client.available() will always return 0, because the other side died. Conclusion: The arguable point is: in which level should it be implemented? Assuming I'm correct about what I see in the code, I see the following possibilities:
Conclusion: Is everyone ok with this plan? Now, having said all of the above, @d-a-v is correct about lwip2. There are a whole lot of issues in lwip1.4 that cause what seem to be infinite loops, or at least some form of s(t)uckness during a socket communication. I've lost count of how many issues have been addressed by it. Therefore, his request to try building with lwip2 is very valid, independently of whether this issue is really true or not. |
@devyte |
@aaro668 uninstall from board manager, then install git. Instructions are on readthedocs. Then restart Arduino IDE. In the menu, you should see lwip options. |
Thanks @devyte , this did not fix my problem though. It seems that wifiClient.connected() is returning true for a while even after I unplug the AP router.... |
@aaro668 I assume you mean that you still have an unreliable connection even with lwjp2. |
Thanks for the lengthy response. Where @devyte.says:
Yes that is correct.
I was taken aback when @devyte correctly pointed out I had linked to a loop that does have a timeout and is therefore OK. It is interesting that @devyte mentioned the
as having this problem because that was where I first discovered it. Here is a list of locations in the Arduino repository where I think a loop depends on the connected() function and will, under the circumstances confirmed by @devyte, get stuck forever.
Have I got these correct? @devyte says
, provides 3 alternatives and favours option 1 as a final solution. I'd argue that options 2 and 3 are best. Option 1 is undesirable for the @devyte reason
plus
Options 2 and 3 aren't exclusive so should both be done. I will test with lwip2. I expect it will take a few weeks as stuckness only occurs in certain circumstrances. In any case the problems identified here should be fixed. |
As @kentaylor shows it, every sketch currently uses
I'm not sure that long connections - idle or not - would be affected by TCP keepalives. To make things short:
|
@d-a-v @kentaylor I sort of lost track on this. I understand that the keepalive is merged. Is the issue resolved? Is there anything pending, or can this be closed? |
The issue is not resolved, at least not for the default case. Using 2.4.1, I periodically get a lockup running this code: I'd also like to second the notion that it is a cardinal sin for a library function to stroke the watchdog inside a loop that might block indefinitely; such behavior makes the entire platform unreliable for embedded applications. |
@devyte I had lost track of it too. @dalbert2 Are you able to reproduce this bug with latest git version (which is pre-2.4.2) ? @ALL
All of these timeouts beeing linked to |
Thank you for the quick response and the amazing level of help you provide in this forum! I am happy to test things, but unfortunately, I am not able to reliably reproduce the failure yet; in fact I saw it for the first time today in code that had been running for a long time; I found this thread when looking into it. My app can (currently) only build with LWIP 1.4 due to shortage of IRAM with LWIP2; if I can reproduce reliably, would the results be useful with LWIP 1.4? Thanks to Ken for identifying library calls that can loop indefinitely without timeout; if it's at all possible, those should be eliminated; no library code that can block should ever (effectively) disable the watchdog. |
What's stated above is not specific to lwip2, so yes you can try keepalive. |
For those struggling with this issue, an easy work-around is to implement a software watchdog using Ticker. Because the interrupt system is still running in this case, if the ticker expires without the watchdog getting stroked (e.g. because it is stuck inside LWIP), your ticker ISR can reset the system. Not perfect, but far better than a permanent lockup. If what I'm describing is not obvious, someone did a nice write-up here: |
@dalbert2 Have you had chance to try |
@d-a-v I'm sorry, I lost track of this issue and just now got back to it. I'm using core 2_4_2 with LWIP 2.0.3 (low memory version). I added a call to keepAlive(5,5,5) to my code. Unfortunately, it just causes an Exception 28 on the call to keepAlive. The problem is that keepAlive() de-references the client's tcp protocol control block which doesn't exist until after the call to connect(). Unfortunately, the call to connect() is where things are getting hung-up. It works most of the time, but sometimes the call locks up the ESP8266 for hours. millis() doesn't advance, but it does eventually recover. So I've since moved the keepAlive() call after the connect() call and it might help with subsequent GETs, but it seems that the problem is either in the call to dns_gethostbyname() or tcp_connect(). I don't see an obvious problem with the code leading up to either call. I've been trying to eliminate more code (e.g. using WiFiClient instead of HttpClient) to narrow down the problem. Next, I'm going to try using the server IP address instead of the hostname to count out the problem being in the DNS lookup.
|
A bit more information: this happens with either LWIP2 configuration and it seems to only happen when connection requests to the server are failing (which points back to the original problem in this post). After several failed connection attempts, a connection attempt will just hang either for hours or indefinitely. I'm able to reproduce it more often now and hope to soon be able to repeat it reliably. |
@kentaylor I think this issue was addressed with the keepalive feature. |
If you had a cheap pc with inbuilt NIC that supported W.O.L or ''magic packet message'' you could implement what i did with my wifi monitor to reboot my open wrt router supply to hard reset it when my complicated network loaded up to much at home. I came across same timing loop issue which i solved with turning an old open wrt wifi router into a wifi repeater and to repeat magic packet message to NIC when node mcu lost connection.and the nic's output gate relay toggled a relay rebooting mains power to my nbn router |
Basic Infos
Where there is an unreliable network connection the ESP8266 using the Arduino Core can get stuck in an endless loop waiting for a message that will never arrive.
Hardware
Hardware: ESP-12E
Core Version: All
Description
The effect of the problem is described in another thread that links to several instance reports. It happens often for me when the ESP is connected to a WiFi AP that uses a mobile phone network for its internet connectivity.
The failure is a consequence of an incorrect assumption that the ESP8266 knows whether it has a TCP connection. The assumption is correct when the TCP connection is created and destroyed in accordance with this state diagram
but can be invalid if there is packet loss once the TCP connection gets to the ESTABLISHED state shown in green in the diagram.
When lost packets are recovered in the TCP layer the state diagram is followed and when packets sent from the ESP are not acknowledged the ESP should know the TCP connection is broken but when the remote socket is closed incorrectly, perhaps through a power outage or packets have been sent from a remote socket but not received by the ESP it can not know. If packets are sent from the remote socket but not received, the remote machine will retry and eventually close its socket. If the ESP is waiting for incoming packets it will still think the TCP connection exists when it doesn't.
The Arduino core relies on the ESP knowing whether it has a TCP connection. An example is WiFiClient::connected() in WiFiClient.cpp where it concludes that there is a TCP connection when the TCP state is ESTABLISHED but ESTABLISHED means there has been a TCP connection, not necesarily that it still exists. In the WiFiClient.ino example client.available() is used to decide if a TCP connection exists and this function can return true when the TCP connection has been lost. Similarly in BasicHttpClient.ino it calls http.GET() which calls sendRequest which calls handleHeaderResponse() in which a loop is controlled with the erroneous connected() function that may return true when there is no TCP connection and hence loop forever.
The text was updated successfully, but these errors were encountered: