Skip to content

Board programmed with example 16 randomly stops sending GPS updates via Iridium #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adrian-bodenmann opened this issue Apr 15, 2025 · 7 comments

Comments

@adrian-bodenmann
Copy link

We are using the SparkFun WRL-18712 board programmed with the code in example 16 and it sometimes stops transmitting GPS messages.

When monitoring the serial communication from the board, it eventually sends the following lines:
*** modem.clearBuffers failed with error 3 ***

followed by
*** modem.begin failed with error 5 ***, which is then sent repeatedly, and it never recovers until the board is restarted.

This happens randomly. Sometimes within 10s of minutes, sometimes within hours, sometimes it does not happen at all (tested for several days). We have now 3 SparkFun WRL-18712 boards and it happens on all of them. We originally had 1 that we noticed stopped transmitting GPS, so we initially thought there was an electrical issue, so bought 2 new ones, but the new ones have the same issue.

I did a lot of debugging, and it looks like the serial strings sent from the Iridium 9603 modem sometimes stop sending the line feed (LF), and sometimes the carriage return (CR) at the end of the terminator, i.e. "OK[CR]" or "OK" instead of "OK[CR][LF]".

waitForATResponse() in IridiumSBD.cpp then times out because it does not see the terminator it is expecting. That then causes an ISBD_PROTOCOL_ERROR (3) and afterwards an ISBD_NO_MODEM_DETECTED (5) error.

(Note: The missing "[LF]" or the "[CR][LF]" characters are sent at the start of the next transmission from the Iridium 9603 modem, but that does not happen until a new prompt (e.g. "AT") is sent to it).

@adrian-bodenmann
Copy link
Author

adrian-bodenmann commented Apr 15, 2025

Small update / clarification
Iridium 9603 modem seems to be shifting what it is sending more and more backwards, i.e. starting with terminating messages with
"...OK[CR][LF]", then
"...OK[CR]", then
"...OK", then
"...O"
This always seems to happen first in the reply message to the prompt "AT+CGMR".

Previously I thought of using "OK" as terminator string instead of "OK[CR][LF]" as a workaround, but this now does not seem to be an option either.

For running these tests I enabled (uncommented) #define noTX and #define skipGNSS and set the intervalls as follows

#define DEF_WAKEINT   30
#define DEF_ALARMINT  1
#define DEF_TXINT     1

I then just let it run, while monitoring the communication on the serial line. This should work indefinitely. However, most of the time (but not always) it stopped after 10s of minutes to hours with the above errors.

@PaulZC
Copy link

PaulZC commented Apr 15, 2025

Just copying my reply from the Forum:

I have used the 9603N a lot over the years, and I don’t think I’ve ever seen the [LF] or [CF][LF] being truncated. My vague guess is that the modem is being powered off while it is outputting its reply? Maybe because the code thinks the battery voltage is low? Did you have a LiPo battery attached during your tests? Maybe try reducing DEF_LOWBATT just to be sure?

Please make sure you are using version 2.1.0 of the SparkFun Apollo3 core / board package, and have selected the RedBoard Artemis ATP. I have seen some UART badness with 2.2.1.

@PaulZC
Copy link

PaulZC commented Apr 15, 2025

Hi Adrian (@adrian-bodenmann ),

Please try enabling the extra diagnostic messages:

#define DIAGNOSTICS true // Change this to true to see IridiumSBD diagnostics

They may provide more clues?

Best,
Paul

@PaulZC
Copy link

PaulZC commented Apr 15, 2025

Please also tell me how you are powering the board:

  • Do you have a LiPo battery attached? Is it fully charged?
  • Do you have USB attached? Is it providing power?

Do you have the Serial Monitor open? Please copy and paste the complete output from before and after the failure.

Thank you,
Paul

@adrian-bodenmann
Copy link
Author

Thank you for your reply.
I am using Version 2.2.1 of Apollo3 core. So it looks like this causes the issue.

The reason why I used 2.2.1 is that 2.1.0 causes another other issue, and I assumed that the issue from 2.1.1 would have been fixed by 2.2.1.

Version 2.1.0 causes the following issue: If the program is (re)started while the Iridium modem is on (red LED on), it gets stuck at modem.begin() forever. This does not happen with version 2.2.1.

The boards are connected by the USB cable. No extra batteries attached.

I've been running the code with the DIAGNOSTICS enabled, and I added more consoleprints or adapted them (e.g. printing [CR] instead of the invisible \r)

This is an example of the code running well to start with, using Version 2.2.1 of Apollo3 core:

2025-04-23 10:33:36.378 INFO  Received on serial: >> AT
2025-04-23 10:33:36.378 INFO  Received on serial: Waiting for response OK
2025-04-23 10:33:36.378 INFO  Received on serial: << AT[CR][CR][LF]OK[CR][LF]
2025-04-23 10:33:36.378 INFO  Received on serial: >> ATE1
2025-04-23 10:33:36.378 INFO  Received on serial: Waiting for response OK
2025-04-23 10:33:36.393 INFO  Received on serial: << ATE1[CR][CR][LF]OK[CR][LF]

And this is 6 minutes later, where the linefeed ([LF]) character is read late, presumably due to the bug in Version 2.2.1 of Appollo3:

2025-04-23 10:39:23.139 INFO  Received on serial: >> AT
2025-04-23 10:39:23.139 INFO  Received on serial: Waiting for response OK
2025-04-23 10:39:23.139 INFO  Received on serial: << [LF]AT[CR][CR][LF]OK[CR]
2025-04-23 10:39:52.704 INFO  Received on serial: waitForATResponse timed out
2025-04-23 10:39:53.704 INFO  Received on serial: >> AT
2025-04-23 10:39:53.706 INFO  Received on serial: Waiting for response OK
2025-04-23 10:39:53.706 INFO  Received on serial: << [LF]AT[CR][CR][LF]OK[CR]

In this occurence the error happened after 6 minutes, but it it varies between minutes, hours, to not occurring at all, with exactly the same setup.

I am a bit surprised the UART bug in Appollo3 has been known for such a long time and is not fixed, because I am sure it causes very hard to debug issues for others as well. Are there plans to fix this in a future release?

For the above-mentioned reason rolling back to 2.1.0 is not really a viable solution

@PaulZC
Copy link

PaulZC commented Apr 23, 2025

Hi Adrian,

Thank you for the update.

I won't be able to run any hardware tests for ~two weeks. I will investigate why re-begin fails on 2.1.0 when I have access to hardware again.

With version 2.2.1 of the core, UnbufferedSerial and serial_api especially require patching. We had to do this on the OpenLog_Artemis. Details are here:

https://github.com/sparkfun/OpenLog_Artemis/blob/main/COMPILE_BINARY.md#patch-the-apollo3-core

Paulvha provided the patch. His notes are here:

sparkfun/OpenLog_Artemis#117 (comment)

I realize it's a big ask but if you have the time please try patching the 2.2.1 core and re-run your tests. I'm hoping the issue goes away...?

Best wishes,
Paul

@adrian-bodenmann
Copy link
Author

adrian-bodenmann commented Apr 23, 2025

Hi Paul,
Thank you for the instructions.
I applied the patch for the relevant files.

Unfortunatley the code now crashes. I believe this is due to an unnecessary assert in the patched serial_api.c.
This is the output of my serial logger:

2025-04-23 17:57:03.817 INFO  Received on serial: Putting the 9603N to sleep.
2025-04-23 17:57:03.817 INFO  Received on serial: custom IridiumSBD::endSerialPort
2025-04-23 17:57:03.817 INFO  Received on serial: Powering off modem...
2025-04-23 17:57:03.817 INFO  Received on serial: setSleepPin: sleepPin set LOW
2025-04-23 17:57:03.817 INFO  Received on serial: Getting ready to put the Apollo3 into deep sleep...
2025-04-23 17:57:03.831 INFO  Received on serial: custom IridiumSBD::endSerialPort
2025-04-23 17:57:03.831 INFO  Received on serial: Disabling 9603N power...
2025-04-23 17:57:03.831 INFO  Received on serial: Disabling the supercapacitor charger...
2025-04-23 17:57:03.831 INFO  Received on serial: Powering down the GNSS...
2025-04-23 17:57:03.831 INFO  Received on serial: Going into deep sleep until next WAKEINT (30 seconds).
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.824 INFO  Received on serial: Artemis Global Tracker
2025-04-23 17:57:12.824 INFO  Received on serial: Software Version: 2.1
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.824 INFO  Received on serial: Ready to accept configuration settings via Serial...
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.824 INFO  Received on serial: 
2025-04-23 17:57:12.840 INFO  Received on serial: Getting the PHT readings...
2025-04-23 17:57:12.845 INFO  Received on serial: *** Could not detect the MS8607 sensor. Trying again... ***
2025-04-23 17:57:13.851 INFO  Received on serial: Pressure (mbar): 1008
2025-04-23 17:57:13.851 INFO  Received on serial: Temperature (C * 10^-2): 2426
2025-04-23 17:57:13.855 INFO  Received on serial: Humidity (%RH * 10^-2): 3242
2025-04-23 17:57:13.855 INFO  Received on serial: Getting ready to put the Apollo3 into deep sleep...
2025-04-23 17:57:13.857 INFO  Received on serial: custom IridiumSBD::endSerialPort
2025-04-23 17:57:13.857 INFO  Received on serial: Disabling 9603N power...
2025-04-23 17:57:13.859 INFO  Received on serial: Disabling the supercapacitor charger...
2025-04-23 17:57:13.859 INFO  Received on serial: 
2025-04-23 17:57:13.859 INFO  Received on serial: 
2025-04-23 17:57:13.859 ERROR Received on serial: ++ MbedOS Error Info ++
2025-04-23 17:57:13.861 ERROR Received on serial: Error Status: 0x80FF0144 Code: 324 Module: 255
2025-04-23 17:57:13.861 ERROR Received on serial: Error Message: Assertion failed: obj->serial.uart_control->handle != NULL
2025-04-23 17:57:13.861 INFO  Received on serial: Location: 0x2F2E5
2025-04-23 17:57:13.861 INFO  Received on serial: File: mbed-os/targets/TARGET_Ambiq_Micro/TARGET_Apollo3/device/serial_api.c+158
2025-04-23 17:57:13.861 ERROR Received on serial: Error Value: 0x0
2025-04-23 17:57:13.866 INFO  Received on serial: Current Thread: main Id: 0x10007224 Entry: 0x30435 StackSize: 0x1000 StackMem: 0x10006200 SP: 0x100070D4
2025-04-23 17:57:13.866 ERROR Received on serial: For more info, visit: https://mbed.com/s/error?error=0x80FF0144&tgt=SFE_ARTEMIS_ATP
2025-04-23 17:57:13.866 ERROR Received on serial: -- MbedOS Error Info --

I believe this happens when attempting to Serial1.end() without having it opened before.

The serial_free() in the patched serial_api.c starts with

void serial_free(serial_t *obj) {

    MBED_ASSERT(obj->serial.uart_control->handle != NULL);

    //
    // check for initialized
    //
    if (! obj->serial.uart_control->handle){
        return;
    }

though MBED_ASSERT(obj->serial.uart_control->handle != NULL); does not seem to be necessary, and in fact cause the crash. The handle is checked in the if-clause below, but obviously it never reaches that line if it it is NULL. I tried to comment-out that line, but the program keeps crashing with the same error message, so it looks like my change is not loaded, even if I close the Arduino IDE (version 2.3.4), re-open it, compile the sketch with the tick button and upload it with the arrow button. Do you know why that is? Is it by any chance because of libmbed-os.a, i.e. does this file need to be rebuilt? If so, how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants