Skip to content

After few days connection stops working on Android #405

@IgorCordasProGlove

Description

@IgorCordasProGlove

Describe the bug

We have an app which runs as a foreground service on Android and should keep persistent connection to IotCore for multiple days/weeks.
Connection is provisioned using fleet provisioning.
We are not sending much data over this connection but it is expected that a standard use case would send 100-1000 messages each day per device.
Above is mostly working but on some devices after a few days (usually 3) we start receiving strange errors and device does not connect properly. This is part of the log that we have to detect this issue:
[2023.03.30 19:20:24.082] I: Publishing to topic <custom_topic>
[2023.03.30 19:21:54.519] I: Cloud connection interrupted: AWS_ERROR_MQTT_UNEXPECTED_HANGUP - The connection was closed unexpectedly.
[2023.03.30 19:21:54.520] I: New cloud connection state: REGISTERED_NOT_CONNECTED
[2023.03.30 19:21:55.887] I: Cloud connection reestablished. Resumed session: true
[2023.03.30 19:21:55.887] I: New cloud connection state: REGISTERED_CONNECTED

The strange thing here is not that connection broke but that from this point in time we start receiving : AWS_ERROR_MQTT_UNEXPECTED_HANGUP in a loop repeating each 90 seconds until we restart the app process.
When connection is in this state we can publish some of the messages but our subscriptions stop working.

We investigated Cloudwatch logs for IOT Core and there is a single error before this connection error loop starts happening:
DUPLICATE_CLIENTID
as per documentation : "The client is using a client ID that is already in use. In this case, the client that is already connected will be disconnected with this disconnect reason."

But the issue here is that we are not using duplicate id anywhere, all the communication for that client id is from a single IP as the logs state and since the process is running constantly it is using a single connection.
After that error we end up in that 90 second re-connection loop on the client.

Above does not happen every time, some devices run for 5+ days without this issue but is reproducible quite often.

Doze mode or battery optimizations seem to have no effect on this, some of the users exclude our app from battery optimization and the issue still starts happening.

Expected Behavior

Connection should be able to run continually (with reconnects) and no 90 second disconnect/re-connect loops should be observed.

Current Behavior

Connection gets into half-broken state and disconnects/re-connects loop indefinitely

Reproduction Steps

We are not able to provide this at current point due to a fact that issue reproduces only on some of the devices

Possible Solution

No response

Additional Information/Context

No response

SDK version used

software.amazon.awssdk.iotdevicesdk:aws-iot-device-sdk-android:1.11.3, software.amazon.awssdk.crt:aws-crt-android:0.16.6

Environment details (OS name and version, etc.)

Android (multiple versions experience the issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestA feature should be added or improved.p2This is a standard priority issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions