Skip to content

DNS Resolver: dns_resolve_close() fails to clean up server slots when context is INACTIVE #99089

@zafersn

Description

@zafersn

Describe the bug

Describe the bug

dns_resolve_close() fails to clean up DNS server slots when the context state is DNS_RESOLVE_CONTEXT_INACTIVE. This leaves stale server configuration data in memory, causing subsequent dns_resolve_init() calls to incorrectly skip socket creation, permanently breaking DNS resolution after network reconnection events.

Expected behavior:
DNS resolution should work normally after step 8.

Actual behavior:
DNS resolution permanently hangs/times out. All getaddrinfo() calls fail silently.

Impact

This bug affects any application that needs to reconfigure DNS after network state changes:

  • Cellular modem power cycling scenarios
  • VPN connection/disconnection cycles
  • Any system with dynamic network topology requiring DNS reconfiguration
  • Applications calling dns_resolve_close()/dns_resolve_init() multiple times

Root Cause Analysis

In subsys/net/lib/dns/resolve.c, the function dns_resolve_close_locked() contains this code:

static int dns_resolve_close_locked(struct dns_resolve_context *ctx)
{
    int i, ret;

    if (ctx->state != DNS_RESOLVE_CONTEXT_ACTIVE) {
        return -ENOENT;  //  Bug: returns immediately without cleanup!
    }
    
    // ... cleanup code that never executes when state is INACTIVE ...
}

Problem flow:

  1. System boots → net_init() auto-creates DNS context in INACTIVE state
  2. Application calls dns_resolve_init() → state becomes ACTIVE, server slot populated
  3. DNS works
  4. Network disconnects
  5. Application calls dns_resolve_close() → context transitions to INACTIVE
  6. Server slot still has: sa_family=1, IP address="141.1.1.1" (NOT cleared because close did nothing!)
  7. Network reconnects
  8. Application calls dns_resolve_init() with same DNS server
  9. is_server_name_found() checks server[0] with sa_family=1finds "server already exists"
  10. Code skips socket creation via continue statement
  11. dns_dispatcher_register() never called → socket service restart never triggered
  12. Socket service keeps polling old (closed) socket FD
  13. DNS responses arrive but are never delivered to application
  14. DNS permanently broken

What have you tried to diagnose or workaround this issue?

  1. Added extensive logging to DNS resolver to trace server slot states through close/init cycles
  2. Confirmed that dns_resolve_close_locked() returns immediately when state is INACTIVE
  3. Verified that server slots retain sa_family and IP address data after close
  4. Traced socket service behavior and confirmed it never receives restart notification on second init
  5. Tested workaround: Modified dns_resolve_close_locked() to always clean up server slots regardless of state → DNS resolution works correctly

Proposed Fix

Modify dns_resolve_close_locked() to clean up server slots even when context is not ACTIVE:

static int dns_resolve_close_locked(struct dns_resolve_context *ctx)
{
    int i, ret;

    if (ctx->state != DNS_RESOLVE_CONTEXT_ACTIVE) {
        /* Even if context is not ACTIVE, we should still close any open sockets
         * to ensure proper cleanup. This handles cases where context is INACTIVE
         * but server slots still have valid data from previous initialization.
         */
        for (i = 0; i < SERVER_COUNT; i++) {
            if (ctx->servers[i].sock >= 0 || 
                ctx->servers[i].dns_server.sa_family != 0) {
                ret = dns_server_close(ctx, i);
                if (ret < 0 && ret != -ENOENT) {
                    NET_DBG("Cannot close DNS server %d (%d)", i, ret);
                }
            }
        }
        
        ctx->state = DNS_RESOLVE_CONTEXT_INACTIVE;
        return 0;  // Success, not error
    }

    // ... existing ACTIVE state cleanup code unchanged ...
}

Regression

  • This is a regression.

Steps to reproduce

Steps to reproduce the behavior:

  1. Boot system with cellular modem or any network interface
  2. Call dns_resolve_init() with a DNS server (e.g., "141.1.1.1")
  3. Verify DNS resolution works correctly
  4. Simulate network disconnect/reconnect (e.g., modem power cycle)
  5. Call dns_resolve_close() to clean up DNS context
  6. Network reconnects with same or different IP address
  7. Call dns_resolve_init() again with the same DNS server
  8. Attempt DNS resolution

Relevant log output

## Logs and console output

**Working scenario (First DNS init):**

[00:01:06.105] <dbg> net_dns_resolve: is_server_name_found: Checking server[0]: sa_family=0 sock=-1
[00:01:06.105] <dbg> net_dns_resolve: is_server_name_found: Server 141.1.1.1 NOT found
[00:01:06.105] <inf> net_dns_resolve: get_free_slot: Found free slot at index 0
[socket creation proceeds normally]
[00:01:07.181] <dbg> modem_hl7812: offload_bind: offload_bind
[00:01:07.181] <dbg> modem_hl7812: offload_recv: offload_recv
[00:01:07.181] <dbg> net_sock_svc: Socket service: triggering restart via eventfd for service 0x8094e34 with 1 fds
[00:01:07.181] <inf> modem_hl7812: DNS resolver initialized successfully


**Broken scenario (Second DNS init after reconnect):**

[00:02:23.560] <inf> modem_hl7812: Closing DNS context (state=3) to unregister from socket service
[00:02:23.560] <dbg> net_dns_resolve: dns_resolve_close_locked: state=3
[00:02:23.560] <inf> modem_hl7812: DNS context state after close: 3 (expecting 3=INACTIVE)
[00:02:23.560] <inf> modem_hl7812: Initializing DNS resolver with server 141.1.1.1
[00:02:23.560] <dbg> net_dns_resolve: is_server_name_found: Checking server[0]: sa_family=1 sock=-1  ← BUG!
[00:02:23.560] <dbg> net_dns_resolve: Server[0] addr=141.1.1.1, comparing with 141.1.1.1
[00:02:23.560] <inf> net_dns_resolve: Server 141.1.1.1 FOUND at index 0 (sock=-1, sa_family=1)
[00:02:23.560] <dbg> net_dns_resolve: Server 141.1.1.1 already exists
[socket creation SKIPPED - offload_bind/recv never called]
[socket service restart NEVER triggered]
[00:02:24.633] <inf> modem_hl7812: DNS resolver initialized successfully, new socket registered with service


**Note:** `sa_family=1` (AF_INET) in the broken scenario indicates the server slot was never cleared, even though the socket was closed (`sock=-1`).

Impact

Annoyance – Minor irritation; no significant impact on usability or functionality.

Environment

Environment

Target Platform:

  • Board: Custom board with stm32u5+ Sierra Wireless HL7812 cellular modem
  • Zephyr version: v4.3.0-rc2 (likely affects all versions)
  • OS: Windows 11
  • Toolchain: zephyr-sdk-0.17.2
  • Last commit from zephyr,
Commit: 80be486c8779725605cca3a85472b0861e6f6b04
Parents: 9a455998a4ffae941f1023e32e1f29f7cb43449e
Author: BUDKE Gerson Fernando <[[email protected]](mailto:[email protected])>
Author Date: Tue Nov 04 2025 10:10:42 GMT+0000 (Greenwich Mean Time)

Additional Context

No response

Metadata

Metadata

Assignees

Labels

area: NetworkingbugThe issue is a bug, or the PR is fixing a bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions