Skip to content

Conversation

@rlubos
Copy link
Contributor

@rlubos rlubos commented Jul 2, 2025

Fix issues reported by UBSAN. For details, please refer to individual commit messages.

This PR combined with #92317, #92304 and #92341 gave me an almost clear run over tests/net with UBSAN enabled (using gcc). Just two tests failed due to an issue apparently unrealted to the networking code:

lib/os/cbprintf_complete.c:1678:36: runtime error: negation of -9223372036854775808 cannot be represented in type 'long long int'; cast to an unsigned type to negae this value to itself

rlubos added 8 commits July 2, 2025 14:42
Some bytes were compared twice and one comparison used wrong index.

Signed-off-by: Robert Lubos <[email protected]>
To address the misaligned access issues reported by UBSAN, introduce raw
variant of certain IPv6 functions used in the critical data path of the
network stack, operating on plain uint8_t buffers in stead of IPv6
address struct.

Signed-off-by: Robert Lubos <[email protected]>
Refactor local functions to work with byte buffers instead of struct
in6_addr and use switch to use raw variants of functions operating on
IPv6 addresses.

Signed-off-by: Robert Lubos <[email protected]>
IPv6 Neighbor Discovery interfaces modules like neighbor or routing
tables - converting them to raw variants seems futile. Therefore, for
IPv6 ND case, copy the raw IP address from the packet into the in6_addr
structure, and then pass it to respective functions. Performance
overhead should not be a big problem in such case as those actions are
only performed if a respective ND packet is received.

Signed-off-by: Robert Lubos <[email protected]>
Rework the rest of the IPv6-related code to avoid casting. Use raw
variants of IPv6-related functions whenever possible (especially on the
critical data path). For the routing case, use a copy of the address to
avoid massive rework of the routing module.

Signed-off-by: Robert Lubos <[email protected]>
To address the misaligned access issues reported by UBSAN, introduce raw
variant of certain IPv4 functions used in the critical data path of the
network stack, operating on plain uint8_t buffers in stead of IPv4
address struct.

Signed-off-by: Robert Lubos <[email protected]>
Rework the IPv4-related code to avoid casting. Use raw variants of
IPv4-related functions whenever possible (especially on the critical
data path).

Signed-off-by: Robert Lubos <[email protected]>
To avoid misalignment errors when casting between
sockaddr_storage/sockaddr and specialied sockaddr_* variants, specify
alignment for the former to match the alignment of the others.

The issue was reported by UBSAN:

  utils.c:802:8: runtime error: member access within misaligned
  address 0xf4aff186 for type 'struct sockaddr_in6', which requires
  4 byte alignment

Signed-off-by: Robert Lubos <[email protected]>
@rlubos rlubos requested review from carlescufi and jukkar July 2, 2025 12:49
@rlubos
Copy link
Contributor Author

rlubos commented Jul 2, 2025

CC @tpambor

@tpambor
Copy link
Contributor

tpambor commented Jul 2, 2025

With #92304, #92317, #92341, #92535 and this PR, I get the following remaining UBSAN errors using clang 19:

/home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18: runtime error: load of value 225, which is not a valid value for type 'bool'
/home/user/west_workspace/zephyr/lib/os/cbprintf_complete.c:1678:36: runtime error: negation of -9223372036854775808 cannot be represented in type 'sint_value_type' (aka 'long long'); cast to an unsigned type to negate this value to itself
/home/user/west_workspace/modules/lib/zcbor/src/zcbor_common.c:308:13: runtime error: call to function decode_lwm2m_senml through pointer to incorrect function type '_Bool (*)(zcbor_state_t *, void *)'
/home/user/west_workspace/zephyr/lib/os/fdtable.c:402:10: runtime error: call to function spair_close through pointer to incorrect function type 'int (*)(void *, int)'
Backtraces

tests/net/lib/lwm2m_rd_client/net.lwm2m.lwm2m_rd_client_dtls:

/home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18: runtime error: load of value 225, which is not a valid value for type 'bool'
    #0 0x0809b1a0 in sm_do_registration /home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18
    #1 0x0809b1a0 in lwm2m_rd_client_service /home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1472:4

SUMMARY: UndefinedBehaviorSanitizer: invalid-bool-load /home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18 

But this is a major step toward passing the network subsystem tests with UBSAN enabled.

@rlubos rlubos force-pushed the net/ip-addr-casting-cleanup branch from 364266f to e2a6af1 Compare July 2, 2025 13:31
@github-actions github-actions bot requested review from de-nordic and nordicjm July 2, 2025 13:32
nordicjm
nordicjm previously approved these changes Jul 2, 2025
rlubos added 9 commits July 2, 2025 16:34
As struct sockaddr have now alignment of 4 bytes, net_ipaddr_copy()
gives the following error if used for sockaddr:

  error: alignment 1 of ‘struct <anonymous>’ is less than 4
  [-Werror=packed-not-aligned]

Just use memcpy() instead, net_ipaddr_copy() was intended to use with IP
addresses, not socket related structs.

Signed-off-by: Robert Lubos <[email protected]>
cmsgbuf pointer can be NULL, therefore verify that before calling
memset() on it.

Signed-off-by: Robert Lubos <[email protected]>
This was caught by UBSAN:

  zvfs_select.c:70:2: runtime error: left shift of 1 by 31 places
  cannot be represented in type 'int'

Signed-off-by: Robert Lubos <[email protected]>
In order to prevent an overflow warning from UBSAN when bitshifting,
cast to uint64_t first before shifting, and then back to uint32_t.

Signed-off-by: Robert Lubos <[email protected]>
In order to avoid alignment issues when casting void pointers to
in(6)_addr structures, create a properly aligned copy of the ip(v6)
address on stack.

Signed-off-by: Robert Lubos <[email protected]>
Cast uint8_t variable to uint32_t explicitly to avoid implicit cast to
int, and thus potentially undefined behavior, reported by UBSAN:

  net_pkt.c:1946:17: runtime error: left shift of 239 by 24 places
  cannot be represented in type 'int'

Signed-off-by: Robert Lubos <[email protected]>
The iface test suite uses two different interface types with two
different context types and iid tests mixed up the two. It attempted
to create a mac address for Ethernet interface (which uses
struct eth_fake_context), however the net_iface_get_mac() function
only works with Dummy interface context (struct net_if_test). In
result the Ethernet interface context was corrupted (mac address
overwritten the promisc_mode flag).

Fix this by extracting mac generation code into a common function, and
use it in the Ethernet iface initialization phase instead of directly in
the test.

This was reported by UBSAN:
  tests/net/iface/src/main.c:239:34: runtime error: load of value 11,
  which is not a valid value for type '_Bool'

Signed-off-by: Robert Lubos <[email protected]>
In case address mode in a packet is none, the address pointer within mhr
struct will not be set. Therefore, the pointer should not be used before
address mode is verified inside ieee802154_check_dst_addr().

This was reported by UBSAN:

  subsys/net/l2/ieee802154ieee802154.c:296:41: runtime error: member
  access within null pointer of type 'struct ieee802154_address_field'

Signed-off-by: Robert Lubos <[email protected]>
socketpair fd regsiters ZVFS_MODE_IFSOCK therefore it should register a
close2() function instead of close().

Signed-off-by: Robert Lubos <[email protected]>
@rlubos rlubos force-pushed the net/ip-addr-casting-cleanup branch from e2a6af1 to a31a5a1 Compare July 2, 2025 14:35
@rlubos
Copy link
Contributor Author

rlubos commented Jul 2, 2025

/home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18: runtime error: load of value 225, which is not a valid value for type 'bool'

This doesn't reproduce easily, need to investigate this further.

/home/user/west_workspace/zephyr/lib/os/cbprintf_complete.c:1678:36: runtime error: negation of -9223372036854775808 cannot be represented in type 'sint_value_type' (aka 'long long'); cast to an unsigned type to negate this value to itself

Not sure how to tackle this, not really my turf... This seems to work:

index 18552ceb977..c8bb4a8f516 100644
--- a/lib/os/cbprintf_complete.c
+++ b/lib/os/cbprintf_complete.c
@@ -1675,7 +1675,7 @@ int z_cbvprintf_impl(cbprintf_cb __out, void *ctx, const char *fp,
                        sint = value->sint;
                        if (sint < 0) {
                                sign = '-';
-                               value->uint = (uint_value_type)-sint;
+                               value->uint = (uint_value_type)~sint + 1;
                        } else {
                                value->uint = (uint_value_type)sint;
                        }

/home/user/west_workspace/modules/lib/zcbor/src/zcbor_common.c:308:13: runtime error: call to function decode_lwm2m_senml through pointer to incorrect function type '_Bool (*)(zcbor_state_t *, void *)'

This will be tricky, as the failure is caused by autogenerated code. And from what I've checked, the problem is not isolated to LwM2M, similar issue could be observed in mcumgr for instance (probably any zcbor user would be affected I think).

/home/user/west_workspace/zephyr/lib/os/fdtable.c:402:10: runtime error: call to function spair_close through pointer to incorrect function type 'int (*)(void *, int)'

This should be fixed with the latest commit.

@tpambor
Copy link
Contributor

tpambor commented Jul 2, 2025

/home/user/west_workspace/zephyr/lib/os/fdtable.c:402:10: runtime error: call to function spair_close through pointer to incorrect function type 'int (*)(void *, int)'

This should be fixed with the latest commit.

I can confirm that this has been fixed and no new errors have occurred.

@rlubos
Copy link
Contributor Author

rlubos commented Jul 2, 2025

/home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18: runtime error: load of value 225, which is not a valid value for type 'bool'

This doesn't reproduce easily, need to investigate this further.

I think I've identified the problem with the help of address sanitizer, pushed the fix to the LwM2M-specific PR #92317 (a11a335). @tpambor I'd appreciate if you could confirm if it fixes the error on your side.

@tpambor
Copy link
Contributor

tpambor commented Jul 2, 2025

/home/user/west_workspace/zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c:1046:18: runtime error: load of value 225, which is not a valid value for type 'bool'

This doesn't reproduce easily, need to investigate this further.

I think I've identified the problem with the help of address sanitizer, pushed the fix to the LwM2M-specific PR #92317 (a11a335). @tpambor I'd appreciate if you could confirm if it fixes the error on your side.

I can confirm that this is now fixed and no new errors have occurred.

jukkar
jukkar previously approved these changes Jul 3, 2025
Socket dispatcher (and offloaded implementations in tests) register
fd as ZVFS_MODE_IFSOCK therefore they should register a  close2(
function instead of close().

Signed-off-by: Robert Lubos <[email protected]>
@rlubos
Copy link
Contributor Author

rlubos commented Jul 3, 2025

Pushed one more commit with a fix for the socket dipatcher (same issue as with socketpair).

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jul 3, 2025

Copy link
Contributor

@dkalowsk dkalowsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has taken a lot longer than I expected to walk through. Thanks @rlubos for the fixes on UBSAN! Do you feel this should be part of the 4.2 release?

@rlubos
Copy link
Contributor Author

rlubos commented Jul 3, 2025

Do you feel this should be part of the 4.2 release?

Majority of those changes are rather neutral and low risk in my opinion, so if it could get it it would be nice. But on the other hand, AFAICT more significant bugs were only identified in the test code, so if it's too late it shouldn't be a big problem. @jukkar @carlescufi What do you think?

@tpambor
Copy link
Contributor

tpambor commented Jul 3, 2025

Do you feel this should be part of the 4.2 release?

Zephyr 4.2 will be the first release to have CONFIG_UBSAN. I imagine users will be eager to try it out and will likely expect it to work across the board.

Currently, enabling CONFIG_UBSAN is problematic for any network-related functionality due to unaligned access issues with in_addr etc. (see #90882), which could be a significant showstopper for adoption. Therefore, getting this fix in for 4.2 would be fantastic.

If it's too late to include this in the 4.2 release, I would suggest we update the Kconfig help text for CONFIG_UBSAN to mark it as experimental. This would help manage user expectations and prevent a potential influx of issue reports from users who expect it to work out-of-the-box.

@jukkar jukkar added this to the v4.2.0 milestone Jul 4, 2025
@jukkar
Copy link
Member

jukkar commented Jul 4, 2025

I suggest getting it to 4.2

@danieldegrasse danieldegrasse merged commit aa4a878 into zephyrproject-rtos:main Jul 4, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants