Skip to content

Investigate the rcl struct drop order #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maspe36 opened this issue Mar 12, 2024 · 4 comments
Closed

Investigate the rcl struct drop order #369

maspe36 opened this issue Mar 12, 2024 · 4 comments

Comments

@maspe36
Copy link
Collaborator

maspe36 commented Mar 12, 2024

As discussed in the March 11th monthly meetup, we have some occasional unit test failures. These failures have seemingly been narrowed down to rclrs_tests and appear to be related to DDS usage. See #364 for more details.

With that in mind, it is imperative that we take multiple approaches to thoroughly investigate and analyze the root causes of these failures.

For this issue, we agreed that it would be beneficial to look at the drop order for our structs. Since we interface with C, it is possible that we are not properly dropping owned resources (i.e. not in the right order). This is a rather subtle potential problem and will require lots of attention to detail. See Destructors from the rust language reference for more details.

@maspe36
Copy link
Collaborator Author

maspe36 commented Mar 16, 2024

While trying to produce an occasionally observed segfault in our unit tests, I managed to get graph_tests.rs to actually panic. This seems like I was able to drop node while calling NodeBuilder::build?

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'map is not initialized, at /home/sam/ros2_iron/src/ros2/rcutils/src/hash_map.c:535'

with this new error message:

  'Unable to fini type cache for node., at /home/sam/ros2_iron/src/ros2/rcl/rcl/src/rcl/node.c:399'

rcutils_reset_error() should be called after error handling to avoid this.
<<<

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'Unable to fini type cache for node., at /home/sam/ros2_iron/src/ros2/rcl/rcl/src/rcl/node.c:399'

with this new error message:

  'node argument is null, at /home/sam/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_node.cpp:80'

rcutils_reset_error() should be called after error handling to avoid this.
<<<

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'node argument is null, at /home/sam/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_node.cpp:80, at /home/sam/ros2_iron/src/ros2/rcl/rcl/src/rcl/node.c:404'

with this new error message:

  'guard_condition argument is null, at /home/sam/ros2_iron/src/ros2/rcl/rcl/src/rcl/guard_condition.c:126'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
thread 'graph_tests::test_publishers' panicked at /home/sam/ros2_iron/install/rclrs/share/rclrs/rust/src/node.rs:24:43:
called `Result::unwrap()` on an `Err` value: RclError { code: Error, msg: Some(RclErrorMsg("guard_condition argument is null, at /home/sam/ros2_iron/src/ros2/rcl/rcl/src/rcl/guard_condition.c:126")) }
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1073:23
   4: rclrs::node::<impl core::ops::drop::Drop for rclrs::rcl_bindings::rcl_node_s>::drop
             at /home/sam/ros2_iron/install/rclrs/share/rclrs/rust/src/node.rs:24:18
   5: core::ptr::drop_in_place<rclrs::rcl_bindings::rcl_node_s>
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507:1
   6: rclrs::node::builder::NodeBuilder::build
             at /home/sam/ros2_iron/install/rclrs/share/rclrs/rust/src/node/builder.rs:303:5
   7: rclrs_tests::graph_tests::construct_test_graph
             at ./src/graph_tests.rs:16:16
   8: rclrs_tests::graph_tests::test_publishers
             at ./src/graph_tests.rs:28:17
   9: rclrs_tests::graph_tests::test_publishers::{{closure}}
             at ./src/graph_tests.rs:26:25
  10: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'graph_tests::test_publishers' panicked at library/core/src/panicking.rs:163:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Signal: SIGABRT (signal SIGABRT)
lost connection

Process finished with exit code -1

@luca-della-vedova
Copy link
Collaborator

I believe this might be caused by the drop order in the Node, it is included in #342 specifically here but that PR has not really been reviewed / tested, if there is no effort in that direction we can just open a small PR that addresses that specific issue.

@luca-della-vedova
Copy link
Collaborator

luca-della-vedova commented Mar 18, 2024

I am iterating on #342 and got the following stack trace from gdb:

Thread 12 "node::graph::te" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff65fd640 (LWP 927519)]
0x00007ffff4e6b238 in eprosima::fastrtps::rtps::TimedEvent::cancel_timer() () from /opt/ros/iron/lib/libfastrtps.so.2.10
(gdb) 
(gdb) where
#0  0x00007ffff4e6b238 in eprosima::fastrtps::rtps::TimedEvent::cancel_timer() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#1  0x00007ffff4e891fc in eprosima::fastrtps::rtps::StatefulWriter::matched_reader_remove(eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/iron/lib/libfastrtps.so.2.10
#2  0x00007ffff4ea91ef in eprosima::fastrtps::rtps::WriterProxy::~WriterProxy() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#3  0x00007ffff4eaa0c5 in eprosima::fastrtps::rtps::StatefulReader::~StatefulReader() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#4  0x00007ffff4eaa1ed in eprosima::fastrtps::rtps::StatefulReader::~StatefulReader() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#5  0x00007ffff4ed1df4 in eprosima::fastrtps::rtps::RTPSParticipantImpl::deleteUserEndpoint(eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/iron/lib/libfastrtps.so.2.10
#6  0x00007ffff4ee0914 in eprosima::fastrtps::rtps::RTPSDomainImpl::removeRTPSReader(eprosima::fastrtps::rtps::RTPSReader*) () from /opt/ros/iron/lib/libfastrtps.so.2.10
#7  0x00007ffff4f228be in eprosima::fastdds::dds::DataReaderImpl::stop() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#8  0x00007ffff4f229e6 in eprosima::fastdds::dds::DataReaderImpl::~DataReaderImpl() () from /opt/ros/iron/lib/libfastrtps.so.2.10
#9  0x00007ffff5108fb7 in ?? () from /opt/ros/iron/lib/libfastrtps.so.2.10
#10 0x00007ffff4f232a6 in eprosima::fastdds::dds::SubscriberImpl::delete_datareader(eprosima::fastdds::dds::DataReader const*) () from /opt/ros/iron/lib/libfastrtps.so.2.10
#11 0x00007ffff78b5c12 in rmw_fastrtps_shared_cpp::destroy_subscription(char const*, CustomParticipantInfo*, rmw_subscription_s*, bool) () from /opt/ros/iron/lib/librmw_fastrtps_shared_cpp.so
#12 0x00007ffff789acfb in rmw_fastrtps_shared_cpp::decrement_context_impl_ref_count(rmw_context_s*) () from /opt/ros/iron/lib/librmw_fastrtps_shared_cpp.so
#13 0x00007ffff790d4a7 in rmw_destroy_node () from /opt/ros/iron/lib/librmw_fastrtps_cpp.so
#14 0x00007ffff7f2ebf5 in rcl_node_fini () from /opt/ros/iron/lib/librcl.so
#15 0x000055555559f16f in rclrs::node::{impl#0}::drop (self=0x7fffec0fb4b8) at src/node.rs:24
#16 0x0000555555592e6a in core::ptr::drop_in_place<rclrs::rcl_bindings::rcl_node_s> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#17 0x0000555555594f8a in core::ptr::drop_in_place<core::cell::UnsafeCell<rclrs::rcl_bindings::rcl_node_s>> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#18 0x00005555555950ce in core::ptr::drop_in_place<std::sync::mutex::Mutex<rclrs::rcl_bindings::rcl_node_s>> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#19 0x000055555560f32e in alloc::sync::Arc<std::sync::mutex::Mutex<rclrs::rcl_bindings::rcl_node_s>, alloc::alloc::Global>::drop_slow<std::sync::mutex::Mutex<rclrs::rcl_bindings::rcl_node_s>, alloc::alloc::Global> (self=0x7fffec0fc2c8) at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/sync.rs:1752
#20 0x0000555555596ef1 in alloc::sync::{impl#33}::drop<std::sync::mutex::Mutex<rclrs::rcl_bindings::rcl_node_s>, alloc::alloc::Global> (self=0x7fffec0fc2c8)
    at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/sync.rs:2407
#21 0x000055555558f6da in core::ptr::drop_in_place<alloc::sync::Arc<std::sync::mutex::Mutex<rclrs::rcl_bindings::rcl_node_s>, alloc::alloc::Global>> ()
    at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#22 0x0000555555592590 in core::ptr::drop_in_place<rclrs::node::Node> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#23 0x000055555560ed3e in alloc::sync::Arc<rclrs::node::Node, alloc::alloc::Global>::drop_slow<rclrs::node::Node, alloc::alloc::Global> (self=0x7ffff65faef0)
    at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/sync.rs:1752
#24 0x0000555555597301 in alloc::sync::{impl#33}::drop<rclrs::node::Node, alloc::alloc::Global> (self=0x7ffff65faef0) at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/sync.rs:2407
#25 0x000055555559366a in core::ptr::drop_in_place<alloc::sync::Arc<rclrs::node::Node, alloc::alloc::Global>> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ptr/mod.rs:507
#26 0x00005555555e3c76 in rclrs::node::graph::tests::test_graph_empty () at src/node/graph.rs:522
#27 0x0000555555647bf7 in rclrs::node::graph::tests::test_graph_empty::{closure#0} () at src/node/graph.rs:465
#28 0x000055555558ed36 in core::ops::function::FnOnce::call_once<rclrs::node::graph::tests::test_graph_empty::{closure_env#0}, ()> ()
    at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250
#29 0x0000555555681e6f in core::ops::function::FnOnce::call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> () at library/core/src/ops/function.rs:250
#30 test::__rust_begin_short_backtrace<core::result::Result<(), alloc::string::String>, fn() -> core::result::Result<(), alloc::string::String>> () at library/test/src/lib.rs:627
#31 0x0000555555680bf1 in test::run_test_in_process::{closure#0} () at library/test/src/lib.rs:650
#32 core::panic::unwind_safe::{impl#23}::call_once<core::result::Result<(), alloc::string::String>, test::run_test_in_process::{closure_env#0}> () at library/core/src/panic/unwind_safe.rs:272
#33 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> () at library/std/src/panicking.rs:552
#34 std::panicking::try<core::result::Result<(), alloc::string::String>, core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>> () at library/std/src/panicking.rs:516
#35 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> () at library/std/src/panic.rs:142
#36 test::run_test_in_process () at library/test/src/lib.rs:650
#37 test::run_test::{closure#0} () at library/test/src/lib.rs:573
#38 0x0000555555647f26 in test::run_test::{closure#1} () at library/test/src/lib.rs:601
#39 std::sys_common::backtrace::__rust_begin_short_backtrace<test::run_test::{closure_env#1}, ()> () at library/std/src/sys_common/backtrace.rs:155
#40 0x000055555564cf87 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<test::run_test::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:529
#41 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>> () at library/core/src/panic/unwind_safe.rs:272
#42 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>, ()> ()
    at library/std/src/panicking.rs:552
#43 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>> ()
    at library/std/src/panicking.rs:516
#44 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>, ()> ()
    at library/std/src/panic.rs:142
#45 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<test::run_test::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:528
#46 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<test::run_test::{closure_env#1}, ()>, ()> () at library/core/src/ops/function.rs:250
#47 0x00005555556c45e5 in alloc::boxed::{impl#47}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:2015
#48 alloc::boxed::{impl#47}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:2015
#49 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#50 0x00007ffff7a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#51 0x00007ffff7b26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

It does seem to point to the drop implementation, I'll investigate further.

@maspe36
Copy link
Collaborator Author

maspe36 commented Mar 20, 2024

Since your PR @luca-della-vedova I am not seeing this. I think you're right that these specific intermittent failures were indeed caused by the drop order!

Looking at the git history, we had the old drop order since roughly August 2022, which would also explain why you can see intermittent failures for develop builds in Github actions going back as far as we can see.

Thanks for you work here. I'm going to close this for now. We can revisit if we see failures in CI again.

@maspe36 maspe36 closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants