Skip to content

Nodes should exit cleanly #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nnmm opened this issue Jun 3, 2022 · 27 comments
Open

Nodes should exit cleanly #188

nnmm opened this issue Jun 3, 2022 · 27 comments
Assignees

Comments

@nnmm
Copy link
Contributor

nnmm commented Jun 3, 2022

Currently, sending Ctrl-C to a node makes it exit with code 130 (on a Linux machine). It should instead exit cleanly.

If we follow the rclcpp example, this would be achieved by a signal handler which calls rcl_shutdown().

@jhdcs jhdcs self-assigned this Jun 13, 2022
@jhdcs
Copy link
Collaborator

jhdcs commented Jun 13, 2022

rclcpp appears to achieve this by creating a global signal handler that gets initialized when rclcpp::init is called. We do not have a similar rclrs::init function that creates a global context.

As such, should I make it so that each context has its own signal handler? Or should I work on creating an rclrs::init function as well?

@esteve
Copy link
Collaborator

esteve commented Jun 14, 2022

Adding a signal handler for every context seems tricky and a source of problems (what if we have more than one context, will all of them handle the exit signal?). Perhaps a rclrs::init makes sense after all, not a huge fan of global variables, but maybe that's the most pragmatic solution. All solutions I've seen in the past for signal handling (in ROS2 or otherwise) involved some form of a global variable in the end.

@esteve
Copy link
Collaborator

esteve commented Jun 14, 2022

The ctrlc crate uses a form of global state, it has an static atomic that just makes sure that set_handler is only called once:

https://github.com/Detegr/rust-ctrlc/blob/master/src/lib.rs#L60=

And then init_os_handler also uses a static variable:

https://github.com/Detegr/rust-ctrlc/blob/master/src/platform/unix/mod.rs#L76=

https://github.com/Detegr/rust-ctrlc/blob/master/src/platform/unix/mod.rs#L14=

In our case, it'd be akin to having a global default context where we'd store the signal handler

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 14, 2022

I had actually been looking at the signal-hook crate: https://crates.io/crates/signal-hook

It would allow us to handle more signals (if needed) than just SIGINT. I'm not sure if that's overkill or not, though. But if I go with that, that would trap us into needing a global default context, right?

@nnmm
Copy link
Contributor Author

nnmm commented Jun 14, 2022

Edit: Removed a part.

There's no fundamental reason we couldn't install a signal handler the first time a context is initialized, by a Once for instance. So I'd try to treat this problem as unrelated from a global default context (which seems very un-Rusty indeed) as far as possible.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 15, 2022

But then the question in my mind becomes: why didn't rclcpp do that? Why did they gravitate toward a solution that uses a single default context, rather than a signal handler for each potential context?

I'm concerned about the potential possibility of multiple signal handlers somehow potentially stepping on each others' toes. From everything I've read, these things are hard to implement correctly, and the risk of race conditions is high.

Also, when are multiple contexts used? Since rclcpp and rclpy seem to be limited to just one (that I can see), what do we gain from the ability to potentially have multiple?

@nnmm
Copy link
Contributor Author

nnmm commented Jun 15, 2022

I think I heard once that it was because of backwards compatibility with ROS 1, but not sure anymore. I very much agree it's important to understand why other packages do things a certain way.

I'm concerned about the potential possibility of multiple signal handlers somehow potentially stepping on each others' toes. From everything I've read, these things are hard to implement correctly, and the risk of race conditions is high.

I think we wouldn't want to have multiple signal handlers. Since the signal handling crates abstract away things for you, I think there is no more risk of race conditions and it's not hard to implement things correctly. E.g. from taking a quick glance at signal-hook, it seems pretty straightforward to adapt this example to do what we want: Call the signal_hook::flag::register() function inside a Once or so when creating a context, make the term variable a static, and read it inside Context::ok().

Good question about the use case of multiple contexts. But maybe this should be discussed in a new issue?

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 15, 2022

Good question about the use case of multiple contexts. But maybe this should be discussed in a new issue?

Perhaps, but it's tangentially related here - if we want to have a signal handler initialized for each context, wouldn't that make multiple signal handlers need to be created? The way the signal-hook crate shows things being constructed, for example, there would be a flag variable that the processes would have to query. We would either have to create a separate flag variable for each process (which would prevent us from using Once, I think), or we would have to have to have them all share the same flag... Which would be basically a global context, in my mind.

Or am I misunderstanding?

@nnmm
Copy link
Contributor Author

nnmm commented Jun 15, 2022

No, I'm suggesting to only have one global signal handler that gets installed when the first context gets created. We could also uninstall it when the last context gets destroyed. With that in mind, my previous comment hopefully makes more sense.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 15, 2022

And where would that live? In the other client libraries, they created a global context to store it.

@BrettRD
Copy link

BrettRD commented Jun 16, 2022

rclcpp allows developers to create multiple ROS contexts, this is important when you partition the ROS network with multiple domain IDs, or run loggers from multiple nodes in the same process. The rclcpp global context is still a thing but you can use it indirectly through a local context.
I use multiple rclcpp::Context local-variables in ros-gst-bridge, and I've added the same to my port of fuse, I'd be surprised if it wasn't needed by people implementing interfaces for ros_control

why didn't rclcpp do that?

The rclcpp Node / Excutor / CallbackGroup model had a substantial refactor in Foxy, most of that was shaking off ROS1 baggage of one-process-one-node. I think if you can ditch globals in rclrs, rclcpp will try to adopt your model.

I'll be following this thread with interest because I have a race condition on sigint between rclcpp and GLib

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 16, 2022

All right. With that information, I feel more comfortable about us splitting off from how rclcpp does things.

@nnmm, I will try to see what I can do to get your suggestion working. Let's see if we can get this done without globals!

@nnmm
Copy link
Contributor Author

nnmm commented Jun 16, 2022

@BrettRD Thanks for chiming in. Can you detail what you mean by "you can use it indirectly through a local context" (I thought the global context is an alternative to a local context, where one doesn't use the other) and "ditch globals" (do you mean the global default context?)?

Also, of what nature is the signal handling race condition? Which signal handler is installed first?

@jhdcs My guess is that you'll need at least one static variable. But that's okay imo.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 16, 2022

I was thinking that, but the static variable would probably be the flag... Which would mean it'd have to be mutable...

And accessing static mut is unsafe... I'm not sure if we can get around that or not, especially because I don't think you're allowed to have a mutex inside a signal handler (so I can't make the flag a static mut term_sig : Arc<Mutex<AtomicBool>>)

While the scope of the problem would be limited to when the program is shutting down, I'd rather the problem not exist at all... Trying to think of a way around it... Perhaps use Once to initialize the signal handler, then modify the Context struct to have a flag field for a terminal signal, and register it with the signal handler each time a Context is created?

@nnmm
Copy link
Contributor Author

nnmm commented Jun 16, 2022

@jhdcs Atomics do not need to be mutable, they can be modified through a shared reference.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 16, 2022

Oh. Right. Silly me!

@BrettRD
Copy link

BrettRD commented Jun 20, 2022

@BrettRD Thanks for chiming in. Can you detail what you mean by "you can use it indirectly through a local context" (I thought the global context is an alternative to a local context, where one doesn't use the other) and "ditch globals" (do you mean the global default context?)?

The RCLCPP API has a couple of globals lurking around, and they make it unclear how much your node is exposed to other nodes in the same process. I would like to see rcl* libraries eliminate globals because I run into a lot of system integration decisions that need to break assumptions.
I made a bunch of changes to RCLC to reduce the reliance on global variables too (allowing one callback function to service multiple subscriptions in C where std::bind doesn't exist)

In RCLCPP, there are a bunch of ways of creating a node with options and context. The old (default?) way is to rclcpp::init() the default context which installs the signal handler, default construct a node, and it automatically picks up the default context.
The more explicit way is to create a context with for your node options, and then make a node with those options. https://github.com/BrettRD/ros-gst-bridge/blob/ros2/gst_bridge/src/rosbasesrc.cpp#L270-L274. You still have to rclcpp::init() (or similar) to install the signal handler

This allows all nodes in the process to use a common signal handler which is held by the default context, but there's no longer much of a reason to have a fully-fledged rclcpp::context sitting around at global scope, except to hold the signal handler.

Also, of what nature is the signal handling race condition? Which signal handler is installed first?

I have a hastily thrown-together threading arrangement with blocking waits etc, I should be doing things differently / safer.
The GLib sigint handler is queued in the GLib main loop which is busy waiting for a notification from a thread where ROS is waiting for a message.
ROS gets to its sigint handler first because it's processing callbacks on its executor waiting for a message.
ROS closes out, leaving the condition variable unfulfilled, and throwing GLib into deadlock.
ROS Launch escalates to sigkill, and Sunday lunch continues.
If there's lots of incoming ros messages, GLib isn't waiting very often and it exits cleanly, mopping up its ros nodes along the way. (Looking at it more closely now, I don't think I ever actually installed the ROS signal handler, oops)

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 21, 2022

So perhaps what we do is have an rclrs::init() function that creates a default signal handler struct, and require any created contexts to have a signal handler struct passed into them?

Though I'm still trying to figure out the proper way to have things like nodes self-terminate on a signal. Ownership is making it a bit tricky. I'm currently checking the signal each time a node spins, but that means that we'd have to pass ownership of both the context and the node being spun to the spin functions in order for it to kill them...

@BrettRD
Copy link

BrettRD commented Jun 22, 2022

rclcpp also has bool install_signal_handlers() (returns false if the signal handlers have already been installed) and bool rclcpp::signal_handlers_installed() (checks if the signal handlers are installed)
You can call these instead of rclcpp::init()

You can be a bit flexible in which structures do which task, in rclcpp, it got really modular very quickly:

  • Nodes have one or more CallbackGroups
  • Subscriptions are added to CallbackGroups, (node.create_subscription is a shorthand to use the default callback group)
  • CallbackGroups are added to Executors
  • Executors spin the CallbackGroups.

rclcpp::spin(Node) and node.spin() are helper functions that create a default executor, load it with the default CallbackGroup, and spin the executor.

I'm not familiar enough with the Rust ownership model, can the signal handler be a Static member of the context class?
That way it could be installed once and used (atomically) by all contexts in the process. Then you don't need a global variable at all. I think that's what @nnmm was saying here: #188 (comment)

@nnmm
Copy link
Contributor Author

nnmm commented Jun 22, 2022

So perhaps what we do is have an rclrs::init() function that creates a default signal handler struct, and require any created contexts to have a signal handler struct passed into them?

@jhdcs I think such a function would be a good apprach, though it should be named differently imo since it's not an analogue of rclcpp::init(), which also creates a context. Maybe rclrs::install_signal_handlers() like in rclcpp. And I wouldn't make it return anything to be passed it into the context, since signal handling should be optional.

Another way, like I described above, is to make contexts "share" the signal handler by enabling it when at least one context exists. That can be done with static variables. The drawback is that there's not a good way to make the signal handling optional in that case.

Though I'm still trying to figure out the proper way to have things like nodes self-terminate on a signal. Ownership is making it a bit tricky. I'm currently checking the signal each time a node spins, but that means that we'd have to pass ownership of both the context and the node being spun to the spin functions in order for it to kill them...

Have you checked out what rclcpp does? I think they have a guard condition that is added to the wait set which is triggered when the signal is received. Plus rclcpp::ok() obviously.

I'm not familiar enough with the Rust ownership model, can the signal handler be a Static member of the context class?
That way it could be installed once and used (atomically) by all contexts in the process. Then you don't need a global variable at all. I think that's what @nnmm was saying here: #188 (comment)

@BrettRD Basically, yes. Rust is a little different, structs can't have static fields, but we'd have one or more static variables that are private to the context module, and which conceptually belong to the Context struct.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 22, 2022

Have you checked out what rclcpp does? I think they have a guard condition that is added to the wait set which is triggered when the signal is received. Plus rclcpp::ok() obviously.

I'm triple-checking this, but if a guard condition is how rclcpp handles things, would that mean that this issue would have to wait until we've implemented guard conditions?

@nnmm
Copy link
Contributor Author

nnmm commented Jun 28, 2022

I'm triple-checking this, but if a guard condition is how rclcpp handles things, would that mean that this issue would have to wait until we've implemented guard conditions?

I'm not sure if the guard conditions are solely an optimization that reduce latency by not waiting for the current wait set to time out. The part that checks the atomic boolean in rclcpp::ok() doesn't need guard conditions, so it could be done independently of them.

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 28, 2022

I thought the guard conditions were making it so that the waitset could be stopped early, which is admittedly similar to what you said, but it makes it so that you're less likely to get the user spamming Ctrl + C multiple times to kill everything. At the moment, it's sort of looking like the time you'd be checking the atomic boolean at the start of each spin call, which if you're dealing with a frozen node, might never happen.

Maybe it's just a case of me over-thinking things, but I figured that this should be able to gracefully shut down even frozen nodes...

@nnmm
Copy link
Contributor Author

nnmm commented Jun 28, 2022

What do you mean by "frozen"?

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 28, 2022

A node not responding to commands or inputs.

Though that might be a case where a graceful shutdown isn't warranted.

@nnmm
Copy link
Contributor Author

nnmm commented Jun 28, 2022

But why – is a subscription callback not returning? If so, adding a guard condition won't help, because the node is not currently waiting in a wait set, right?

@jhdcs
Copy link
Collaborator

jhdcs commented Jun 28, 2022

Not sure. I think I need to do some more reading on wait sets, guard conditions, and signal handling...

@nnmm nnmm mentioned this issue Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants