Skip to content

Added timeout for VMshutdown #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Zyqsempai
Copy link
Contributor

Issue #247

Added timeout for VM Shutdown
Added special type of error for case when timeout exceeded

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Zyqsempai
Copy link
Contributor Author

@sipsma PTAL

@Zyqsempai Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from f2ac15e to 7e4c8c1 Compare October 2, 2019 13:33
Copy link
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, thanks! Just some minor comments

@@ -67,6 +67,7 @@ const (
defaultVsockPort = 10789
minVsockIOPort = uint32(11000)
firecrackerStartTimeout = 5 * time.Second
stopVMTimeoutSeconds = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For this variable, I think it should be a time.Duration (so just 5 * time.Second) so it can stay agnostic of particular units.

Also, can this be named more like defaultStopVMTimeout? Just to make it clear it's a default value that can be optionally overridden.

_, err = s.Shutdown(requestCtx, &taskAPI.ShutdownRequest{Now: true})
if err != nil {
return nil, err
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need the for statement wrapping the select, unless I'm missing something we should be good with just the select here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I took a second look and think we will need a little bit more than just removing the for. We want to make sure that if we hit the timeout while blocked on the Shutdown call, the StopVM method doesn't stay blocked. I think the simplest way to do this will be running Shutdown in its own goroutine. Roughly sketched out:

shutdownCh := make(chan error)
go func() {
    defer close(shutdownCh)
    _, err = s.Shutdown(...)
    shutdownCh <- err
}()

select {
    case <- ctx.Done():
        ...
    case err = <-shutdownCh:
        if err != nil {
            return nil, err
        }
        return Empty{}, nil
}

@@ -36,6 +36,7 @@ message CreateVMRequest {

message StopVMRequest {
string VMID = 1;
uint32 VMShutDownTimeOut = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this variable, units matter, so can we name it more like VMShutdownTimeoutSeconds?

@sipsma
Copy link
Contributor

sipsma commented Oct 2, 2019

Oh also, the tests are failing just because your commit Added timeout for VMshutdown doesn't have a DCO (your other commit does), so please just amend that commit with a signoff when updating.

@Zyqsempai Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from 7e4c8c1 to 577335b Compare October 3, 2019 08:47
@sipsma
Copy link
Contributor

sipsma commented Oct 4, 2019

@Zyqsempai Saw in the test runs of your latest commits that the DeadlineExceeded error was being returned unexpectedly in the test case, which didn't make any sense since the code you wrote looked good. It took a while, but I think I tracked down why. It's really really tricky (but fortunately easy to fix I think).

The fix should be to not create a new ctx derived from the requestCtx in StopVM. Instead, just use the existing requestCtx unmodified. Then, use atime.Timer to enforce the timeout, something along the lines of:

func (s *service) StopVM(...) {
    ...
    timer := time.NewTimer(timeout)
    ...
    select {
        case <-timer.C:
            ...
        case err = <-shutdownCh:
            ...
    }
}

To explain what I think the actual problem is, containerd gives our code a context object during initialization in the NewService method, which we decided to call shimCtx. This context represents the execution of the shim process as a whole; when it is cancelled that means the process is shutting down (and thus all the services we run need to shut down too).

One place we give the shimCtx to is the server for the firecracker-control API, which is what serves the StopVM API (among others). The TTRPC server implementation uses that context to decide when to shutdown, but it also derives the requestCtx objects for all its methods from that context.

As a result, the requestCtx provided to StopVM is a child context of shimCtx. This by itself is reasonable; if the shim is shutting down then ongoing requests should get cancelled too. However, inside the Shutdown method, we cancel the shimCtx in order to actually start the full shutdown process, which is also reasonable by itself.

Putting all the above together, canceling shimCtx cancels requestCtx, which in turn cancels the new ctx your code added, so then when we hit the select statement in the new code we always hit the <-ctx.Done() case first and return DeadlineExceeded. Getting rid of that new extra ctx should thus fix the problem (and keep things from getting even more complicated than they already are). Let me know if this makes sense, it's obviously not straightforward at all.

@Zyqsempai
Copy link
Contributor Author

@sipsma Yep, it's really tricky, but totally make sense.

Copy link
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one more minor comment. Also, can you squash the separate commits into a single one? That should be sufficient for this change. Then it should be good for me to ship!

shutdownCh := make(chan error)
go func() {
defer close(shutdownCh)
_, err = s.Shutdown(requestCtx, &taskAPI.ShutdownRequest{Now: true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed this, let's use _, err := s.Shutdown(...) (:= instead of =), just to avoid the confusion that can result from overwriting a named return val from within a separate goroutine.

@Zyqsempai Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from c8cc629 to cab9097 Compare October 7, 2019 08:25
@Zyqsempai
Copy link
Contributor Author

Zyqsempai commented Oct 7, 2019

@sipsma Done, thanks for your comments. PTAL.

Copy link
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again!

@kzys
Copy link
Contributor

kzys commented Oct 8, 2019

Two (a bit bikeshed-y) questions:

  • If a customer do want to shutdown the VM regardless of the time it takes, what would be the right way? Giving large enough timeout? We are removing the way to say "no timeout" which may not be bad though.
  • Do we want to have server-side timeout for all of operations? It is not clear why StopVM can have the timeout, but others don't.

@sipsma
Copy link
Contributor

sipsma commented Oct 8, 2019

* If a customer do want to shutdown the VM regardless of the time it takes, what would be the right way? Giving large enough timeout?

Yes, this change essentially prevents StopVM from ever being blocked forever waiting for the shutdown, which I think is preferable as I can't think of any situation in which being blocked literally forever is the desired behavior for stopping the VM.

* Do we want to have server-side timeout for all of operations? It is not clear why StopVM can have the timeout, but others don't.

I think adding it to StopVM is particularly valuable because it ensures that users don't get in a situation where they are unable cleanup resources on their machine (once the timeout is hit, we just send SIGKILL).

The cases where other operations besides StopVM get blocked forever are still not good, but the worst case scenario is the user doesn't create a VM or get information about their VM, which is somewhat less harmful than being unable to remove resources from their machine. That being said, I agree that adding timeouts to those other operations could still be valuable and is something to consider in the future (perhaps just not in this PR).

@kzys
Copy link
Contributor

kzys commented Oct 8, 2019

Thanks. I agree that blocking literally forever won't be what customers want.

How about renaming the field from VMShutdownTimeoutSeconds to TimeoutSeconds or simply Timeout, if we may add the field in other requests?

@xibz
Copy link
Contributor

xibz commented Oct 8, 2019

I wonder if it may be best to a grpc.ClientConn constructor function. You can set a timeout for the individual methods there. I wonder if that is preferable rather than adding a timeout field, and this would allow us to make use of the context that is being passed in.

Can this not be handled by setting a context deadline when making the request? Rather than an API change?

@sipsma
Copy link
Contributor

sipsma commented Oct 8, 2019

Can this not be handled by setting a context deadline when making the request? Rather than an API change?

I think having the client specify the timeout in the context of their requests raises a bunch of questions that make the behavior harder to reason about: if the timeout is hit, then will the client be unblocked right away before any response is received (as opposed to receiving a timeout error from the server)? What if the context timeout is applied to other parts of the client connection, could that mean that the user is unable to distinguish between a dial timeout and a timeout shutting down the VM? How would the user determine if SIGKILL was sent in that case?

Also see this comment; from a purely internal code perspective, using context timeouts here creates more confusion than it's worth IMO.

I think in practice it ends up being simpler to just have it be part of the API.

@sipsma
Copy link
Contributor

sipsma commented Oct 8, 2019

Thanks. I agree that blocking literally forever won't be what customers want.

How about renaming the field from VMShutdownTimeoutSeconds to TimeoutSeconds or simply Timeout, if we may add the field in other requests?

Agree, @Zyqsempai I think that's worth making before we merge here.

Copy link
Contributor

@IRCody IRCody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with changing the var name but LGTM otherwise.

Copy link
Contributor

@xibz xibz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one minor nit comment, but otherwise looks good.

@@ -499,6 +500,13 @@ func (s *service) createVM(requestCtx context.Context, request *proto.CreateVMRe
// created yet and the timeout is hit waiting for it to exist, an error will be returned but the shim will
// continue to shutdown.
func (s *service) StopVM(requestCtx context.Context, request *proto.StopVMRequest) (_ *empty.Empty, err error) {
var timeout time.Duration
Copy link
Contributor

@xibz xibz Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little bit cleaner

timeout := defaultStopVMTimeout
if request.VMShutdownTimeoutSeconds > 0 {
    timeout = time.Duration(request.VMShutdownTimeoutSeconds) * time.Second
}

@Zyqsempai Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from cab9097 to 1ac8f63 Compare October 9, 2019 08:37
@Zyqsempai
Copy link
Contributor Author

Done!

@sipsma sipsma merged commit a28cf67 into firecracker-microvm:master Oct 9, 2019
fangn2 pushed a commit to fangn2/firecracker-containerd that referenced this pull request Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants