Added timeout for VMshutdown #272

Zyqsempai · 2019-10-02T07:32:37Z

Issue #247

Added timeout for VM Shutdown
Added special type of error for case when timeout exceeded

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Zyqsempai · 2019-10-02T11:59:11Z

@sipsma PTAL

sipsma

Looks great overall, thanks! Just some minor comments

sipsma · 2019-10-02T19:40:44Z

runtime/service.go

@@ -67,6 +67,7 @@ const (
 	defaultVsockPort        = 10789
 	minVsockIOPort          = uint32(11000)
 	firecrackerStartTimeout = 5 * time.Second
+	stopVMTimeoutSeconds    = 5


nit: For this variable, I think it should be a time.Duration (so just 5 * time.Second) so it can stay agnostic of particular units.

Also, can this be named more like defaultStopVMTimeout? Just to make it clear it's a default value that can be optionally overridden.

sipsma · 2019-10-02T19:43:42Z

runtime/service.go

-	_, err = s.Shutdown(requestCtx, &taskAPI.ShutdownRequest{Now: true})
-	if err != nil {
-		return nil, err
+	for {


I'm not sure we need the for statement wrapping the select, unless I'm missing something we should be good with just the select here

Actually, I took a second look and think we will need a little bit more than just removing the for. We want to make sure that if we hit the timeout while blocked on the Shutdown call, the StopVM method doesn't stay blocked. I think the simplest way to do this will be running Shutdown in its own goroutine. Roughly sketched out:

shutdownCh := make(chan error) go func() { defer close(shutdownCh) _, err = s.Shutdown(...) shutdownCh <- err }() select { case <- ctx.Done(): ... case err = <-shutdownCh: if err != nil { return nil, err } return Empty{}, nil }

sipsma · 2019-10-02T19:44:23Z

proto/firecracker.proto

@@ -36,6 +36,7 @@ message CreateVMRequest {

 message StopVMRequest {
    string VMID = 1;
+    uint32 VMShutDownTimeOut = 2;


For this variable, units matter, so can we name it more like VMShutdownTimeoutSeconds?

sipsma · 2019-10-02T19:48:49Z

Oh also, the tests are failing just because your commit Added timeout for VMshutdown doesn't have a DCO (your other commit does), so please just amend that commit with a signoff when updating.

sipsma · 2019-10-04T00:31:57Z

@Zyqsempai Saw in the test runs of your latest commits that the DeadlineExceeded error was being returned unexpectedly in the test case, which didn't make any sense since the code you wrote looked good. It took a while, but I think I tracked down why. It's really really tricky (but fortunately easy to fix I think).

The fix should be to not create a new ctx derived from the requestCtx in StopVM. Instead, just use the existing requestCtx unmodified. Then, use atime.Timer to enforce the timeout, something along the lines of:

func (s *service) StopVM(...) {
    ...
    timer := time.NewTimer(timeout)
    ...
    select {
        case <-timer.C:
            ...
        case err = <-shutdownCh:
            ...
    }
}

To explain what I think the actual problem is, containerd gives our code a context object during initialization in the NewService method, which we decided to call shimCtx. This context represents the execution of the shim process as a whole; when it is cancelled that means the process is shutting down (and thus all the services we run need to shut down too).

One place we give the shimCtx to is the server for the firecracker-control API, which is what serves the StopVM API (among others). The TTRPC server implementation uses that context to decide when to shutdown, but it also derives the requestCtx objects for all its methods from that context.

As a result, the requestCtx provided to StopVM is a child context of shimCtx. This by itself is reasonable; if the shim is shutting down then ongoing requests should get cancelled too. However, inside the Shutdown method, we cancel the shimCtx in order to actually start the full shutdown process, which is also reasonable by itself.

Putting all the above together, canceling shimCtx cancels requestCtx, which in turn cancels the new ctx your code added, so then when we hit the select statement in the new code we always hit the <-ctx.Done() case first and return DeadlineExceeded. Getting rid of that new extra ctx should thus fix the problem (and keep things from getting even more complicated than they already are). Let me know if this makes sense, it's obviously not straightforward at all.

Zyqsempai · 2019-10-04T08:32:13Z

@sipsma Yep, it's really tricky, but totally make sense.

sipsma

Looks good, just one more minor comment. Also, can you squash the separate commits into a single one? That should be sufficient for this change. Then it should be good for me to ship!

sipsma · 2019-10-04T20:24:59Z

runtime/service.go

+	shutdownCh := make(chan error)
+	go func() {
+		defer close(shutdownCh)
+		_, err = s.Shutdown(requestCtx, &taskAPI.ShutdownRequest{Now: true})


just noticed this, let's use _, err := s.Shutdown(...) (:= instead of =), just to avoid the confusion that can result from overwriting a named return val from within a separate goroutine.

Zyqsempai · 2019-10-07T08:26:49Z

@sipsma Done, thanks for your comments. PTAL.

sipsma

LGTM, thanks again!

kzys · 2019-10-08T17:59:37Z

Two (a bit bikeshed-y) questions:

If a customer do want to shutdown the VM regardless of the time it takes, what would be the right way? Giving large enough timeout? We are removing the way to say "no timeout" which may not be bad though.
Do we want to have server-side timeout for all of operations? It is not clear why StopVM can have the timeout, but others don't.

sipsma · 2019-10-08T18:18:04Z

* If a customer do want to shutdown the VM regardless of the time it takes, what would be the right way? Giving large enough timeout?

Yes, this change essentially prevents StopVM from ever being blocked forever waiting for the shutdown, which I think is preferable as I can't think of any situation in which being blocked literally forever is the desired behavior for stopping the VM.

* Do we want to have server-side timeout for all of operations? It is not clear why StopVM can have the timeout, but others don't.

I think adding it to StopVM is particularly valuable because it ensures that users don't get in a situation where they are unable cleanup resources on their machine (once the timeout is hit, we just send SIGKILL).

The cases where other operations besides StopVM get blocked forever are still not good, but the worst case scenario is the user doesn't create a VM or get information about their VM, which is somewhat less harmful than being unable to remove resources from their machine. That being said, I agree that adding timeouts to those other operations could still be valuable and is something to consider in the future (perhaps just not in this PR).

kzys · 2019-10-08T19:10:46Z

Thanks. I agree that blocking literally forever won't be what customers want.

How about renaming the field from VMShutdownTimeoutSeconds to TimeoutSeconds or simply Timeout, if we may add the field in other requests?

runtime/service.go

xibz · 2019-10-08T19:32:49Z

I wonder if it may be best to a grpc.ClientConn constructor function. You can set a timeout for the individual methods there. I wonder if that is preferable rather than adding a timeout field, and this would allow us to make use of the context that is being passed in.

Can this not be handled by setting a context deadline when making the request? Rather than an API change?

sipsma · 2019-10-08T20:31:41Z

Can this not be handled by setting a context deadline when making the request? Rather than an API change?

I think having the client specify the timeout in the context of their requests raises a bunch of questions that make the behavior harder to reason about: if the timeout is hit, then will the client be unblocked right away before any response is received (as opposed to receiving a timeout error from the server)? What if the context timeout is applied to other parts of the client connection, could that mean that the user is unable to distinguish between a dial timeout and a timeout shutting down the VM? How would the user determine if SIGKILL was sent in that case?

Also see this comment; from a purely internal code perspective, using context timeouts here creates more confusion than it's worth IMO.

I think in practice it ends up being simpler to just have it be part of the API.

sipsma · 2019-10-08T20:32:35Z

Thanks. I agree that blocking literally forever won't be what customers want.

How about renaming the field from VMShutdownTimeoutSeconds to TimeoutSeconds or simply Timeout, if we may add the field in other requests?

Agree, @Zyqsempai I think that's worth making before we merge here.

IRCody

Agree with changing the var name but LGTM otherwise.

xibz

Had one minor nit comment, but otherwise looks good.

xibz · 2019-10-08T22:21:39Z

runtime/service.go

@@ -499,6 +500,13 @@ func (s *service) createVM(requestCtx context.Context, request *proto.CreateVMRe
 // created yet and the timeout is hit waiting for it to exist, an error will be returned but the shim will
 // continue to shutdown.
 func (s *service) StopVM(requestCtx context.Context, request *proto.StopVMRequest) (_ *empty.Empty, err error) {
+	var timeout time.Duration


I think this is a little bit cleaner

timeout := defaultStopVMTimeout if request.VMShutdownTimeoutSeconds > 0 { timeout = time.Duration(request.VMShutdownTimeoutSeconds) * time.Second }

Signed-off-by: bpopovschi <[email protected]>

Zyqsempai · 2019-10-09T11:48:18Z

Done!

go mod tidy

Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from f2ac15e to 7e4c8c1 Compare October 2, 2019 13:33

sipsma reviewed Oct 2, 2019

View reviewed changes

Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from 7e4c8c1 to 577335b Compare October 3, 2019 08:47

sipsma reviewed Oct 4, 2019

View reviewed changes

Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from c8cc629 to cab9097 Compare October 7, 2019 08:25

sipsma approved these changes Oct 7, 2019

View reviewed changes

IRCody reviewed Oct 8, 2019

View reviewed changes

runtime/service.go Show resolved Hide resolved

IRCody approved these changes Oct 8, 2019

View reviewed changes

xibz approved these changes Oct 8, 2019

View reviewed changes

Added timeout for VMshutdown

1ac8f63

Signed-off-by: bpopovschi <[email protected]>

Zyqsempai force-pushed the 247-add-timeout-for-vm-shutdown branch from cab9097 to 1ac8f63 Compare October 9, 2019 08:37

sipsma merged commit a28cf67 into firecracker-microvm:master Oct 9, 2019

sipsma mentioned this pull request Oct 14, 2019

Feature request for StopVM API #247

Closed

fangn2 pushed a commit to fangn2/firecracker-containerd that referenced this pull request Mar 23, 2023

Merge pull request firecracker-microvm#272 from kzys/go-mod-tidy

19055e1

go mod tidy

Added timeout for VMshutdown #272

Added timeout for VMshutdown #272

Uh oh!

Conversation

Zyqsempai commented Oct 2, 2019

Uh oh!

Zyqsempai commented Oct 2, 2019

Uh oh!

sipsma left a comment

Choose a reason for hiding this comment

Uh oh!

sipsma Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

sipsma Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

sipsma Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

sipsma Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

sipsma commented Oct 2, 2019

Uh oh!

sipsma commented Oct 4, 2019

Uh oh!

Zyqsempai commented Oct 4, 2019

Uh oh!

sipsma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sipsma Oct 4, 2019

Choose a reason for hiding this comment

Uh oh!

Zyqsempai commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipsma left a comment

Choose a reason for hiding this comment

Uh oh!

kzys commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipsma commented Oct 8, 2019

Uh oh!

kzys commented Oct 8, 2019

Uh oh!

Uh oh!

xibz commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipsma commented Oct 8, 2019

Uh oh!

sipsma commented Oct 8, 2019

Uh oh!

IRCody left a comment

Choose a reason for hiding this comment

Uh oh!

xibz left a comment

Choose a reason for hiding this comment

Uh oh!

xibz Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zyqsempai commented Oct 9, 2019

Uh oh!

Uh oh!

sipsma left a comment •

edited

Loading

Zyqsempai commented Oct 7, 2019 •

edited

Loading

kzys commented Oct 8, 2019 •

edited

Loading

xibz commented Oct 8, 2019 •

edited

Loading

xibz Oct 8, 2019 •

edited

Loading