Skip to content

Conversation

jordanhendricks
Copy link
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2023

This PR makes use of the new bhyve VMM timing data interfaces introduced in the accompanying bhyve change (CR link). See RFD 358 for additional background.

Closes #337.

Overview

This change adds the following functionality for guests that are migrated:

  • the guest TSC continues functioning at the same frequency, adjusted for live migration time
  • device timers continue functioning properly

To do so, propolis-server makes use of the new bhyve VMM timing data interfaces. On the source side, we read the timing data before the lengthier parts of migration begin. Toward the end, we send this data over. The destination side calculates migration time based on the source and destination wall clocks, and adjusts the guest TSC based on migration time, and adjusts the boot_hrtime (the value used for device timers) based on the guest uptime, migration time, and the hrtime of its new host.

A load-bearing assumption in our adjustment calculations is that the wall clocks of the source and destination hosts are synchronized, which is an assumption we are making in other parts of the product.

Bhyve VMM timing data interface details

On a read, the timing data interface provides the following:

  • current guest TSC
  • guest TSC frequency
  • guest boot_hrtime (hrtime of the host when the guest was booted)
  • host wall clock
  • host hrtime (high res clock)

Callers writing the data provide the guest-related fields, as well as the hrtime and wall clock time of the host when when the caller sent the request. This allows bhyve to make additional adjustments to account for latency between the userspace write and kernel-side receipt of the data.

Timing Adjustment D script
This PR adds a couple of probes to where we adjust the timing data to help validate the data were are seeing is correct.

Some example output of this script, run for a VM with no load on on my otherwise not-busy test machine:

$ pfexec ./scripts/time-adjustments.d 527
tracing pid 527...
timing data imported; press CTRL+C for summary
^C
GUEST FREQ (Hz)        UPTIME (usec)    TOTAL TIME (usec)
     2649924837              1431678              2818798

EVENT           DURATION (usec)            GUEST TSC          BOOT HRTIME
Migration               2818767           3793847219        3899256607629
[adjustment]                              7469522214          -6033009351
Propolis                     12          11263369433        3893223598278
[adjustment]                                   82044                31221
Kernel                       17          11263451477        3893223629499

Notes for reviewers

  • There are some TODOs in here, and most of them (except the one about writing test) I wouldn't mind feedback on.
  • I still need to write some automated tests and write up a bunch of testing notes, but I wanted to start getting eyes on this.

@jordanhendricks jordanhendricks force-pushed the migrate-timing branch 11 times, most recently from d7786a9 to 5d1c492 Compare March 29, 2023 23:17
@jordanhendricks jordanhendricks marked this pull request as ready for review March 29, 2023 23:33
this->total / 1000);
printf("\n");

printf("%-10s %20s %20s %20s\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that the output ordering here is a bit confusing -- will fix in a follow up

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! Left a few initial thoughts (hopefully nothing too off-base with the bhyve changes, which I've skimmed but not looked at terribly closely yet).

// TODO: A hacky way for now to hang onto the VMM timing data between
// migration phases.
//
// We want to read the VMM timing data as soon as we can after we pause the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or for worse (possibly for worse) the state driver pauses entities after vCPUs. Entity pause might be expensive if, say, an entity wants to drain a queue or wait for an outstanding operation to finish before completing its pause request. If the goal is to capture this state as soon as the CPUs stop ticking, we should consider one of these approaches:

  1. Reorder operations in the state driver so that entities pause before vCPUs
  2. Teach the state driver to capture this architectural state at the right time and send it back to the migration task

Of these two, I would strongly prefer option 1 (option 2 is not as hard as it sounds, but it would be pretty messy). To implement option 1, though, we have to be confident that all our entities won't care that the CPUs are still running at the time they're asked to pause. (But see below for a third possible choice.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to read the VMM timing data as soon as we can after we pause the
source, and make adjustments on the destination as close as we can to
the end of the migration.

With the way things are structured, I'm not sure that's necessary for accuracy?

As Greg notes, we could change some of the ordering around when userspace device entities are paused vs the vCPUs. I imagine we'll end up evolving those events with more time (such as adding a stage when disk/network IO is throttled, but the devices themselves aren't paused).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all restructured as of cdbe3ef, though I still kept around a field to hang on to time data state between export and send in the protocol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of d09938f, there is only one time data phase of the protocol

I realized through review and discussion that having two phases more closely mapped to my initial model of how to correct the guest TSC (namely, measure how long migration takes and add that to the guest TSC). But it doesn't actually matter for correctness when we export/import the data, since the guest TSC is calculated from host hrtime. As long as we import the time data before the device state, we should be in good shape.

// source, and make adjustments on the destination as close as we can to
// the end of the migration.
//
// This lets us hang on to the data between export in the pause phase,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deeply do we care about which protocol phase sends this data? The destination has to compute the total amount of time the VM spent paused in any case, and has to account for time that elapses on post-migration tasks like starting entities.

If it doesn't matter when we send this (only when we collect it), then perhaps an even better solution to the above is to add phases: instead of going Pause -> RamPush -> DeviceState, have PauseCpus -> ArchState -> PauseDevices -> RamPush -> DeviceState. Then the migration context doesn't have to cart this data around between phases--it just gets captured in the phase immediately after CPUs are paused.

(This assumes we don't have to pause entities to have a consistent architectural state. But even if we do, we can still solve this problem by adding that phase back: Pause -> ArchState -> RamPush -> DeviceState.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destination has to compute the total amount of time the VM spent paused in any case

I'm not sure it so much has to calculate how much time the VM was paused, but rather how much time passed between when the reading of timing (TSC, etc) information was taken, and when its adjusted value(s) are being loaded on the destination. The destination VM could remain paused for additional time past that, but it would appear no different than a case where vCPU threads are not being scheduled for some reason.

Contextually, it is important that the timing information be loaded first, before other device state, since the adjusted boot_hrtime value is used to offset the other timer values in the emulated device state.

Copy link
Contributor Author

@jordanhendricks jordanhendricks Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deeply do we care about which protocol phase sends this data?

As Patrick said, it needs to be sent before the device state is imported (which this current change does not do), but also we are trying to push most of the adjustments made into userspace, which implies we need to collect the data before any lengthier pieces of migration, where possible. Perhaps I was a bit too constraining when describing this in terms of when we pause the vCPUs, but it's been my mental model for thinking about it. I can change some of the comments here to reflect that.

I had avoided reintroducing the ArchState phase in the hopes that it would be cleaner given that we split up the collecting of time data and the sending of it. But perhaps that isn't true, as it might be harder to grok that this data must be imported before the device state is imported.

I could add the ArchState phase back as such: Pause (with time data collection) -> RamPush -> ArchState (time data send) -> DeviceState...

How does that sound?

Copy link
Contributor Author

@jordanhendricks jordanhendricks Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 803081, I added:

  • a TimeDataRead phase on the source side
  • a TimeData phase for both sides to exchange the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: #359 (comment)

I consolidated down to one protocol phase


/// Find the perceived wall clock time difference between when timing data
/// was read on the source versus the current time on the destination.
// TODO(#357): Right now we throw up our hands if the delta is negative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid this problem by trying to measure elapsed migration time from the point of view of just one of the actors involved?

That is, suppose the Pause phase included some kind of mutual acknowledgement in the protocol:

  • Source sends an "I'm pausing" message
  • Target records the time it received this message and immediately sends an "OK to pause" message
  • Only then does the source start pausing

(This could also be an "I read timing data"/"OK you read the timing data" message pair if that's what we want; I'll probably work it out as I keep reviewing this.)

This can produce a "total migration time" estimate on the target that's always at least the amount of time the VM was actually paused. It might be an overestimate because it also includes the time it took for the source to receive the "OK to pause now" message and actually start pausing things. We could possibly correct for that further by having the source measure some length of time (e.g. from when it sent its "am I OK to pause?" message to when it actually paused things) and then send that to the target to use to correct its view of the total migration time.

Anyway: I haven't totally worked out the arithmetic here, so I could be totally wrong about all this, but I'd be interested in seeing if there's an approach that doesn't rely on real time synchronization between the two machines involved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the wallclock difference (and assuming the respective clocks on either side are adequately synced via NTP) seemed like a simple way to get this off the ground. Ultimately we want some means of judging the passage of time (from some global observer) between the point at which the timing measurement (TSC, boot_hrtime, etc) is taken on the source, and when its adjusted values are to be loaded on the destination.

It's been suggested that we could do more sophisticated things here in the future, including something like a little mini implementation of the NTP algo between the two propolis instances, but cheating with wallclock in the short term seemed like a reasonable starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the problem I ran into with a negative wall clock delta was in an earlier prototype where the time data was sent immediately after collecting it. In practice (if not for longer delay between reading and sending, but hopefully also because ntp is better behaved than on the lab machines I was on), my hope is that this won't matter, and I think I'm comfortable with leaving this behavior for now. When we begin to optimize things such that migrations are faster, we may want to reconsider it (and I'm happy to leave #357 as a placeholder for that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In d09938f, which has a much tighter window between export and import than previous versions of this PR, I clamped the migrate delta to 0 if for some reason we see a negative delta.

// TODO: A hacky way for now to hang onto the VMM timing data between
// migration phases.
//
// We want to read the VMM timing data as soon as we can after we pause the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to read the VMM timing data as soon as we can after we pause the
source, and make adjustments on the destination as close as we can to
the end of the migration.

With the way things are structured, I'm not sure that's necessary for accuracy?

As Greg notes, we could change some of the ordering around when userspace device entities are paused vs the vCPUs. I imagine we'll end up evolving those events with more time (such as adding a stage when disk/network IO is throttled, but the devices themselves aren't paused).

// source, and make adjustments on the destination as close as we can to
// the end of the migration.
//
// This lets us hang on to the data between export in the pause phase,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destination has to compute the total amount of time the VM spent paused in any case

I'm not sure it so much has to calculate how much time the VM was paused, but rather how much time passed between when the reading of timing (TSC, etc) information was taken, and when its adjusted value(s) are being loaded on the destination. The destination VM could remain paused for additional time past that, but it would appear no different than a case where vCPU threads are not being scheduled for some reason.

Contextually, it is important that the timing information be loaded first, before other device state, since the adjusted boot_hrtime value is used to offset the other timer values in the emulated device state.

@jordanhendricks
Copy link
Contributor Author

Got through most of the review feedback in 803081, and changed the bhyve-interfaces and naming conventions to reflect the current state of the bhyve bits. I added a new phase for exchanging the time data. I want to do a bit more refactoring there to only import the time data pieces (and ignore the arch entries).

async fn time_data_read(&mut self) -> Result<(), MigrateError> {
let instance_guard = self.vm_controller.instance().lock();
let vmm_hdl = &instance_guard.machine().hdl;
let raw = vmm_hdl.export_vm()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtlety in here that I think would be good to fix: export_vm returns a MigrateStateError, which has a From<std::io:Error> impl that converts all errors into MigrateStateError::ImportFailed, which is confusing since we're exporting here. (This hasn't been a problem before because entity state export is infallible.)

I'm not sure of the best way to fix this. We could remove the From impl but that would break all the uses of ? in the import routines. Maybe we need a StateExportError enum for export_vm to use?

Copy link
Contributor Author

@jordanhendricks jordanhendricks May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a great point. I took a stab at making error handling a bit better in cdbe3ef; there is a now a MigrateError::TimeData type that we convert all relevant TimeData{Read} phase errors into. Let me know what you think.

@jordanhendricks
Copy link
Contributor Author

jordanhendricks commented May 6, 2023

cdbe3ef is a medium-sized refactor and addresses a hodge podge of other review feedback. Specifically:

  • 627be13 ripped out the migrate module for VmmHdl and suggested an alternate way of handling some of the global VM data; I did something similar here by creating a VmTimeData struct to represent the abstract notion of guest time data. This is what gets serialized/deserialized in the protocol, and there are explicit functions to export/import this state that we can build versioning into as needed.
  • The logic in the former time_adjust module was moved to live with the VmTimeData state in lib/propolis/src/vmm/time.rs; basically all of the guest time -related utilities (import/export, adjustment calculations) live in that module now.
  • The logic around the destination side reading the source time data, adjusting it, and importing it now lives with the rest of the protocol code, calling into the time module for making adjustments to it and importing it. I think this makes a bit more sense, as it is the consumer of the import/export interfaces, and it makes more sense to me that the migration code would be responsible for this (as opposed to the import function, as things were structured before).

@jordanhendricks
Copy link
Contributor Author

jordanhendricks commented May 6, 2023

I have a couple of other small review comments to wrap up still, but wanted to point out, the PHD live migration tests pass again 🎉

Copy link
Collaborator

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a more thorough pass when you've pushed pending updates

@jordanhendricks
Copy link
Contributor Author

jordanhendricks commented May 12, 2023

Per recent discussion about landing this ASAP, I squashed my commits and force-pushed after rebasing with master to minimize merge conflicts with other recent changes in master.

Additional notes:

  • I removed the change that turned PHD migration tests back on
  • This change is clippy clean

Testing notes:

  • I ran a handful of physical migrations back and forth between machines. Everything migrated successfully
  • cargo test passes on my linux laptop and helios dev machine
  • PHD tests (except the crucible ones, as I did not have it configured for that, but including the migration ones) pass locally

Miscellaneous output for posterity:

live-migration-times.d (source)

jordan@maxwell ~/propolis/scripts $ pfexec ./live-migration-times.d 1533 v                             
tracing live migration protocol times for pid 1533...                                                  
                                                                                                       
PHASE                                        TIMESTAMP                                                 
Sync         BEGIN                      17437355109672
Sync         END                        17437355230924
RamPushPrePause BEGIN                      17437355249524      
RamPushPrePause END                        17446758097054
RamPushPostPause BEGIN                      17446758114444                                             
RamPushPostPause END                        17446960120090                                             
TimeData     BEGIN                      17446960141050
TimeData     END                        17446960297073 
DeviceState  BEGIN                      17446960307583                                                 
DeviceState  END                        17447021682763                                                 
RamPull      BEGIN                      17447021689593
RamPull      END                        17447079199443                                                 
ServerState  BEGIN                      17447079210023         
ServerState  END                        17447143771862                    
Finish       BEGIN                      17447143778262
Finish       END                        17447199125099
^C                                                                                                     
                                                                                                       
                                                                                                       
PHASE                           TIME ELAPSED (usec)                                                    
Sync                                            121                                                    
RamPushPrePause                             9402847                                                    
RamPushPostPause                             202005                                                    
TimeData                                        156                                                    
DeviceState                                   61375                                                    
RamPull                                       57509                                                    
ServerState                                   64561                                                    
Finish                                        55346                                                    
TOTAL                                       9843924     

time-adjustments.d (dest)

jordan@yuban ~/propolis/scripts $ pfexec ./time-adjustments.d 2160
tracing pid 2160...
time data imported; press CTRL+C for summary
^C
GUEST FREQ (Hz)        UPTIME (usec)    TOTAL TIME (usec)
     2649958945            879497608                  148

EVENT           DURATION (usec)            GUEST TSC          BOOT HRTIME
Migration                   107        2330632597459       17148534696948
[adjustment]                                  285087          -9235619709
Propolis                     22        2330632882546       17139299077239
[adjustment]                                   84400                    0
Kernel                       18        2330632966946       17139299077239

example log messages of time adjustments (dest):

May 12 00:18:33.582 INFO VMM Time Data: "(guest_freq:2649958945,guest_tsc:2330632597459,hrtime:18028032
305249,hres_sec:1683850713,hres_ns:581965192,boot_hrtime:17148534696948)", component: vm_controller    
May 12 00:18:33.582 INFO Time data adjustments:                                                        
- guest TSC freq: 2649958945 Hz = 2.649958945 GHz                                                      
- guest uptime ns: 879497608301                                                                        
- migration time delta: 107.582µs                                                                      
- guest_tsc adjustment = 2330632597459 + 285087 = 2330632882546                                        
- boot_hrtime adjustment = 17148534696948 ---> 18018796793122 - 879497715883 = 17139299077239          
- dest highres clock time: 18018796793122                                                              
- dest wall clock time: 1683850713.582072774s, component: vm_controller  

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small bits of cleanup, but generally LGTM. Thanks for putting the work into this!

Comment on lines 346 to 352
let mut deserializer = ron::Deserializer::from_str(&raw)
.map_err(codec::ProtocolError::from)?;
let deserializer =
&mut <dyn erased_serde::Deserializer>::erase(&mut deserializer);
let time_data_src: vmm::time::VmTimeData =
erased_serde::deserialize(deserializer).map_err(|e| {
MigrateError::TimeData(format!(
"VMM Time Data deserialization error: {}",
e
))
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the destination type is known here, I think this can just be written as let time_data_src: VmTimeData = ron::from_str(&raw) (with an appropriate error mapping)--no need to do any of the type erasure/boxing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 505caa0

this->total += this->d_rpush_post;
}
if (this->d_td != 0) {
printf("%-15s %30d\n", this->td, this->d_td / 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same width specifiers as the surrounding prints? (If the output aligns properly, then no worries; I just remember having to mess around with this to get all the headers to line up right when I added the post-pause phase.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this (and my own D script nits #359 (comment)) in a follow on PR

dst_hrt: i64,
dst_wc: Duration,
) -> Result<(VmTimeData, VmTimeDataAdjustments), TimeAdjustError> {
// Basic validation: there is no reason system hrtime should be negative,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional/nit: would it be valid to check this when the timestamp is fetched and convert it to an unsigned value there, or do we use it in signed form in other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted about this offline -- I did use unsigned values for hrtime values that really should never be negative (such as uptime) in earlier versions of writing this code. I found that it was kind of a pain to go back and forth between signed and unsigned, since there are more constrained overflow checks for signed values (and thus I would have to convert any unsigned values back to signed anyway, or otherwise handle that for overflow checks). For my own sanity, I chose to keep all hrtime values signed throughout. Maybe there's a nicer way to write this code, but I would rather not do this for now.

//

// guest_uptime_ns = source hrtime - source boot_hrtime
let guest_uptime_ns: i64 = src
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble squaring this with the comment on line 181. If the exported timer value is normalized, hasn't the source boot_hrtime been subtracted from src.hrtime already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this offline -- there are no timers being normalized/denormalized here; this is part of the calculation of the new boot_hrtime. This code and comment could make what is happening here more clear.

@jordanhendricks jordanhendricks merged commit 7ed4808 into oxidecomputer:master May 12, 2023
@jordanhendricks jordanhendricks deleted the migrate-timing branch May 12, 2023 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement live migration support for guest timing data
3 participants