-
Notifications
You must be signed in to change notification settings - Fork 27
Migrate guest time-related data #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate guest time-related data #359
Conversation
d7786a9
to
5d1c492
Compare
this->total / 1000); | ||
printf("\n"); | ||
|
||
printf("%-10s %20s %20s %20s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that the output ordering here is a bit confusing -- will fix in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! Left a few initial thoughts (hopefully nothing too off-base with the bhyve changes, which I've skimmed but not looked at terribly closely yet).
// TODO: A hacky way for now to hang onto the VMM timing data between | ||
// migration phases. | ||
// | ||
// We want to read the VMM timing data as soon as we can after we pause the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or for worse (possibly for worse) the state driver pauses entities after vCPUs. Entity pause might be expensive if, say, an entity wants to drain a queue or wait for an outstanding operation to finish before completing its pause request. If the goal is to capture this state as soon as the CPUs stop ticking, we should consider one of these approaches:
- Reorder operations in the state driver so that entities pause before vCPUs
- Teach the state driver to capture this architectural state at the right time and send it back to the migration task
Of these two, I would strongly prefer option 1 (option 2 is not as hard as it sounds, but it would be pretty messy). To implement option 1, though, we have to be confident that all our entities won't care that the CPUs are still running at the time they're asked to pause. (But see below for a third possible choice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to read the VMM timing data as soon as we can after we pause the
source, and make adjustments on the destination as close as we can to
the end of the migration.
With the way things are structured, I'm not sure that's necessary for accuracy?
As Greg notes, we could change some of the ordering around when userspace device entities are paused vs the vCPUs. I imagine we'll end up evolving those events with more time (such as adding a stage when disk/network IO is throttled, but the devices themselves aren't paused).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all restructured as of cdbe3ef, though I still kept around a field to hang on to time data state between export and send in the protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of d09938f, there is only one time data phase of the protocol
I realized through review and discussion that having two phases more closely mapped to my initial model of how to correct the guest TSC (namely, measure how long migration takes and add that to the guest TSC). But it doesn't actually matter for correctness when we export/import the data, since the guest TSC is calculated from host hrtime. As long as we import the time data before the device state, we should be in good shape.
// source, and make adjustments on the destination as close as we can to | ||
// the end of the migration. | ||
// | ||
// This lets us hang on to the data between export in the pause phase, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How deeply do we care about which protocol phase sends this data? The destination has to compute the total amount of time the VM spent paused in any case, and has to account for time that elapses on post-migration tasks like starting entities.
If it doesn't matter when we send this (only when we collect it), then perhaps an even better solution to the above is to add phases: instead of going Pause -> RamPush -> DeviceState, have PauseCpus -> ArchState -> PauseDevices -> RamPush -> DeviceState. Then the migration context doesn't have to cart this data around between phases--it just gets captured in the phase immediately after CPUs are paused.
(This assumes we don't have to pause entities to have a consistent architectural state. But even if we do, we can still solve this problem by adding that phase back: Pause -> ArchState -> RamPush -> DeviceState.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destination has to compute the total amount of time the VM spent paused in any case
I'm not sure it so much has to calculate how much time the VM was paused, but rather how much time passed between when the reading of timing (TSC, etc) information was taken, and when its adjusted value(s) are being loaded on the destination. The destination VM could remain paused for additional time past that, but it would appear no different than a case where vCPU threads are not being scheduled for some reason.
Contextually, it is important that the timing information be loaded first, before other device state, since the adjusted boot_hrtime
value is used to offset the other timer values in the emulated device state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How deeply do we care about which protocol phase sends this data?
As Patrick said, it needs to be sent before the device state is imported (which this current change does not do), but also we are trying to push most of the adjustments made into userspace, which implies we need to collect the data before any lengthier pieces of migration, where possible. Perhaps I was a bit too constraining when describing this in terms of when we pause the vCPUs, but it's been my mental model for thinking about it. I can change some of the comments here to reflect that.
I had avoided reintroducing the ArchState phase in the hopes that it would be cleaner given that we split up the collecting of time data and the sending of it. But perhaps that isn't true, as it might be harder to grok that this data must be imported before the device state is imported.
I could add the ArchState phase back as such: Pause (with time data collection) -> RamPush -> ArchState (time data send) -> DeviceState...
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 803081
, I added:
- a TimeDataRead phase on the source side
- a TimeData phase for both sides to exchange the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see: #359 (comment)
I consolidated down to one protocol phase
lib/propolis/src/vmm/time_adjust.rs
Outdated
|
||
/// Find the perceived wall clock time difference between when timing data | ||
/// was read on the source versus the current time on the destination. | ||
// TODO(#357): Right now we throw up our hands if the delta is negative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid this problem by trying to measure elapsed migration time from the point of view of just one of the actors involved?
That is, suppose the Pause phase included some kind of mutual acknowledgement in the protocol:
- Source sends an "I'm pausing" message
- Target records the time it received this message and immediately sends an "OK to pause" message
- Only then does the source start pausing
(This could also be an "I read timing data"/"OK you read the timing data" message pair if that's what we want; I'll probably work it out as I keep reviewing this.)
This can produce a "total migration time" estimate on the target that's always at least the amount of time the VM was actually paused. It might be an overestimate because it also includes the time it took for the source to receive the "OK to pause now" message and actually start pausing things. We could possibly correct for that further by having the source measure some length of time (e.g. from when it sent its "am I OK to pause?" message to when it actually paused things) and then send that to the target to use to correct its view of the total migration time.
Anyway: I haven't totally worked out the arithmetic here, so I could be totally wrong about all this, but I'd be interested in seeing if there's an approach that doesn't rely on real time synchronization between the two machines involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the wallclock difference (and assuming the respective clocks on either side are adequately synced via NTP) seemed like a simple way to get this off the ground. Ultimately we want some means of judging the passage of time (from some global observer) between the point at which the timing measurement (TSC, boot_hrtime, etc) is taken on the source, and when its adjusted values are to be loaded on the destination.
It's been suggested that we could do more sophisticated things here in the future, including something like a little mini implementation of the NTP algo between the two propolis instances, but cheating with wallclock in the short term seemed like a reasonable starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the problem I ran into with a negative wall clock delta was in an earlier prototype where the time data was sent immediately after collecting it. In practice (if not for longer delay between reading and sending, but hopefully also because ntp is better behaved than on the lab machines I was on), my hope is that this won't matter, and I think I'm comfortable with leaving this behavior for now. When we begin to optimize things such that migrations are faster, we may want to reconsider it (and I'm happy to leave #357 as a placeholder for that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In d09938f, which has a much tighter window between export and import than previous versions of this PR, I clamped the migrate delta to 0 if for some reason we see a negative delta.
// TODO: A hacky way for now to hang onto the VMM timing data between | ||
// migration phases. | ||
// | ||
// We want to read the VMM timing data as soon as we can after we pause the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to read the VMM timing data as soon as we can after we pause the
source, and make adjustments on the destination as close as we can to
the end of the migration.
With the way things are structured, I'm not sure that's necessary for accuracy?
As Greg notes, we could change some of the ordering around when userspace device entities are paused vs the vCPUs. I imagine we'll end up evolving those events with more time (such as adding a stage when disk/network IO is throttled, but the devices themselves aren't paused).
// source, and make adjustments on the destination as close as we can to | ||
// the end of the migration. | ||
// | ||
// This lets us hang on to the data between export in the pause phase, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destination has to compute the total amount of time the VM spent paused in any case
I'm not sure it so much has to calculate how much time the VM was paused, but rather how much time passed between when the reading of timing (TSC, etc) information was taken, and when its adjusted value(s) are being loaded on the destination. The destination VM could remain paused for additional time past that, but it would appear no different than a case where vCPU threads are not being scheduled for some reason.
Contextually, it is important that the timing information be loaded first, before other device state, since the adjusted boot_hrtime
value is used to offset the other timer values in the emulated device state.
8ab12e6
to
803081f
Compare
Got through most of the review feedback in |
async fn time_data_read(&mut self) -> Result<(), MigrateError> { | ||
let instance_guard = self.vm_controller.instance().lock(); | ||
let vmm_hdl = &instance_guard.machine().hdl; | ||
let raw = vmm_hdl.export_vm()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtlety in here that I think would be good to fix: export_vm
returns a MigrateStateError
, which has a From<std::io:Error>
impl that converts all errors into MigrateStateError::ImportFailed
, which is confusing since we're exporting here. (This hasn't been a problem before because entity state export is infallible.)
I'm not sure of the best way to fix this. We could remove the From
impl but that would break all the uses of ?
in the import routines. Maybe we need a StateExportError
enum for export_vm
to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a great point. I took a stab at making error handling a bit better in cdbe3ef; there is a now a MigrateError::TimeData
type that we convert all relevant TimeData{Read} phase errors into. Let me know what you think.
a310faf
to
cdbe3ef
Compare
cdbe3ef is a medium-sized refactor and addresses a hodge podge of other review feedback. Specifically:
|
I have a couple of other small review comments to wrap up still, but wanted to point out, the PHD live migration tests pass again 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a more thorough pass when you've pushed pending updates
1ca482c
to
d09938f
Compare
Per recent discussion about landing this ASAP, I squashed my commits and force-pushed after rebasing with master to minimize merge conflicts with other recent changes in master. Additional notes:
Testing notes:
Miscellaneous output for posterity: live-migration-times.d (source)
time-adjustments.d (dest)
example log messages of time adjustments (dest):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small bits of cleanup, but generally LGTM. Thanks for putting the work into this!
let mut deserializer = ron::Deserializer::from_str(&raw) | ||
.map_err(codec::ProtocolError::from)?; | ||
let deserializer = | ||
&mut <dyn erased_serde::Deserializer>::erase(&mut deserializer); | ||
let time_data_src: vmm::time::VmTimeData = | ||
erased_serde::deserialize(deserializer).map_err(|e| { | ||
MigrateError::TimeData(format!( | ||
"VMM Time Data deserialization error: {}", | ||
e | ||
)) | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the destination type is known here, I think this can just be written as let time_data_src: VmTimeData = ron::from_str(&raw)
(with an appropriate error mapping)--no need to do any of the type erasure/boxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 505caa0
this->total += this->d_rpush_post; | ||
} | ||
if (this->d_td != 0) { | ||
printf("%-15s %30d\n", this->td, this->d_td / 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same width specifiers as the surrounding prints? (If the output aligns properly, then no worries; I just remember having to mess around with this to get all the headers to line up right when I added the post-pause phase.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix this (and my own D script nits #359 (comment)) in a follow on PR
dst_hrt: i64, | ||
dst_wc: Duration, | ||
) -> Result<(VmTimeData, VmTimeDataAdjustments), TimeAdjustError> { | ||
// Basic validation: there is no reason system hrtime should be negative, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional/nit: would it be valid to check this when the timestamp is fetched and convert it to an unsigned value there, or do we use it in signed form in other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatted about this offline -- I did use unsigned values for hrtime values that really should never be negative (such as uptime) in earlier versions of writing this code. I found that it was kind of a pain to go back and forth between signed and unsigned, since there are more constrained overflow checks for signed values (and thus I would have to convert any unsigned values back to signed anyway, or otherwise handle that for overflow checks). For my own sanity, I chose to keep all hrtime values signed throughout. Maybe there's a nicer way to write this code, but I would rather not do this for now.
// | ||
|
||
// guest_uptime_ns = source hrtime - source boot_hrtime | ||
let guest_uptime_ns: i64 = src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble squaring this with the comment on line 181. If the exported timer value is normalized, hasn't the source boot_hrtime been subtracted from src.hrtime already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted about this offline -- there are no timers being normalized/denormalized here; this is part of the calculation of the new boot_hrtime. This code and comment could make what is happening here more clear.
c1f53d1
to
4893e3b
Compare
This PR makes use of the new bhyve VMM timing data interfaces introduced in the accompanying bhyve change (CR link). See RFD 358 for additional background.
Closes #337.
Overview
This change adds the following functionality for guests that are migrated:
To do so, propolis-server makes use of the new bhyve VMM timing data interfaces. On the source side, we read the timing data before the lengthier parts of migration begin. Toward the end, we send this data over. The destination side calculates migration time based on the source and destination wall clocks, and adjusts the guest TSC based on migration time, and adjusts the
boot_hrtime
(the value used for device timers) based on the guest uptime, migration time, and the hrtime of its new host.A load-bearing assumption in our adjustment calculations is that the wall clocks of the source and destination hosts are synchronized, which is an assumption we are making in other parts of the product.
Bhyve VMM timing data interface details
On a read, the timing data interface provides the following:
boot_hrtime
(hrtime of the host when the guest was booted)Callers writing the data provide the guest-related fields, as well as the hrtime and wall clock time of the host when when the caller sent the request. This allows bhyve to make additional adjustments to account for latency between the userspace write and kernel-side receipt of the data.
Timing Adjustment D script
This PR adds a couple of probes to where we adjust the timing data to help validate the data were are seeing is correct.
Some example output of this script, run for a VM with no load on on my otherwise not-busy test machine:
Notes for reviewers