-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime/trace: TestTraceStress fails while unpark #11320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc @aclements for triage |
@alexbrainman how frequently does it happen? Please provide links to builder logs. Can't find them. |
Dmitry, the failure is on windows-amd64-2003kardianos builder (see builder in the middle of 3 windows builders). Recently it fails for nearly every build - failures are slightly different, but all about trace parsing. The builder belongs to @kardianos - perhaps he can help you with testing. The test never fails on my PCs here. Thank you for looking into it. Alex |
What is that builder? Don't see it here: |
I really don't know. All I know that it is one of the oldest ones we have OS-wise. You won't be able to run try-bot on it either. It is running old builder program. Lets wait for @kardianos - I am sure he will help you to test whatever you like there. Alex |
Running it mainly to test the runtime and cgo interface for older versions We could disable this test in some manner, or I could send you the product On Mon, Jun 22, 2015 at 4:00 AM Alex Brainman [email protected]
|
@kardianos thanks, I see |
FWIW, this happened a few times back in April, but only start happening reliably a few days ago: 2015-04-21T20:50:23-e589e08/windows-amd64-2003kardianos I see e72f5f6 (runtime: fix tracing of syscallexit) on the 18th. Could that have made us more susceptible to this failure, or did it just make us able to detect it? |
@dvyukov Last time we approached this we left it at this: I'll do some testing later to see if I can detect if time goes backwards reliably. Let me know if you don't think that is worthwhile. |
What is your plan? Do you want to detect this in test, or in runtime? |
It looked like Brad wanted to try to put it in the runtime, though I'm not sure about that. If it can be detected cheaply then maybe the runtime too. I'll try to quantize the cost of detection and get back. |
We don't have a way to disable tests per builder yet, nor is there anything in the environment to key off of. I filed #11346. |
I tried to detect "time goes backwards" with something like this:
Output:
But I was unable to detect time going backwards, regardless of how I got "time". Is there a better way to do this? I'm sure I'm missing something. |
Cputicks do not go backwards, they are unsynchronized on different CPUs. So you would need to create a pair of threads, bind them to every possible pair of CPUs and measure minimal roundtrip time as perceived by one CPU and by another CPU. And then figure out from that numbers if cputicks are unsynchronized. However, there can be various restrictions on binding of threads to CPUs; in such case kernel will silently ignore your affinity mask and bind the thread to something else. Here it is important to not end up with two threads bound to the same CPU, because otherwise round-trip will be determined by thread context switch time which is wrong. And, of course, syscalls for determining of affinity masks and actual binding are OS-specific. This is going to be messy. |
Are we sure this is an unsynchronized TSCs problem? Why would it have only started up on the 18th? |
I used the following program to test TSC on a machine: |
Just to note, the VM only has one core allocated to it and I pinned the VM threads to a single core with taskset -p 1 X and the ppro test is still failing . I ran the detection program on the host and it didn't detect a problem with the CPU itself. I'm out of my depth here by a long ways. If someone wants to play around on the box that is failing I can spin up access to it, or try something myself. Let me know. |
CL https://golang.org/cl/11288 mentions this issue. |
cl 11288 doesn't prevent these two pprof tests from failing, but it does appear to reduce the occurrence of failures. |
@kardianos then probably RDTSC ticks with very low frequency or episodically inside of the VM... |
@dvyukov I'm guessing that is the case. I would like to propose something like the following:
On my windows 7 box (HW) nextDiffMilli always returns 1. On my linux box it returns 0. On my windows XP VM it returns somewhere between 45 and 15 if unloaded, if loaded it gets worse. |
@kardianos your program prints:
on my windows xp. windows xp (and maybe others) cannot do ticks better then 15ms. I expect all windows pcs have either 1ms or 15ms tick resolution. Maybe you can look for "missing ticks". Do a set of measurements like yours and see if there are any variations. Keep in mind that Go prof is not running by default - it is started on request. But when started, it runs on a separate thread with high priority (see profileloop1 func in runtime). Mind you, now that I am looking at this code again, it used to work like that when in C. Now that runtime is written in Go, I am not sure that profileloop1 is not rescheduled onto a different thread as it runs. Even worse, the thread that profileloop1 has started with (with high priority), might be running some things that should not be running with high priority? @dvyukov what do you think? Alex |
@kardianos @alexbrainman wait, I am talking about RDTSC ticks, not OS time ticks. RDTSC is processor instruction on x86 processors, it is not affected/dependent on OS. |
I just created https://go-review.googlesource.com/11478 that, I though, should be unrelated to our discussion. But I can reliably see similar failures now (on my windows/amd64 pc):
Why? I guess I made mistake somewhere, but where? Alex |
Profileloop1 is not a real goroutine, it is always bound to a single thread. I would expect that your change is a no-op. Also CPU profiling is not enabled in trace tests. Are you sure that it is that change that makes difference for trace tests? |
@dvyukov could you point me to the code that reads RDTSC ticks? I was able to find code for windows that that reads the OS ticks, but not a CPU instruction. @alexbrainman Were those failures on a WinXP box on hardwre? On WinXP when we heavily load the CPU (like the pprof stress tests do) the time per thread only has a resolution of 450 milliseconds, which is going to break pprof.
Output:
|
@kardianos Here is an example of how to call is: rdtsc.go
rdtsc.s
put both files into a single package and build. |
@dvyukov you are correct - runtime/pprof tests fail regardless of my changes - the test is flaky. I am not sure if you wanted me to test this:
Regarding https://go-review.googlesource.com/11478 why do you say that the change is noop? Why profileloop1 is bound to a single thread? Are you saying that none of functions we could potentially call from profileloop1 could result in execution rescheduled on a different thread? Alex |
Something similar needs to be tested on @kardianos builder. But the diff must not include println time, e.g. collect diffs in an array first, then print.
ASFAIR profileloop1 is not a goroutine, it is created directly with CreateThread API call. So it always has own thread. Scheduler preempts only real goroutines running on scheduler threads. |
Thank you Dmitry.
Alex |
@alexbrainman that's numbers from the machine with flaky trace tests? On my machine it prints:
Your numbers look quite high, but I don't see how it can make trace tests flaky. On the opposite, too low numbers (in particular 0) would make trace tests flaky... |
Yes, the shown output is from my windows amd64 pc with flaky tests. It seems I can easily reproduce them. If you have something for me to try on it, I would be happy to. Thank you. Alex |
with flaky trace tests or with flaky CPU profiler tests? |
Well the failing TestTraceStress (see my yesterday post above). That's what we are trying to fix. Isn't it? Alex |
@alexbrainman ah, OK, I was confused by the cpu profiler change. |
@dvyukov When I look for low diffs between ticks on my VM I can find them in the range of 1-40 (never zero). The host machine is always 97. The is what I used to determine this:
|
|
Sorry, hold over code. I had done other tests to show that it is never zero. But modified, still never shows zero:
|
It is vmware. Inside VM it says Intel Xeon CPU E5-2680 0 @ 2.70GHz, Windows 7.
It looks like my pc has 1 CPU only. Your program seems like it needs multiple CPUs. Isn't it? Can I use Go? Alex |
@alexbrainman On both our setups where we can reproduce this we only have one CPU. In both cases we never see the same RDTSC value (it never appears to go backwards on our single core). What I do see is under load and on idle the RDTSC has 100x variability in the time difference between ticks. On HW I may get 100 ticks diff each call or 300 ticks diff. But on the VM it varies between 1-30 to 300000 ticks. @dvyukov I wasn't able to find any code that relied on the calculated tick frequency; it looks like the RDTSC time is directly encoded into the field and directly read before the events are sorted. But maybe I missed something. |
How that was proved? |
Run the following program when host is loaded: package main
func cputicks() int64
func main() {
for {
t0, t1 := cputicks(), cputicks()
if t1 - t0 <= 0 {
println(t0, t1, t1 - t0)
return
}
}
} |
I've had this running for a half hour and nothing printed or exited. I'll On Mon, Jun 29, 2015 at 9:40 AM Dmitry Vyukov [email protected]
|
@kardianos @alexbrainman OK, please dump a bad trace to file and attach it here. I will take a look as to what exactly is wrong there. |
From the trace that @kardianos sent me offline:
The last two evens happen at the same time. Please check if https://go-review.googlesource.com/#/c/11834/ fixes tests. |
CL https://golang.org/cl/11834 mentions this issue. |
On some VMs two events can happen at the same time. For examples: 179827399 GoStart p=2 g=11 off=936359 g=11 179827399 GoUnblock p=2 g=0 off=936355 g=11 If we do non-stable sort, the events can be reordered making the trace inconsistent. Do stable sort instead. Batches are dumped in FIFO order, so if these same-time events are split into separate batches, stable sort still works. Events on different CPUs go into different batches and can be reordered. But the intention is that causally-related events on different CPUs will have larger (non-zero) time diff. Update #11320 Change-Id: Id1df96af41dff68ea1782ab4b23d5afd63b890c9 Reviewed-on: https://go-review.googlesource.com/11834 Reviewed-by: Brad Fitzpatrick <[email protected]>
CL 11834 fixes this issue on my VM. |
CL 11834 fixes this issue on my pc too. Thank you @dvyukov and @kardianos for seeing it till the end. Alex |
From recent windows-amd64-2003kardianos builder log:
@dvyukov how would I debug that? I cannot reproduce the failure here.
Thank you.
Alex
The text was updated successfully, but these errors were encountered: