-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Nondeterminism in llvm-libc++-static.cfg.in :: std/time/time.zone/time.zone.timezone/time.zone.members/sys_info.zdump.pass.cpp #89629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-bug Author: Paul Kirth (ilovepi)
We're seeing some non-deterministic failures in llvm-libc++-static.cfg.in :: std/time/time.zone/time.zone.timezone/time.zone.members/sys_info.zdump.pass.cpp in our CI (https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8750473375533143361/overview).
It's unclear why this is happening, and why sometimes this test passes w/o any (known) changes to our CI configuration/infrastructure or to libcxx. This appears related to 1fda177, so I'm CCing @mordante. Below is the failing diff, which are only different in 2 lines.
vs.
|
Thanks for the report. This looks really odd. These tests should be deterministic. Did I miss the link? Or can you share the log? |
The full logs are listed below under the |
Also, it seems like its possible this was some transient thing in our infrastructure, since the test hasn't failed in a while. Why don't we leave this open for another day, and I can close it tomorrow if we don't see more issues? I For your investigation, I'd think you're likely fine to ignore this for the time being. I'll be sure to update you if this crops up again. |
I had a quick look. The zdump information on my system matches the output of your zdump. I manually verified the contents of the database and that too matches what zdump does. So if this is a real issue; libc++ is wrong. Based on your comment above I'll ignore the issue for now. Please let me know if the problem persists. |
Well, it still seems to be happening. I've asked our infra folks and the bots in this pool should be completely stable (same OS versions, packages, VM images, etc.). We suspected this was an issue w/ the region, but have also ruled that out, since passes and failures happen randomly across regions. The diffs from the latest failure (https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8749770631515774145/overview) also look the same as before, so at least the failure mode is consistent. Do you have any ideas about what would be wrong here? |
Oh, just an FYI, our bots use the just built compiler to build libc++. Is it possible there is some nondeterminism in clang itself that was introduced recently? It seems odd that it would only affect this one test though... but that's about the only thing I can think of ATM. |
@mordante We're still seeing this happen with some regularity. I took a deeper look at the patch, and have a suspicion that there may be something subtly wrong w/ some of the merging logic. I'm not that sure though, since it seems most of that logic is driven by errata in the spec. Have you had any success reproducing this? With nondeterminism, I've had success using LD_PRELOAD to change the default allocator, since in the compiler its usually related to something ordered by pointer value being iterated over. That doesn't seem to be the case here, from a quick look, but I'd be remiss if I didn't share the only thing I've found to make non-determinism bugs reproduce more reliably. |
@mordante We've had to disable this test in our CI, since it's been flaky. We'd like to remove that workaround in our toolchain build ASAP. Have you had success reproducing the nondeterminism? our CI uses a stock debian 11.8 image as the base, so I wouldn't expect there to be issues w/in the image itself, and other than using the runtimes build and therefore the latest compiler, I'm not aware of anything in our build/test environment that is significant. Further, other than some kind of pointer ordering making merge decisions nondeterministic, I can't see how we'd consistently get a single pair of Date/Times wrong. Do you have time to look into this? I feel like this is going to become a larger issue once more people start using the latest libc++ revisions. |
I have been quite busy last week, so I had little time to look into this. I did some tests, but I can't reproduce the issue locally (using Debian 12). Also during development I executed this test quite often without "random" issue. (Obviously I ran into issues, but they were deterministic and were errors in the implementation.) Did you happen to investigate whether all failures are with the same time zone? If you didn't investigate can you provide links to additional failures. |
Yes, I've gone through all the logs for both the Aarch64 and x86_64 builds and we always see the same diff(the one posted above). We had something similar happen w/ debug info a while back. In that case two entries were reordered based on a iteration order of some container. The debug info wasn't wrong in that case though, just ordered differently. Several maintainers ran our test for hours w/o reproduction, but our bots would reliably repro it, almost every time. Ultimately, I got it to start failing deterministically by running once w/ the system allocator, and then Another thought is that maybe its possible to swap one of the internal containers w/ I'm not that familiar with your patch, but from what I can tell, just about the only place this could crop up is in the merging logic, since I think that's the only way you'd interpret a value differently. It's entirely possible I'm wrong on that detail though. I've set aside some time tomorrow to try and reproduce the issue w/ the above approaches. |
We're seeing some non-deterministic failures in llvm-libc++-static.cfg.in :: std/time/time.zone/time.zone.timezone/time.zone.members/sys_info.zdump.pass.cpp in our CI (https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8750473375533143361/overview).
It's unclear why this is happening, and why sometimes this test passes w/o any (known) changes to our CI configuration/infrastructure or to libcxx.
This appears related to 1fda177, so I'm CCing @mordante.
Below is the failing diff, which are only different in 2 lines.
vs.
The text was updated successfully, but these errors were encountered: