-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Labels
A-jobserverArea: jobserver, concurrency, parallelismArea: jobserver, concurrency, parallelismC-bugCategory: bugCategory: bugS-triageStatus: This issue is waiting on initial triage.Status: This issue is waiting on initial triage.
Description
Problem
cargo test
randomly hang in Cloud Hypervisor
AArch64 unit test building stage.
Here is a recorded failing job in our CI: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248/
cargo version: cargo 1.55.0 (32da73a 2021-08-23)
kernel version: 5.10.0
The problem is highly random, and I saw this problem on only one AArch64 server.
I reproduced the problem and debugged with GDB, here is some observation:
gdb.txt
##################### Backtrace of cargo process: #####################
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
__libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
26 ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffffa2264ac0 (LWP 665544) "cargo" __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
2 Thread 0xffff9dea8030 (LWP 666949) "cargo" futex_wait_cancelable (private=0, expected=0, futex_word=0xaaab143ff6dc) at ../sysdeps/nptl/futex-internal.h:183
3 Thread 0xfffeb5dce030 (LWP 671236) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb5dccfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
4 Thread 0xfffeb0da6030 (LWP 673261) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb0da4fb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
5 Thread 0xfffeb05a2030 (LWP 673755) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb05a0fb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
6 Thread 0xfffeb03a1030 (LWP 673756) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb039ffb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
7 Thread 0xfffeb01a0030 (LWP 673852) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb019efb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
8 Thread 0xfffeaff9f030 (LWP 673932) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeaff9dfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
9 Thread 0xfffeafd9e030 (LWP 673992) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeafd9cfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
(gdb) bt
#0 __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
#1 __libc_write (fd=<optimized out>, buf=0xfffffcee5a7c, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2 0x0000aaaad282df14 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:146
#3 std::sys::unix::fs::File::write () at library/std/src/sys/unix/fs.rs:845
#4 <&std::fs::File as std::io::Write>::write () at library/std/src/fs.rs:680
#5 0x0000aaaad2813b34 in jobserver::imp::Client::release () at library/std/src/panicking.rs:541
#6 0x0000aaaad2814dd0 in <jobserver::Acquired as core::ops::drop::Drop>::drop () at library/std/src/panicking.rs:541
#7 0x0000aaaad21b9da0 in cargo::core::compiler::job_queue::DrainState::drain_the_queue () at library/std/src/panicking.rs:541
#8 0x0000aaaad21f9460 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/std/src/panicking.rs:541
#9 0x0000aaaad21fe4cc in crossbeam_utils::thread::scope () at library/std/src/panicking.rs:541
#10 0x0000aaaad21b8c9c in cargo::core::compiler::job_queue::JobQueue::execute () at library/std/src/panicking.rs:541
#11 0x0000aaaad2411b3c in cargo::core::compiler::context::Context::compile () at library/std/src/panicking.rs:541
#12 0x0000aaaad23692e8 in cargo::ops::cargo_compile::compile_ws () at library/std/src/panicking.rs:541
#13 0x0000aaaad23690f4 in cargo::ops::cargo_compile::compile () at library/std/src/panicking.rs:541
#14 0x0000aaaad23f44a0 in cargo::ops::cargo_test::run_tests () at library/std/src/panicking.rs:541
#15 0x0000aaaad208a1cc in cargo::commands::test::exec () at library/std/src/panicking.rs:541
#16 0x0000aaaad2081abc in cargo::cli::main () at library/std/src/panicking.rs:541
#17 0x0000aaaad208ab6c in cargo::main () at library/std/src/panicking.rs:541
#18 0x0000aaaad20d26f4 in std::sys_common::backtrace::__rust_begin_short_backtrace () at library/std/src/panicking.rs:541
#19 0x0000aaaad208c860 in std::rt::lang_start::{{closure}} () at library/std/src/panicking.rs:541
#20 0x0000aaaad283ca7c in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#21 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#22 std::panicking::try () at library/std/src/panicking.rs:365
#23 std::panic::catch_unwind () at library/std/src/panic.rs:434
#24 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#25 0x0000aaaad208c608 in main () at library/std/src/panicking.rs:541
(gdb) l
21 in ../sysdeps/unix/sysv/linux/write.c
(gdb)
############# Some rustc processes were also seen: ##########################
root@45e3b95cdbdc:/cloud-hypervisor# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 3612 2564 ? Ss 03:21 0:00 /bin/bash ./scripts/run_unit_tests.sh --hypervisor kvm
root 9 0.6 0.0 4017272 73332 ? Sl 03:21 0:14 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/cargo test --target aarch64-unknown-linux-g
root 2986 0.4 0.1 897372 263324 ? Sl 03:22 0:08 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_packet /usr/local/r
root 4463 0.1 0.0 899068 169872 ? Sl 03:22 0:02 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name arch --edition=2018 arch
root 4909 0.0 0.0 585264 177472 ? Sl 03:22 0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name vm_allocator --edition=2
root 4910 0.0 0.0 584320 200320 ? Sl 03:22 0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name devices --edition=2018 d
root 4987 0.0 0.0 763788 157652 ? Sl 03:22 0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_transport /usr/loca
root 5071 0.0 0.0 591956 230064 ? Sl 03:22 0:01 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pci --edition=2018 pci/s
root 5115 0.0 0.0 407344 75080 ? Sl 03:22 0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet /usr/local/rust/reg
r
################## I checked the backtrace of some of them, all like: ###################
(gdb) attach 673941
Attaching to process 673941
[New LWP 673988]
[New LWP 674062]
[New LWP 674063]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
__pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) bt
#0 __pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1 0x0000ffff9b63832c in std::sys::unix::thread::Thread::join () at library/std/src/sys/unix/thread.rs:177
#2 0x0000ffff9c2284bc in std::thread::JoinHandle<T>::join ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#3 0x0000ffff9c241688 in rustc_interface::util::setup_callbacks_and_run_in_thread_pool_with_globals ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#4 0x0000ffff9c295d3c in rustc_driver::RunCompiler::run ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#5 0x0000ffff9c21fce8 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#6 0x0000ffff9c29bb0c in rustc_driver::main () from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#7 0x0000aaaaba0e0c30 in rustc_main::main ()
#8 0x0000aaaaba0e0be4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9 0x0000aaaaba0e0bfc in std::rt::lang_start::{{closure}} ()
#10 0x0000ffff9b62da04 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#11 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#12 std::panicking::try () at library/std/src/panicking.rs:365
#13 std::panic::catch_unwind () at library/std/src/panic.rs:434
#14 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#15 0x0000aaaaba0e0c5c in main ()
(gdb)
Steps
- git clone https://github.com/cloud-hypervisor/cloud-hypervisor.git
- cd cloud-hypervisor
- cargo test --target aarch64-unknown-linux-gnu --workspace --no-default-features --features kvm
Possible Solution(s)
No response
Notes
No response
Version
cargo 1.55.0 (32da73ab1 2021-08-23)
Metadata
Metadata
Assignees
Labels
A-jobserverArea: jobserver, concurrency, parallelismArea: jobserver, concurrency, parallelismC-bugCategory: bugCategory: bugS-triageStatus: This issue is waiting on initial triage.Status: This issue is waiting on initial triage.