Skip to content

cargo test command hang in build stage on AArch64 #10007

@michael2012z

Description

@michael2012z

Problem

cargo test randomly hang in Cloud Hypervisor AArch64 unit test building stage.

Here is a recorded failing job in our CI: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248/

cargo version: cargo 1.55.0 (32da73a 2021-08-23)

kernel version: 5.10.0

The problem is highly random, and I saw this problem on only one AArch64 server.

I reproduced the problem and debugged with GDB, here is some observation:
gdb.txt

##################### Backtrace of cargo process: #####################

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
__libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
26	../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                  Frame 
* 1    Thread 0xffffa2264ac0 (LWP 665544) "cargo" __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
  2    Thread 0xffff9dea8030 (LWP 666949) "cargo" futex_wait_cancelable (private=0, expected=0, futex_word=0xaaab143ff6dc) at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0xfffeb5dce030 (LWP 671236) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb5dccfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  4    Thread 0xfffeb0da6030 (LWP 673261) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb0da4fb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  5    Thread 0xfffeb05a2030 (LWP 673755) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb05a0fb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  6    Thread 0xfffeb03a1030 (LWP 673756) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb039ffb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  7    Thread 0xfffeb01a0030 (LWP 673852) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb019efb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  8    Thread 0xfffeaff9f030 (LWP 673932) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeaff9dfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  9    Thread 0xfffeafd9e030 (LWP 673992) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeafd9cfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
(gdb) bt
#0  __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=<optimized out>, buf=0xfffffcee5a7c, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x0000aaaad282df14 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:146
#3  std::sys::unix::fs::File::write () at library/std/src/sys/unix/fs.rs:845
#4  <&std::fs::File as std::io::Write>::write () at library/std/src/fs.rs:680
#5  0x0000aaaad2813b34 in jobserver::imp::Client::release () at library/std/src/panicking.rs:541
#6  0x0000aaaad2814dd0 in <jobserver::Acquired as core::ops::drop::Drop>::drop () at library/std/src/panicking.rs:541
#7  0x0000aaaad21b9da0 in cargo::core::compiler::job_queue::DrainState::drain_the_queue () at library/std/src/panicking.rs:541
#8  0x0000aaaad21f9460 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/std/src/panicking.rs:541
#9  0x0000aaaad21fe4cc in crossbeam_utils::thread::scope () at library/std/src/panicking.rs:541
#10 0x0000aaaad21b8c9c in cargo::core::compiler::job_queue::JobQueue::execute () at library/std/src/panicking.rs:541
#11 0x0000aaaad2411b3c in cargo::core::compiler::context::Context::compile () at library/std/src/panicking.rs:541
#12 0x0000aaaad23692e8 in cargo::ops::cargo_compile::compile_ws () at library/std/src/panicking.rs:541
#13 0x0000aaaad23690f4 in cargo::ops::cargo_compile::compile () at library/std/src/panicking.rs:541
#14 0x0000aaaad23f44a0 in cargo::ops::cargo_test::run_tests () at library/std/src/panicking.rs:541
#15 0x0000aaaad208a1cc in cargo::commands::test::exec () at library/std/src/panicking.rs:541
#16 0x0000aaaad2081abc in cargo::cli::main () at library/std/src/panicking.rs:541
#17 0x0000aaaad208ab6c in cargo::main () at library/std/src/panicking.rs:541
#18 0x0000aaaad20d26f4 in std::sys_common::backtrace::__rust_begin_short_backtrace () at library/std/src/panicking.rs:541
#19 0x0000aaaad208c860 in std::rt::lang_start::{{closure}} () at library/std/src/panicking.rs:541
#20 0x0000aaaad283ca7c in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#21 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#22 std::panicking::try () at library/std/src/panicking.rs:365
#23 std::panic::catch_unwind () at library/std/src/panic.rs:434
#24 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#25 0x0000aaaad208c608 in main () at library/std/src/panicking.rs:541
(gdb) l
21	in ../sysdeps/unix/sysv/linux/write.c
(gdb) 

############# Some rustc processes were also seen: ##########################

root@45e3b95cdbdc:/cloud-hypervisor# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   3612  2564 ?        Ss   03:21   0:00 /bin/bash ./scripts/run_unit_tests.sh --hypervisor kvm
root           9  0.6  0.0 4017272 73332 ?       Sl   03:21   0:14 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/cargo test --target aarch64-unknown-linux-g
root        2986  0.4  0.1 897372 263324 ?       Sl   03:22   0:08 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_packet /usr/local/r
root        4463  0.1  0.0 899068 169872 ?       Sl   03:22   0:02 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name arch --edition=2018 arch
root        4909  0.0  0.0 585264 177472 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name vm_allocator --edition=2
root        4910  0.0  0.0 584320 200320 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name devices --edition=2018 d
root        4987  0.0  0.0 763788 157652 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_transport /usr/loca
root        5071  0.0  0.0 591956 230064 ?       Sl   03:22   0:01 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pci --edition=2018 pci/s
root        5115  0.0  0.0 407344 75080 ?        Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet /usr/local/rust/reg
r

################## I checked the backtrace of some of them, all like: ###################

(gdb) attach 673941
Attaching to process 673941
[New LWP 673988]
[New LWP 674062]
[New LWP 674063]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
__pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145	pthread_join_common.c: No such file or directory.
(gdb) bt
#0  __pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1  0x0000ffff9b63832c in std::sys::unix::thread::Thread::join () at library/std/src/sys/unix/thread.rs:177
#2  0x0000ffff9c2284bc in std::thread::JoinHandle<T>::join ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#3  0x0000ffff9c241688 in rustc_interface::util::setup_callbacks_and_run_in_thread_pool_with_globals ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#4  0x0000ffff9c295d3c in rustc_driver::RunCompiler::run ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#5  0x0000ffff9c21fce8 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#6  0x0000ffff9c29bb0c in rustc_driver::main () from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#7  0x0000aaaaba0e0c30 in rustc_main::main ()
#8  0x0000aaaaba0e0be4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9  0x0000aaaaba0e0bfc in std::rt::lang_start::{{closure}} ()
#10 0x0000ffff9b62da04 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#11 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#12 std::panicking::try () at library/std/src/panicking.rs:365
#13 std::panic::catch_unwind () at library/std/src/panic.rs:434
#14 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#15 0x0000aaaaba0e0c5c in main ()
(gdb)

Steps

  1. git clone https://github.com/cloud-hypervisor/cloud-hypervisor.git
  2. cd cloud-hypervisor
  3. cargo test --target aarch64-unknown-linux-gnu --workspace --no-default-features --features kvm

Possible Solution(s)

No response

Notes

No response

Version

cargo 1.55.0 (32da73ab1 2021-08-23)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-jobserverArea: jobserver, concurrency, parallelismC-bugCategory: bugS-triageStatus: This issue is waiting on initial triage.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions