Skip to content

Command inside container hangs forever #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fooock opened this issue Aug 29, 2019 · 11 comments · Fixed by #383
Closed

Command inside container hangs forever #253

fooock opened this issue Aug 29, 2019 · 11 comments · Fixed by #383
Labels
kind/bug Something isn't working upstream/firecracker

Comments

@fooock
Copy link

fooock commented Aug 29, 2019

I'm able to run a new busybox container using this command:

sudo firecracker-ctr --address /run/firecracker-containerd/containerd.sock run \
--snapshotter firecracker-naive \
--runtime aws.firecracker \
--rm --env HELLO=world --tty docker.io/library/busybox:latest busybox-1

When I'm inside, I execute these commands:

/ # echo $HELLO
world
/ # ls
bin   dev   etc   home  proc  root  run   sys   tmp   usr   var
/ # ls /sys/block/

The ls command hangs forever. The task can't be killed, paused or destroyed. Logs from firecracker-containerd:

DEBU[2019-08-29T19:09:50.989475168+02:00] prepare snapshot                              key=busybox-1 parent="sha256:0d315111b4847e8cd50514ca19657d1e8d827f4e128d172ce8b2f76a04f3faea"
DEBU[2019-08-29T19:09:51.569003261+02:00] event published                               ns=default topic=/snapshot/prepare type=containerd.events.SnapshotPrepare
DEBU[2019-08-29T19:09:51.572614670+02:00] get snapshot mounts                           key=busybox-1
DEBU[2019-08-29T19:09:55.283637876+02:00] event published                               ns=default topic=/containers/create type=containerd.events.ContainerCreate
DEBU[2019-08-29T19:09:55.288290532+02:00] get snapshot mounts                           key=busybox-1
time="2019-08-29T19:09:55.329489962+02:00" level=debug msg=StartShim runtime=aws.firecracker task_id=busybox-1
time="2019-08-29T19:09:55.329619889+02:00" level=info msg="will start a single-task VM since no VMID has been provided" runtime=aws.firecracker task_id=busybox-1 vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
DEBU[2019-08-29T19:09:55.330150426+02:00] create VM request: VMID:"ccf256ab-2ad4-4380-acc9-4cd6acae2c34" ContainerCount:1 ExitAfterAllTasksDeleted:true  
DEBU[2019-08-29T19:09:55.330178269+02:00] using namespace: default                     
DEBU[2019-08-29T19:09:55.330532743+02:00] waiting on shim process                       vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.351619410+02:00] starting signal loop                          namespace=default path=/run/firecracker-containerd/default/ccf256ab-2ad4-4380-acc9-4cd6acae2c34 pid=13909
INFO[2019-08-29T19:09:55.351810128+02:00] creating new VM                               runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.386745469+02:00] Called startVMM(), setting up a VMM on /var/run/firecracker-containerd/default/ccf256ab-2ad4-4380-acc9-4cd6acae2c34/firecracker.sock  runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.399817508+02:00] refreshMachineConfiguration: [GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate:T2 HtEnabled:0xc00047819b MemSizeMib:0xc000478190 VcpuCount:0xc000478188}  runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.400159781+02:00] PutGuestBootSource: [PUT /boot-source][204] putGuestBootSourceNoContent   runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.400185833+02:00] Attaching drive /var/run/firecracker-containerd/default/ccf256ab-2ad4-4380-acc9-4cd6acae2c34/stub0, slot stub0, root false.  runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.400653874+02:00] Attached drive /var/run/firecracker-containerd/default/ccf256ab-2ad4-4380-acc9-4cd6acae2c34/stub0: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent   runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.400676438+02:00] Attaching drive /var/lib/firecracker-containerd/runtime/default-rootfs.img, slot root_drive, root true.  runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.401031501+02:00] Attached drive /var/lib/firecracker-containerd/runtime/default-rootfs.img: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent   runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.410509688+02:00] startInstance successful: [PUT /actions][204] createSyncActionNoContent   runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:55.410539751+02:00] calling agent                                 runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:56.211042010+02:00] successfully started the VM                   runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
DEBU[2019-08-29T19:09:56.211351012+02:00] event forwarded                               ns=default topic=/firecracker-vm/start type=VMStart
INFO[2019-08-29T19:09:56.212777900+02:00] PatchGuestDrive successful                    runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
INFO[2019-08-29T19:09:56.342161558+02:00] successfully created task                     ExecID= TaskID=busybox-1 pid_in_vm=742 runtime=aws.firecracker vmID=ccf256ab-2ad4-4380-acc9-4cd6acae2c34
DEBU[2019-08-29T19:09:56.342434728+02:00] event forwarded                               ns=default topic=/tasks/create type=containerd.events.TaskCreate
DEBU[2019-08-29T19:09:56.348199604+02:00] event forwarded                               ns=default topic=/tasks/start type=containerd.events.TaskStart

If I stop and start again the firecracker-containerd, then this lines are shown in the log:

DEBU[2019-08-29T19:43:39.599290608+02:00] loading tasks in namespace                    namespace=default
WARN[2019-08-29T19:43:39.599402490+02:00] cleaning up after shim disconnected           id=busybox-1 namespace=default
INFO[2019-08-29T19:43:39.599413056+02:00] cleaning up dead shim                        
WARN[2019-08-29T19:43:39.622662715+02:00] failed to clean up after shim disconnected    error="aws.firecracker: failed to connect: dial unix /run/firecracker-containerd/containerd.sock.ttrpc: connect: connection refused\n: exit status 1" id=busybox-1 namespace=default
DEBU[2019-08-29T19:43:39.622741381+02:00] event published                               ns=default topic=/tasks/exit type=containerd.events.TaskExit

Why the ls command hangs forever? What I can do to avoid this? Maybe setup container timeout?

@sipsma sipsma added the kind/bug Something isn't working label Aug 29, 2019
@sipsma
Copy link
Contributor

sipsma commented Aug 29, 2019

I just tried this and am able to reproduce it, though it seems inconsistent (sometimes a command will hang almost right away, other times you can run a bunch of commands, including ls /sys/block, before a hang occurs).

This seems to be a bug. The stdio is going from ctr->fifo->shim->vsock->agent->fifo->container and back, so there's a lot of places this could be breaking. Will need more investigation.

@sipsma
Copy link
Contributor

sipsma commented Aug 29, 2019

I locally changed the code to have the VM agent write all output to /dev/console and to forward all VM stdout+stderr to containerd logs. I got similar behavior to happen, though this time the command only hung for a little bit (maybe 15-30ish seconds, I wasn't counting) before the task just exited.

The logs from the VM are actually showing what looks to be corruption in /dev/vdb (the container's rootfs block device) and a subsequent kernel crash:

time="2019-08-29T18:51:42.432583477Z" level=info msg="[   16.796616] EXT4-fs error (device vdb): ext4_validate_inode_bitmap:99: comm sh: Corrupt inode bitmap - block_group = 0, inode_bitmap = 137" runtime=aws.firecracker vmID=c67de549-ad55-4094-9c14-170bfca3e616
time="2019-08-29T18:51:42.436447051Z" level=info msg="[   16.800483] EXT4-fs error (device vdb): ext4_validate_inode_bitmap:99: comm sh: Corrupt inode bitmap - block_group = 1, inode_bitmap = 138" runtime=aws.firecracker vmID=c67de549-ad55-4094-9c14-170bfca3e616
time="2019-08-29T18:52:31.569909673Z" level=info msg="[   65.936647] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008" runtime=aws.firecracker vmID=c67de549-ad55-4094-9c14-170bfca3e616

Here's the full containerd logs that snippet was extracted from, which has the full kernel stack trace: containerd.log

So the stdio being hung may be a side effect of whatever underlying issue is happening with the block device, though it's worth reproducing again to make sure the block device corruption is the only issue.

@sipsma
Copy link
Contributor

sipsma commented Oct 5, 2019

I'm growing a bit more concerned about this issue. I have been hitting it again when doing unrelated manual tests, including when I do absolutely no interaction with /sys/block.

Out of curiosity, I tried running it with the devmapper snapshotter to see if it's something specific to the naive snapshotter, but the kernel panic still happens.

@sipsma
Copy link
Contributor

sipsma commented Oct 7, 2019

I am able to reproduce this in an automated test case now by just doing repeated writes to the filesystem, see my fork here: https://github.com/sipsma/firecracker-containerd/blob/kernelpanic/runtime/service_integ_test.go#L791-L824

When the linked to test case runs, it just hangs indefinitely, but if you look at runtime/logs/containerd.out you can see the kernel panic from the guest VM (forwarded from the guest's journald to the host's containerd logs).

So this appears to have nothing to do with /sys/block, just any writes happening to the filesystem will trigger it. Our previous automated tests don't make much, if any writes to the filesystem, which is why it went unnoticed.

Here's containerd logs with the relevant output from a run of the test:

time="2019-10-07T18:50:31.645464292Z" level=debug msg="event forwarded" ns=default topic=/tasks/start type=containerd.events.TaskStart
time="2019-10-07T18:50:31.735877692Z" level=info msg="[    1.088090] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x21272259925, max_idle_ns: 440795269411 ns" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:32.315266052Z" level=info msg="\r" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:32.318642684Z" level=info msg="Debian GNU/Linux 9 microvm ttyS0" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:32.318855723Z" level=info runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.789783732Z" level=info msg="microvm login: [   16.142995] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.791268329Z" level=info msg="[   16.145131] IP: jbd2_journal_add_journal_head+0xec/0x140" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.791957103Z" level=info msg="[   16.146595] PGD 0 P4D 0 " runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.792900354Z" level=info msg="[   16.147292] Oops: 0002 [#1] SMP PTI" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.793732845Z" level=info msg="[   16.148233] Modules linked in:" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.795653664Z" level=info msg="[   16.149064] CPU: 0 PID: 799 Comm: dd Not tainted 4.14.55-84.37.amzn2.x86_64 #1" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.797211764Z" level=info msg="[   16.150982] task: ffff88000674b700 task.stack: ffffc900005f0000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.798774685Z" level=info msg="[   16.152548] RIP: 0010:jbd2_journal_add_journal_head+0xec/0x140" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.799673359Z" level=info msg="[   16.154108] RSP: 0018:ffffc900005f3b10 EFLAGS: 00010206" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.800688839Z" level=info msg="[   16.154996] RAX: 0000000000000000 RBX: ffff880006fccf08 RCX: 0000000000000000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.801686779Z" level=info msg="[   16.156011] RDX: ffff880006fccf0b RSI: ffff880006fccf08 RDI: ffff880006fccf08" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.802699130Z" level=info msg="[   16.157006] RBP: ffffc900005f3b18 R08: ffff880006fccf08 R09: 0000000000997040" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.803689718Z" level=info msg="[   16.158007] R10: ffffea00000e05c0 R11: ffffc900005f3a48 R12: ffff880006fec000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.804692640Z" level=info msg="[   16.159010] R13: ffffffff81a26cd0 R14: ffff880006fec000 R15: ffff880006f7c000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.805822770Z" level=info msg="[   16.160015] FS:  00007f4617bd8d48(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.806615513Z" level=info msg="[   16.161147] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.807611381Z" level=info msg="[   16.161940] CR2: 0000000000000008 CR3: 000000000661a004 CR4: 00000000003606b0" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.808593270Z" level=info msg="[   16.162933] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.809594327Z" level=info msg="[   16.163913] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.809965617Z" level=info msg="[   16.164921] Call Trace:" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.810658580Z" level=info msg="[   16.165284]  jbd2_journal_get_write_access+0x3c/0x60" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.811415684Z" level=info msg="[   16.165985]  __ext4_journal_get_write_access+0x2e/0x60" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.812008668Z" level=info msg="[   16.166743]  __ext4_new_inode+0x602/0x1370" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.812515930Z" level=info msg="[   16.167337]  ext4_create+0xa4/0x1b0" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.813050220Z" level=info msg="[   16.167848]  path_openat+0x13c5/0x16e0" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.813607733Z" level=info msg="[   16.168386]  ? __perf_sw_event+0x66/0x90" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.814112007Z" level=info msg="[   16.168935]  do_filp_open+0x96/0x110" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.814626616Z" level=info msg="[   16.169439]  ? __alloc_fd+0xaf/0x160" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.815142938Z" level=info msg="[   16.169953]  do_sys_open+0x1b5/0x250" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.815683911Z" level=info msg="[   16.170468]  ? do_sys_open+0x1b5/0x250" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.816149635Z" level=info msg="[   16.171011]  SyS_open+0x19/0x20" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.816690698Z" level=info msg="[   16.171476]  do_syscall_64+0x67/0x100" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.817395895Z" level=info msg="[   16.172018]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.817924121Z" level=info msg="[   16.172722] RIP: 0033:0x7f4617b9b080" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.819007228Z" level=info msg="[   16.173246] RSP: 002b:00007ffd70f6f018 EFLAGS: 00000246 ORIG_RAX: 0000000000000002" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.820024699Z" level=info msg="[   16.174334] RAX: ffffffffffffffda RBX: 00007f4617bd8d48 RCX: 00007f4617b9b080" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.821035056Z" level=info msg="[   16.175350] RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007ffd70f6ff71" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.822064963Z" level=info msg="[   16.176361] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.823077767Z" level=info msg="[   16.177388] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.824114244Z" level=info msg="[   16.178404] R13: 000056231581a880 R14: 000056231581a880 R15: 0000000000000000" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.826746947Z" level=info msg="[   16.179439] Code: f7 c2 00 00 00 01 75 f3 e9 3c ff ff ff 48 8b 53 10 48 85 d2 74 0b 48 83 7a 08 00 0f 85 4d ff ff ff 0f 0b 48 8b 4b 40 48 8d 53 03 <83> 41 08 01 3e 80 22 fe 48 85 c0 74 08 48 89 c7 e8 ef d7 ff ff " runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.827768471Z" level=info msg="[   16.182075] RIP: jbd2_journal_add_journal_head+0xec/0x140 RSP: ffffc900005f3b10" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.828248824Z" level=info msg="[   16.183093] CR2: 0000000000000008" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a
time="2019-10-07T18:50:46.828910376Z" level=info msg="[   16.183576] ---[ end trace ddd07bdb86724ef2 ]---" runtime=aws.firecracker vmID=4ff80912-6597-4790-a2b0-207ffc6a695a

@nmeyerhans
Copy link
Contributor

Just tested an up-to-date Linux 4.14.147 kernel with the same results:

time="2019-10-07T22:00:24.422599683Z" level=info msg="microvm login: [   16.136814] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.424568018Z" level=info msg="[   16.139814] IP: jbd2_journal_add_journal_head+0xf0/0x130" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.425528939Z" level=info msg="[   16.141767] PGD 0 P4D 0 " runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.426830121Z" level=info msg="[   16.142727] Oops: 0002 [#1] SMP PTI" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.427965564Z" level=info msg="[   16.144020] Modules linked in:" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.430035095Z" level=info msg="[   16.145163] CPU: 0 PID: 833 Comm: dd Not tainted 4.14.147 #1" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.431409213Z" level=info msg="[   16.147235] task: ffff88800665a940 task.stack: ffffc9000079c000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.432761013Z" level=info msg="[   16.148602] RIP: 0010:jbd2_journal_add_journal_head+0xf0/0x130" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.433956120Z" level=info msg="[   16.149938] RSP: 0018:ffffc9000079fb28 EFLAGS: 00010206" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.435574502Z" level=info msg="[   16.151141] RAX: 0000000000000000 RBX: ffff888006ff5208 RCX: 0000000000000000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.437208790Z" level=info msg="[   16.152764] RDX: ffff888006ff520b RSI: ffff888006ff5208 RDI: ffff888006ff5208" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.438819897Z" level=info msg="[   16.154396] RBP: ffffc9000079fb30 R08: ffff888006ff5208 R09: 00000000009af040" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.440461674Z" level=info msg="[   16.156012] R10: ffff888006f68dd8 R11: 0000000000000040 R12: 00000000000003b0" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.441670699Z" level=info msg="[   16.157642] R13: ffffffff81a219e0 R14: ffff888004a5f000 R15: ffff888006ff5208" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.442916110Z" level=info msg="[   16.158856] FS:  00007f6c32144d48(0000) GS:ffff888007c00000(0000) knlGS:0000000000000000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.443815797Z" level=info msg="[   16.160098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.444910132Z" level=info msg="[   16.160993] CR2: 0000000000000008 CR3: 00000000066c4002 CR4: 00000000003606b0" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.446003050Z" level=info msg="[   16.162094] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.447116353Z" level=info msg="[   16.163193] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.447498292Z" level=info msg="[   16.164296] Call Trace:" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.448273634Z" level=info msg="[   16.164682]  jbd2_journal_get_write_access+0x27/0x60" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.449059674Z" level=info msg="[   16.165461]  __ext4_journal_get_write_access+0x2c/0x70" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.449692092Z" level=info msg="[   16.166247]  __ext4_new_inode+0x3ba/0x14e0" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.450228471Z" level=info msg="[   16.166879]  ext4_create+0xe0/0x1c0" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.450818911Z" level=info msg="[   16.167418]  path_openat+0x1344/0x16d0" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.451459170Z" level=info msg="[   16.168006]  ? filemap_map_pages+0x231/0x320" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.451952060Z" level=info msg="[   16.168636]  do_filp_open+0x8e/0x100" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.452475862Z" level=info msg="[   16.169141]  ? __alloc_fd+0xad/0x140" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.452982363Z" level=info msg="[   16.169661]  do_sys_open+0x17f/0x210" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.453423365Z" level=info msg="[   16.170158]  SyS_open+0x19/0x20" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.453941496Z" level=info msg="[   16.170598]  do_syscall_64+0x67/0x100" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.454648263Z" level=info msg="[   16.171119]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.455154584Z" level=info msg="[   16.171828] RIP: 0033:0x7f6c32107080" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.456187495Z" level=info msg="[   16.172330] RSP: 002b:00007ffd1ea7cdb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.457155823Z" level=info msg="[   16.173362] RAX: ffffffffffffffda RBX: 00007f6c32144d48 RCX: 00007f6c32107080" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.458133747Z" level=info msg="[   16.174332] RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007ffd1ea7df71" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.459118026Z" level=info msg="[   16.175321] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.460102493Z" level=info msg="[   16.176296] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.461083935Z" level=info msg="[   16.177278] R13: 0000563aabb17880 R14: 0000563aabb17880 R15: 0000000000000000" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.463693372Z" level=info msg="[   16.178270] Code: f7 c2 00 00 00 01 75 f3 e9 38 ff ff ff 48 8b 53 10 48 85 d2 74 0b 48 83 7a 08 00 0f 85 49 ff ff ff 0f 0b 48 8b 4b 40 48 8d 53 03 <83> 41 08 01 3e 80 22 fe 48 85 c0 74 08 48 89 c7 e8 cb d9 ff ff " runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.464734206Z" level=info msg="[   16.180872] RIP: jbd2_journal_add_journal_head+0xf0/0x130 RSP: ffffc9000079fb28" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.465196248Z" level=info msg="[   16.181907] CR2: 0000000000000008" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd
time="2019-10-07T22:00:24.465830254Z" level=info msg="[   16.182373] ---[ end trace fa21c3a14971e3f0 ]---" runtime=aws.firecracker vmID=f8cc548b-4746-4393-b653-b61432ba26dd

@nmeyerhans
Copy link
Contributor

I think there's an issue with the stub drive patching. In @sipsma's test, we try to write 128 1 MB files. However, if we remove the loop and only try to write a single 1 MB file, we still crash. We also see the following in the kernel output prior to the crash, which doesn't look right:

time="2019-10-07T22:27:49.925693639Z" level=info msg="[    0.846179] virtio_blk virtio1: new size: 0 512-byte logical blocks (0 B/0 B)" runtime=aws.firecracker vmID=c1bc0ee1-402f-4979-ba54-061f75c6c169                                                    
time="2019-10-07T22:27:49.927660833Z" level=info msg="[    0.848864] vdb: detected capacity change from 512 to 0" runtime=aws.firecracker vmID=c1bc0ee1-402f-4979-ba54-061f75c6c169                                                                          
time="2019-10-07T22:27:49.933996689Z" level=info msg="[    0.853626] agent[759]: time=\"2019-10-07T22:27:49Z\" level=info msg=create ExecID= TaskID=blockwriter" runtime=aws.firecracker vmID=c1bc0ee1-402f-4979-ba54-061f75c6c169                           
time="2019-10-07T22:27:49.939360510Z" level=info msg="[    0.860731] EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)" runtime=aws.firecracker vmID=c1bc0ee1-402f-4979-ba54-061f75c6c169                                               

@nmeyerhans
Copy link
Contributor

If I manually patch a drive, the right thing happens (This replaces a 512B drive with a 100 MB drive):

[  444.228864] virtio_blk virtio2: new size: 204800 512-byte logical blocks (105 MB/100 MiB)                                 
[  444.231925] vdc: detected capacity change from 512 to 104857600                                                           

@nmeyerhans
Copy link
Contributor

But if the backing store for the drive changes from a plain file to a block device, we see the capacity drop to zero, as we see during the panic-inducing test:

[  587.545650] virtio_blk virtio2: new size: 0 512-byte logical blocks (0 B/0 B)                             
[  587.548366] vdc: detected capacity change from 104857600 to 0 

@kzys
Copy link
Contributor

kzys commented Oct 7, 2019

When firecracker-containerd is patching a drive, Firecracker is inspecting the size of the backing file by calling std::fs::metadata, which ultimately calls stat() on Unix.

However since our stub drive is backed by a loopback device, the size from stat() is always zero. Luckily, there is ioctl() we can use for getting the size of a block device (thanks @sipsma!). Firecracker should call this ioctl() instead of relying stat() when the backing file is a block device.

@kzys
Copy link
Contributor

kzys commented Nov 18, 2019

The fix has been merged! Waiting Firecracker's next release. We have no further actions from our end, assuming Firecracker 0.20 doesn't have any breaking changes :)

@kzys
Copy link
Contributor

kzys commented Nov 29, 2019

Seems 0.20.0 is coming soon.

firecracker-microvm/firecracker#1433

@kzys kzys closed this as completed in #383 Feb 18, 2020
fangn2 pushed a commit to fangn2/firecracker-containerd that referenced this issue Mar 23, 2023
Use Config.VMID as Firecracker's instance ID
fangn2 pushed a commit to fangn2/firecracker-containerd that referenced this issue Mar 23, 2023
* Since firecracker-microvm/firecracker#2125, `cargo build` doesn't build jailer by default. (firecracker-microvm#263)
* Fix Benchmark Goroutine (firecracker-microvm#259)
* Jailer configuration API cleanup and improved logging with Debug log level (firecracker-microvm#255)
* Firecracker is internally has an instance ID, but the SDK didn't have the way to configure the ID. This change connects Config.VMID to the instance ID. (firecracker-microvm#253)
* Fixed error that was not being test against in `TestWait` (firecracker-microvm#251)
* Fixes issue where socket path may not be defined since the config file has yet to be loaded (firecracker-microvm#230)
* Fixed error that was not being test against in `TestNewPlugin` (firecracker-microvm#225)
* Download Firecracker 0.21.1 and its jailer from Makefile (firecracker-microvm#218)

Signed-off-by: xibz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working upstream/firecracker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants