Skip to content

Panic in the Upstairs leaves an instance in a zombie coma #1652

@leftwo

Description

@leftwo

While testing core files for this issue I made crucible upstairs panic just because. After the panic, I did get a core file and I see that the propolis-server service has restarted:

Aug 29 17:15:26.270 INFO accepted connection, remote_addr: [fd00:1122:3344:101::1]:63263, local_addr: [fd00:1122:3344:101::c]:12400
Aug 29 17:15:26.271 INFO request completed, response_code: 101, uri: /instance/serial, method: GET, req_id: a759e184-bbb3-4243-b7d9-0626bac14639, remote_addr: [fd00:1122:3344:101::1]:63263, local_addr: [fd00:1122:3344:101::c]:12400
Aug 29 17:15:55.387 INFO rdmsr, msr: 3221291675, sync_task: vcpu-1, component: vmm
Aug 29 17:15:55.387 INFO rdmsr, msr: 3221291673, sync_task: vcpu-1, component: vmm
Scrub at offset 335616/4194304 sp:335616
thread 'tokio-runtime-worker' panicked at 'We are going to panic now!', /home/alan/ws/crucible/upstairs/src/volume.rs:402:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Aug 29 10:17:39 Stopping because all processes in service exited. ]
[ Aug 29 10:17:39 Executing stop method (:kill). ]
[ Aug 29 10:17:39 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml [fd00:1122:3344:101::c]:12400 --metric-addr [fd00:1122:3344:101::3]:12221 &"). ]
[ Aug 29 10:17:39 Method "start" exited with status 0. ]
Aug 29 17:17:39.784 ERRO could not query reservoir Os { code: 1, kind: PermissionDenied, message: "Not owner" }
Aug 29 17:17:39.784 INFO Metrics server will use InstanceMetricsConfig { propolis_addr: [fd00:1122:3344:101::c]:12400, metric_addr: [fd00:1122:3344:101::3]:12221 }
Aug 29 17:17:39.784 INFO Starting server...
Aug 29 17:17:39.785 INFO listening, local_addr: [fd00:1122:3344:101::c]:12400

However, my instance did not come back. The console and API think it's running:

alan@atrium:omicron-files$ oxide instance view -o myorg -p myproj debian
 time_run_state_updated | 3 minutes ago                        
 time_modified          | 3 minutes ago                        
 time_created           | 3 minutes ago                        
 run_state              | running                              
 project_id             | 637caf31-5b5b-41d7-ab16-6175dd1b98a5 
 ncpus                  | 2                                    
 memory                 | 1073741824                           
 hostname               | debian                               
 description            | debian                               
 name                   | debian                               
 id                     | f71d0d33-7b1c-47dd-888f-dcf3aa5b5b85 

I attempted to stop it, but fails my stop request:

alan@atrium:omicron-files$ oxide instance stop -o myorg -p myproj debian
Type debian to confirm stop:: debian
✘ Oxide API internal error: Internal Server Error

The propolis log reports:

Aug 29 17:20:25.096 INFO request completed, error_message_external: Internal Server Error, error_message_internal: Server not initialized (no instance), response_code: 500, uri: /instance/state, method: PUT, req_id: 05b58734-2f88-4864-b279-123d33ffc83f, remote_addr: [fd00:1122:3344:101::1]:53026, local_addr: [fd00:1122:3344:101::c]:12400

Another attempt to stop it results in the command hanging (at least I gave up after 10 minutes):

alan@atrium:omicron-files$ oxide instance stop --confirm -o myorg -p myproj debian
⠁ Waiting for instance status to be `stopped`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions