diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index feeab637eb..5314f48bed 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -34,18 +34,18 @@ Ensure that NGF can recover gracefully from container failures without any user 3. Check out the latest tag (unless you are installing the edge version from the main branch). 4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`. This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container. -5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) +5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/site/content/installation/installing-ngf/manifests.md) to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service. 6. In a separate terminal track NGF logs. ```console - kubectl -n nginx-gateway logs -f deploy/nginx-gateway + kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx-gateway ``` 7. In a separate terminal track NGINX container logs. ```console - kubectl -n nginx-gateway logs -f -c nginx + kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx ``` 8. In a separate terminal Exec into the NGINX container inside the NGF pod. @@ -56,9 +56,7 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan 9. In a different terminal, deploy the [https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination). -10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see -if the configuration and version were correctly updated. -11. Send traffic through the example application and ensure it is working correctly. +10. Send traffic through the example application and ensure it is working correctly. ### Run the tests @@ -80,25 +78,22 @@ if the configuration and version were correctly updated. 4. Check for errors in the NGF and NGINX container logs. 5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly. 6. Open up the NGF and NGINX container logs and check for errors. -7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. -8. Send traffic through the example application and ensure it is working correctly. -9. Check that NGF can still process changes of resources. +7. Send traffic through the example application and ensure it is working correctly. +8. Check that NGF can still process changes of resources. 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources. + 2. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 3. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + 4. Send traffic through the example application using the updated resources and ensure traffic flows correctly. #### Restart NGINX container @@ -113,24 +108,21 @@ if the configuration and version were correctly updated. 4. When NGINX container is back up, ensure traffic flows through the example application correctly. 5. Open up the NGINX container logs and check for errors. -6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. -7. Check that NGF can still process changes of resources. +6. Check that NGF can still process changes of resources. 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources. + 2. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 3. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + 4. Send traffic through the example application using the updated resources and ensure traffic flows correctly. #### Restart Node with draining @@ -156,26 +148,23 @@ if the configuration and version were correctly updated. docker restart kind-control-plane ``` -7. Open up both NGF and NGINX container logs and check for errors. -8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. -9. Send traffic through the example application and ensure it is working correctly. -10. Check that NGF can still process changes of resources. +7. Check the logs of the old and new NGF and NGINX containers for errors. +8. Send traffic through the example application and ensure it is working correctly. +9. Check that NGF can still process changes of resources. 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources. + 2. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 3. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. - 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + 4. Send traffic through the example application using the updated resources and ensure traffic flows correctly. #### Restart Node without draining diff --git a/tests/graceful-recovery/results/1.1.0/1.1.0.md b/tests/graceful-recovery/results/1.1.0/1.1.0.md new file mode 100644 index 0000000000..107aa8c7ac --- /dev/null +++ b/tests/graceful-recovery/results/1.1.0/1.1.0.md @@ -0,0 +1,139 @@ +# Results for v1.1.0 + + +- [Results for v1.1.0](#results-for-v110) + - [Summary](#summary) + - [Versions](#versions) + - [Tests](#tests) + - [Restart nginx-gateway container](#restart-nginx-gateway-container) + - [Restart NGINX container](#restart-nginx-container) + - [Restart Node with draining](#restart-node-with-draining) + - [Restart Node without draining](#restart-node-without-draining) + - [Future Improvements](#future-improvements) + + + +## Summary + +- No new issues since 1.0. +- One new error in the [Restart Node with draining](#restart-node-with-draining) test, but it is not actionable. + +## Versions + +NGF version: + + +```text +commit: d6bbdba28a0f9ae3f75864855b76b0fb34bee3e5 +date: 2023-12-05T18:43:51Z +version: edge +``` + +with NGINX: + +```text +nginx/1.25.3 +built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10) +OS: Linux 5.15.49-linuxkit-pr +``` + + +Kubernetes: + +```text +Server Version: version.Info{Major:"1", Minor:"28", +GitVersion:"v1.28.0", +GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d", +GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z", +GoVersion:"go1.20.7", Compiler:"gc", +Platform:"linux/arm64"} +``` + +## Tests + +### Restart nginx-gateway container + +No errors. + +### Restart NGINX container + +The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process. +The following appeared in the NGINX logs: + +```text +2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 22:18:45 [notice] 116#116: try again to bind() after 500ms +``` + +NGF cannot update NGINX after this and logs the following error: + +```text +{ + "level": "error", + "ts": "2023-12-05T22:19:53Z", + "logger": "eventLoop.eventHandler", + "msg": "Failed to update NGINX configuration", + "batchID": 22, + "error": "failed to reload NGINX: open /proc/19/task/19/children: no such file or directory", + "stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74" +} +``` + +Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 + + +### Restart Node with draining + +Previous NGF container error: + +```json +{ + "level": "error", + "ts": "2023-12-05T21:43:31Z", + "logger": "eventLoop.eventHandler", + "msg": "Failed to update NGINX configuration", + "batchID": 11, + "error": "failed to reload NGINX: could not get expected config version 7: error getting client: Get \"http://config-version/version\": dial unix /var/run/nginx/nginx-config-version.sock: connect: no such file or directory", + "stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74" +} +``` + +This error is likely due to NGINX terminating during a reload attempt and does not consistently occur on a node restart. + +No errors in previous NGINX container. +No errors in new NGF/NGINX containers. + +### Restart Node without draining + +The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`. + +The following appeared in the NGINX logs: + +```text +2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms +2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms +2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms +2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms +2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms +2023/12/05 21:53:51 [emerg] 29#29: still could not bind() +``` + +The following appeared in the NGF logs: + +```text +failed to start control loop: cannot create nginx metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused +``` + +Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 + +## Future Improvements + +- None