|
| 1 | +# Longevity Test |
| 2 | + |
| 3 | +This document describes how we test NGF for longevity. |
| 4 | + |
| 5 | +<!-- TOC --> |
| 6 | + |
| 7 | +- [Longevity Test](#longevity-test) |
| 8 | + - [Goals](#goals) |
| 9 | + - [Test Environment](#test-environment) |
| 10 | + - [Steps](#steps) |
| 11 | + - [Start](#start) |
| 12 | + - [Check the Test is Running Correctly](#check-the-test-is-running-correctly) |
| 13 | + - [End](#end) |
| 14 | + - [Analyze](#analyze) |
| 15 | + - [Results](#results) |
| 16 | + |
| 17 | +<!-- TOC --> |
| 18 | + |
| 19 | +## Goals |
| 20 | + |
| 21 | +- Ensure that NGF successfully processes both control plane and data plane transactions over a period of time much |
| 22 | + greater than in our other tests. |
| 23 | +- Catch bugs that could only appear over a period of time (like resource leaks). |
| 24 | + |
| 25 | +## Test Environment |
| 26 | + |
| 27 | +- A Kubernetes cluster with 3 nodes on GKE |
| 28 | + - Node: e2-medium (2 vCPU, 4GB memory) |
| 29 | + - Enabled GKE logging. |
| 30 | + - Enabled GKE Cloud monitoring with managed Prometheus service, with enabled: |
| 31 | + - system. |
| 32 | + - kube state - pods, deployments. |
| 33 | +- Tester VMs: |
| 34 | + - Configuration: |
| 35 | + - Debian |
| 36 | + - Install packages: tmux, wrk |
| 37 | + - Location - same zone as the Kubernetes cluster. |
| 38 | + - First VM - for HTTP traffic |
| 39 | + - Second VM - for sending HTTPs traffic |
| 40 | +- NGF |
| 41 | + - Deployment with 1 replica |
| 42 | + - Exposed via a Service with type LoadBalancer, private IP |
| 43 | + - Gateway, two listeners - HTTP and HTTPs |
| 44 | + - Two apps: |
| 45 | + - Coffee - 3 replicas |
| 46 | + - Tea - 3 replicas |
| 47 | + - Two HTTPRoutes |
| 48 | + - Coffee (HTTP) |
| 49 | + - Tea (HTTPS) |
| 50 | + |
| 51 | +## Steps |
| 52 | + |
| 53 | +### Start |
| 54 | + |
| 55 | +Test duration - 4 days. |
| 56 | + |
| 57 | +1. Create a Kubernetes cluster on GKE. |
| 58 | +2. Deploy NGF. |
| 59 | +3. Expose NFG via a Load Balancer Service with `"networking.gke.io/load-balancer-type":"Internal"` annotation to |
| 60 | + allocate an internal load balancer. |
| 61 | +4. Apply the manifests which will: |
| 62 | + 1. Deploy the coffee and tea backends. |
| 63 | + 2. Configure HTTP and HTTPS listeners on the Gateway. |
| 64 | + 3. Expose coffee via HTTP listener and tea via HTTPS listener. |
| 65 | + 4. Create two CronJobs to re-rollout backends: |
| 66 | + 1. Coffee - every minute for an hour every 6 hours |
| 67 | + 2. Tea - every minute for an hour every 6 hours, 3 ours apart from coffee. |
| 68 | + 5. Configure Prometheus on GKE to pick up NGF metrics. |
| 69 | + |
| 70 | + ```shell |
| 71 | + kubectl apply -f files |
| 72 | + ``` |
| 73 | + |
| 74 | +5. In Tester VMs, update `/etc/hosts` to have an entry with the External IP of the NGF Service (`10.128.0.10` in this |
| 75 | + case): |
| 76 | + |
| 77 | + ```text |
| 78 | + 10.128.0.10 cafe.example.com |
| 79 | + ``` |
| 80 | + |
| 81 | +6. In Tester VMs, start a tmux session (this is needed so that even if you disconnect from the VM, any launched command: |
| 82 | + will keep running): |
| 83 | + |
| 84 | + ```shell |
| 85 | + tmux |
| 86 | + ``` |
| 87 | + |
| 88 | +7. In First VM, start wrk for 4 days for coffee via HTTP: |
| 89 | + |
| 90 | + ```shell |
| 91 | + wrk -t2 -c100 -d96h http://cafe.example.com/coffee |
| 92 | + ``` |
| 93 | + |
| 94 | +8. In Second VM, start wrk for 4 days for tea via HTTPS: |
| 95 | + |
| 96 | + ```shell |
| 97 | + wrk -t2 -c100 -d96h https://cafe.example.com/tea |
| 98 | + ``` |
| 99 | + |
| 100 | +Notes: |
| 101 | + |
| 102 | +- The updated coffee and tea backends in cafe.yaml include extra configuration for zero time upgrades, so that |
| 103 | + wrk in Tester VMs don't get 502 from NGF. Based on https://learnk8s.io/graceful-shutdown |
| 104 | +
|
| 105 | +### Check the Test is Running Correctly |
| 106 | +
|
| 107 | +Check that you don't see any errors: |
| 108 | + |
| 109 | +1. Traffic is flowing - look at the access logs of NGINX. |
| 110 | +2. Check that cron job can run. |
| 111 | + |
| 112 | + ```shell |
| 113 | + kubectl create job --from=cronjob/coffee-rollout-mgr coffee-test |
| 114 | + kubectl create job --from=cronjob/tea-rollout-mgr tea-test |
| 115 | + ``` |
| 116 | + |
| 117 | +3. Check that GKE exports logs and Prometheus metrics. |
| 118 | + |
| 119 | +In case of errors, double check if you prepared the environment and launched the test correctly. |
| 120 | + |
| 121 | +### End |
| 122 | + |
| 123 | +- Remove CronJobs. |
| 124 | + |
| 125 | +## Analyze |
| 126 | + |
| 127 | +- Traffic |
| 128 | + - Tester VMs (clients) |
| 129 | + - As wrk stop, they will print output upon termination. To connect to the tmux session with wrk, |
| 130 | + run `tmux attach -t 0` |
| 131 | + - Check for errors, latency, RPS |
| 132 | +- Logs |
| 133 | + - Check the logs for errors in Google Cloud Operations Logging. |
| 134 | + - NGF |
| 135 | + - NGINX |
| 136 | +- Check metrics in Google Cloud Monitoring. |
| 137 | + - NGF |
| 138 | + - CPU usage |
| 139 | + - NGINX |
| 140 | + - NGF |
| 141 | + - Memory usage |
| 142 | + - NGINX |
| 143 | + - NGF |
| 144 | + - NGINX metrics |
| 145 | + - Reloads |
| 146 | + |
| 147 | +## Results |
| 148 | + |
| 149 | +- [1.0.0](results/1.0.0.md) |
0 commit comments