@@ -25,8 +25,17 @@ time, and delayed merges.
25
25
26
26
#### Debuggability ####
27
27
28
- If your test fails, it should provide as detailed as possible reasons
29
- for the failure in its output. "Timeout" is not a useful error
28
+ If your test fails, it should provide as detailed as possible reasons for the
29
+ failure in its failure message. The failure message is the string that gets
30
+ passed (directly or indirectly) to ` ginkgo.Fail(f) ` . That text is what gets
31
+ shown in the overview of failed tests for a Prow job and what gets aggregated
32
+ by https://go.k8s.io/triage .
33
+
34
+ A good failure message:
35
+ - identifies the test failure
36
+ - has enough details to provide some initial understanding of what went wrong
37
+
38
+ "Timeout" is not a useful error
30
39
message. "Timed out after 60 seconds waiting for pod xxx to enter
31
40
running state, still in pending state" is much more useful to someone
32
41
trying to figure out why your test failed and what to do about it.
@@ -38,17 +47,233 @@ like the following generates rather useless errors:
38
47
Expect(err).NotTo(HaveOccurred())
39
48
```
40
49
41
- Rather
50
+ Errors returned by client-go can be very detailed. A better way to test for
51
+ errors is with ` framework.ExpectNoError ` because it logs the full error and
52
+ then includes only the shorter ` err.Error() ` in the failure message. To make
53
+ that failure message more informative,
42
54
[ annotate] ( https://onsi.github.io/gomega/#annotating-assertions ) your assertion with something like this:
43
55
44
56
```
45
- Expect(err).NotTo(HaveOccurred() , "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
57
+ framework.ExpectNoError(err , "tried creating %d foobars, only created %d", foobarsReqd, foobarsCreated)
46
58
```
47
59
48
60
On the other hand, overly verbose logging, particularly of non-error conditions, can make
49
61
it unnecessarily difficult to figure out whether a test failed and if
50
62
so why? So don't log lots of irrelevant stuff either.
51
63
64
+ Except for this special case, using Gomega assertions directly is
65
+ encouraged. They are more versatile than the (few) wrappers that were added to
66
+ the E2E framework. Use assertions that match the check in the test. Using Go
67
+ code to evaluate some condition and then checking the result often isn't
68
+ informative. For example this check should be avoided:
69
+
70
+ ```
71
+ gomega.Expect(strings.Contains(actualStr, expectedSubStr)).To(gomega.Equal(true))
72
+ ```
73
+
74
+ Better pass both values to Gomega, which will automatically include them in the
75
+ failure message. Add an annotation that explains what the assertion is about:
76
+
77
+ ```
78
+ gomega.Expect(actualStr).To(gomega.ContainSubstring("xyz"), "checking log output")
79
+ ```
80
+
81
+ This produces the following failure message:
82
+ ```
83
+ [FAILED] checking log output
84
+ Expected
85
+ <string>: hello world
86
+ to contain substring
87
+ <string>: xyz
88
+ ```
89
+
90
+ If there is no suitable Gomega assertion, call ` ginkgo.Failf ` directly:
91
+ ```
92
+ import "github.com/onsi/gomega/format"
93
+
94
+ ok := someCustomCheck(abc)
95
+ if !ok {
96
+ ginkgo.Failf("check xyz failed for object:\n%s", format.Object(abc))
97
+ }
98
+ ```
99
+
100
+ Dumping structs with ` format.Object ` is recommended. Starting with Kubernetes
101
+ 1.26, ` format.Object ` will pretty-print Kubernetes API objects or structs [ as
102
+ YAML and omit unset
103
+ fields] ( https://github.com/kubernetes/kubernetes/pull/113384 ) , which is more
104
+ readable than other alternatives like ` fmt.Sprintf("%+v") ` .
105
+
106
+ Consider writing a [ Gomega
107
+ matcher] ( https://onsi.github.io/gomega/#adding-your-own-matchers ) . It combines
108
+ the custom check and generating the failure messages and can make tests more
109
+ readable.
110
+
111
+ It is good practice to include details like the object that failed some
112
+ assertion in the failure message because then a) the information is available
113
+ when analyzing a failure that occurred in the CI and b) it only gets logged
114
+ when some assertion fails. Always dumping objects via log messages can make the
115
+ test output very large and may distract from the relevant information.
116
+
117
+ #### Recovering from test failures ####
118
+
119
+ All tests should ensure that a cluster is restored to the state that it was in
120
+ before the test ran. [ ` ginkgo.DeferCleanup `
121
+ ] ( https://pkg.go.dev/github.com/onsi/ginkgo/v2#DeferCleanup ) is recommended for
122
+ this because it can be called similar to ` defer ` directly after setting up
123
+ something. It is better than ` defer ` because Ginkgo will show additional
124
+ details about which cleanup code is running and (if possible) handle timeouts
125
+ for that code (see next section). Is is better than ` ginkgo.AfterEach ` because
126
+ it is not necessary to define additional variables and because
127
+ ` ginkgo.DeferCleanup ` executes code in the more useful last-in-first-out order,
128
+ i.e. things that get set up first get removed last.
129
+
130
+ Objects created in the test namespace do not need to be deleted because
131
+ deleting the namespace will also delete them. However, if deleting an object
132
+ may fail, then explicitly cleaning it up is better because then failures or
133
+ timeouts related to it will be more obvious.
134
+
135
+ In cases where the test may have removed the object, ` framework.IgnoreNotFound `
136
+ can be used to ignore the "not found" error:
137
+ ```
138
+ podClient := f.ClientSet.CoreV1().Pods(f.Namespace.Name)
139
+ pod, err := podClient.Create(ctx, testPod, metav1.CreateOptions{})
140
+ framework.ExpectNoError(err, "create test pod")
141
+ ginkgo.DeferCleanup(framework.IgnoreNotFound(podClient.Delete), pod.Name, metav1.DeleteOptions{})
142
+ ```
143
+
144
+ #### Interrupting tests and timeouts ####
145
+
146
+ When aborting a manual ` gingko ./test/e2e ` invocation with CTRL-C or a signal,
147
+ the currently running test(s) should stop immediately. This gets achieved by
148
+ accepting a ` ctx context.Context ` as first parameter in the Ginkgo callback
149
+ function and then passing that context through to all code that might
150
+ block. When Ginkgo notices that it needs to shut down, it will cancel that
151
+ context and all code trying to use it will immediately return with a `context
152
+ canceled` error. Cleanup callbacks get a new context which will time out
153
+ eventually to ensure that tests don't get stuck. For a detailed description,
154
+ see https://onsi.github.io/ginkgo/#interrupting-aborting-and-timing-out-suites .
155
+ Most of the E2E tests [ were update to use the Ginkgo
156
+ context] ( https://github.com/kubernetes/kubernetes/pull/112923 ) at the start of
157
+ the 1.27 development cycle.
158
+
159
+ When waiting for something to happen, use a reasonable timeout. Without it, a
160
+ test might keep running until the entire test suite gets killed by the
161
+ CI. Beware that the CI under load may take a lot longer to complete some
162
+ operation compared to running the same test locally.
163
+
164
+ On the other hand, a too long timeout can be annoying when trying to debug
165
+ tests locally. Some tips for writing and debugging long-running tests:
166
+
167
+ - Use ` ginkgo.By ` to record individual steps. Ginkgo will use that information
168
+ when describing where a test timed out.
169
+
170
+ - Invoke the ` ginkgo ` CLI with ` --poll-progress-after=30s ` or some other
171
+ suitable duration to [ be informed
172
+ early] ( https://onsi.github.io/ginkgo/#getting-visibility-into-long-running-specs )
173
+ why a test doesn't complete and where it is stuck. A SIGINFO or SIGUSR1
174
+ signal can be sent to the CLI and/or e2e.test processes to trigger an
175
+ immediate progress report.
176
+
177
+ - Use ` gomega.Eventually ` to wait for some condition. When it times out or
178
+ gets stuck, the last failed assertion will be included in the report
179
+ automatically. A good way to invoke it is:
180
+
181
+ gomega.Eventually(ctx, func(ctx context.Context) (book Book, err error) {
182
+ // Retrieve book from API server and return it.
183
+ ...
184
+ }).WithPolling(5 * time.Second).WithTimeout(30 * time.Second).
185
+ Should(gomega.HaveField("Author.DOB.Year()", BeNumerically("<", 1900)))
186
+
187
+ Avoid testing for some condition inside the callback and returning a boolean
188
+ because then failure messages are not informative (see above).
189
+
190
+ Some of the E2E framework sub-packages have helper functions that wait for
191
+ certain domain-specific conditions. Currently most of these functions don't
192
+ follow best practices (not using gomega.Eventually, error messages not very
193
+ informative). [ Work is
194
+ planned] ( https://github.com/kubernetes/kubernetes/issues/106575 ) in that
195
+ area, so beware that these APIs [ may
196
+ change] ( https://github.com/kubernetes/kubernetes/pull/113298 ) at some point.
197
+
198
+ - Use ` gomega.Consistently ` to ensure that some condition is true
199
+ for a while.
200
+
201
+ - Both ` gomega.Consistently ` and ` gomega.Eventually ` can be aborted early via
202
+ [ ` gomega.StopPolling ` ] ( https://onsi.github.io/gomega/#bailing-out-early---polling-functions ) .
203
+
204
+ - Avoid polling with functions that don't take a context (` wait.Poll ` ,
205
+ ` wait.PollImmediate ` , ` wait.Until ` , ...) and replace with their counterparts
206
+ that do (` wait.PollWithContext ` , ` wait.PollImmediateWithContext ` ,
207
+ ` wait.UntilWithContext ` , ...) or even better, with ` gomega.Eventually ` .
208
+
209
+ There are some gotchas:
210
+
211
+ - Don't use the ` ctx ` passed into ` ginkgo.It ` in a ` ginkgo.DeferCleanup `
212
+ callback because the context will be canceled when the cleanup code
213
+ runs. This is wrong:
214
+
215
+ ginkgo.It("something", func(ctx context.Context) {
216
+ ...
217
+ ginkgo.DeferCleanup(func() {
218
+ // do something with ctx
219
+ })
220
+ })
221
+
222
+ Instead, register a function which accepts a new context:
223
+
224
+ ginkgo.DeferCleanup(func(ctx context.Context) {
225
+ // do something with the new ctx
226
+ })
227
+
228
+ Anonymous functions can be avoided by passing some existing function and its
229
+ parameters directly to ` ginkgo.DeferCleanup ` . Again, beware to * not* pass the
230
+ wrong ` ctx ` . This is wrong:
231
+
232
+ ginkgo.It("something", func(ctx context.Context) {
233
+ ...
234
+ ginkgo.DeferCleanup(myDeleteFunc, ctx, objName)
235
+ })
236
+
237
+ Instead, just pass the other parameters and let ` ginkgo.DeferCleanup `
238
+ add a new context:
239
+
240
+ ginkgo.DeferCleanup(myDeleteFunc, objName)
241
+
242
+ - When starting some background goroutine in a ` ginkgo.BeforeEach ` callback,
243
+ use ` context.WithCancel(context.Background()) ` . The context passed into the
244
+ callback will get canceled when the callback returns, which would cause the
245
+ background goroutine to stop before the test runs. This works:
246
+
247
+ backgroundCtx, cancel := context.WithCancel(context.Background())
248
+ ginkgo.DeferCleanup(cancel)
249
+ _, controller = cache.NewInformer( ... )
250
+ go controller.Run(backgroundCtx.Done())
251
+
252
+ - When adding a timeout to the context for one particular operation,
253
+ beware of overwriting the ` ctx ` variable. This code here applies
254
+ the timeout to the next call and everything else that follows:
255
+
256
+ ctx, cancel := context.WithTimeout(ctx, 5 * time.Second)
257
+ defer cancel()
258
+ someOperation(ctx)
259
+ ...
260
+ anotherOperation(ctx)
261
+
262
+ Better use some other variable name:
263
+
264
+ timeoutCtx, cancel := context.WithTimeout(ctx, 5 * time.Second)
265
+ defer cancel()
266
+ someOperation(timeoutCtx)
267
+
268
+ When the intention is to set a timeout for the entire callback, use
269
+ [ ` ginkgo.NodeTimeout ` ] ( https://pkg.go.dev/github.com/onsi/ginkgo/v2#NodeTimeout ) :
270
+
271
+ ginkgo.It("something", ginkgo.NodeTimeout(30 * time.Second), func(ctx context.Context) {
272
+ })
273
+
274
+ There is also a ` ginkgo.SpecTimeout ` , but that then also includes the time
275
+ taken for ` BeforeEach ` , ` AfterEach ` and ` DeferCleanup ` callbacks.
276
+
52
277
#### Ability to run in non-dedicated test clusters ####
53
278
54
279
To reduce end-to-end delay and improve resource utilization when
0 commit comments