Skip to content

Commit bf8b4d5

Browse files
committed
writing-good-e2e-tests.md: polling, contexts and failure messages
This reflects the recent changes that came with Ginkgo v2.
1 parent 0380da7 commit bf8b4d5

File tree

1 file changed

+316
-4
lines changed

1 file changed

+316
-4
lines changed

contributors/devel/sig-testing/writing-good-e2e-tests.md

Lines changed: 316 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,22 @@ time, and delayed merges.
2525

2626
#### Debuggability ####
2727

28-
If your test fails, it should provide as detailed as possible reasons
29-
for the failure in its output. "Timeout" is not a useful error
28+
If your test fails, it should provide as detailed as possible reasons for the
29+
failure in its failure message. The failure message is the string that gets
30+
passed (directly or indirectly) to `ginkgo.Fail(f)`. That text is what gets
31+
shown in the overview of failed tests for a Prow job and what gets aggregated
32+
by https://go.k8s.io/triage.
33+
34+
A good failure message:
35+
- identifies the test failure
36+
- has enough details to provide some initial understanding of what went wrong
37+
38+
It's okay for it to contain information that changes during each test
39+
run. Aggregation [simplifies the failure message with regular
40+
expressions](https://github.com/kubernetes/test-infra/blob/d56bc333ae8acf176887a3249f750e7a8e0377f0/triage/summarize/text.go#L39-L69)
41+
before looking for similar failures.
42+
43+
"Timeout" is not a useful error
3044
message. "Timed out after 60 seconds waiting for pod xxx to enter
3145
running state, still in pending state" is much more useful to someone
3246
trying to figure out why your test failed and what to do about it.
@@ -38,17 +52,315 @@ like the following generates rather useless errors:
3852
Expect(err).NotTo(HaveOccurred())
3953
```
4054

41-
Rather
55+
Errors returned by client-go can be very detailed. A better way to test for
56+
errors is with `framework.ExpectNoError` because it logs the full error and
57+
then includes only the shorter `err.Error()` in the failure message. To make
58+
that failure message more informative,
4259
[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this:
4360

4461
```
45-
Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
62+
framework.ExpectNoError(err, "tried creating %d foobars, only created %d", foobarsReqd, foobarsCreated)
4663
```
4764

4865
On the other hand, overly verbose logging, particularly of non-error conditions, can make
4966
it unnecessarily difficult to figure out whether a test failed and if
5067
so why? So don't log lots of irrelevant stuff either.
5168

69+
Except for this special case, using Gomega assertions directly is
70+
encouraged. They are more versatile than the (few) wrappers that were added to
71+
the E2E framework. Use assertions that match the check in the test. Using Go
72+
code to evaluate some condition and then checking the result often isn't
73+
informative. For example this check should be avoided:
74+
75+
```
76+
gomega.Expect(strings.Contains(actualStr, expectedSubStr)).To(gomega.Equal(true))
77+
```
78+
79+
Better pass both values to Gomega, which will automatically include them in the
80+
failure message. Add an annotation that explains what the assertion is about:
81+
82+
```
83+
gomega.Expect(actualStr).To(gomega.ContainSubstring("xyz"), "checking log output")
84+
```
85+
86+
This produces the following failure message:
87+
```
88+
[FAILED] checking log output
89+
Expected
90+
<string>: hello world
91+
to contain substring
92+
<string>: xyz
93+
```
94+
95+
If there is no suitable Gomega assertion, call `ginkgo.Failf` directly:
96+
```
97+
import "github.com/onsi/gomega/format"
98+
99+
ok := someCustomCheck(abc)
100+
if !ok {
101+
ginkgo.Failf("check xyz failed for object:\n%s", format.Object(abc))
102+
}
103+
```
104+
105+
[Comparing a boolean](https://github.com/kubernetes/kubernetes/issues/105678)
106+
like this against `true` or `false` with `gomega.Equal` or
107+
`framework.ExpectEqual` is not useful because dumping the actual and expected
108+
value just distracts from the underlying failure reason.
109+
110+
Dumping structs with `format.Object` is recommended. Starting with Kubernetes
111+
1.26, `format.Object` will pretty-print Kubernetes API objects or structs [as
112+
YAML and omit unset
113+
fields](https://github.com/kubernetes/kubernetes/pull/113384), which is more
114+
readable than other alternatives like `fmt.Sprintf("%+v")`.
115+
116+
import (
117+
"fmt"
118+
"k8s.io/api/core/v1"
119+
"k8s.io/kubernetes/test/utils/format"
120+
)
121+
122+
var pod v1.Pod
123+
fmt.Printf("Printf: %+v\n\n", pod)
124+
fmt.Printf("format.Object:\n%s", format.Object(pod, 1 /* indent one level */))
125+
126+
=>
127+
128+
Printf: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name: GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Spec:{Volumes:[] InitContainers:[] Containers:[] EphemeralContainers:[] RestartPolicy: TerminationGracePeriodSeconds:<nil> ActiveDeadlineSeconds:<nil> DNSPolicy: NodeSelector:map[] ServiceAccountName: DeprecatedServiceAccount: AutomountServiceAccountToken:<nil> NodeName: HostNetwork:false HostPID:false HostIPC:false ShareProcessNamespace:<nil> SecurityContext:nil ImagePullSecrets:[] Hostname: Subdomain: Affinity:nil SchedulerName: Tolerations:[] HostAliases:[] PriorityClassName: Priority:<nil> DNSConfig:nil ReadinessGates:[] RuntimeClassName:<nil> EnableServiceLinks:<nil> PreemptionPolicy:<nil> Overhead:map[] TopologySpreadConstraints:[] SetHostnameAsFQDN:<nil> OS:nil HostUsers:<nil> SchedulingGates:[] ResourceClaims:[]} Status:{Phase: Conditions:[] Message: Reason: NominatedNodeName: HostIP: PodIP: PodIPs:[] StartTime:<nil> InitContainerStatuses:[] ContainerStatuses:[] QOSClass: EphemeralContainerStatuses:[] Resize:}}
129+
130+
format.Object:
131+
<v1.Pod>:
132+
metadata:
133+
creationTimestamp: null
134+
spec:
135+
containers: null
136+
status: {}
137+
138+
Consider writing a [Gomega
139+
matcher](https://onsi.github.io/gomega/#adding-your-own-matchers). It combines
140+
the custom check and generating the failure messages and can make tests more
141+
readable.
142+
143+
It is good practice to include details like the object that failed some
144+
assertion in the failure message because then a) the information is available
145+
when analyzing a failure that occurred in the CI and b) it only gets logged
146+
when some assertion fails. Always dumping objects via log messages can make the
147+
test output very large and may distract from the relevant information.
148+
149+
#### Recovering from test failures ####
150+
151+
All tests should ensure that a cluster is restored to the state that it was in
152+
before the test ran. [`ginkgo.DeferCleanup`
153+
](https://pkg.go.dev/github.com/onsi/ginkgo/v2#DeferCleanup) is recommended for
154+
this because it can be called similar to `defer` directly after setting up
155+
something. It is better than `defer` because Ginkgo will show additional
156+
details about which cleanup code is running and (if possible) handle timeouts
157+
for that code (see next section). Is is better than `ginkgo.AfterEach` because
158+
it is not necessary to define additional variables and because
159+
`ginkgo.DeferCleanup` executes code in the more useful last-in-first-out order,
160+
i.e. things that get set up first get removed last.
161+
162+
Objects created in the test namespace do not need to be deleted because
163+
deleting the namespace will also delete them. However, if deleting an object
164+
may fail, then explicitly cleaning it up is better because then failures or
165+
timeouts related to it will be more obvious.
166+
167+
In cases where the test may have removed the object, `framework.IgnoreNotFound`
168+
can be used to ignore the "not found" error:
169+
```
170+
podClient := f.ClientSet.CoreV1().Pods(f.Namespace.Name)
171+
pod, err := podClient.Create(ctx, testPod, metav1.CreateOptions{})
172+
framework.ExpectNoError(err, "create test pod")
173+
ginkgo.DeferCleanup(framework.IgnoreNotFound(podClient.Delete), pod.Name, metav1.DeleteOptions{})
174+
```
175+
176+
#### Interrupting tests ####
177+
178+
When aborting a manual `gingko ./test/e2e` invocation with CTRL-C or a signal,
179+
the currently running test(s) should stop immediately. This gets achieved by
180+
accepting a `ctx context.Context` as first parameter in the Ginkgo callback
181+
function and then passing that context through to all code that might
182+
block. When Ginkgo notices that it needs to shut down, it will cancel that
183+
context and all code trying to use it will immediately return with a `context
184+
canceled` error. Cleanup callbacks get a new context which will time out
185+
eventually to ensure that tests don't get stuck. For a detailed description,
186+
see https://onsi.github.io/ginkgo/#interrupting-aborting-and-timing-out-suites.
187+
Most of the E2E tests [were update to use the Ginkgo
188+
context](https://github.com/kubernetes/kubernetes/pull/112923) at the start of
189+
the 1.27 development cycle.
190+
191+
There are some gotchas:
192+
193+
- Don't use the `ctx` passed into `ginkgo.It` in a `ginkgo.DeferCleanup`
194+
callback because the context will be canceled when the cleanup code
195+
runs. This is wrong:
196+
197+
ginkgo.It("something", func(ctx context.Context) {
198+
...
199+
ginkgo.DeferCleanup(func() {
200+
// do something with ctx
201+
})
202+
})
203+
204+
Instead, register a function which accepts a new context:
205+
206+
ginkgo.DeferCleanup(func(ctx context.Context) {
207+
// do something with the new ctx
208+
})
209+
210+
Anonymous functions can be avoided by passing some existing function and its
211+
parameters directly to `ginkgo.DeferCleanup`. Again, beware to *not* pass the
212+
wrong `ctx`. This is wrong:
213+
214+
ginkgo.It("something", func(ctx context.Context) {
215+
...
216+
ginkgo.DeferCleanup(myDeleteFunc, ctx, objName)
217+
})
218+
219+
Instead, just pass the other parameters and let `ginkgo.DeferCleanup`
220+
add a new context:
221+
222+
ginkgo.DeferCleanup(myDeleteFunc, objName)
223+
224+
- When starting some background goroutine in a `ginkgo.BeforeEach` callback,
225+
use `context.WithCancel(context.Background())`. The context passed into the
226+
callback will get canceled when the callback returns, which would cause the
227+
background goroutine to stop before the test runs. This works:
228+
229+
backgroundCtx, cancel := context.WithCancel(context.Background())
230+
ginkgo.DeferCleanup(cancel)
231+
_, controller = cache.NewInformer( ... )
232+
go controller.Run(backgroundCtx.Done())
233+
234+
- When adding a timeout to the context for one particular operation,
235+
beware of overwriting the `ctx` variable. This code here applies
236+
the timeout to the next call and everything else that follows:
237+
238+
ctx, cancel := context.WithTimeout(ctx, 5 * time.Second)
239+
defer cancel()
240+
someOperation(ctx)
241+
...
242+
anotherOperation(ctx)
243+
244+
Better use some other variable name:
245+
246+
timeoutCtx, cancel := context.WithTimeout(ctx, 5 * time.Second)
247+
defer cancel()
248+
someOperation(timeoutCtx)
249+
250+
When the intention is to set a timeout for the entire callback, use
251+
[`ginkgo.NodeTimeout`](https://pkg.go.dev/github.com/onsi/ginkgo/v2#NodeTimeout):
252+
253+
ginkgo.It("something", ginkgo.NodeTimeout(30 * time.Second), func(ctx context.Context) {
254+
})
255+
256+
There is also a `ginkgo.SpecTimeout`, but that then also includes the time
257+
taken for `BeforeEach`, `AfterEach` and `DeferCleanup` callbacks.
258+
259+
#### Polling and timeouts ####
260+
261+
When waiting for something to happen, use a reasonable timeout. Without it, a
262+
test might keep running until the entire test suite gets killed by the
263+
CI. Beware that the CI under load may take a lot longer to complete some
264+
operation compared to running the same test locally. On the other hand, a too
265+
long timeout can be annoying when trying to debug tests locally.
266+
267+
The framework provides some [common
268+
timeouts](https://github.com/kubernetes/kubernetes/blob/eba98af1d8b19b120e39f3/test/e2e/framework/timeouts.go#L44-L109)
269+
through the [framework
270+
instance](https://github.com/kubernetes/kubernetes/blob/1e84987baccbccf929eba98af1d8b19b120e39f3/test/e2e/framework/framework.go#L122-L123).
271+
When writing a test, check whether one of those fits before defining a custom
272+
timeout in the test.
273+
274+
Good code that waits for something to happen meets the following criteria:
275+
- accepts a context for test timeouts
276+
- informative during interactive use (i.e. intermediate reports, either
277+
periodically or on demand)
278+
- little to no output during a CI run except when it fails
279+
- full explanation when it fails: when it observes some state and then
280+
encounters errors reading the state, then dumping both the latest
281+
observed state and the latest error is useful
282+
- extension mechanism for writing custom checks
283+
- early abort when condition cannot be reached anymore
284+
285+
[`gomega.Eventually`](https://pkg.go.dev/github.com/onsi/gomega#Eventually)
286+
satisfies all of these criteria and therefore is recommended, but not required.
287+
In https://github.com/kubernetes/kubernetes/pull/113298,
288+
[test/e2e/framework/pods/wait.go](https://github.com/kubernetes/kubernetes/blob/222f65506252354da012c2e9d5457a6944a4e681/test/e2e/framework/pod/wait.go)
289+
and the framework were modified to use gomega. Typically, `Eventually` is
290+
passed a function which gets an object or lists several of them, then `Should`
291+
checks against the expected result. Errors and retries specific to Kubernetes
292+
are handled by [wrapping client-go
293+
functions](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/get.go).
294+
295+
Using gomega assertions in helper packages is problematic for two reasons:
296+
- The stacktrace associated with the failure starts with the helper unless
297+
extra care is take to pass in a stack offset.
298+
- Additional explanations for a potential failure must be prepared beforehand
299+
and passed in.
300+
301+
The E2E framework therefore uses a different approach:
302+
- [`framework.Gomega()`](https://github.com/kubernetes/kubernetes/blob/222f65506252354da012c2e9d5457a6944a4e681/test/e2e/framework/expect.go#L80-L101)
303+
offers similar functions as the `gomega` package, except that they return a
304+
normal error instead of failing the test.
305+
- That error gets wrapped with `fmt.Errorf("<explanation>: %w)` to
306+
add additional information, just as in normal Go code.
307+
- Wrapping the error (`%w` instead of `%v`) is important because then
308+
`framework.ExpectNoError` directly uses the error message as failure without
309+
additional boiler plate text. It also is able to log the stacktrace where
310+
the error occurred and not just where it was finally treated as a test
311+
failure.
312+
313+
Some tips for writing and debugging long-running tests:
314+
315+
- Use `ginkgo.By` to record individual steps. Ginkgo will use that information
316+
when describing where a test timed out.
317+
318+
- Invoke the `ginkgo` CLI with `--poll-progress-after=30s` or some other
319+
suitable duration to [be informed
320+
early](https://onsi.github.io/ginkgo/#getting-visibility-into-long-running-specs)
321+
why a test doesn't complete and where it is stuck. A SIGINFO or SIGUSR1
322+
signal can be sent to the CLI and/or e2e.test processes to trigger an
323+
immediate progress report.
324+
325+
- Use [`gomega.Eventually`](https://pkg.go.dev/github.com/onsi/gomega#Eventually)
326+
to wait for some condition. When it times out or
327+
gets stuck, the last failed assertion will be included in the report
328+
automatically. A good way to invoke it is:
329+
330+
gomega.Eventually(ctx, func(ctx context.Context) (book Book, err error) {
331+
// Retrieve book from API server and return it.
332+
...
333+
}).WithPolling(5 * time.Second).WithTimeout(30 * time.Second).
334+
Should(gomega.HaveField("Author.DOB.Year()", BeNumerically("<", 1900)))
335+
336+
Avoid testing for some condition inside the callback and returning a boolean
337+
because then failure messages are not informative (see above). See
338+
https://github.com/kubernetes/kubernetes/pull/114640 for an example where
339+
[gomega/gcustom](https://pkg.go.dev/github.com/onsi/[email protected]/gcustom)
340+
was used to write assertions.
341+
342+
Some of the E2E framework sub-packages have helper functions that wait for
343+
certain domain-specific conditions. Currently most of these functions don't
344+
follow best practices (not using gomega.Eventually, error messages not very
345+
informative). [Work is
346+
planned](https://github.com/kubernetes/kubernetes/issues/106575) in that
347+
area, so beware that these APIs may
348+
change at some point.
349+
350+
- Use `gomega.Consistently` to ensure that some condition is true
351+
for a while. As with `gomega.Eventually`, make assertions about the
352+
value instead of checking the value with Go code and then asserting
353+
that the code returns true.
354+
355+
- Both `gomega.Consistently` and `gomega.Eventually` can be aborted early via
356+
[`gomega.StopPolling`](https://onsi.github.io/gomega/#bailing-out-early---polling-functions).
357+
358+
- Avoid polling with functions that don't take a context (`wait.Poll`,
359+
`wait.PollImmediate`, `wait.Until`, ...) and replace with their counterparts
360+
that do (`wait.PollWithContext`, `wait.PollImmediateWithContext`,
361+
`wait.UntilWithContext`, ...) or even better, with `gomega.Eventually`.
362+
363+
52364
#### Ability to run in non-dedicated test clusters ####
53365

54366
To reduce end-to-end delay and improve resource utilization when

0 commit comments

Comments
 (0)