Skip to content

Annotate test cases which are timing sensitive #130363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rossburton opened this issue Feb 20, 2025 · 19 comments
Open

Annotate test cases which are timing sensitive #130363

rossburton opened this issue Feb 20, 2025 · 19 comments
Labels
tests Tests in the Lib/test dir type-feature A feature request or enhancement

Comments

@rossburton
Copy link
Contributor

rossburton commented Feb 20, 2025

Feature or enhancement

Proposal:

Many tests in the cpython test suite are sensitive to timing and assume an entirely unloaded system. For example, the TimerfdTests check for 1ms accuracy in expected durations. Some people (eg people building and testing python) may be running the test suite on build machines which are potentially heavily loaded so this sort of expectation isn't feasible.

It seems like a good solution to this would be to annotate the tests which are timing sensitive in some way, so they can be skipped easily.

For reference, we're manually patching python to skip tests which are failing under load but we're now having to rebase these patches on upgrades:

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

@rossburton rossburton added the type-feature A feature request or enhancement label Feb 20, 2025
@tomasr8 tomasr8 added the tests Tests in the Lib/test dir label Feb 20, 2025
@encukou
Copy link
Member

encukou commented Feb 21, 2025

Sounds like it could be a new resource, see python -m test --help and test.support.requires.

@tomasr8
Copy link
Member

tomasr8 commented Feb 22, 2025

I think it makes sense to hide the tests that clearly depend on timings (e.g. test_process_time) behind a test resource. Though for some of the other tests (e.g. test_input_no_stdout_fileno) it's not immediately obvious to me why they would fail, so perhaps instead of skipping them we could look into making them more reliable?

@rossburton
Copy link
Contributor Author

Agreed - we've a short list of known flaky tests, and need to triage it to identify the ones which are genuinely sensitive vs those with subtle race conditions which should be fixed.

I will prototype a new resource and see how it works.

@tomasr8
Copy link
Member

tomasr8 commented Feb 22, 2025

If it helps, to add a new resource, you just need to add it here:

ALL_RESOURCES = ('audio', 'curses', 'largefile', 'network',
'decimal', 'cpu', 'subprocess', 'urlfetch', 'gui', 'walltime')

And then you can annotate the tests with @support.requires_resource('your_new_resource')

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

Though for some of the other tests (e.g. test_input_no_stdout_fileno) it's not immediately obvious to me why they would fail, so perhaps instead of skipping them we could look into making them more reliable?

How about a flaky resources. At least, we know which tests are meant to be flaky and which are not. And for the CI, we could ask the CI to rerun flaky tests on failures so that contributors don't need to wait for a triager/core dev to rerun a failed workflow when needed.

I think it makes sense to hide the tests that clearly depend on timings (e.g. test_process_time) behind a test resource

I suggest adding stats resources which could cover all tests that are meant to gather timing and check that some condition is statistically verified. Other possible names that are more or less generic: timeprec, measures, experiments. I don't really have better names =/

@tomasr8
Copy link
Member

tomasr8 commented Feb 22, 2025

I like the flaky test idea! For timing-related tests, I don't have a good suggestion either, though stats seems a bit too generic to me (but I guess how generic we want this resource to be). timeprec is more self-evident, or maybe timing or timings?

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

I like the flaky test idea!

Let's open a separate issue for the flaky tests as it could be orthogonal to tests that require an idle system due to stats gathering.

I don't have a good suggestion either, though stats seems a bit too generic to me (but I guess how generic we want this resource to be). timeprec is more self-evident, or maybe timing or timings?

I suggested "stats" because I don't think we should have a test that doesn't check against a population that some hypothesis is verified. However, I'm not sure we have any test that would test something else than timings like this. Memory cannot really be flaky unless the kernel is doing some weird stuff and network can fail but generally it's a timeout issue, not a data issue itself. So timings looks quite promising.

@tomasr8
Copy link
Member

tomasr8 commented Feb 22, 2025

Let's open a separate issue for the flaky tests as it could be orthogonal to tests that require an idle system due to stats gathering.

Issue's here: #130474

I suggested "stats" because I don't think we should have a test that doesn't check against a population that some hypothesis is verified. However, I'm not sure we have any test that would test something else than timings like this.

Yeah, I'm not sure if we assert anything else besides timings that is similarly stochastic.. maybe we can stick with some time-related name for now and adjust it if something else comes up?

rossburton added a commit to rossburton/cpython that referenced this issue Feb 24, 2025
…system

Some tests are very sensitive to timing and will fail on a loaded system,
typically because there are small windows for timeouts to trigger in.

Some of these are poorly implemented tests which can be improved, but
others may genuinely have strict timing requirements.

Add a test resource so that these tests can be marked as such, and only
ran when the system is known to be idle.
rossburton added a commit to rossburton/cpython that referenced this issue Feb 24, 2025
…system

Some tests are very sensitive to timing and will fail on a loaded system,
typically because there are small windows for timeouts to trigger in.

Some of these are poorly implemented tests which can be improved, but
others may genuinely have strict timing requirements.

Add a test resource so that these tests can be marked as such, and only
ran when the system is known to be idle.
@rossburton
Copy link
Contributor Author

Posted a minimal PR: #130508. This just adds an "idle" resource.

@vstinner
Copy link
Member

I'm not comfortable with this issue. "Skip tests because they fail on our CI" sounds too specific to me. IMO a "skip list" for a specific CI is perfectly fine.

Your patches use the description:

This test hangs frequently when run on the Autobuilder. Disable it in testing until the cause can be determined.

Well, yes, the cause should be determined.

0001-test_storlines-skip-due-to-load-variability.patch: it would be interesting to investigate why this specific test "intermittently fails on the Yocto AB when a worker is under heavy load".

We already have the "walltime" resources for "slow tests". You might start by skipping this resource?

@rossburton
Copy link
Contributor Author

Slow tests isn't a problem, it's tests that take 1s but have precise timing restrictions that are.

@rossburton
Copy link
Contributor Author

And I will absolutely review every skip we have to identify why it fails under load before tagging it with 'idle'.

@vstinner
Copy link
Member

Slow tests isn't a problem, it's tests that take 1s but have precise timing restrictions that are.

In general, tests should not fail if they take longer than a maximum timing: tests should not have maximum duration to handle well heavy loaded systems.

TimerfdTests which has an hardcoded precision deserves its own solution. I dislike blindly skip TimerfdTests because the CI may be too slow.

@rossburton
Copy link
Contributor Author

rossburton commented Feb 24, 2025

What would the solution for TimerFdTests be?

Another recent failure that we're not yet patching out is test.test_concurrent_futures.test_wait.ProcessPoolForkWaitTest.test_timeout:

FAIL: test_timeout (test.test_concurrent_futures.test_wait.ProcessPoolForkWaitTest.test_timeout)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.13/test/test_concurrent_futures/test_wait.py", line 129, in test_timeout
    self.assertEqual(set([CANCELLED_AND_NOTIFIED_FUTURE,
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                          EXCEPTION_FUTURE,
                          ^^^^^^^^^^^^^^^^^
                          SUCCESSFUL_FUTURE]),
                          ^^^^^^^^^^^^^^^^^^^^
                     finished)
                     ^^^^^^^^^
AssertionError: Items in the second set but not the first:
<Future at 0x7fdbee570d10 state=finished returned NoneType>

@vstinner
Copy link
Member

What would the solution for TimerFdTests be?

For example, add a command line option to specify the timer precision. Or change the test to only use a precision of 1 second. I'm not sure what's the best solution here.

@rossburton
Copy link
Contributor Author

From experience, when you say "let's just make this timeout longer" you're actually just saying "let's just have this test fail less frequently". A heavily loaded system will surprise you with exceptional delays at the most annoying moments.

@vstinner
Copy link
Member

I'm open to rework timerfd tests to remove "maximum timing" tests.

@rossburton
Copy link
Contributor Author

Another example that our CI just failed on:

ERROR: test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessPidfdWatcherTests.test_kill_issue43884)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.13/test/test_asyncio/test_subprocess.py", line 229, in test_kill_issue43884
    proc.kill()
    ~~~~~~~~~^^
  File "/usr/lib/python3.13/asyncio/subprocess.py", line 146, in kill
    self._transport.kill()
    ~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/asyncio/base_subprocess.py", line 172, in kill
    self.send_signal(signal.SIGKILL)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/asyncio/base_subprocess.py", line 162, in send_signal
    self._check_proc()
    ~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/asyncio/base_subprocess.py", line 146, in _check_proc
    raise ProcessLookupError()
ProcessLookupError

Given the test code contains a hardcoded short duration sleep:

self.loop.run_until_complete(asyncio.sleep(1))

I'm guessing the problem here is that the test assumes one second is enough for the forked process to be ready, and it normally is, but under heavy load at just the right time it isn't.

What would the right solution be here? Make the sleep(1) something more like sleep(60)?

@vstinner
Copy link
Member

vstinner commented Mar 7, 2025

ERROR: test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessPidfdWatcherTests.test_kill_issue43884)

A solution for subprocess synchronization would be to create a pipe and read/write into this pipe to synchronize. For example, the child process can write when it's ready. See Lib/test/_test_eintr.py for examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants