Skip to content

Conversation

@dario-piotrowicz
Copy link
Member

As I've been running the repo's tests I notice that on my machine1 test-fs-stat-bigint seems rather
flaky and such flakiness is caused by the allowable deltas in the test which seem to generally be very
low (1 or 2), so here I'm proposing to increment such values by 502

Besides the above I've applies some minor code refactoring.

Note

I'd like to point out that I am adding this 50 magic number, before a magic number 5 was being used by was removed in #31726, naturally I do agree that magic numbers are not great but I'd argue that the presence of a magic number is anyways preferable compared to that of flakiness

Footnotes

  1. macOS Sequoia 15.4 Darwin XXX 24.4.0 Darwin Kernel Version 24.4.0: Wed Mar 19 21:16:34 PDT 2025; root:xnu-11417.101.15~1/RELEASE_ARM64_T6000 arm64

  2. 50 is a completely arbitrary number, I have tried with various other numbers, such as 5 and 10 and they were significantly reducing but not completely getting rid of the flakiness, so I figured that the number might as well be increased to eliminate the flakiness completely, my guess is that in the grand scheme of things something like a 50 milliseconds is a very small and acceptable delta anyways, but this is really a wild guess of mine, I'm waiting to see if anyone pushes back on it

the current allowable delta values cause
some flakiness in the test-fs-stat-bigint
tests, so the changes here incremente such
values by an arbitrary number to fix such
flakiness
@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Apr 26, 2025
@codecov
Copy link

codecov bot commented Apr 26, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.26%. Comparing base (b673c69) to head (e6057b9).
Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #58043      +/-   ##
==========================================
- Coverage   90.28%   90.26%   -0.02%     
==========================================
  Files         630      630              
  Lines      186150   186328     +178     
  Branches    36468    36503      +35     
==========================================
+ Hits       168059   168186     +127     
- Misses      10975    11013      +38     
- Partials     7116     7129      +13     

see 33 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@RaisinTen RaisinTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a magic number feels more like a mitigation than a fix.

Logically, it does make sense to expect the numbers to be between the start and end times.

I'm curious if this was flakey for all of these subtests or if it can be isolated to only some of the:

  • fs calls like stat / fstat / lstat
  • conditions like sync / promise / callback
  • stat object properties like atime / mtime / ctime / birthtime

Investigating that might help in figuring out what's really causing this flakiness.

// this number is computed by taking the difference between the start and end
// times of the stat calls, converting it from nanoseconds to milliseconds,
// and adding a small buffer to account for potential timing variations
const allowableTimeDelta = Math.ceil(Number(endTime - startTime) / 1e6) + 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number needs to be passed through common.platformTimeout() to translate to equivalent numbers on other platforms

Copy link
Member

@RaisinTen RaisinTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if measuring the start time before file creation (the getFilename() call creates the file) gets rid of the flakiness?

@dario-piotrowicz
Copy link
Member Author

Can you check if measuring the start time before file creation (the getFilename() call creates the file) gets rid of the flakiness?

Sure, I've just checked 👍 (on the sync test), it does not seem to help with the flakiness 🤔

@dario-piotrowicz
Copy link
Member Author

I'm curious if this was flakey for all of these subtests or if it can be isolated to only some of the:

  • fs calls like stat / fstat / lstat
  • conditions like sync / promise / callback
  • stat object properties like atime / mtime / ctime / birthtime

Thanks for the helpful suggestion 🙂 , however I've tested all the isolated possibilities above and I could not see any version that would not include flakiness 😓

I've run subsets of the tests/checks (only the promise tests, only the callback test, only the stat tests, only the fstat tests, only the atime checks, only the mtime checks, etc...) and run them with a high repetition (300), I did not run combinations of them (only atime checks in fstat promise tests, only mtime checks in lstat callback tests, etc...) since given the above I don't imagine that combinations would yield any different result 😕

I've also not checked the level of flakiness of any of the isolated versions, however by running then and manually looking at the results it feels to me like there is not much difference between the flakiness of each version (or the full test file as a whole)

@dario-piotrowicz
Copy link
Member Author

@RaisinTen, could we query Jenkins results to understand if the flakiness is present for all platforms or just a subset? (or if it's even failing in CI)

@RaisinTen
Copy link
Member

RaisinTen commented Apr 27, 2025

The Jenkins results are logged in the reliability repo but I wasn't able to find anything useful there because the linked CI logs are pretty old, so the data is gone. However, there are some interesting bits of information about this in the main issue tracker. Check this out - #45612 and the discussions on James' attempted fix which is similar to yours.

Based on the data there, the time-based flakiness in this test is known to repro only on M1 macOS due to the unexpectedly large difference in the 2 atimeMs values. Can you confirm that you're also seeing flakiness because of differences in the (birth|c|m)time values?

@dario-piotrowicz
Copy link
Member Author

Ah, sorry I forgot to check if there was an issue already tracking this 😓

Also sorry I made a mistake in my testing earlier and it does seem that the flakiness only applies to atimeMs values 🙇

My testing

Basically with this change:

function verifyStats(bigintStats, numStats, allowableDelta) {
  // allowableDelta: It's possible that the file stats are updated between the
  // two stat() calls so allow for a small difference.
  for (const key of Object.keys(numStats)) {
    const val = numStats[key];
+   if (key === 'atimeMs') continue;
    if (isDate(val)) {
      const time = val.getTime();
      const time2 = bigintStats[key].getTime();
      assert(

I can see that the flakiness goes away

@dario-piotrowicz
Copy link
Member Author

One solution here could be to check, if the key is atimeMs and the platform is darwin then the allowable delta could be increased 🤔

But again that might be seen as a mitigation rather than an actual fix.

@dario-piotrowicz
Copy link
Member Author

I feel like this might just be highly dependent on the OS/platform scheduling and something outside of the control of node 😕

@RaisinTen
Copy link
Member

Based on the docs (I hope it's the same on your M1 macOS too), these are the causes for the atime to change:

$ man 2 stat
...
     st_atime         Time when file data last accessed.  Changed by the mknod(2), utimes(2) and read(2) system
                      calls.

Can you check what's causing the atime to go outside the expected window? You can try to wildly simplify the test locally, make sure the flake repros and use commands like fs_usage, dtruss, etc. to see what's causing this change.

@dario-piotrowicz
Copy link
Member Author

Based on the docs (I hope it's the same on your M1 macOS too), these are the causes for the atime to change:

$ man 2 stat
...
     st_atime         Time when file data last accessed.  Changed by the mknod(2), utimes(2) and read(2) system
                      calls.

Can you check what's causing the atime to go outside the expected window? You can try to wildly simplify the test locally, make sure the flake repros and use commands like fs_usage, dtruss, etc. to see what's causing this change.

Thank you so very much for the fantastic suggestion @RaisinTen 🫶

Of course, I will look into this as soon as I can

Anyways, what do you think of this current PR? should I maybe close it and we can keep the conversation in #45612? (I'm asking just to make sure that this PR doesn't create unnecessary noise in the repo 🙂)

@RaisinTen
Copy link
Member

Since the current approach of this PR is similar to James' approach in #45925 which was closed, I don't think this should land in its current state.

The investigation happening here is very useful and I think we can figure out the root cause and solve this. I don't have any strong opinions on whether the conversation should happen here or in #45612, as long as it happens somewhere. So I'm fine with whichever channel you prefer.

Thanks a lot for looking into this. :)

@dario-piotrowicz
Copy link
Member Author

Since the current approach of this PR is similar to James' approach in #45925 which was closed, I don't think this should land in its current state.

Yes I think that the content of this PR would anyways likely significantly change, so I might as well close it and reopen one with the correct/agreed approach when we come up with one

The investigation happening here is very useful and I think we can figure out the root cause and solve this. I don't have any strong opinions on whether the conversation should happen here or in #45612, as long as it happens somewhere. So I'm fine with whichever channel you prefer.

Ok 🙂 , if it's all ok to you then I'll just close this one and continue the conversation in #45612 when I have updates 🙂
(I've made a note on this and will look into it properly in the next few days, I'm not going to forget about it 👍)

Thanks a lot for looking into this. :)

It's my pleasure 😄 Thank you so very much for all the support and helpful suggestions! 🫶

@dario-piotrowicz dario-piotrowicz deleted the dario/test-fs-stat-bigint-flakiness branch April 28, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants