Skip to content

Conversation

lovesegfault
Copy link
Member

Motivation

Attempting to revive the extremely useful #6585

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Oct 11, 2025
Previously, `CycleEdgeScanSink::operator()` copied the entire
`getResult()` `StringSet` twice on every 64KB chunk to detect newly
found hashes. For large files, this created O(n * chunks) overhead.

Now we track which hashes have been recorded for the current file using
`recordedForCurrentFile`, avoiding the set copies. The insert() returns
true only for newly seen hashes, making this O(1) per hash found.
The `hashPathMap` was being passed to `CycleEdgeScanSink` and stored as
a member variable, but was never actually used. The sink only needs the
hash strings for detection via `RefScanSink`, not the full `StorePath`
mapping.
Moved the `edges` member variable from public to private section for
better encapsulation. Access is provided through the `getEdges()`
method.
Replaced raw `read()` with `readFull()` helper, which properly
handles partial reads and `EINTR`. The previous code manually checked
for errors but didn't handle the case where `read()` returns fewer
bytes than requested.
After refactoring to use `RefScanSink`, we no longer manually search for
hashes in buffers, so the `refLength` constant (hash length) is unused.
`RefScanSink` handles this internally.
Comment on lines 94 to 98
auto st = lstat(path);

debug("scanForCycleEdges2: scanning path = %s", path);

if (S_ISREG(st.st_mode)) {
Copy link
Member

@Ericson2314 Ericson2314 Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to do this with std::filesystem stuff, so it is portable to Windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we reuse NarAccessor to avoid the filesystem stuff completely?

Copy link
Member

@Ericson2314 Ericson2314 Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. In generally I do get the sense that there is a lot of deduplication that can still be done here. I just haven't looked into the details yet.

(Also, FWIW, my plan with CA derivations is to make cycles impossible by construction.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored it to use NarAccessor

The single-pass greedy algorithm could fail to connect all edges if
they arrived in certain orders. For example, edges A→B, C→D, D→A, B→C
would result in two paths [D→A→B→C] and [C→D] instead of one complete
cycle.

Added a second pass that repeatedly tries to merge existing multiedges
with each other until no more merges are possible. This ensures we find
complete cycle paths regardless of edge discovery order.
…File APIs

Replaced POSIX-specific file operations with portable alternatives to
improve Windows compatibility.
Added explicit `#include "nix/util/archive.hh"` since we use the
caseHackSuffix constant from it.
Replace string concatenation for path joining with type-safe
`std::filesystem::path operator/`.
…anPath

The "2" suffix was unclear and didn't communicate the function's purpose.
The new name better describes what it does: walks the filesystem tree and
scans each file using the provided sink.
Changed `walkAndScanPath` to take `std::filesystem::path` instead of `string`,
eliminating repeated string→path conversions throughout the function.
…acOS only

Since caseHackSuffix is only used inside `#ifdef __APPLE__` blocks, gate
the archive.hh include with the same condition.
Move `find-cycles.hh` to `src/libstore/include/nix/store/build/` to
ensure it is properly installed as a public header and can be used by
tests and other consumers of the library.
Replace the multi-pass O(n²) algorithm with a single-pass O(n) approach
using hashmaps to track path endpoints.

Before:
- First pass: O(n*m) to join each edge with existing multiedges
- Second pass: O(m²) repeated merging until no more merges possible
- Required multiple iterations through the entire dataset

Now:
- Single O(n) pass through edges
- Two hashmaps (pathStartingAt, pathEndingAt) enable O(1) lookups
- Immediately identifies connections without linear searches
- Handles all cases in one pass:
  * Extending existing paths (prepend/append)
  * Joining two separate paths
  * Forming cycles when a path connects to itself
Make the cyclic outputs test compatible with older daemons that don't
have the enhanced cycle detection. The test now accepts both:
- New behavior: detailed cycle analysis with file paths
- Old behavior: basic cycle error message

This ensures daemon compatibility tests pass when running the new test
suite against older daemon versions (e.g., 2.28.4).
Replace direct `std::filesystem` operations with `SourceAccessor`..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

with-tests Issues related to testing. PRs with tests have some priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants