Skip to content

New publish concurrent with yanks is missed #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Nemo157 opened this issue Nov 17, 2022 · 9 comments · Fixed by #29
Closed

New publish concurrent with yanks is missed #26

Nemo157 opened this issue Nov 17, 2022 · 9 comments · Fixed by #29
Assignees

Comments

@Nemo157
Copy link
Contributor

Nemo157 commented Nov 17, 2022

From investigation in rust-lang/docs.rs#1912 it looks like this is a crates-index-diff issue. The crate had a publish and two yanks in between two checks and we only see the yank event.

The relevant commit range contains just this publish and the two yanks:

> git log --oneline --reverse b49672ff6a2d40123a593cfbca9d05219346e398~1..92c18bdf30a4872d355e4a5b3a7f7c6c75323cf7
b49672ff6a2 Updating crate `aegis#0.2.4`
da97cd0243b Updating crate `ansi-color-codec#0.3.11`
1533f8e863a Yanking crate `ansi-color-codec#0.3.4`
92c18bdf30a Yanking crate `ansi-color-codec#0.3.5`

I wrote a little test program to verify this:

use crates_index_diff::{Change, CrateVersion};

fn main() -> anyhow::Result<()> {
    let index = crates_index_diff::Index::from_path_or_cloned("index")?;
    let mut args = std::env::args().skip(1);
    let start = git_hash::ObjectId::from_hex(args.next().unwrap().as_bytes())?;
    let end = git_hash::ObjectId::from_hex(args.next().unwrap().as_bytes())?;
    for change in index.changes_between_commits(start, end)? {
        match change {
            Change::Added(CrateVersion { name, version, .. }) => println!("added {name} {version}"),
            Change::Yanked(CrateVersion { name, version, .. }) => println!("yanked {name} {version}"),
            Change::Deleted { name } => println!("deleted {name}"),
        }
    }
    Ok(())
}

For the full range it shows the same behaviour:

> cargo run --quiet --release -- b49672ff6a2d40123a593cfbca9d05219346e398 92c18bdf30a4872d355e4a5b3a7f7c6c75323cf7
yanked ansi-color-codec 0.3.4
yanked ansi-color-codec 0.3.5

If I only include the first one or two commits it behaves correctly:

> cargo run --quiet --release -- b49672ff6a2d40123a593cfbca9d05219346e398 da97cd0243bd297e7b0e4b11040e531af2e256d1
added ansi-color-codec 0.3.11

> cargo run --quiet --release -- b49672ff6a2d40123a593cfbca9d05219346e398 1533f8e863aee3fd5340513acfeef9d42816cd08
yanked ansi-color-codec 0.3.4
added ansi-color-codec 0.3.11
@Byron Byron self-assigned this Nov 18, 2022
@Byron
Copy link
Owner

Byron commented Nov 18, 2022

Thanks a lot for the excellent analysis, which will help tremendously to produce a test for a fix. I am pretty sure it has something to do with these lines which exist to assure the normalization tests work (and don't register as change). Let's see.

Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
Previously it was possible to have multiple diffs in one crate
distributed over multiple commits to rightfully show up as multiple
hunks of modified and added lines only register the modified lines,
not the new ones (or the deleted ones for that matter).

This would cause updates or removals to be missed.

Now hunks of changes are exhaused properly, fixing this issue.
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
@Byron
Copy link
Owner

Byron commented Nov 18, 2022

A new bugfix release has been created.

The issue was I forgot to exhaust the whole diff and the algorithm would stop after the first hunk of modification. Here there were modifications and an addition.

That's probably one of the disadvantages of operating only on a sample and I wonder if the baseline could be made exhaustive despite a constantly changing index (in terms of git history). Maybe it's enough to get any clone of the crates index and assure that all changes obtained though this library, when aggregated, match all iterable crates.

Byron added a commit that referenced this issue Nov 18, 2022
This improves performance slightly when dealing with a lot of versions,
like when all versions are obtained from the beginning of time.
Byron added a commit that referenced this issue Nov 18, 2022
…eleted versions. (#26)

That way it doesn't degenerate any information, previously the exact
version information was lost.

Not doing so helps to be able to reproduce the current state by
aggregating all changes.
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
But what we would have to do is to step through a couple of the changes
at a time and aggregate from these.
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
When stepping through the changes in multiple steps, we end up with
more crates then there are even though we identify them by
checksum and consider deletions. Yanking doesn't remove them from
the iteration either.
Byron added a commit that referenced this issue Nov 18, 2022
Byron added a commit that referenced this issue Nov 18, 2022
…but besides completely failing the normalization test which I don't
understand, it also doesn't manage to get the correct amount of
versions.
Byron added a commit that referenced this issue Nov 18, 2022
@Byron
Copy link
Owner

Byron commented Nov 18, 2022

I did manage to create a baseline comparison but it only works if a single diff is used. When trying to chunk up the diffs into iterations so it takes multiple steps to complete, I couldn't get it to line up. The diffed state would end up with about 13 thousand changes less than expected. I couldn't figure out if it's due to a faulty tests implementation (the step-wise diffing) or if it's related to the diff implementation or logic itself. I did try to adjust it to make more sense to me and that also didn't work while breaking normalization tests, so unfortunately I think I am still missing something here. The big question is if the results can possibly match with a step-wise diffing strategy, and I thought it should but maybe that's a wrong assumption.

CC @pascalkuthe for more theoretical (and practical) diffing expertise. Should a diff A | C be the same changes as A | B and B | C? I think so, but maybe that's wrong?

@pascalkuthe
Copy link
Contributor

pascalkuthe commented Nov 18, 2022

I did manage to create a baseline comparison but it only works if a single diff is used. When trying to chunk up the diffs into iterations so it takes multiple steps to complete, I couldn't get it to line up. The diffed state would end up with about 13 thousand changes less than expected. I couldn't figure out if it's due to a faulty tests implementation (the step-wise diffing) or if it's related to the diff implementation or logic itself. I did try to adjust it to make more sense to me and that also didn't work while breaking normalization tests, so unfortunately I think I am still missing something here. The big question is if the results can possibly match with a step-wise diffing strategy, and I thought it should but maybe that's a wrong assumption.

CC @pascalkuthe for more theoretical (and practical) diffing expertise. Should a diff A | C be the same changes as A | B and B | C? I think so, but maybe that's wrong?

I am not sure what you define as being the same changes. But in general I think that is not true.
Take the following example:

A:

foo

B:

foo
bar

C:

foo
foo

A diff A| C would yield:

 foo
+foo

A diff A | B B | C would yield:

 foo
+bar
 foo
-bar
+foo

I am not sure if you would define that as equivalent but I wouldn't.
Even if you created some sort of smart algorithm to merge these diffs into each other the result would generally not be the same. This is harder to demonstrate with a simple example, but let me try to explain this more generally:

Diffs are called edit sequences in the literature because what they provide is a linear sequence of edits that transform sequence A to sequence B. That means the following property always holds (pseudo code does not actually compile):

let hunks = diff(A, B);
for hunk in hunks{
   A.remove_lines(hunk.before);
   A.insert_lines_at(hunk.before.start, B.get_lines(hunk.afer);
}
assert_eq!(A, B);

Beyond that there are no guarantees made by imara-diff.
In fact the following would be be perfectly valid diff: [Hunk{ before: 0..A.len(), after: 0..B.len() }].

Ofcourse a diffing algorithm that would just always return a full replacement is not particularly useful.
Therefore an unmodified version of Myers algorithm further guarantees that the generated edit script is minimal.
That means that all other possible edit scripts change the same amount of tokens (lines) or more. However there can be multiple minimal edit scripts, which one you get is essentially random/depends on the implementation. The algorithms are usually implemented such so the output is as intuitive to humans as possible. For example all implementations of myers algorithm prefer deletions over insertions which means that diffs are also not symmetric in regards to their input.

This unmodified version (available in imara-diff as Algorithm::MyersMinimal) has run complexity of O(N^2) and hence imara-diff (and git diff and gnu diff) use some heuristics to abort the search for a minimal edit script once a good enough edit-script is found so it has O(N) complexity. This still yields pretty good diffs in almost all cases because myers diff is implemented as a divide and conquer algorithm and a decent approximation for the divide point made by these heuristics is usually pretty good/close to minimal. This is usually not a problem in practice because a non-minimal diff is often more intuitive to humans (which is what the Histogram algorithm exploits).

Overall that means that you can not rely on any property of the diff apart from being a valid edit script and being a pretty good approximation of a minimal edit script (if you are not using MyersMinimal) that is intuitive to humans. However when splitting a diff up into multipe steps a different edit sequence might be choosen by the heuristic. Even using MyersMinimal would not really help here because there can also be multiple different minimal edit sequences and the algorithm might chose a different minimal edit sequence if applied to the full file. Also I believe that the composition of minimal edit sequences is not necessarily a minimal edit sequence but I might be wrong about that.

@Byron
Copy link
Owner

Byron commented Nov 18, 2022

Thanks so much! From what I see, I am not crazy thinking that my baseline should work (but doesn't), as edits that change A to B no matter how always lead to the desired state. So if I go from A to C or A to B to C doesn't matter, I should arrive at C no matter what edit sequence the diffing comes up with.

But there lies the problem. In my baseline.rs I am able to aggregate all changes from A to C (where A is empty) and get C, as compared to C itself (the baseline, provided by crates-index). C is the set of all versions of all crates, each of which identifiable by a unique hash which also doesn't change if it's yanked or not. However, for some reason, when I aggregate all changes from A to B and from B to C, I do not end up at all versions of C, but have about 14k less. That number also varies depending on how many steps I take to C (even though the example doesn't really support further subdivision 😅).

This baseline implementation goes from A to C directly and shows that it can work, but we essentially only aggregate additions.

This other baseline subdivides the path to C further, but it ends up with less versions, and I really couldn't figure out why even after ruthlessly refactoring the way the diffing works.

And that's the problem here. I am not sure what the problem is, and just can't seem to make it work. Probably I am hoping you look at it and quickly see what's wrong even though what I probably have to do is to use a smaller sample and debug it from there because clearly, it can work and it should work and if it doesn't there is still something wrong with the algorithm somewhere which will eventually lead to docs.rs skipping builds which is quite unacceptable - I mean, this library exists for this singular purpose and thus far it actually seemed to have worked. Now certain weaknesses appear despite having better tests and more capabilities than ever, so I feel like I am cheering myself on here to not give up and make it work because it can and it should :D. But not tonight 😅.

@Byron
Copy link
Owner

Byron commented Nov 18, 2022

Potential ways forward…

  • check the baseline not to commit 50k, but to 5, to reduce the set of crates to work with
  • don't use imara-diff but instead use a maximal diff by adding all lines of the new file and removing all lines of the old file, by checksum
  • use a sub-directory of crates.io to reduce the amount of crates and version to look at

@Byron
Copy link
Owner

Byron commented Nov 19, 2022

Pascal and I aligned on removing imara-diff from the equation for the only reason of simplifying the implementation and leverage the presence of the checksum field instead, which globally identifies each crate version. With it, a diff between old and new becomes a set operation between all versions (lines) in the old and modified versions of a crate file. Note that to see yanks, we will have to make our own id that is the hash of just the checksum the yank field. Doing this should naturally pass the normalization tests as well without strange workarounds that are currently present.

@pascalkuthe already started digging into a solution like the one described above, and I believe it will perform as good or better than what's currently there.

While being at it, there is another shortcoming in the implementation currently, which is the loss of ordering between crates. For docs.rs this means that more recently published crates may take precedence over less recently published ones by merit of sorting alphabetically before them. This should be addressable by allowing to single-step through the chain of ancestors that connect a given two commits. With that, and knowing that only manual deletions of files have more than one change per commit, we are known to keep the order of changes we emit which makes the queue fair. Note that it may be and is likely that docs.rs is already puts the queue in some order, but with the proposed change here it could do so more fairly as well. Note that when doing any stepping, it must be able to resort to comparing without intermediate steps if there is no connection between the commits anymore (as happens when crates.io is squashed).

The work about maintaining order I should finish this weekend and adjust the baseline test so that…

  • …it uses up to X big steps (without single-stepping) to simulate more massive changes to then…
  • …take Y steps from a close-to-end commit to the last one in single-step-mode to simulate the then typical usage of docs.rs

…expecting that the final set of versions obtained through that matches an iteration of versions with crates-index::Index::crates().versions().

So if the list of commits of crates.io (50k) looks like this:

-----------------------------------------------------------------------

then the step order of the baseline test would roughly be, with each step marked with an 'x':

-----x-----x-----x-----x-----x-----x-----x-----x-----x-----xxxxxxxxxxxx

Once a baseline test like this works I am confident that the implementation doesn't hold any surprises anymore for us to discover in the future and 'be done' with it :D.

@Byron
Copy link
Owner

Byron commented Nov 19, 2022

Once the above works, I think another improvement is to setup baseline tests to also incorporate the yanked status. Right now it only sees versions, and it isn't known if these are yanked or not, even though the changes emitted by this library should allow to arrive at the correct yanked state as well.

This leaves us with the following:

Tasks

  1. refactor diffing logic to pass all tests and current baseline (@pascalkuthe) reproduce and fix #26 #27
  2. use stepped-baseline approach to validate it really works now (like in ---x----x----x--x)

Everything else was offloaded to #30 to remain within budgetary constraints.

With all these in place I am confident that the crate finally works as it should without surprises haunting us in the future.

pascalkuthe pushed a commit to pascalkuthe/crates-index-diff-rs that referenced this issue Nov 21, 2022
- typos and form
- improve docs and docs consistency.
@Byron Byron closed this as completed in #29 Nov 21, 2022
@Byron
Copy link
Owner

Byron commented Nov 21, 2022

A new release has been created for the fastest and provably most correct version of this crate, yet 🎉.

@pascalkuthe is looking forward to open a PR over on docs.rs to integrate this latest release as it has some breaking changes that need to be tackled. This will happen in the next week or two.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants