-
Notifications
You must be signed in to change notification settings - Fork 9
New publish concurrent with yanks is missed #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks a lot for the excellent analysis, which will help tremendously to produce a test for a fix. I am pretty sure it has something to do with these lines which exist to assure the normalization tests work (and don't register as change). Let's see. |
Previously it was possible to have multiple diffs in one crate distributed over multiple commits to rightfully show up as multiple hunks of modified and added lines only register the modified lines, not the new ones (or the deleted ones for that matter). This would cause updates or removals to be missed. Now hunks of changes are exhaused properly, fixing this issue.
A new bugfix release has been created. The issue was I forgot to exhaust the whole diff and the algorithm would stop after the first hunk of modification. Here there were modifications and an addition. That's probably one of the disadvantages of operating only on a sample and I wonder if the baseline could be made exhaustive despite a constantly changing index (in terms of git history). Maybe it's enough to get any clone of the crates index and assure that all changes obtained though this library, when aggregated, match all iterable crates. |
This improves performance slightly when dealing with a lot of versions, like when all versions are obtained from the beginning of time.
…eleted versions. (#26) That way it doesn't degenerate any information, previously the exact version information was lost. Not doing so helps to be able to reproduce the current state by aggregating all changes.
But what we would have to do is to step through a couple of the changes at a time and aggregate from these.
When stepping through the changes in multiple steps, we end up with more crates then there are even though we identify them by checksum and consider deletions. Yanking doesn't remove them from the iteration either.
…but besides completely failing the normalization test which I don't understand, it also doesn't manage to get the correct amount of versions.
I did manage to create a baseline comparison but it only works if a single diff is used. When trying to chunk up the diffs into iterations so it takes multiple steps to complete, I couldn't get it to line up. The diffed state would end up with about 13 thousand changes less than expected. I couldn't figure out if it's due to a faulty tests implementation (the step-wise diffing) or if it's related to the diff implementation or logic itself. I did try to adjust it to make more sense to me and that also didn't work while breaking normalization tests, so unfortunately I think I am still missing something here. The big question is if the results can possibly match with a step-wise diffing strategy, and I thought it should but maybe that's a wrong assumption. CC @pascalkuthe for more theoretical (and practical) diffing expertise. Should a diff A | C be the same changes as A | B and B | C? I think so, but maybe that's wrong? |
I am not sure what you define as being the same changes. But in general I think that is not true. A:
B:
C:
A diff
A diff
I am not sure if you would define that as equivalent but I wouldn't. Diffs are called edit sequences in the literature because what they provide is a linear sequence of edits that transform sequence A to sequence B. That means the following property always holds (pseudo code does not actually compile): let hunks = diff(A, B);
for hunk in hunks{
A.remove_lines(hunk.before);
A.insert_lines_at(hunk.before.start, B.get_lines(hunk.afer);
}
assert_eq!(A, B); Beyond that there are no guarantees made by Ofcourse a diffing algorithm that would just always return a full replacement is not particularly useful. This unmodified version (available in Overall that means that you can not rely on any property of the diff apart from being a valid edit script and being a pretty good approximation of a minimal edit script (if you are not using |
Thanks so much! From what I see, I am not crazy thinking that my baseline should work (but doesn't), as edits that change A to B no matter how always lead to the desired state. So if I go from A to C or A to B to C doesn't matter, I should arrive at C no matter what edit sequence the diffing comes up with. But there lies the problem. In my This baseline implementation goes from A to C directly and shows that it can work, but we essentially only aggregate additions. This other baseline subdivides the path to C further, but it ends up with less versions, and I really couldn't figure out why even after ruthlessly refactoring the way the diffing works. And that's the problem here. I am not sure what the problem is, and just can't seem to make it work. Probably I am hoping you look at it and quickly see what's wrong even though what I probably have to do is to use a smaller sample and debug it from there because clearly, it can work and it should work and if it doesn't there is still something wrong with the algorithm somewhere which will eventually lead to docs.rs skipping builds which is quite unacceptable - I mean, this library exists for this singular purpose and thus far it actually seemed to have worked. Now certain weaknesses appear despite having better tests and more capabilities than ever, so I feel like I am cheering myself on here to not give up and make it work because it can and it should :D. But not tonight 😅. |
Potential ways forward…
|
Pascal and I aligned on removing @pascalkuthe already started digging into a solution like the one described above, and I believe it will perform as good or better than what's currently there. While being at it, there is another shortcoming in the implementation currently, which is the loss of ordering between crates. For The work about maintaining order I should finish this weekend and adjust the baseline test so that…
…expecting that the final set of versions obtained through that matches an iteration of versions with So if the list of commits of crates.io (50k) looks like this:
then the step order of the baseline test would roughly be, with each step marked with an 'x':
Once a baseline test like this works I am confident that the implementation doesn't hold any surprises anymore for us to discover in the future and 'be done' with it :D. |
Once the above works, I think another improvement is to setup baseline tests to also incorporate the This leaves us with the following: Tasks
Everything else was offloaded to #30 to remain within budgetary constraints. With all these in place I am confident that the crate finally works as it should without surprises haunting us in the future. |
- typos and form - improve docs and docs consistency.
A new release has been created for the fastest and provably most correct version of this crate, yet 🎉. @pascalkuthe is looking forward to open a PR over on docs.rs to integrate this latest release as it has some breaking changes that need to be tackled. This will happen in the next week or two. Cheers |
From investigation in rust-lang/docs.rs#1912 it looks like this is a
crates-index-diff
issue. The crate had a publish and two yanks in between two checks and we only see the yank event.The relevant commit range contains just this publish and the two yanks:
I wrote a little test program to verify this:
For the full range it shows the same behaviour:
If I only include the first one or two commits it behaves correctly:
The text was updated successfully, but these errors were encountered: