Skip to content

Added File Churn Metric #1071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 9, 2023
Merged

Added File Churn Metric #1071

merged 31 commits into from
Jun 9, 2023

Conversation

o2sh
Copy link
Owner

@o2sh o2sh commented Jun 4, 2023

This pull request, part of #1059, introduces a new info line to onefetch called "Churn." This info line displays the files with the most modifications (commits), providing valuable insights into code volatility and potential hotspots.

Calculating the churn is computationally expensive 😢 , as it requires comparing each commit's tree with its parent tree to obtain the diff and see which files were modified. Since git (and gitoxide) does not store deltas, this process becomes resource-intensive the more commits there are in the git history.

To optimize performance, I added a limit on the number of commits used to compute the file churns. This limit can be configured via a CLI flag:

--churn-commit-limit <NUM>     NUM of commits from HEAD used to compute file churns [default: 100]

Here is what it looks like on onefetch:

image

@o2sh o2sh requested a review from spenserblack as a code owner June 4, 2023 00:10
@o2sh o2sh requested review from Byron and spenserblack and removed request for spenserblack June 4, 2023 00:11
* delay conversion to String for filepaths to the last moment. That way, only the paths that
  are displayed will be converted in an operation that isn't free.
* change diff implementation to decode parents only once, instead of three times in the commmon case.
* setup an object cache in the `Repository` for faster traversals and much faster diffs.
@Byron
Copy link
Collaborator

Byron commented Jun 4, 2023

Thanks for reeling me in :)!

I like the implementation even though I'd love it if there was a way to 'more smoothly' deal with diffs against the first tree, which is the empty tree.

Besides that, I hope you excuse me pushing improvements directly into this branch as I think it's easier to see what I mean that way, with the disclaimer of: these commits are suggestions, please feel free to change or remove as you see fit no questions asked.

With all that out of the way, the commit here already improves performance quite a bit:

linux ( master)
❯ hyperfine --warmup 1 /Users/byron/dev/github.com/o2sh/onefetch/onefetch /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
Benchmark 1: /Users/byron/dev/github.com/o2sh/onefetch/onefetch
  Time (mean ± σ):     20.285 s ±  0.455 s    [User: 22.086 s, System: 2.658 s]
  Range (min … max):   19.617 s … 21.018 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
  Time (mean ± σ):     14.635 s ±  0.311 s    [User: 16.232 s, System: 2.659 s]
  Range (min … max):   13.949 s … 15.044 s    10 runs

Summary
  '/Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized' ran
    1.39 ± 0.04 times faster than '/Users/byron/dev/github.com/o2sh/onefetch/onefetch'

That's not all though, as I am working on a faster form of traversal which can make use of commitgraph data-structures - stay tuned (a new gix release is necessary) :).

@o2sh
Copy link
Owner Author

o2sh commented Jun 5, 2023

BTW, @spenserblack, what do you think about this new info line? Do you believe it provides value for users? I apologize for not discussing it beforehand 😔

Although the performance impact has been significantly reduced, thanks to @Byron, it may still be noticeable.

@spenserblack
Copy link
Collaborator

If we're concerned about performance, we could consider this an experimental feature that defaults to being disabled 🙂

But this is definitely a cool idea! 👍

@Byron
Copy link
Collaborator

Byron commented Jun 6, 2023

From a performance perspective, I think it will be fine for 99.5% of the cases. After all, it's just 100 commits to compute deltas for, and thanks to the performance improvements onefetch now has 'won' some time credits to put towards a feature like this.

However, there are repositories with huge trees, and those inevitably take longer to create diffs for as the tree-traversal has to touch so many objects. I have seen speeds of as low as 4 deltas/s per core, which means 100 commit-diffs could easily take 25s or so. It would be great if this case could automatically be handled, but even that would cost time that now always adds to the overall time. The idea was to load the index and apply some heuristic to understand how long diffs would probably take, but it seems tricky to get right. Another option would be to run diffs on another thread, send it only the first X commits as before, but cancel cancel it as soon as all commits have been iterated. That way, one wouldn't use any additional wallclock time to compute diffs and could abort long-running diffs pretty swiftly - the worst case diffs I have seen take 250ms, so at worst the operation would take 250ms longer as the thread will only respond after a diff was done.

Let me give that a try as this would allow the feature to be toggled on unconditionally without measurable impact.

That way, we could even use higher amounts of diffs if we wanted to, or could
warn if there was not enough time to reach the desired amount of diffs.
@Byron
Copy link
Collaborator

Byron commented Jun 6, 2023

I'd think this was a success :D, another ~10% faster now.

linux ( master) +369 -819 [!?]
❯ hyperfine --warmup 1 /Users/byron/dev/github.com/o2sh/onefetch/onefetch /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
Benchmark 1: /Users/byron/dev/github.com/o2sh/onefetch/onefetch
  Time (mean ± σ):     13.395 s ±  0.518 s    [User: 15.469 s, System: 2.551 s]
  Range (min … max):   12.952 s … 14.707 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
  Time (mean ± σ):     12.083 s ±  0.146 s    [User: 16.981 s, System: 2.482 s]
  Range (min … max):   11.833 s … 12.374 s    10 runs

Summary
  '/Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized' ran
    1.11 ± 0.04 times faster than '/Users/byron/dev/github.com/o2sh/onefetch/onefetch'

In the commit message, there are some ideas for further improvements related to how the feature is presented - I think it could be polished to show the amount of commits actually used to create the churn summary in case these were less than expected as there might not have been enough time.


Please note that I also took an extended look at all these allocations that seem to be happening and (once again) tracked it down to zlib inflate. It's well-known to me that it allocates while inflating, and these allocations cause a lot of 'trashing' which causes the allocator to cause high peak memory allocation. For instance, on the linux kernel, it takes about 1GB peak even though 99.9% or so of these allocations are transient. It's a bit sad to see as gix tries very hard to reuse memory while at its core it calls into code that does exactly that. On the bright side, I found a way to actually save a couple of these forced zlib allocations which seems to have fixed a performance regression as well.

That way, there is always some data to work with.
This is important in case the repo is very small and the thread needs some time
to start-up and finish.
@Byron
Copy link
Collaborator

Byron commented Jun 6, 2023

The recent test-failure has shown that it's probably a good idea to indicate how many commits were used for churn rates if these are below the desired value, as very small repos might also trigger this case (like happened in the test). That failing test at least caused the mitigation that the thread will always produce at least one diff.

o2sh added 2 commits June 7, 2023 19:12
The churn_pool_size allow the user to force onefetch to be deterministic
in the number of commits used to create the churn summary
@o2sh
Copy link
Owner Author

o2sh commented Jun 8, 2023

Thanks to @Byron's optimizations, the impact on wall clock time has been reduced to nearly zero. As you suggested @Byron, I added the number of commits actually used to create the churn summary:

image

For a more deterministic approach, users can specify the churn-pool-size:

image

image

And thanks for the review @spenserblack

Byron added 3 commits June 9, 2023 07:29
* simplify `should_break()`
…of its buffer.

The delta-processing happens by referring to a commit, and previously we could
send the whole commit buffer (which is expensive) as the overall amount of buffers
in flight would be bounded.

Now that the bound was removed, it's necessary to limit the costs of the commit,
and we do this by referring to it by id instead.

That way, on the linux kernel, we get these values for memory consumption:

* bounded: 960MB
* unbounded buffer: 2156
* unbounded id: 1033
Copy link
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of being non-deterministic by default using the available time for churn-computation, while optionally offering a desired amount that should be met if set.

However, this comes at great costs: Running onefetch with these semantics on the linux kernel easily doubles its real-memory footprint.

Screenshot 2023-06-09 at 07 36 35

The reason for this is that we have no bound for how many detached objects are stuffed into the channel, which was the case previously and only that's why I deemed it correct to send the whole object buffer.

With these semantics, one has to reduce the costs of the commits sent, by re-retrieving the commit in the thread. This itself comes at a cost though, even though it is probably negligible.

I have implemented the alternative that sends IDs instead of data, which does bring down overall memory usage (while definitely still being noticeable), and now the feature seems to cost ~70MB on the linux kernel. It's probably acceptable.

@Byron
Copy link
Collaborator

Byron commented Jun 9, 2023

I really can't wait for this commit to be merged, as I think I have a neat follow-up in the works: This PR over at gitoxide adds support for commit-graphs during iteration, and if done right it's possible to multi-thread the author-map which is currently relying on single-threaded iteration. As this workload should scale perfectly per core, I'd expect the linux kernel to finish in 2s or so :) (at least if someone ran git commit-graph write --no-progress --reachable which those with huge repos probably do).

@o2sh o2sh merged commit 1955153 into main Jun 9, 2023
@o2sh o2sh deleted the feat/churn branch June 9, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants