Added File Churn Metric #1071

o2sh · 2023-06-04T00:10:53Z

This pull request, part of #1059, introduces a new info line to onefetch called "Churn." This info line displays the files with the most modifications (commits), providing valuable insights into code volatility and potential hotspots.

Calculating the churn is computationally expensive 😢 , as it requires comparing each commit's tree with its parent tree to obtain the diff and see which files were modified. Since git (and gitoxide) does not store deltas, this process becomes resource-intensive the more commits there are in the git history.

To optimize performance, I added a limit on the number of commits used to compute the file churns. This limit can be configured via a CLI flag:

--churn-commit-limit <NUM>     NUM of commits from HEAD used to compute file churns [default: 100]

Here is what it looks like on onefetch:

* delay conversion to String for filepaths to the last moment. That way, only the paths that are displayed will be converted in an operation that isn't free. * change diff implementation to decode parents only once, instead of three times in the commmon case. * setup an object cache in the `Repository` for faster traversals and much faster diffs.

Byron · 2023-06-04T14:24:16Z

Thanks for reeling me in :)!

I like the implementation even though I'd love it if there was a way to 'more smoothly' deal with diffs against the first tree, which is the empty tree.

Besides that, I hope you excuse me pushing improvements directly into this branch as I think it's easier to see what I mean that way, with the disclaimer of: these commits are suggestions, please feel free to change or remove as you see fit no questions asked.

With all that out of the way, the commit here already improves performance quite a bit:

linux ( master)
❯ hyperfine --warmup 1 /Users/byron/dev/github.com/o2sh/onefetch/onefetch /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
Benchmark 1: /Users/byron/dev/github.com/o2sh/onefetch/onefetch
  Time (mean ± σ):     20.285 s ±  0.455 s    [User: 22.086 s, System: 2.658 s]
  Range (min … max):   19.617 s … 21.018 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
  Time (mean ± σ):     14.635 s ±  0.311 s    [User: 16.232 s, System: 2.659 s]
  Range (min … max):   13.949 s … 15.044 s    10 runs

Summary
  '/Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized' ran
    1.39 ± 0.04 times faster than '/Users/byron/dev/github.com/o2sh/onefetch/onefetch'

That's not all though, as I am working on a faster form of traversal which can make use of commitgraph data-structures - stay tuned (a new gix release is necessary) :).

o2sh · 2023-06-05T21:25:52Z

BTW, @spenserblack, what do you think about this new info line? Do you believe it provides value for users? I apologize for not discussing it beforehand 😔

Although the performance impact has been significantly reduced, thanks to @Byron, it may still be noticeable.

spenserblack · 2023-06-05T22:00:31Z

If we're concerned about performance, we could consider this an experimental feature that defaults to being disabled 🙂

But this is definitely a cool idea! 👍

Byron · 2023-06-06T06:06:16Z

From a performance perspective, I think it will be fine for 99.5% of the cases. After all, it's just 100 commits to compute deltas for, and thanks to the performance improvements onefetch now has 'won' some time credits to put towards a feature like this.

However, there are repositories with huge trees, and those inevitably take longer to create diffs for as the tree-traversal has to touch so many objects. I have seen speeds of as low as 4 deltas/s per core, which means 100 commit-diffs could easily take 25s or so. It would be great if this case could automatically be handled, but even that would cost time that now always adds to the overall time. The idea was to load the index and apply some heuristic to understand how long diffs would probably take, but it seems tricky to get right. Another option would be to run diffs on another thread, send it only the first X commits as before, but cancel cancel it as soon as all commits have been iterated. That way, one wouldn't use any additional wallclock time to compute diffs and could abort long-running diffs pretty swiftly - the worst case diffs I have seen take 250ms, so at worst the operation would take 250ms longer as the thread will only respond after a diff was done.

Let me give that a try as this would allow the feature to be toggled on unconditionally without measurable impact.

That way, we could even use higher amounts of diffs if we wanted to, or could warn if there was not enough time to reach the desired amount of diffs.

Byron · 2023-06-06T06:41:03Z

I'd think this was a success :D, another ~10% faster now.

linux ( master) +369 -819 [!?]
❯ hyperfine --warmup 1 /Users/byron/dev/github.com/o2sh/onefetch/onefetch /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
Benchmark 1: /Users/byron/dev/github.com/o2sh/onefetch/onefetch
  Time (mean ± σ):     13.395 s ±  0.518 s    [User: 15.469 s, System: 2.551 s]
  Range (min … max):   12.952 s … 14.707 s    10 runs

Benchmark 2: /Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized
  Time (mean ± σ):     12.083 s ±  0.146 s    [User: 16.981 s, System: 2.482 s]
  Range (min … max):   11.833 s … 12.374 s    10 runs

Summary
  '/Users/byron/dev/github.com/o2sh/onefetch/onefetch-optimized' ran
    1.11 ± 0.04 times faster than '/Users/byron/dev/github.com/o2sh/onefetch/onefetch'

In the commit message, there are some ideas for further improvements related to how the feature is presented - I think it could be polished to show the amount of commits actually used to create the churn summary in case these were less than expected as there might not have been enough time.

Please note that I also took an extended look at all these allocations that seem to be happening and (once again) tracked it down to zlib inflate. It's well-known to me that it allocates while inflating, and these allocations cause a lot of 'trashing' which causes the allocator to cause high peak memory allocation. For instance, on the linux kernel, it takes about 1GB peak even though 99.9% or so of these allocations are transient. It's a bit sad to see as gix tries very hard to reuse memory while at its core it calls into code that does exactly that. On the bright side, I found a way to actually save a couple of these forced zlib allocations which seems to have fixed a performance regression as well.

That way, there is always some data to work with. This is important in case the repo is very small and the thread needs some time to start-up and finish.

Byron · 2023-06-06T09:02:00Z

The recent test-failure has shown that it's probably a good idea to indicate how many commits were used for churn rates if these are below the desired value, as very small repos might also trigger this case (like happened in the test). That failing test at least caused the mitigation that the thread will always produce at least one diff.

The churn_pool_size allow the user to force onefetch to be deterministic in the number of commits used to create the churn summary

o2sh · 2023-06-08T23:28:18Z

Thanks to @Byron's optimizations, the impact on wall clock time has been reduced to nearly zero. As you suggested @Byron, I added the number of commits actually used to create the churn summary:

For a more deterministic approach, users can specify the churn-pool-size:

And thanks for the review @spenserblack

* simplify `should_break()`

…of its buffer. The delta-processing happens by referring to a commit, and previously we could send the whole commit buffer (which is expensive) as the overall amount of buffers in flight would be bounded. Now that the bound was removed, it's necessary to limit the costs of the commit, and we do this by referring to it by id instead. That way, on the linux kernel, we get these values for memory consumption: * bounded: 960MB * unbounded buffer: 2156 * unbounded id: 1033

Byron

I love the idea of being non-deterministic by default using the available time for churn-computation, while optionally offering a desired amount that should be met if set.

However, this comes at great costs: Running onefetch with these semantics on the linux kernel easily doubles its real-memory footprint.

The reason for this is that we have no bound for how many detached objects are stuffed into the channel, which was the case previously and only that's why I deemed it correct to send the whole object buffer.

With these semantics, one has to reduce the costs of the commits sent, by re-retrieving the commit in the thread. This itself comes at a cost though, even though it is probably negligible.

I have implemented the alternative that sends IDs instead of data, which does bring down overall memory usage (while definitely still being noticeable), and now the feature seems to cost ~70MB on the linux kernel. It's probably acceptable.

Byron · 2023-06-09T18:13:31Z

I really can't wait for this commit to be merged, as I think I have a neat follow-up in the works: This PR over at gitoxide adds support for commit-graphs during iteration, and if done right it's possible to multi-thread the author-map which is currently relying on single-threaded iteration. As this workload should scale perfectly per core, I'd expect the linux kernel to finish in 2s or so :) (at least if someone ran git commit-graph write --no-progress --reachable which those with huge repos probably do).

o2sh added 7 commits June 3, 2023 20:27

add churn metric

5019e81

add diff_count

205dd36

revert

6dadbf8

rename

5f941dd

add churn cli flags

667ca9a

fix integration test

6850da5

add unit tests

3d58e26

o2sh requested a review from spenserblack as a code owner June 4, 2023 00:10

o2sh requested review from Byron and spenserblack and removed request for spenserblack June 4, 2023 00:11

o2sh added 3 commits June 4, 2023 12:53

Merge branch 'main' into feat/churn

d4a5339

try fix codeowners

6ffd49c

fix codeowners

5f91e6e

vercel bot deployed to Preview June 4, 2023 11:03 View deployment

o2sh requested review from Byron and spenserblack and removed request for Byron and spenserblack June 4, 2023 11:04

o2sh added 2 commits June 4, 2023 13:07

Merge branch 'main' into feat/churn

2f7a2b4

Merge branch 'main' into feat/churn

5b85a3f

vercel bot deployed to Preview June 4, 2023 11:32 View deployment

Byron self-assigned this Jun 4, 2023

vercel bot deployed to Preview June 4, 2023 14:23 View deployment

Byron force-pushed the feat/churn branch from 816ed31 to afa5788 Compare June 4, 2023 14:24

vercel bot deployed to Preview June 4, 2023 14:24 View deployment

vercel bot deployed to Preview June 5, 2023 21:16 View deployment

revert

195fe73

vercel bot deployed to Preview June 5, 2023 21:18 View deployment

vercel bot deployed to Preview June 6, 2023 06:29 View deployment

run expensive diffs in parallel and abort them once we run out of time.

627b9f2

That way, we could even use higher amounts of diffs if we wanted to, or could warn if there was not enough time to reach the desired amount of diffs.

Always calculate at least one diff for 'churn'

1433c17

That way, there is always some data to work with. This is important in case the repo is very small and the thread needs some time to start-up and finish.

vercel bot deployed to Preview June 6, 2023 08:59 View deployment

o2sh added 2 commits June 7, 2023 19:12

improved readability + churn_pool_size CLI flag

27af8a5

The churn_pool_size allow the user to force onefetch to be deterministic in the number of commits used to create the churn summary

fix test

cc6b3ef

vercel bot deployed to Preview June 7, 2023 21:27 View deployment

halt if the churn pool size is bigger than the total number of commits

ddeaea3

vercel bot deployed to Preview June 7, 2023 23:24 View deployment

improve readability

2876860

vercel bot deployed to Preview June 8, 2023 22:04 View deployment

add unit test

0b0abae

vercel bot deployed to Preview June 8, 2023 23:08 View deployment

Byron added 3 commits June 9, 2023 07:29

refactor

16d2274

* simplify `should_break()`

update to latest gix version

c8a8cb7

vercel bot deployed to Preview June 9, 2023 05:49 View deployment

Byron reviewed Jun 9, 2023

View reviewed changes

o2sh merged commit 1955153 into main Jun 9, 2023

o2sh deleted the feat/churn branch June 9, 2023 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added File Churn Metric #1071

Added File Churn Metric #1071

Uh oh!

o2sh commented Jun 4, 2023 •

edited

Loading

Uh oh!

Byron commented Jun 4, 2023

Uh oh!

o2sh commented Jun 5, 2023

Uh oh!

spenserblack commented Jun 5, 2023

Uh oh!

Byron commented Jun 6, 2023

Uh oh!

Byron commented Jun 6, 2023 •

edited

Loading

Uh oh!

Byron commented Jun 6, 2023 •

edited

Loading

Uh oh!

o2sh commented Jun 8, 2023 •

edited

Loading

Uh oh!

Byron left a comment

Uh oh!

Byron commented Jun 9, 2023 •

edited

Loading

Uh oh!

Uh oh!

Added File Churn Metric #1071

Added File Churn Metric #1071

Uh oh!

Conversation

o2sh commented Jun 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Jun 4, 2023

Uh oh!

o2sh commented Jun 5, 2023

Uh oh!

spenserblack commented Jun 5, 2023

Uh oh!

Byron commented Jun 6, 2023

Uh oh!

Byron commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

o2sh commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Byron commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

o2sh commented Jun 4, 2023 •

edited

Loading

Byron commented Jun 6, 2023 •

edited

Loading

Byron commented Jun 6, 2023 •

edited

Loading

o2sh commented Jun 8, 2023 •

edited

Loading

Byron commented Jun 9, 2023 •

edited

Loading