-
Notifications
You must be signed in to change notification settings - Fork 286
Added File Churn Metric #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added File Churn Metric #1071
Conversation
* delay conversion to String for filepaths to the last moment. That way, only the paths that are displayed will be converted in an operation that isn't free. * change diff implementation to decode parents only once, instead of three times in the commmon case. * setup an object cache in the `Repository` for faster traversals and much faster diffs.
Thanks for reeling me in :)! I like the implementation even though I'd love it if there was a way to 'more smoothly' deal with diffs against the first tree, which is the empty tree. Besides that, I hope you excuse me pushing improvements directly into this branch as I think it's easier to see what I mean that way, with the disclaimer of: these commits are suggestions, please feel free to change or remove as you see fit no questions asked. With all that out of the way, the commit here already improves performance quite a bit:
That's not all though, as I am working on a faster form of traversal which can make use of commitgraph data-structures - stay tuned (a new |
BTW, @spenserblack, what do you think about this new info line? Do you believe it provides value for users? I apologize for not discussing it beforehand 😔 Although the performance impact has been significantly reduced, thanks to @Byron, it may still be noticeable. |
If we're concerned about performance, we could consider this an experimental feature that defaults to being disabled 🙂 But this is definitely a cool idea! 👍 |
From a performance perspective, I think it will be fine for 99.5% of the cases. After all, it's just 100 commits to compute deltas for, and thanks to the performance improvements However, there are repositories with huge trees, and those inevitably take longer to create diffs for as the tree-traversal has to touch so many objects. I have seen speeds of as low as 4 deltas/s per core, which means 100 commit-diffs could easily take 25s or so. It would be great if this case could automatically be handled, but even that would cost time that now always adds to the overall time. The idea was to load the index and apply some heuristic to understand how long diffs would probably take, but it seems tricky to get right. Another option would be to run diffs on another thread, send it only the first X commits as before, but cancel cancel it as soon as all commits have been iterated. That way, one wouldn't use any additional wallclock time to compute diffs and could abort long-running diffs pretty swiftly - the worst case diffs I have seen take 250ms, so at worst the operation would take 250ms longer as the thread will only respond after a diff was done. Let me give that a try as this would allow the feature to be toggled on unconditionally without measurable impact. |
That way, we could even use higher amounts of diffs if we wanted to, or could warn if there was not enough time to reach the desired amount of diffs.
I'd think this was a success :D, another ~10% faster now.
In the commit message, there are some ideas for further improvements related to how the feature is presented - I think it could be polished to show the amount of commits actually used to create the churn summary in case these were less than expected as there might not have been enough time. Please note that I also took an extended look at all these allocations that seem to be happening and (once again) tracked it down to zlib inflate. It's well-known to me that it allocates while inflating, and these allocations cause a lot of 'trashing' which causes the allocator to cause high peak memory allocation. For instance, on the linux kernel, it takes about 1GB peak even though 99.9% or so of these allocations are transient. It's a bit sad to see as |
That way, there is always some data to work with. This is important in case the repo is very small and the thread needs some time to start-up and finish.
The recent test-failure has shown that it's probably a good idea to indicate how many commits were used for churn rates if these are below the desired value, as very small repos might also trigger this case (like happened in the test). That failing test at least caused the mitigation that the thread will always produce at least one diff. |
The churn_pool_size allow the user to force onefetch to be deterministic in the number of commits used to create the churn summary
Thanks to @Byron's optimizations, the impact on wall clock time has been reduced to nearly zero. As you suggested @Byron, I added the number of commits actually used to create the churn summary: For a more deterministic approach, users can specify the churn-pool-size: And thanks for the review @spenserblack |
…of its buffer. The delta-processing happens by referring to a commit, and previously we could send the whole commit buffer (which is expensive) as the overall amount of buffers in flight would be bounded. Now that the bound was removed, it's necessary to limit the costs of the commit, and we do this by referring to it by id instead. That way, on the linux kernel, we get these values for memory consumption: * bounded: 960MB * unbounded buffer: 2156 * unbounded id: 1033
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of being non-deterministic by default using the available time for churn-computation, while optionally offering a desired amount that should be met if set.
However, this comes at great costs: Running onefetch
with these semantics on the linux kernel easily doubles its real-memory footprint.

The reason for this is that we have no bound for how many detached objects are stuffed into the channel, which was the case previously and only that's why I deemed it correct to send the whole object buffer.
With these semantics, one has to reduce the costs of the commits sent, by re-retrieving the commit in the thread. This itself comes at a cost though, even though it is probably negligible.
I have implemented the alternative that sends IDs instead of data, which does bring down overall memory usage (while definitely still being noticeable), and now the feature seems to cost ~70MB on the linux kernel. It's probably acceptable.
I really can't wait for this commit to be merged, as I think I have a neat follow-up in the works: This PR over at |
This pull request, part of #1059, introduces a new info line to onefetch called "Churn." This info line displays the files with the most modifications (commits), providing valuable insights into code volatility and potential hotspots.
Calculating the churn is computationally expensive 😢 , as it requires comparing each commit's tree with its parent tree to obtain the diff and see which files were modified. Since git (and gitoxide) does not store deltas, this process becomes resource-intensive the more commits there are in the git history.
To optimize performance, I added a limit on the number of commits used to compute the file churns. This limit can be configured via a CLI flag:
Here is what it looks like on onefetch: