Skip to content

Run git gc occasionally #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jyn514 opened this issue May 27, 2020 · 4 comments
Closed

Run git gc occasionally #10

jyn514 opened this issue May 27, 2020 · 4 comments
Labels

Comments

@jyn514
Copy link
Contributor

jyn514 commented May 27, 2020

In docs.rs, we just found we had very high CPU and IO usage because we had never run git gc. It would be great if crates-index-diff ran this automatically, the same way that git runs git gc automatically once in a while.

See also rust-lang/docs.rs#778.

@Byron
Copy link
Owner

Byron commented May 28, 2020

I think this would be the first suggestion from docs.rs I feel I should turn down. Even though git does do an automatic GC when it discovers too many loose objects, this functionality would have to be explicitly implemented here and seems like a very surprising side-effect.

If the gc is blocking, it would slow down the caller from time to time, and if it is asynchronous, it would have to fork off a thread to run the gc operation there. It could probably be implemented so that it won't launch multiple gc's concurrently, yet I am not sure I trust a git writer (gc) to not interfere in various ways with the reader.

However, I could imagine to have it implemented as a separate 'maintenance' function that can be handled by the caller themselves. Something like fn get_maintenance_handle() -> Option<impl Fn() -> Result>, and the caller can run it in the current thread (blocking) or spawn it off into their own thread and manage its runtime that way. Doing so is also a little dangerous as libgit2 will consume additional system resources, so the next time someone wants to read while the gc is running (in process) the reader might run out of file handles.
Alternatively, the same could be done by using git directly and run git gc in a separate process, resolving the issue with file handles at least, but the issue with a background writer in the repository remains.

This makes me think it's safer for docs.rs to implement it themselves, and run it on their own account while possibly blocking other readers while the gc is in progress. This could be as easy as spawning a git gc in a process and waiting for it to complete.

What do you think?

@Byron Byron added the question label May 28, 2020
@jyn514
Copy link
Contributor Author

jyn514 commented May 28, 2020

If the gc is blocking, it would slow down the caller from time to time, and if it is asynchronous, it would have to fork off a thread to run the gc operation there. It could probably be implemented so that it won't launch multiple gc's concurrently, yet I am not sure I trust a git writer (gc) to not interfere in various ways with the reader.

That's about what we thought too, yeah.

fn get_maintenance_handle() -> Option<impl Fn() -> Result>

Presumably this would be a function that's optional to call? In that case I think we'd be better off implementing it ourselves, it would be less complicated on both ends.

This makes me think it's safer for docs.rs to implement it themselves, and run it on their own account while possibly blocking other readers while the gc is in progress. This could be as easy as spawning a git gc in a process and waiting for it to complete.

That sounds reasonable. I do think you should document that crates-index-diff doesn't run git gc but otherwise all that makes sense, it could potentially use a lot of system resources and it's difficult to predict.

@jyn514 jyn514 closed this as completed May 28, 2020
Byron added a commit that referenced this issue May 28, 2020
@Byron
Copy link
Owner

Byron commented May 28, 2020

Presumably this would be a function that's optional to call? In that case I think we'd be better off implementing it ourselves, it would be less complicated on both ends.

This is intended as: this crate determines if cleanup should be performed, and if not, return None. Otherwise return a function to perform the cleanup, as controlled by the caller.
This could be simplified to: Always return a (blocking) function to call, and either call git gc if possible, or run a gc using libgit2.
I am absolutely open to provide that if you are interested.

@jyn514
Copy link
Contributor Author

jyn514 commented May 28, 2020

I think we wouldn't use that on our end, but I appreciate the offer! I also very much appreciate the documentation change :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants