Skip to content

Run git gc periodically on the crates.io index #778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pietroalbini opened this issue May 27, 2020 · 12 comments
Closed

Run git gc periodically on the crates.io index #778

pietroalbini opened this issue May 27, 2020 · 12 comments
Labels
A-admin Area: Administration of the production docs.rs server C-bug Category: This is a bug E-easy Effort: Should be easy to implement and would make a good first PR P-low Low priority issues

Comments

@pietroalbini
Copy link
Member

We discovered that the clone on the index managed by docs.rs reached 2k+ packfiles, which caused spikes of thousands of open FDs and 20% CPU usage every time the repository was queried. We should add code that performs git gc periodically.

@jyn514 jyn514 added A-builds Area: Building the documentation for a crate C-bug Category: This is a bug E-easy Effort: Should be easy to implement and would make a good first PR P-low Low priority issues labels Jul 6, 2020
@ohaddahan
Copy link
Contributor

@pietroalbini @jyn514 the most relevant place I found to insert it is in one of the daemons such as:
https://github.com/rust-lang/docs.rs/blob/master/src/utils/github_updater.rs

If you think it's a fitting location, I'll start working on a PR.

@jyn514
Copy link
Member

jyn514 commented Aug 8, 2020

I think a better spot would be in the registry watcher: https://github.com/rust-lang/docs.rs/blob/master/src/utils/daemon.rs#L18, but it should be spaced out longer than every 30 seconds. Maybe you can add a counter and every time it rolls over to 1000 run git gc?

@ohaddahan
Copy link
Contributor

ohaddahan commented Aug 8, 2020

@jyn514 thanks for the quick reply.
Yeah we don't want to run every 30 sec.
I thought of here:

Duration::from_secs(60 * 60),

It runs every hour and the action before is update of repos which probably adds garbage for git gc to collect.
We can add it as an after action inside update_all_crates and add some counter to make sure the interval wasn't too close, or we want to control it better in the future.

Do you think it's a good spot?

@pietroalbini
Copy link
Member Author

I think a more reliable approach is to count the files in .git/objects/pack/ and run gc if that exceeds a configurable threshold (configured with Config).

@ohaddahan
Copy link
Contributor

ohaddahan commented Aug 8, 2020

@pietroalbini git gc will already check according to the limit used.

git-gc

--auto
With this option, git gc checks whether any housekeeping is required; if not, it exits without performing any work.

See the gc.auto option in the "CONFIGURATION" section below for how this heuristic works.

Once housekeeping is triggered by exceeding the limits of configuration options such as gc.auto and gc.autoPackLimit, all other housekeeping tasks (e.g. rerere, working trees, reflog…​) will be performed as well.

gc.auto
When there are approximately more than this many loose objects in the repository, git gc --auto will pack them. Some Porcelain commands use this command to perform a light-weight garbage collection from time to time. The default value is 6700.

Setting this to 0 disables not only automatic packing based on the number of loose objects, but any other heuristic git gc --auto will otherwise use to determine if there’s work to do, such as gc.autoPackLimit.

gc.autoPackLimit
When there are more than this many packs that are not marked with *.keep file in the repository, git gc --auto consolidates them into one larger pack. The default value is 50. Setting this to 0 disables it. Setting gc.auto to 0 will also disable this.

See the gc.bigPackThreshold configuration variable below. When in use, it’ll affect how the auto pack limit works.

Why would you want to redo the same logic it contains?
You can run it and if the limit won't be met it will skip this gc cycle.
We can actively monitor this directory and catch right when it's over the limit but I think it's a little of an overkill.

Anything that I'm missing?

@pietroalbini
Copy link
Member Author

Mostly to reduce the amount of times we shell out to git. To be fair git gc --auto is so quick when there is nothing to do we could run it on every iteration of the registry watcher.

@ohaddahan
Copy link
Contributor

Yeah obviously less forks but it comes with the cost of extra code to maintain which isn't really in doc.rs domain.
I think doing it once an hour after update_all_crates is a good compromise, once an hour isn't too frequent and I think the origin of the garbage files is update_all_crates.

(I don't mind implementing some watcher over the directory as you suggested, I need the practice 😄 )

@pietroalbini
Copy link
Member Author

I think doing it once an hour after update_all_crates is a good compromise, once an hour isn't too frequent and I think the origin of the garbage files is update_all_crates.

Let's not put it there: the origin of the extra files is in the registry watcher (update_all_crates just queries the GitHub API), and I'd really prefer all the relevant code to be in one place instead of being scattered around the codebase.

Yes, that thread loops every 30 seconds, but you can do something like this:

let mut last_gc = Instant::now();
loop {
    // Existing code

    if last_gc.elapsed() > Duration::from_secs(config.registry_gc_interval) {
        // Run `git gc`
        last_gc = Instant::now();
    }
}

@ohaddahan
Copy link
Contributor

ohaddahan commented Aug 11, 2020

@pietroalbini @jyn514
I did as suggested but I have a few questions:

  1. How would you recommend writing a test for it?
    I tested it locally and saw it works but I'm not seeing any test for start_registry_watcher.
    Is there any pointer you can share?

  2. I wanted to use rustwide since I saw you used it but I'm not seeing an option to simply launch a process without docker.
    The closest was new_workspaceless but it's a private method and I didn't find how to get rustwide just run a plain command.
    So I'm using std::process::Command is that acceptable?

@pietroalbini
Copy link
Member Author

How would you recommend writing a test for it? I tested it locally and saw it works but I'm not seeing any test for start_registry_watcher. Is there any pointer you can share?

I don't think there is any test for that at all. Until around a year ago docs.rs had no tests at all (one of the focuses of the past year was building a test suite), so it's not that surprising.

I wanted to use rustwide since I saw you used it but I'm not seeing an option to simply launch a process without docker. The closest was new_workspaceless but it's a private method and I didn't find how to get rustwide just run a plain command. So I'm using std::process::Command is that acceptable?

Yeah using std::process::Command is fine! Rustwide is used by the code that builds the documentation, but not by the rest of the codebase.

@ohaddahan
Copy link
Contributor

@pietroalbini thank you for the details.
I created a PR, please review 😄

@jyn514 jyn514 added A-admin Area: Administration of the production docs.rs server and removed A-builds Area: Building the documentation for a crate labels Aug 12, 2020
@jyn514
Copy link
Member

jyn514 commented Sep 19, 2020

This was implemented in #975.

@jyn514 jyn514 closed this as completed Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admin Area: Administration of the production docs.rs server C-bug Category: This is a bug E-easy Effort: Should be easy to implement and would make a good first PR P-low Low priority issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants