-
Notifications
You must be signed in to change notification settings - Fork 125
Interesting performance issue counting commits #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Right now each partition is one repository so increasing the number of cores does not help. Also these many core systems usually have cores slightly slower than top of the line laptops. |
Not important but with that command you are counting only commits on master history.
Comparing the same script, but instead of using git, using go-git: package main
import (
"fmt"
"io"
"gopkg.in/src-d/go-git.v4"
. "gopkg.in/src-d/go-git.v4/_examples"
)
func main() {
r, err := git.PlainOpen(".git")
CheckIfError(err)
iter, err := r.CommitObjects()
CheckIfError(err)
for {
c, err := iter.Next()
if err == io.EOF {
break
}
CheckIfError(err)
fmt.Println(c.Hash.String())
}
iter.Close()
} Result:
So we can say that we should focus our efforts on improving go-git more. Binary used to list hashes from a repository: |
We're debugging a problem that might be making go-git abnormally slow when iterating objects on the kubernetes repository specifically, but everything @ajnavarro and @jfontan said still apply. |
In a repo of this size, at least 50% of time is spent reading object headers in the packfile to check if they are commits or not. That means that after getting all commits, we still need to check +700k object headers for their type. Why is this faster with If we make the assumption that when using gitbase, we're not interested in dangling commits, or reflog, which I think it would be fair to assume, we could try changing the implementation of the |
It might be worth considering, yeah. Maybe under a flag so the old behavior is still available for those that needed? |
Actually current behavior is undesirable unless you are using gitbase to do some weird analysis on local development repositories rather than fresh clones. If we change it, I wouldn't expose it at all. |
To make this possible we need this merged on go-git to make gitbase work with siva files and repositories with references pointing to missing objects: src-d/go-git#1067 Numbers executing Actual master: mysql> select count(*) from commits;
+----------+
| COUNT(*) |
+----------+
| 83167 |
+----------+
1 row in set (21,05 sec) Using +----------+
| COUNT(*) |
+----------+
| 78171 |
+----------+
1 row in set (3,92 sec)
|
I cloned https://github.com/kubernetes/kubernetes in order to count how many commits I can find.
Counting the commits accessible from
HEAD
in this way takes around 2 seconds on my MacBook pro.Next step is doing the same thing with gitbase.
Lastly, I tried to see whether adding cores would help. Running on a GCP instance with 96 cores and way more RAM that we need, the analysis
It takes longer than before! I assumed it was before my laptop has an SSD, while this instance was using a HD ... so I tried storing the dataset (just Kubernetes) in RAM. The result was interesting ... as in it took longer than before!
I have no idea why this is, but it goes completely against my expectations.
The text was updated successfully, but these errors were encountered: