Skip to content

Interesting performance issue counting commits #617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
campoy opened this issue Nov 14, 2018 · 7 comments · Fixed by #698
Closed

Interesting performance issue counting commits #617

campoy opened this issue Nov 14, 2018 · 7 comments · Fixed by #698
Assignees
Labels
performance Performance improvements

Comments

@campoy
Copy link
Contributor

campoy commented Nov 14, 2018

I cloned https://github.com/kubernetes/kubernetes in order to count how many commits I can find.

$ git clone https://github.com/kubernetes/kubernetes
$ cd kubernetes
$ git log | grep "^commit " |  wc -l
72071

Counting the commits accessible from HEAD in this way takes around 2 seconds on my MacBook pro.

Next step is doing the same thing with gitbase.

$ time srcd sql "select count(*) from commits"
+----------+
| COUNT(*) |
+----------+
|    79991 |
+----------+
srcd sql "select count(*) from  commits;"  0.02s user 0.03s system 0% cpu 18.415 total

Lastly, I tried to see whether adding cores would help. Running on a GCP instance with 96 cores and way more RAM that we need, the analysis

+----------+
| COUNT(*) |
+----------+
|    79991 |
+----------+
1 row in set (22.90 sec)

It takes longer than before! I assumed it was before my laptop has an SSD, while this instance was using a HD ... so I tried storing the dataset (just Kubernetes) in RAM. The result was interesting ... as in it took longer than before!

+----------+
| COUNT(*) |
+----------+
|    79991 |
+----------+
1 row in set (23.72 sec)

I have no idea why this is, but it goes completely against my expectations.

@campoy campoy added the performance Performance improvements label Nov 14, 2018
@jfontan
Copy link
Contributor

jfontan commented Nov 15, 2018

Right now each partition is one repository so increasing the number of cores does not help. Also these many core systems usually have cores slightly slower than top of the line laptops.

@ajnavarro
Copy link
Contributor

ajnavarro commented Nov 15, 2018

Not important but with that command you are counting only commits on master history.
The correct command is:

$ time git log --format=oneline --all | wc -l
80028

real	0m1.063s
user	0m1.064s
sys	0m0.200s

Comparing the same script, but instead of using git, using go-git:

package main

import (
    "fmt"
    "io"

    "gopkg.in/src-d/go-git.v4"
    . "gopkg.in/src-d/go-git.v4/_examples"
)

func main() {
    r, err := git.PlainOpen(".git")
    CheckIfError(err)

    iter, err := r.CommitObjects()
    CheckIfError(err)
    for {
        c, err := iter.Next()
        if err == io.EOF {
            break
        }

        CheckIfError(err)
        fmt.Println(c.Hash.String())
    }

    iter.Close()
}

Result:

$ time /tmp/list-commits | wc -l
80028

real	0m20.113s
user	0m19.868s
sys	0m2.220s

So we can say that we should focus our efforts on improving go-git more.

Binary used to list hashes from a repository:

list-commits.zip

@smola
Copy link
Contributor

smola commented Nov 15, 2018

We're debugging a problem that might be making go-git abnormally slow when iterating objects on the kubernetes repository specifically, but everything @ajnavarro and @jfontan said still apply.

@smola
Copy link
Contributor

smola commented Nov 15, 2018

In a repo of this size, at least 50% of time is spent reading object headers in the packfile to check if they are commits or not. That means that after getting all commits, we still need to check +700k object headers for their type. Why is this faster with git log? Because git log or even git log --all does not iterate over all commits, it just walks history from the references, ignoring dangling commits, or commits just referenced from reflog.

If we make the assumption that when using gitbase, we're not interested in dangling commits, or reflog, which I think it would be fair to assume, we could try changing the implementation of the commits table to return all commits reachable from all known references. I think this might buy us a big speed up on big repositories.

@campoy
Copy link
Contributor Author

campoy commented Nov 16, 2018

It might be worth considering, yeah. Maybe under a flag so the old behavior is still available for those that needed?

@smola
Copy link
Contributor

smola commented Nov 19, 2018

Actually current behavior is undesirable unless you are using gitbase to do some weird analysis on local development repositories rather than fresh clones. If we change it, I wouldn't expose it at all.

@ajnavarro ajnavarro added the blocked Some other issue is blocking this label Nov 19, 2018
@ajnavarro ajnavarro added blocked Some other issue is blocking this and removed blocked Some other issue is blocking this labels Nov 27, 2018
@kuba-- kuba-- assigned kuba-- and unassigned kuba-- Dec 17, 2018
@ajnavarro ajnavarro removed the blocked Some other issue is blocking this label Feb 12, 2019
@ajnavarro
Copy link
Contributor

To make this possible we need this merged on go-git to make gitbase work with siva files and repositories with references pointing to missing objects: src-d/go-git#1067

Numbers executing select count(*) from commits;:

Actual master:

mysql> select count(*) from commits;
+----------+
| COUNT(*) |
+----------+
|    83167 |
+----------+
1 row in set (21,05 sec)

Using repo.LogAll:

+----------+
| COUNT(*) |
+----------+
|    78171 |
+----------+
1 row in set (3,92 sec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants