Skip to content

gix corpus - an extendable way to run algorithms and record their results for comparison #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
Byron opened this issue May 22, 2023 · 0 comments
Open
2 of 3 tasks
Labels
C-tracking-issue An issue to track to track the progress of multiple PRs or issues

Comments

@Byron
Copy link
Member

Byron commented May 22, 2023

Generally, it maintains information about a corpus of git repositories and writes it into a sqlite database for later data analysis.

The git repositories should be as many of the top-by-stars-and-smaller-than-5GB GitHub repos as can be held by a disk, which was 80K for a 4TB budget, leaving enough space for worktree checkouts as well. Be sure to also get one of these 100GB repos for good measure, by hand.

Initialization
  • record information about the corpus as seen at one time, with some meta-data like pack-size and object size and other data by which to select which repos to run on.
  • assume append-only set of repositories, where removals are the exception that we don't care about

Run commands

  • dry-run mode which just shows what would be run.
  • inform about changes in the corpus to let the user know it changed. They then can re-run the initialization (none-destructively) again to update statistics
  • offer a filter as SQL statement ideally to be able to chose a subset of commands to run against
  • allow to choose the set of commands to run, or run all of them
  • each command can specify if it can run in parallel with other commands of its kind or not
  • if a command-type can be run in parallel with others, the runner will perform the parallelization. The amount of threads can be configured.
  • keep information about each run along with its own version to be able to see what happened.
  • keep information about the result of each command along with timings (and maybe memory and CPU usage)
  • Each command can return a JSON Value with its own free-form information.
  • Its specifically useful for benchmarks that validate critical performance, like opening repositories, or resolving packs.
  • definitely store progress message this method on the tree::Root
  • commands can return timings for sub-tasks that they can keep track of themselves, but that are in a format that's usable for storage in the database. This way it's possible to for instance keep track of how long it takes to create an index file from a tree, and then how long it takes to perform an operation on the index.
  • Try to use tracing to record performance data about certain operations, akin to what git does, and store these spans in the database. These spans could be taken verbatim for analysis, ignoring their tree-structure at least at the beginning.

Analysis

A few very simple commands to answer questions like

  • did the performance of a command get better or worse (also for a subset of all available data?)
  • correlations between certain statistical datapoints, like size of pack, size of objects, and maybe how these affect the performance values (e.g. it got slower only for smaller objects)
  • make it easy to get access to the underlying data, maybe by emitting SQL statements to do so
  • make it easy to print all information about particular runs of commands

Ingestion Implementation

Analysis Implementation

Maybe at first we can limit the corpus run to specific repos that we check by hand in the corpus.db

  • TBD
@Byron Byron mentioned this issue Jun 13, 2023
10 tasks
@Byron Byron added the C-tracking-issue An issue to track to track the progress of multiple PRs or issues label Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue An issue to track to track the progress of multiple PRs or issues
Projects
None yet
Development

No branches or pull requests

1 participant