Skip to content

Reduce the sizes of the repository and release tarballs #2681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 5 tasks
seisman opened this issue Feb 7, 2020 · 10 comments
Open
4 of 5 tasks

Reduce the sizes of the repository and release tarballs #2681

seisman opened this issue Feb 7, 2020 · 10 comments
Labels
longterm Long standing issues that need to be resolved

Comments

@seisman
Copy link
Member

seisman commented Feb 7, 2020

The GMT repository is getting bigger. The current size is ~1 GB, and the largest directories are:

  • .git: ~720 MB. It contains the whole git history. We can't reduce the size unless we rewrite the git history.
  • test: ~113 MB. There are many large PS files (>1 MB) and may be reduced to smaller sizes.
  • doc: ~70 MB. Files in the fig directory can be optimized to smaller sizes.
  • share: 14 MB. Nothing we can do to this directory.
  • src: 15 MB. Nothing to do.

The release tarball doesn't ship the GMT tests, and its size is ~150 MB after uncompression.

  • doc: ~70MB
  • doc_release: 50 MB

Below are what we can do to reduce the repository size and tarball size:

  • Some PS files are large because they use high-resolution GSHHG or earth relief files. For example, test/genper/east_map_0.ps is used to test a map projection (maybe I'm wrong), but it plots all coastlines, rivers and national boundaries, and the PS figure size is ~5 MB. We may use lower-resolution data to highly reduce their sizes.
  • Optimize the images in the fig directory.
  • Optimize images in the doc/rst/source/users_contrib_symbols directory
  • Reduce size of images before deploying to gh-pages branch.
  • The doc_release contains the HTML documentation and hundreds of images. These images can be optimized to much smaller sizes. These images are generated when building documentation, so we need to do the image optimization before each release.
@PaulWessel
Copy link
Member

During AGU we discussed this briefly and I think @leouieda mentioned there was a way to reduce the git history? It would be nice to cut back to GMT 5 release in 2013 I think. Surely there is a way to achieve that? I think for the tests we may need to split it from the repo. Even with smaller PS files we will eventually get too big. If we were to decouple the tests from the repo we would want to remove all the git history related to that part, and we would want things to work more or less the same way. So given a GMT installation, we could cd into the test repo and run make check etc.

@joa-quim
Copy link
Member

joa-quim commented Feb 7, 2020

We could store the test figs directly in png (and at 150 dpi). That would save future space and probably speed up the tests because the originals would be already rasterized. However, this won't save history space.

@leouieda
Copy link
Member

leouieda commented Feb 7, 2020

The history is a bit tricky. There is a way to prune any large files from the history. But that might mean that we can't run tests when checking out those commits because the files aren't there any more. The GMT history is so large that it's almost useless to have commit dating back so long.

What we can do is split the repository. One solution would be to prune the old history from the current one and save a backup gmt5 repo in the organization. There is no easy way to do that without overwriting the entire thing on github. Consequences of that would be:

  • Forks will all need to be deleted and re-forked
  • Clones need to be deleted and re-cloned

Another option is to start a new gmt6 repository with only the history that we want. But that would create different problems:

  • Issues would have to be migrated. Not sure how hard that is.
  • PRs would have to be reopepened in the new repo.
  • Backlog of PRs would not be ported (should be fine since we have the commit history).
  • Clones and forks would still need to be redone.

@PaulWessel
Copy link
Member

Wasn't there the option of doing a shallow git clone so at least the users can have a .git light file instead of 800 Mb? Perhaps we can determine the oldest file in the repo that has not changed for a long time and use that as the cutoff?

@PaulWessel
Copy link
Member

@leouieda
Copy link
Member

leouieda commented Mar 2, 2020

I've done some of that in the past with another repository. It can reduce the size quite a bit but remember that it's deleting some files from the history so they will be lost forever. This isn't a problem if these files shouldn't have been in the history to start. But it will limit our ability to checkout and test older commits. If that's not a problem, then this might be an option.

What would be large files that can be deleted from the history? They can't still be present in the current state for this to work.

Since this option rewrites history on master then all branches must be rebased onto the new master branch. Forks will also be out of date and anyone making pull requests from outdated forks will cause huge conflicts. The easiest way is to instruct people to delete their forks and make new ones (assuming they have pending PRs and branches).

Again, any solution we choose will cause disruption to current development since it rewrites git history.

@PaulWessel
Copy link
Member

Yes, not so simple due to the history. The big files we are discussing are PS originals for the tests and examples, and it would seem we would have to completely remove the test dir from the history (and create a separate gmt-test repo) in order to remove the entire test dir from the gmt history. Separately, we could modify test scripts that make large PS files as indicated above.

@seisman
Copy link
Member Author

seisman commented May 5, 2020

What's the purpose of the tests in the test/genper directory? Some of the scripts plot all rivers and boundaries using "pscoast -Ia -Na -W". The sizes of the PS files can be reduced if we don't plot them.

@PaulWessel
Copy link
Member

Yes, nothing extra is learned by plotting the rivers and borders. I think just coastlines is fine, so please make changes to this.

@stale
Copy link

stale bot commented Aug 5, 2020

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions.

@stale stale bot added the stale This will not be worked on label Aug 5, 2020
@seisman seisman added the longterm Long standing issues that need to be resolved label Aug 5, 2020
@stale stale bot removed the stale This will not be worked on label Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
longterm Long standing issues that need to be resolved
Projects
None yet
Development

No branches or pull requests

4 participants