-
-
Notifications
You must be signed in to change notification settings - Fork 60
Backup GitHub information #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Versioned content history
Certainly! Additionally, GitHub issues are editable and deletable. Accordingly, there is potential for abuse (e.g. deceptive revisions) that a diff-tracking backup system would help prevent. On the other hand, in the case of content that needs to be purged (e.g. copyrighted, malicious, or inappropriate material), persistence of old versions could be problematic. I recently created a backup solution for a text corpus ( |
There are several options according to: https://help.github.com/articles/backing-up-a-repository/ |
New GitHub Migrations API: https://developer.github.com/changes/2018-05-24-user-migration-api/ |
I haven't tried the new migrations API, but I've tried one of the backup mechanism mentioned in https://help.github.com/articles/backing-up-a-repository/, by using GitHub Records Archiver. I used my personal access token to run the script. It was able to backup these repos for me within python organization, before it came across API rate-limit issue 😛 But for each of the projects that it did back up:
It was able to back up these projects before I used up all my available API calls..
|
Ok just read this about the Migrations AP:
This is as far as I can go since I'm not Python organization owner :) |
I kicked off an archive for |
Thanks @brettcannon and @ewdurbin :) I archived my own project (black_out), and the output is It's not as huge as CPython, figured it might be easier to analyze. |
Nevermind that link above, it timed out 😅 Here is the downloaded content: |
The result of the Migrations API dump appears to have everything and is well organized. Since the intention of the dump is for migrating from GitHub to GitHub Enterprise and the dump is an official GitHub offering (although currently in preview), it seems to be the solution that is least likely to require any regular maintenance beyond ensuring it's run and that we have collected and stored the tarball safely. Summary of what's there, on a cursory glance these generally line up with some GitHub API object in JSON format from their API:
beyond that we get into the primitives that comprise what we see as a "Pull Request" or "Issue", again these appear to line up 1:1 with JSON objects from the GitHub API.
|
Thanks for the update, @ewdurbin! Will you be able to set up daily backups for the python GitHub org? (cpython is higher priority I would think 😇) Thanks! |
@Mariatta the backup took about 15 minutes to run, but it's asynchronous so we can just kick them off and then poll for completion before pulling the archive. The result was 320 MB so I'm curious if weekly might suffice for now? If we stick with daily, what kind of retention would we want? Daily backups for the past week, weekly backups for the past month, monthly backups forever? |
Hmm I don't know what the usual good backup practise is.. Open to suggestions. |
How crazy would it be to stick everything to a git repo that would be hosted on github but also mirrored somewhere else? |
that's probably not completely out of the realm of reasonability. the biggest concern there would be attachments and the notorious "git + big/binary files" limitations. |
For large binary files, I would suggest using Git LFS. GitHub supports LFS files up to two gigabytes in size. If your organization qualifies for GitHub education, you can request a free LFS quota. It is also possible to use a GitHub repository with LFS assets stored with GitLab, however the interface is less user friendly this way. |
Well I assume there are limits for the attachments anyway. @ewdurbin could you check what's the biggest file there? |
Limitations on attachments are documented here |
That's not that big. I mean of course versioning a 25 MB binary blob will eventually be crazy, but those attachments don't change over time IMHO. |
I think Ernest's retention policy suggestion works. |
Okay, for the initial pass I'll setup a task to kick off the "migration" and fetch it once complete each day. I think the archives can just be dropped in an S3 bucket with a little bit of structure and some retention policies to automatically clear out unnecessary archives. Will post back here with more information. |
Never came back and updated this. Ended up using Backhub for this. It is keeping daily snapshots for the past month and pushing archives to S3 as well. |
In that case, this issue can probably be closed! :) |
Backup is set up, and no comments in the past ~2 years, closing! 💾💾 |
Better safe than sorry.
The text was updated successfully, but these errors were encountered: