-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Implement n-level directory hashing for backend storage #22532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't quite understand the benefit of how making everything much more complicated is going to help with anything? |
My understanding is currently you specify a single directory which is the root folder containing the repositories. This makes it hard to scale to multiple NFS servers (for example). Introducing directory hashing would allow you to have multiple NFS mount points. I also don't really see how going from root/folder.repo root/hash/folder.repo is "much more complicated" ? |
Well, at the moment, we find repos based on the path
Please, try it yourself (without missing any edge cases! There will be a few here). And tell me afterward if it is really "not a lot of work". |
In your example the hash function would be passed "user or org" and return the extra path entry. E.g. if the user name was Fred the function might return 2/6/Fred. This would/should only need a change the whichever method returns a home directory. How would that require searching anything? https://medium.com/eonian-technologies/file-name-hashing-creating-a-hashed-directory-structure-eabb03aa4091 is an example (although not one I'd recommend) |
This is a minimal proof-of-concept of what I'm talking about. I haven't ever programmed in Go before, but hopefully this explains better what I'm getting at. It worked for me for SSH and HTTP (although it's hardly elegant). It is obviously hardcoded to only support hashing over 4 top-level folders.
The other interesting observation I have (which might be incorrect) is that the UserPath function isn't used for SSH based commands which seems a little odd, but I assume it's to do with how the pass-thru to the git binary works. |
I've made a draft pull #22588 to see if there is any interest in this functionality to help Gitea scale storage. |
Yeah, a sharded repo filestore based on uuid or hash is something that would be beneficial as it would allow greater flexibility in repo names, reduce fs operations in repo renames/transfers, and possibly other benefits as well. I guess it would also prevent some confusion with users attempting to push directly to the internal directories. There is an issue somewhere around about creating an interface for git operations so that something like gitaly could optionally be used, which allows for sharding and more. |
I think I saw that issue. That felt very much like a major piece of work that would be cool (AWS S3 backend plugin, etc.), but I felt something simple like this might help in some situations without being overly invasive - and could be migrated to some future as-yet-undefined plugin architecture. |
Feature Description
It would be handy for scaling and other factors if the repository location could involve a multiple level hash, e.g. instead of /repo.git you have /123/456/789/repo.git or /2/repo.git. The hashing should be configurable in terms of both the unique-values-per-level and the number-of-levels.
This could support simple scaling of NFS backed storage, with perhaps the first layer being 4 (giving [0-4] as the top level directory, corresponding to 4 different mount points) and the second layer being 256 (00 - ff) which may improve performance for entries-per-directory.
The algorithm used should be exposed via some simple CLI tool or similar, such that re-hashing can be done via shell scripts or similar, e.g.
mv $(gitea-hash 4 256 repo-name)/repo-name $(gitea-hash 8 256 repo-name)/repo-name
to enable scalingGo I think implements the FNV algorithm which is pretty trivial, so low overhead.
Screenshots
No response
The text was updated successfully, but these errors were encountered: