Skip to content

Implement n-level directory hashing for backend storage #22532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
juur opened this issue Jan 19, 2023 · 8 comments
Closed

Implement n-level directory hashing for backend storage #22532

juur opened this issue Jan 19, 2023 · 8 comments
Labels
type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@juur
Copy link

juur commented Jan 19, 2023

Feature Description

It would be handy for scaling and other factors if the repository location could involve a multiple level hash, e.g. instead of /repo.git you have /123/456/789/repo.git or /2/repo.git. The hashing should be configurable in terms of both the unique-values-per-level and the number-of-levels.

This could support simple scaling of NFS backed storage, with perhaps the first layer being 4 (giving [0-4] as the top level directory, corresponding to 4 different mount points) and the second layer being 256 (00 - ff) which may improve performance for entries-per-directory.

The algorithm used should be exposed via some simple CLI tool or similar, such that re-hashing can be done via shell scripts or similar, e.g. mv $(gitea-hash 4 256 repo-name)/repo-name $(gitea-hash 8 256 repo-name)/repo-name to enable scaling

Go I think implements the FNV algorithm which is pretty trivial, so low overhead.

Screenshots

No response

@juur juur added type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first. labels Jan 19, 2023
@delvh
Copy link
Member

delvh commented Jan 19, 2023

I don't quite understand the benefit of how making everything much more complicated is going to help with anything?
Could you please explain that a bit more?

@juur
Copy link
Author

juur commented Jan 19, 2023

My understanding is currently you specify a single directory which is the root folder containing the repositories. This makes it hard to scale to multiple NFS servers (for example). Introducing directory hashing would allow you to have multiple NFS mount points.

I also don't really see how going from root/folder.repo root/hash/folder.repo is "much more complicated" ?

@delvh
Copy link
Member

delvh commented Jan 19, 2023

Well, at the moment, we find repos based on the path <repository root>/<user or org>/<repo-name>.git.
It is much more difficult to find where a repository is when we have to search through 4*256 combinations where it could be.
Storing which repo is located where would still introduce its own slew of bugs.
All that for the small benefit to support multiple NFS servers?
For me, it's not likely this will be implemented as it has multiple drawbacks that you only notice once it would have been implemented.
However, we plan to support clustered instances, which is something somewhat similar but also different.


I also don't really see how going from root/folder.repo root/hash/folder.repo is "much more complicated" ?

Please, try it yourself (without missing any edge cases! There will be a few here). And tell me afterward if it is really "not a lot of work".
For example, I implemented the Viewed checkbox in PRs. That's something where you would think it doesn't have any edge cases either, right? Well… wrong. There were quite a few edge cases I didn't imagine either.

@juur
Copy link
Author

juur commented Jan 19, 2023

In your example the hash function would be passed "user or org" and return the extra path entry. E.g. if the user name was Fred the function might return 2/6/Fred. This would/should only need a change the whichever method returns a home directory.

How would that require searching anything?

https://medium.com/eonian-technologies/file-name-hashing-creating-a-hashed-directory-structure-eabb03aa4091 is an example (although not one I'd recommend)

@juur
Copy link
Author

juur commented Jan 21, 2023

This is a minimal proof-of-concept of what I'm talking about. I haven't ever programmed in Go before, but hopefully this explains better what I'm getting at. It worked for me for SSH and HTTP (although it's hardly elegant). It is obviously hardcoded to only support hashing over 4 top-level folders.

diff --git a/cmd/serv.go b/cmd/serv.go
index 346c918b1..42d407c75 100644
--- a/cmd/serv.go
+++ b/cmd/serv.go
@@ -12,3 +12,5 @@ import (
        "os"
+       "path/filepath"
        "os/exec"
+       "hash/fnv"
        "regexp"
@@ -304,4 +306,7 @@ func runServ(c *cli.Context) error {

+       hash := fnv.New32()
+       hash.Write([]byte(strings.ToLower(results.OwnerName)))
+
        process.SetSysProcAttribute(gitcmd)
-       gitcmd.Dir = setting.RepoRootPath
+       gitcmd.Dir = filepath.Join(setting.RepoRootPath, fmt.Sprintf("%02x", hash.Sum32() & 0x3))
        gitcmd.Stdout = os.Stdout
diff --git a/models/user/user.go b/models/user/user.go
index a2c54a442..21d73b82f 100644
--- a/models/user/user.go
+++ b/models/user/user.go
@@ -17,2 +17,3 @@ import (
        "time"
+       "hash/fnv"

@@ -1000,3 +1001,5 @@ func GetInactiveUsers(ctx context.Context, olderThan time.Duration) ([]*User, er
 func UserPath(userName string) string { //revive:disable-line:exported
-       return filepath.Join(setting.RepoRootPath, strings.ToLower(userName))
+       hash := fnv.New32()
+       hash.Write([]byte(strings.ToLower(userName)))
+       return filepath.Join(setting.RepoRootPath, fmt.Sprintf("%02x", hash.Sum32() & 0x3), strings.ToLower(userName))
 }

The other interesting observation I have (which might be incorrect) is that the UserPath function isn't used for SSH based commands which seems a little odd, but I assume it's to do with how the pass-thru to the git binary works.

@juur
Copy link
Author

juur commented Jan 23, 2023

I've made a draft pull #22588 to see if there is any interest in this functionality to help Gitea scale storage.

@techknowlogick
Copy link
Member

Yeah, a sharded repo filestore based on uuid or hash is something that would be beneficial as it would allow greater flexibility in repo names, reduce fs operations in repo renames/transfers, and possibly other benefits as well. I guess it would also prevent some confusion with users attempting to push directly to the internal directories.

There is an issue somewhere around about creating an interface for git operations so that something like gitaly could optionally be used, which allows for sharding and more.

@juur
Copy link
Author

juur commented Jan 23, 2023

Yeah, a sharded repo filestore based on uuid or hash is something that would be beneficial as it would allow greater flexibility in repo names, reduce fs operations in repo renames/transfers, and possibly other benefits as well. I guess it would also prevent some confusion with users attempting to push directly to the internal directories.

There is an issue somewhere around about creating an interface for git operations so that something like gitaly could optionally be used, which allows for sharding and more.

I think I saw that issue. That felt very much like a major piece of work that would be cool (AWS S3 backend plugin, etc.), but I felt something simple like this might help in some situations without being overly invasive - and could be migrated to some future as-yet-undefined plugin architecture.

@juur juur closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/feature Completely new functionality. Can only be merged if feature freeze is not active. type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

No branches or pull requests

3 participants