Open
Description
ATM CachingFileSystem
has a single bool
option same_names
to switch layout of files from /hash
to /url-filename
and thus does not leave room for "improvement":
Under heavy use of the cache use having a flat tree of files (/hash
or /url-filename
based) could lead to a very heavy directory so filesystem could become inefficient in listing that directory etc.
- A common (look under
.git/objects
, same approach used by git-annex, girder etc) workaround is to establish leading directories, e.g. for a/hash
it could be/hash[:2]/hash[2:4]/hash[4:]
path to the file, thus reducing impact on the file system - for url-based path, it could simply be a path constructed from URI components, e.g. for
http://domain/p1/p2/filename
URL it could becomehttp/domain/p1/p2/filename
path, thus allowing to disambiguate between file systems etc, and also avoid conflicts for the same common filename (as I guess would be now withsame_names=True
).
With above in mind, I think it would have been nice if instead of same_names
there was a layout={hash,hashtree,url_filename,url_fullpath}
or alike, thus allowing users to switch to most appropriate layout depending on their use case.
Metadata
Metadata
Assignees
Labels
No labels