Skip to content

feat(gist): fsspec file system for GitHub gists (resolves #888) #1791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2025

Conversation

lmmx
Copy link
Contributor

@lmmx lmmx commented Feb 16, 2025

This PR introduces a new filesystem backend, GistFileSystem, which allows read-only access to files within a single GitHub Gist (as suggested in #888). I'd find this really useful in combination with Universal Pathlib (also an fsspec project)!

  • Gists are essentially flat collections of files, so there is no subdirectory concept. (Technically they are git repos that can store dirs too but we only need to support them as flat file lists, that's all the website UI will show them as)
  • The implementation is closely based on GithubFileSystem but simplified for a single gist.
  • Supports both public and private gists, latter needed user/token (PAT).

Users can do:

import fsspec

# For a public gist
fs = fsspec.filesystem("gist", gist_id="729837f14264089288178a5f632221ab")
print(fs.ls(""))  # lists files
with fs.open("test1.txt", "rb") as f:
    print(f.read().decode())

For a private gist, the same but also passing username and token args.

  • Implemented FS (methods: ls, _open, cat, invalidate_cache), read-only impl
  • Added to registry (alphabetically)
  • Added a test - I changed the gist ID to one by martindurant so you wouldn't have to worry about relying on someone else preserving their artifact for your tests to pass, also it's fairly small so shouldn't be slow to load
  • Added documentation in docs/source/api.rst.
  • Verified that read-only operations (ls, cat, open) are working with public gists.

Example usage

Below is a short snippet showing how to retrieve files from a public gist:

import fsspec

gist_id = "16bee4256595d3b6814be139ab1bd54e"
print("Gist ID:", gist_id)
fs = fsspec.filesystem("gist", gist_id=gist_id)
file_list = fs.ls("")
print("Files in the Gist (via fsspec):", file_list)
contents = fs.cat(["gistfile1.txt"])
print(contents["gistfile1.txt"].decode()[:120] + "\n...")

Gist ID: 16bee4256595d3b6814be139ab1bd54e
Files in the Gist (via fsspec): ['gistfile1.txt']
import astropy.io.fits._tiled_compression as tiled
from astropy.io import fits
import numcodecs
import fsspec
import zar
...

@martindurant
Copy link
Member

Thanks for providing! I haven't had a chance to look yet, but I will soon :)

@lmmx
Copy link
Contributor Author

lmmx commented Feb 22, 2025

Most welcome, no worries! 😃

@martindurant
Copy link
Member

Quick suggestion: it would be good to enable bundling the gist ID with the URL:

with fsspec.open("gist://16bee4256595d3b6814be139ab1bd54e@/test1.txt", "rb")

like github: allows. It would require enabling extracting kwargs from the URL.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read through, I found that you answered almost all of my questions elsewhere in the code :)

I don't think there's a good way to test this thoroughly, but at least we can reasonably expect gist to be available whenever GHA is running.

@martindurant
Copy link
Member

Please ping me when I should have another look

@lmmx
Copy link
Contributor Author

lmmx commented Feb 28, 2025

Thanks for reviewing Martin, gotten sidetracked in a CI fixing rabbit hole this week I've thankfully emerged and can return to revisit this!

Please ping me when I should have another look

Will do 🫡

@martindurant
Copy link
Member

I just noticed this is still stalled. Please ask for help if you need it.

@lmmx
Copy link
Contributor Author

lmmx commented May 22, 2025

Oh snap I’m sorry, yeah let me take a look…

@lmmx lmmx force-pushed the lmmx/gist-file-system branch 5 times, most recently from 6d5f82d to a10d426 Compare May 22, 2025 17:17
@lmmx
Copy link
Contributor Author

lmmx commented May 22, 2025

Updated now with the ability to specify just a single file (and ran the linter, sorry I missed that last time)

Some TODOs:

  • Test for grabbing all files (changed the choice of gist to one of yours with multiple files so I can assert)
  • Test for grabbing a single file (and assert number of files is 1)
  • Test for non-existent file should raise FileNotFound
  • Tests for gist://... parsing with what we expect
  • add the ability to specify filename in the URL (I think to keep it conventional the URI will just be to specify a single file, the multi-file syntax would just be via filenames kwarg)
  • add the ability to specify SHA (revision) in the URL

@lmmx lmmx force-pushed the lmmx/gist-file-system branch from a10d426 to ba2b962 Compare May 22, 2025 17:25
@lmmx
Copy link
Contributor Author

lmmx commented May 22, 2025

A little test that the URL is parsed as expected

@pytest.mark.parametrize(
    "gist_id,sha,file,token,user",
    [
        ("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", "my-user"),
        ("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", ""),
        ("my-gist-id-12345", None, "a_file.txt", "secret_token", "my-user"),  # No SHA
    ],
)
def test_gist_url_parse(gist_id, sha, file, token, user):
    if sha:
        fmt_str = f"gist://{user}:{token}@{gist_id}/{sha}/{file}"
    else:
        fmt_str = f"gist://{user}:{token}@{gist_id}/{file}"
    
    parsed = GistFileSystem._get_kwargs_from_urls(fmt_str)
    
    expected = {"gist_id": gist_id, "token": token}
    if user:  # Only include username if it's not empty
        expected["username"] = user
    if sha:  # Only include SHA if it's specified
        expected["sha"] = sha
    
    assert parsed == expected

@lmmx lmmx force-pushed the lmmx/gist-file-system branch from ba2b962 to e404148 Compare May 22, 2025 17:49
@lmmx lmmx force-pushed the lmmx/gist-file-system branch from e404148 to be6d20d Compare May 22, 2025 18:11
@lmmx
Copy link
Contributor Author

lmmx commented May 22, 2025

Cool, all done. A "round trip" might be nice too

@lmmx
Copy link
Contributor Author

lmmx commented May 22, 2025

Checks passed, 3.10 failed with an intermitten HTTP error from conda repodata (I don't have the ability to re-run it), LGTM

@martindurant
Copy link
Member

I like it! Let's put it in, and see if the public has feedback once using it.

@martindurant martindurant merged commit 3e4fdce into fsspec:master May 23, 2025
19 of 20 checks passed
@lmmx lmmx deleted the lmmx/gist-file-system branch May 23, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants