-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Implement support for multiple storage backends #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dataframely/collection.py
Outdated
| assert isinstance(fs, fsspec.AbstractFileSystem) | ||
| try: | ||
| fs.makedir(directory, create_parents=True) | ||
| except FileExistsError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs.makedir has no exists_ok parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try fs.makedirs? It seems to provide that parameter.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 41 41
Lines 2401 2410 +9
=========================================
+ Hits 2401 2410 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
delsner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress, thanks!
dataframely/collection.py
Outdated
| fs, _ = fsspec.url_to_fs( | ||
| directory if isinstance(directory, str) else str(directory) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fs, _ = fsspec.url_to_fs( | |
| directory if isinstance(directory, str) else str(directory) | |
| ) | |
| fs, _ = fsspec.url_to_fs(str(directory)) |
Isn't this the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching
dataframely/collection.py
Outdated
| assert isinstance(fs, fsspec.AbstractFileSystem) | ||
| try: | ||
| fs.makedir(directory, create_parents=True) | ||
| except FileExistsError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try fs.makedirs? It seems to provide that parameter.
| except FileExistsError: | ||
| pass | ||
|
|
||
| with fs.open(f"{directory}{fs.sep}schema.json", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to ensure that directory doesn't already end with a separator here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with directory = directory.rstrip(fs.sep)
dataframely/collection.py
Outdated
| if isinstance(directory, Path): | ||
| directory = str(directory) | ||
| fs, _ = fsspec.url_to_fs(directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if isinstance(directory, Path): | |
| directory = str(directory) | |
| fs, _ = fsspec.url_to_fs(directory) | |
| fs, _ = fsspec.url_to_fs(str(directory)) |
| # that the data adheres to the collection and we do not need to run validation. | ||
| if (json_serialization := directory / "schema.json").exists(): | ||
| metadata = json_serialization.read_text() | ||
| if fs.exists(json_serialization := f"{directory}{fs.sep}schema.json"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.a. I think we need to make sure that directory doesn't already have a trailing path separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | ||
|
|
||
| [[tool.mypy.overrides]] | ||
| module = ["fsspec.*"] | ||
| follow_untyped_imports = true | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | |
| [[tool.mypy.overrides]] | |
| module = ["fsspec.*"] | |
| follow_untyped_imports = true | |
| module = ["fsspec.*", "pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to add fsspec in pyproject.toml as a dependency.
| os: [ubuntu-latest, windows-latest] | ||
| os: [ubuntu-latest] | ||
| environment: [py310, py311, py312, py313] | ||
| services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Services are only supported on linux OS in Github CI. I didnt find a way to dry, so I just created two jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: did you try using docker compose up minio -d --wait instead of using GitHub actions services? This might also work on Windows.
|
|
||
| def check_platform(path_fixture: str) -> None: | ||
| if platform.system() == "Windows" and path_fixture == "s3_path": | ||
| pytest.skip("Skipping because Minio is not set up in Windows CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just skip s3 tests if no s3 envvars are set to be able to run tests locally without starting minio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes more sense, will adjust :)
delsner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few smaller comments.
|
|
||
| unit-tests: | ||
| unit-tests-linux: | ||
| name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} | |
| name: Unit Tests (Linux) - ${{ matrix.environment }} |
| os: [ubuntu-latest, windows-latest] | ||
| os: [ubuntu-latest] | ||
| environment: [py310, py311, py312, py313] | ||
| services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: did you try using docker compose up minio -d --wait instead of using GitHub actions services? This might also work on Windows.
| jobs: | ||
| unit-tests: | ||
| name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) | ||
| name: Unit Tests (Linuxs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name: Unit Tests (Linuxs) | |
| name: Unit Tests (Linux) |
| unit-tests-linux: | ||
| name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} | ||
| timeout-minutes: 30 | ||
| runs-on: ${{ matrix.os }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the matrix here at all?
| def s3_path() -> str: | ||
| """Fixture to provide a mock S3 path for testing.""" | ||
| s3 = boto3.resource("s3") | ||
| bucket_name = "".join(random.choice(string.ascii_lowercase) for _ in range(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just use a uuid here, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please also test that everything works if a directory path with trailing / is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does polars support anything else than s3:// paths beyond regular file paths? Should we test for those as well (e.g., file:// or hf://)?
|
Hey @FrithjofWinkelmannQC, I wanted to check in with you on how we should proceed with this PR. A little background info: @borchero and I have been working towards integrating dataframely with Anyway, my point is: If we have |
|
This PR has diverged significantly from |
Motivation
Closes #87.
This PR introduces
fsspecto support different storage backends for IO methods. For now, we assume all relevant access parameters are configured via environment variables.Changes
fsspecto dependencies and modified IO calls incollection.pyto use that instead of the native python library