Skip to content

Conversation

@AndreasAlbertQC
Copy link
Collaborator

@AndreasAlbertQC AndreasAlbertQC commented Sep 2, 2025

Motivation

I would like to be able to serialize data to deltalake, including the dataframely metadata.

Changes

  • Refactored the tests from test_read_write_parquet.py such that they can easily be run for each storage backend.
    • Since different backends use different args / kwargs, support slightly different features (e.g. no truly lazy sink in deltalake atm), and need different code for mocking, I implemented a thin wrapper Tester interface that translates the calls needed for the tests into the specifics of the backends.
    • Tests are then simply pytest.mark.parametrized over the differet backend testers
    • If we want to add another backend in the future, you would just implement the tester class and can then fully reuse existing tests.
  • Added implementation for reading from and writing to delta lake in Schema, Collection, and FailureInfo
    • data is stored in delta tables
    • meta data is stored in custom commit meta data. I initially considered table metadata, but switched to commits because table metadata is not changeable after an initial write, and may not reflect the current state of the table. With commit metadata, we can make sure to only trust the metadata if the last write came from dataframely.

@github-actions github-actions bot added the enhancement New feature or request label Sep 2, 2025
@codecov
Copy link

codecov bot commented Sep 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (4fd1b4d) to head (0785c2d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #134    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           45        49     +4     
  Lines         2577      2812   +235     
==========================================
+ Hits          2577      2812   +235     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AndreasAlbertQC AndreasAlbertQC marked this pull request as ready for review September 3, 2025 13:53
@AndreasAlbertQC
Copy link
Collaborator Author

@delsner @borchero this turned out bigger than I thought because I had to refactor the way that the storage testing is done. (So the tests are reusable across backends). Let me know if you'd like to chat about this synchronously to make it easier to review.

Copy link
Member

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks great! The abstraction really shows its worth here 😄

One additional request: can we add deltalake as an optional dependency to pyproject.toml?

@AndreasAlbertQC
Copy link
Collaborator Author

One additional request: can we add deltalake as an optional dependency to pyproject.toml?

Good point, done. I also added sqlalchemy and pyarrow because I don't know why they were not already added.

@AndreasAlbertQC
Copy link
Collaborator Author

Let's figure out the treatment of optional dependency testing in #135

Copy link
Member

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just small suggestions! :)

@AndreasAlbertQC AndreasAlbertQC enabled auto-merge (squash) September 5, 2025 07:41
@AndreasAlbertQC AndreasAlbertQC merged commit 2c2b6f6 into main Sep 5, 2025
22 checks passed
@AndreasAlbertQC AndreasAlbertQC deleted the 2025-09-01_delta branch September 5, 2025 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants