Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 15, 2025

When pickling a HfFileSystem:

  • re-populate instance cache (the fsspec cache that maps an instance args to the instance itself)
  • re-populate the cache attributes of every instance (dircache and _repo_and_revision_exists_cache)

This is useful to keep the cache in multiprocessing instead of starting from scratch.

This is especially useful when streaming datasets, because this way DataLoader workers don't have to re-populate the HfFileSystem cache. This PR ensures that the worker uses the cached list of files from the main process, which avoid unnecessary /api/.../tree calls

In fact, this PR is needed for huggingface/datasets#7820 which ensures that DataLoaders don't do unnecessary requests when streaming datasets

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq requested a review from Wauplin October 15, 2025 16:14
@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.44%. Comparing base (ff79763) to head (f0701d0).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/huggingface_hub/hf_file_system.py 36.36% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3443      +/-   ##
==========================================
+ Coverage   44.10%   52.44%   +8.34%     
==========================================
  Files         157      157              
  Lines       15527    15548      +21     
==========================================
+ Hits         6848     8154    +1306     
+ Misses       8679     7394    -1285     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I wanted to carefully check what's going on in this PR 😄 Looks good to me like this 👍

@Wauplin
Copy link
Contributor

Wauplin commented Oct 17, 2025

Failing CI is unrelated so feel free to merge.

@lhoestq lhoestq merged commit 5e77e04 into main Oct 17, 2025
21 of 25 checks passed
@lhoestq lhoestq deleted the hffs-keep-cache-on-pickle branch October 17, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants