Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4e18df1
Add parquet scan options and docs (#7801)
lhoestq Oct 9, 2025
cfcdfce
More Parquet streaming docs (#7803)
lhoestq Oct 9, 2025
02ee330
Less api calls when resolving data_files (#7805)
lhoestq Oct 9, 2025
5eec91a
Parquet: add `on_bad_file` argument to error/warn/skip bad files (#7806)
lhoestq Oct 9, 2025
fd8d287
typo (#7807)
lhoestq Oct 9, 2025
7e1350b
release: 4.2.0 (#7808)
lhoestq Oct 9, 2025
f25661f
Set dev version (#7809)
lhoestq Oct 9, 2025
88d53e2
fix conda deps (#7810)
lhoestq Oct 9, 2025
63c933a
Add pyarrow's binary view to features (#7795)
delta003 Oct 10, 2025
aa7f2a9
Fix polars cast column image (#7800)
CloseChoice Oct 13, 2025
3e13d30
Allow streaming hdf5 files (#7814)
lhoestq Oct 13, 2025
12f5aca
Retry open hf file (#7822)
lhoestq Oct 17, 2025
0b2a4c2
Keep hffs cache in workers when streaming (#7820)
lhoestq Oct 17, 2025
74c7154
Fix batch_size default description in to_polars docstrings (#7824)
albertvillanova Oct 20, 2025
fb445ff
docs: document_dataset PDFs & OCR (#7812)
ethanknights Oct 20, 2025
d10e846
Add custom fingerprint support to `from_generator` (#7533)
simonreise Oct 23, 2025
9332649
picklable batch_fn (#7826)
lhoestq Oct 23, 2025
41c0529
release: 4.3.0 (#7827)
lhoestq Oct 23, 2025
159a645
set dev version (#7828)
lhoestq Oct 23, 2025
5138876
Add nifti support (#7815)
CloseChoice Oct 24, 2025
a7600ac
Fix random seed on shuffle and interleave_datasets (#7823)
CloseChoice Oct 24, 2025
6d985d9
fix ci compressionfs (#7830)
lhoestq Oct 24, 2025
f7c8e46
fix: better args passthrough for `_batch_setitems()` (#7817)
sghng Oct 27, 2025
627ed2e
Fix: Properly render [!TIP] block in stream.shuffle documentation (#7…
art-test-stack Oct 28, 2025
9e5b0e6
resolves the ValueError: Unable to avoid copy while creating an array…
ArjunJagdale Oct 28, 2025
8b1bd4e
Python 3.14 (#7836)
lhoestq Oct 31, 2025
0e7c6ca
Add num channels to audio (#7840)
CloseChoice Nov 3, 2025
03c16ec
fix column with transform (#7843)
lhoestq Nov 3, 2025
fc7f97c
support fsspec 2025.10.0 (#7844)
lhoestq Nov 3, 2025
232cb10
Release: 4.4.0 (#7845)
lhoestq Nov 4, 2025
5cb2925
set dev version (#7846)
lhoestq Nov 4, 2025
f2f58b3
Better streaming retries (504 and 429) (#7847)
lhoestq Nov 4, 2025
d32a1f7
DOC: remove mode parameter in docstring of pdf and video feature (#7848)
CloseChoice Nov 5, 2025
6a6983a
release: 4.4.1 (#7849)
lhoestq Nov 5, 2025
91f96a0
dev version (#7850)
lhoestq Nov 5, 2025
3356d74
Fix embed storage nifti (#7853)
CloseChoice Nov 6, 2025
cf647ab
ArXiv -> HF Papers (#7855)
qgallouedec Nov 10, 2025
17f40a3
fix some broken links (#7859)
julien-c Nov 10, 2025
c97e757
Nifti visualization support (#7874)
CloseChoice Nov 21, 2025
004a5bf
Replace papaya with niivue (#7878)
CloseChoice Nov 27, 2025
b8291fc
feat(bids): add pybids optional dependency and config check
The-Obstacle-Is-The-Way Nov 29, 2025
ea34394
test(bids): add synthetic BIDS dataset fixtures
The-Obstacle-Is-The-Way Nov 29, 2025
f441822
feat(bids): implement basic BIDS loader module
The-Obstacle-Is-The-Way Nov 29, 2025
d06fcd0
fix(test): repair syntax in BIDS test fixture
The-Obstacle-Is-The-Way Nov 29, 2025
34be5a4
fix(test): handle Bids init exception
The-Obstacle-Is-The-Way Nov 29, 2025
67a6b6b
feat(bids): add subject/session/datatype filtering
The-Obstacle-Is-The-Way Nov 29, 2025
2305c2a
test(bids): add multi-subject filtering test
The-Obstacle-Is-The-Way Nov 29, 2025
962ee8b
feat(bids): add validation and error handling
The-Obstacle-Is-The-Way Nov 29, 2025
6425d93
docs(bids): add BIDS loading guide
The-Obstacle-Is-The-Way Nov 29, 2025
b748207
fix(bids): lint and format fixes, remove deprecated trust_remote_code
The-Obstacle-Is-The-Way Nov 29, 2025
bc5a3fd
fix(bids): apply CodeRabbit feedback
The-Obstacle-Is-The-Way Nov 29, 2025
fda30c3
chore: trigger CI
The-Obstacle-Is-The-Way Nov 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/conda/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,12 @@ requirements:
- dill
- pandas
- requests >=2.19.0
- httpx <1.0.0
- tqdm >=4.66.3
- dataclasses
- multiprocess
- fsspec
- huggingface_hub >=0.24.0,<1.0.0
- huggingface_hub >=0.25.0,<2.0.0
- packaging
run:
- python
Expand All @@ -35,11 +36,12 @@ requirements:
- dill
- pandas
- requests >=2.19.0
- httpx <1.0.0
- tqdm >=4.66.3
- dataclasses
- multiprocess
- fsspec
- huggingface_hub >=0.24.0,<1.0.0
- huggingface_hub >=0.25.0,<2.0.0
- packaging

test:
Expand Down
16 changes: 8 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ jobs:
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/

test_py312:
test_py314:
needs: check_code_quality
strategy:
matrix:
Expand All @@ -100,18 +100,18 @@ jobs:
run: |
sudo apt update
sudo apt install -y ffmpeg
- name: Set up Python 3.12
- name: Set up Python 3.14
uses: actions/setup-python@v5
with:
python-version: "3.12"
python-version: "3.14"
- name: Setup conda env (windows)
if: ${{ matrix.os == 'windows-latest' }}
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
miniconda-version: "latest"
activate-environment: test
python-version: "3.12"
python-version: "3.14"
- name: Setup FFmpeg (windows)
if: ${{ matrix.os == 'windows-latest' }}
run: conda install "ffmpeg=7.0.1" -c conda-forge
Expand All @@ -127,7 +127,7 @@ jobs:
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/

test_py312_future:
test_py314_future:
needs: check_code_quality
strategy:
matrix:
Expand All @@ -145,18 +145,18 @@ jobs:
run: |
sudo apt update
sudo apt install -y ffmpeg
- name: Set up Python 3.12
- name: Set up Python 3.14
uses: actions/setup-python@v5
with:
python-version: "3.12"
python-version: "3.14"
- name: Setup conda env (windows)
if: ${{ matrix.os == 'windows-latest' }}
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
miniconda-version: "latest"
activate-environment: test
python-version: "3.12"
python-version: "3.14"
- name: Setup FFmpeg (windows)
if: ${{ matrix.os == 'windows-latest' }}
run: conda install "ffmpeg=7.0.1" -c conda-forge
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ If you are a **dataset author**... you know what to do, it is your dataset after

If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.

Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).
Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://huggingface.co/papers/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).

Thank you for your contribution!

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ If you're a dataset owner and wish to update any part of it (description, citati

## BibTeX

If you want to cite our 🤗 Datasets library, you can use our [paper](https://arxiv.org/abs/2109.02846):
If you want to cite our 🤗 Datasets library, you can use our [paper](https://huggingface.co/papers/2109.02846):

```bibtex
@inproceedings{lhoest-etal-2021-datasets,
Expand Down
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@
title: Load document data
- local: document_dataset
title: Create a document dataset
- local: nifti_dataset
title: Create a medical imaging dataset
- local: bids_dataset
title: Load a BIDS dataset
title: "Vision"
- sections:
- local: nlp_load
Expand Down
63 changes: 63 additions & 0 deletions docs/source/bids_dataset.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# BIDS Dataset

[BIDS (Brain Imaging Data Structure)](https://bids.neuroimaging.io/) is a standard for organizing and describing neuroimaging and behavioral data. The `datasets` library supports loading BIDS datasets directly, leveraging `pybids` for parsing and `nibabel` for handling NIfTI files.

<Tip>

To use the BIDS loader, you need to install the `bids` extra (which installs `pybids` and `nibabel`):

```bash
pip install datasets[bids]
```

</Tip>

## Loading a BIDS Dataset

You can load a BIDS dataset by pointing to its root directory (containing `dataset_description.json`):

```python
from datasets import load_dataset

# Load a local BIDS dataset
ds = load_dataset("bids", data_dir="/path/to/bids/dataset")

# Access the first example
print(ds["train"][0])
# {
# 'subject': '01',
# 'session': 'baseline',
# 'datatype': 'anat',
# 'suffix': 'T1w',
# 'nifti': <nibabel.nifti1.Nifti1Image>,
# ...
# }
```

The `nifti` column contains `nibabel` image objects, which can be visualized interactively in Jupyter notebooks.

## Filtering

You can filter the dataset by BIDS entities like `subject`, `session`, and `datatype` when loading:

```python
# Load only specific subjects and datatypes
ds = load_dataset(
"bids",
data_dir="/path/to/bids/dataset",
subjects=["01", "05", "10"],
sessions=["pre", "post"],
datatypes=["func"],
)
```

## Metadata

BIDS datasets often include JSON sidecar files with metadata (e.g., scanner parameters). This metadata is loaded into the `metadata` column as a JSON string.

```python
import json

metadata = json.loads(ds["train"][0]["metadata"])
print(metadata["RepetitionTime"])
```
4 changes: 2 additions & 2 deletions docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Create a dataset card

Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset.
This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://huggingface.co/papers/1810.03993).
Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.

Creating a dataset card is easy and can be done in just a few steps:
Expand All @@ -24,4 +24,4 @@ Creating a dataset card is easy and can be done in just a few steps:

YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.

Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.
20 changes: 10 additions & 10 deletions docs/source/document_dataset.mdx
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Create a document dataset

This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.

> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## PdfFolder

The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.

> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
Expand Down Expand Up @@ -72,32 +72,32 @@ file_name,additional_feature
or using `metadata.jsonl`:

```jsonl
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
```

Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.

It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:

```jsonl
{"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
{"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
{"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
```

You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:

```jsonl
{"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
```

### OCR (Optical character recognition)
### OCR (Optical Character Recognition)

OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like:
OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:

```csv
file_name,text
Expand All @@ -106,7 +106,7 @@ file_name,text
0003.pdf,Attention is all you need. Abstract. The ...
```

Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions:
Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:

```py
>>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faiss_es.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ FAISS retrieves documents based on the similarity of their vector representation

```py
>>> from datasets import load_dataset
>>> ds = load_dataset('crime_and_punish', split='train[:100]')
>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})
```

Expand Down Expand Up @@ -62,7 +62,7 @@ FAISS retrieves documents based on the similarity of their vector representation
7. Reload it at a later time with [`Dataset.load_faiss_index`]:

```py
>>> ds = load_dataset('crime_and_punish', split='train[:100]')
>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
>>> ds.load_faiss_index('embeddings', 'my_index.faiss')
```

Expand Down
4 changes: 2 additions & 2 deletions docs/source/image_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ When you load an image dataset and call the image column, the images are decoded
```py
>>> from datasets import load_dataset, Image

>>> dataset = load_dataset("beans", split="train")
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train")
>>> dataset[0]["image"]
```

Expand All @@ -33,7 +33,7 @@ You can load a dataset from the image path. Use the [`~Dataset.cast_column`] fun
If you only want to load the underlying path to the image dataset without decoding the image object, set `decode=False` in the [`Image`] feature:

```py
>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]
{'bytes': None,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/bean_rust/bean_rust_train.29.jpg'}
Expand Down
2 changes: 1 addition & 1 deletion docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ Select specific rows of the `train` split:
```py
>>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[10:20]")
===STRINGAPI-READINSTRUCTION-SPLIT===
>>> train_10_20_ds = datasets.load_dataset("bookcorpu", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
>>> train_10_20_ds = datasets.load_dataset("rojagtap/bookcorpus", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
```

Or select a percentage of a split with:
Expand Down
Loading