CloseChoice · The-Obstacle-Is-The-Way · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025
diff --git a/.github/conda/meta.yaml b/.github/conda/meta.yaml
@@ -20,11 +20,12 @@ requirements:
     - dill
     - pandas
     - requests >=2.19.0
+    - httpx <1.0.0
     - tqdm >=4.66.3
     - dataclasses
     - multiprocess
     - fsspec
-    - huggingface_hub >=0.24.0,<1.0.0
+    - huggingface_hub >=0.25.0,<2.0.0
     - packaging
   run:
     - python
@@ -35,11 +36,12 @@ requirements:
     - dill
     - pandas
     - requests >=2.19.0
+    - httpx <1.0.0
     - tqdm >=4.66.3
     - dataclasses
     - multiprocess
     - fsspec
-    - huggingface_hub >=0.24.0,<1.0.0
+    - huggingface_hub >=0.25.0,<2.0.0
     - packaging
 
 test:

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -82,7 +82,7 @@ jobs:
         run: |
           python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
 
-  test_py312:
+  test_py314:
     needs: check_code_quality
     strategy:
       matrix:
@@ -100,18 +100,18 @@ jobs:
         run: |
           sudo apt update
           sudo apt install -y ffmpeg
-      - name: Set up Python 3.12
+      - name: Set up Python 3.14
         uses: actions/setup-python@v5
         with:
-          python-version: "3.12"
+          python-version: "3.14"
       - name: Setup conda env (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         uses: conda-incubator/setup-miniconda@v2
         with:
           auto-update-conda: true
           miniconda-version: "latest"
           activate-environment: test
-          python-version: "3.12"
+          python-version: "3.14"
       - name: Setup FFmpeg (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         run: conda install "ffmpeg=7.0.1" -c conda-forge
@@ -127,7 +127,7 @@ jobs:
         run: |
           python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
 
-  test_py312_future:
+  test_py314_future:
     needs: check_code_quality
     strategy:
       matrix:
@@ -145,18 +145,18 @@ jobs:
         run: |
           sudo apt update
           sudo apt install -y ffmpeg 
-      - name: Set up Python 3.12
+      - name: Set up Python 3.14
         uses: actions/setup-python@v5
         with:
-          python-version: "3.12"
+          python-version: "3.14"
       - name: Setup conda env (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         uses: conda-incubator/setup-miniconda@v2
         with:
           auto-update-conda: true
           miniconda-version: "latest"
           activate-environment: test
-          python-version: "3.12"
+          python-version: "3.14"
       - name: Setup FFmpeg (windows)
         if: ${{ matrix.os == 'windows-latest' }}
         run: conda install "ffmpeg=7.0.1" -c conda-forge

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -120,7 +120,7 @@ If you are a **dataset author**... you know what to do, it is your dataset after
 
 If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.
 
-Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).
+Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://huggingface.co/papers/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).
 
 Thank you for your contribution!
 

diff --git a/README.md b/README.md
@@ -136,7 +136,7 @@ If you're a dataset owner and wish to update any part of it (description, citati
 
 ## BibTeX
 
-If you want to cite our 🤗 Datasets library, you can use our [paper](https://arxiv.org/abs/2109.02846):
+If you want to cite our 🤗 Datasets library, you can use our [paper](https://huggingface.co/papers/2109.02846):
 
 ```bibtex
 @inproceedings{lhoest-etal-2021-datasets,

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -88,6 +88,10 @@
       title: Load document data
     - local: document_dataset
       title: Create a document dataset
+    - local: nifti_dataset
+      title: Create a medical imaging dataset
+    - local: bids_dataset
+      title: Load a BIDS dataset
     title: "Vision"
   - sections:
     - local: nlp_load

diff --git a/docs/source/bids_dataset.mdx b/docs/source/bids_dataset.mdx
@@ -0,0 +1,63 @@
+# BIDS Dataset
+
+[BIDS (Brain Imaging Data Structure)](https://bids.neuroimaging.io/) is a standard for organizing and describing neuroimaging and behavioral data. The `datasets` library supports loading BIDS datasets directly, leveraging `pybids` for parsing and `nibabel` for handling NIfTI files.
+
+<Tip>
+
+To use the BIDS loader, you need to install the `bids` extra (which installs `pybids` and `nibabel`):
+
+```bash
+pip install datasets[bids]
+```
+
+</Tip>
+
+## Loading a BIDS Dataset
+
+You can load a BIDS dataset by pointing to its root directory (containing `dataset_description.json`):
+
+```python
+from datasets import load_dataset
+
+# Load a local BIDS dataset
+ds = load_dataset("bids", data_dir="/path/to/bids/dataset")
+
+# Access the first example
+print(ds["train"][0])
+# {
+#     'subject': '01',
+#     'session': 'baseline',
+#     'datatype': 'anat',
+#     'suffix': 'T1w',
+#     'nifti': <nibabel.nifti1.Nifti1Image>,
+#     ...
+# }
+```
+
+The `nifti` column contains `nibabel` image objects, which can be visualized interactively in Jupyter notebooks.
+
+## Filtering
+
+You can filter the dataset by BIDS entities like `subject`, `session`, and `datatype` when loading:
+
+```python
+# Load only specific subjects and datatypes
+ds = load_dataset(
+    "bids",
+    data_dir="/path/to/bids/dataset",
+    subjects=["01", "05", "10"],
+    sessions=["pre", "post"],
+    datatypes=["func"],
+)
+```
+
+## Metadata
+
+BIDS datasets often include JSON sidecar files with metadata (e.g., scanner parameters). This metadata is loaded into the `metadata` column as a JSON string.
+
+```python
+import json
+
+metadata = json.loads(ds["train"][0]["metadata"])
+print(metadata["RepetitionTime"])
+```
diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx
@@ -1,7 +1,7 @@
 # Create a dataset card
 
 Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset.
-This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
+This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://huggingface.co/papers/1810.03993).
 Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.
 
 Creating a dataset card is easy and can be done in just a few steps:
@@ -24,4 +24,4 @@ Creating a dataset card is easy and can be done in just a few steps:
 
 YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.
 
-Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
+Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.
diff --git a/docs/source/document_dataset.mdx b/docs/source/document_dataset.mdx
@@ -1,13 +1,13 @@
 # Create a document dataset
 
-This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
+This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.
 
 > [!TIP]
 > You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
 
 ## PdfFolder
 
-The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
+The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.
 
 > [!TIP]
 > 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
@@ -72,32 +72,32 @@ file_name,additional_feature
 or using `metadata.jsonl`:
 
 ```jsonl
-{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
-{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
-{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
+{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
+{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
+{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
 ```
 
 Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.
 
-It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
+It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:
 
 ```jsonl
 {"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
 {"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
 {"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
 ```
 
-You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
+You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
 
 ```jsonl
 {"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
 {"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
 {"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
 ```
 
-### OCR (Optical character recognition)
+### OCR (Optical Character Recognition)
 
-OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like:
+OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:
 
 ```csv
 file_name,text
@@ -106,7 +106,7 @@ file_name,text
 0003.pdf,Attention is all you need. Abstract. The ...
 ```
 
-Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions:
+Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:
 
 ```py
 >>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")

diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx
@@ -22,7 +22,7 @@ FAISS retrieves documents based on the similarity of their vector representation
 
 ```py
 >>> from datasets import load_dataset
->>> ds = load_dataset('crime_and_punish', split='train[:100]')
+>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
 >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})
 ```
 
@@ -62,7 +62,7 @@ FAISS retrieves documents based on the similarity of their vector representation
 7. Reload it at a later time with [`Dataset.load_faiss_index`]:
 
 ```py
->>> ds = load_dataset('crime_and_punish', split='train[:100]')
+>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
 >>> ds.load_faiss_index('embeddings', 'my_index.faiss')
 ```
 

diff --git a/docs/source/image_load.mdx b/docs/source/image_load.mdx
@@ -10,7 +10,7 @@ When you load an image dataset and call the image column, the images are decoded
 ```py
 >>> from datasets import load_dataset, Image
 
->>> dataset = load_dataset("beans", split="train")
+>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train")
 >>> dataset[0]["image"]
 ```
 
@@ -33,7 +33,7 @@ You can load a dataset from the image path. Use the [`~Dataset.cast_column`] fun
 If you only want to load the underlying path to the image dataset without decoding the image object, set `decode=False` in the [`Image`] feature:
 
 ```py
->>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
+>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
 >>> dataset[0]["image"]
 {'bytes': None,
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/bean_rust/bean_rust_train.29.jpg'}

diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -327,7 +327,7 @@ Select specific rows of the `train` split:
 ```py
 >>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[10:20]")
 ===STRINGAPI-READINSTRUCTION-SPLIT===
->>> train_10_20_ds = datasets.load_dataset("bookcorpu", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
+>>> train_10_20_ds = datasets.load_dataset("rojagtap/bookcorpus", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
 ```
 
 Or select a percentage of a split with: