Skip to content

Spurious label column created when audiofolder/imagefolder directories match split names #7880

@neha222222

Description

@neha222222

Describe the bug

When using audiofolder or imagefolder with directories for splits (train/test) rather than class labels, a spurious label column is incorrectly created.

Example: https://huggingface.co/datasets/datasets-examples/doc-audio-4

from datasets import load_dataset
ds = load_dataset("datasets-examples/doc-audio-4")
print(ds["train"].features)

Shows 'label' column with ClassLabel(names=['test', 'train']) - incorrect!## Root cause

In folder_based_builder.py, the labels set is accumulated across ALL splits (line 77). When directories are train/ and test/:

  • labels = {"train", "test"}len(labels) > 1add_labels = True
  • Spurious label column is created with split names as class labels

Expected behavior

No label column should be added when directory names match split names.

Proposed fix

Skip label inference when inferred labels match split names.

cc @lhoestq

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions