Add Grain support to `image_dataset_from_directory` and `text_dataset_from_directory` #21593

james77777778 · 2025-08-17T09:02:15Z

This PR adds Grain support to image_dataset_from_directory and text_dataset_from_directory by including format="grain" as an option.

Note that the text will be Python strings when using format="grain", since only TF supports string tensors.

Here is a demo colab:
https://colab.research.google.com/drive/13RNE_KTbc13Shp5c1uIKzCbo-Ztfe5Ik?usp=sharing

cc @divyashreepathihalli @mattdangerw

gemini-code-assist

Summary of Changes

Hello @james77777778, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading capabilities within Keras by integrating Grain support into the image_dataset_from_directory and text_dataset_from_directory utilities. Users can now specify format="grain" to leverage Grain's efficient data processing pipeline, offering an alternative to the default TensorFlow dataset format. This change involves refactoring core dataset utilities to be backend-agnostic, introducing new Grain-specific helpers, and updating file system handling to support diverse storage locations, ultimately providing greater flexibility and performance for data ingestion in Keras workflows.

Highlights

Added format argument to dataset utilities: The image_dataset_from_directory and text_dataset_from_directory utilities now accept a new format argument. This allows users to choose between tf.data.Dataset (default) and grain.IterDataset for their data loading needs.
New grain_utils module for Grain-specific utilities: A new module, grain_utils.py, has been introduced to house helper functions specifically for handling Grain datasets, including make_batch and make_string_batch for efficient data batching.
Improved file system handling for local and remote paths: The internal logic for handling file paths in dataset_utils.py has been enhanced to dynamically use os or tf.io.gfile based on the path prefix. This enables seamless support for both local and remote file systems (e.g., Google Cloud Storage, HDFS).
Refactored data loading and label processing for multi-backend support: The data loading and label processing functions within image_dataset_utils.py and text_dataset_utils.py have been refactored to support both TensorFlow and Grain backends, ensuring compatibility and flexibility.
Expanded test coverage for new format argument: Test suites for both image and text dataset utilities have been updated to include parameterized tests, thoroughly validating the new format argument across different scenarios and ensuring correct behavior for both TensorFlow and Grain outputs.
Grain text data returned as Python strings: When using format="grain" with text_dataset_from_directory, the text data will be returned as standard Python strings, as Grain does not natively support TensorFlow string tensors.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the grain data format to image_dataset_from_directory and text_dataset_from_directory by introducing a format='grain' option. The implementation is well-structured, creating parallel functions for grain and tensorflow data loading paths. The changes also include a refactoring in dataset_utils.py to handle different filesystems (local vs. GCS/HDFS) in a more generic way. My feedback includes a couple of suggestions for further refactoring to improve code clarity, performance, and maintainability by reducing code duplication and leveraging existing utilities.

keras/src/utils/image_dataset_utils_test.py

keras/src/utils/dataset_utils.py

keras/src/utils/image_dataset_utils.py

…_from_directory`.

codecov-commenter · 2025-08-17T14:13:20Z

Codecov Report

❌ Patch coverage is 73.26733% with 54 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@89a8676). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
keras/src/utils/image_dataset_utils.py	65.55%	17 Missing and 14 partials ⚠️
keras/src/utils/dataset_utils.py	65.85%	11 Missing and 3 partials ⚠️
keras/src/utils/text_dataset_utils.py	89.36%	1 Missing and 4 partials ⚠️
keras/src/utils/grain_utils.py	81.81%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #21593   +/-   ##
=========================================
  Coverage          ?   82.70%           
=========================================
  Files             ?      568           
  Lines             ?    56881           
  Branches          ?     8889           
=========================================
  Hits              ?    47045           
  Misses            ?     7642           
  Partials          ?     2194

Flag	Coverage Δ
keras	`82.51% <71.78%> (?)`
keras-jax	`63.65% <69.30%> (?)`
keras-numpy	`58.24% <69.30%> (?)`
keras-openvino	`34.55% <13.86%> (?)`
keras-tensorflow	`64.21% <71.28%> (?)`
keras-torch	`63.80% <69.80%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fchollet

Awesome work! The code looks good to me.

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

…tory` and `text_dataset_from_directory` Merged from original PR #21593 Original: keras-team/keras#21593

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

…tory` and `text_dataset_from_directory` Merged from original PR #21593 Original: keras-team/keras#21593

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

…tory` and `text_dataset_from_directory` Merged from original PR #21593 Original: keras-team/keras#21593

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

…tory` and `text_dataset_from_directory` Merged from original PR #21593 Original: keras-team/keras#21593

google-ml-butler bot added the size:XL label Aug 17, 2025

google-ml-butler bot assigned gbaned Aug 17, 2025

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

keras/src/utils/image_dataset_utils_test.py Show resolved Hide resolved

keras/src/utils/dataset_utils.py Outdated Show resolved Hide resolved

keras/src/utils/image_dataset_utils.py Outdated Show resolved Hide resolved

Add Grain support to image_dataset_from_directory and `text_dataset…

4b791ab

…_from_directory`.

james77777778 force-pushed the add-grain-support-in-dataset-utils branch from 4196615 to 4b791ab Compare August 17, 2025 09:26

james77777778 added 2 commits August 17, 2025 18:03

Fix channels_first bug.

17802a0

Refine the docstrings.

3d3644c

gbaned requested a review from mattdangerw August 18, 2025 07:22

google-ml-butler bot added the awaiting review label Aug 18, 2025

gbaned added this to PR Queue Aug 18, 2025

github-project-automation bot moved this to Assigned Reviewer in PR Queue Aug 18, 2025

fchollet approved these changes Aug 18, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Aug 18, 2025

github-project-automation bot moved this from Assigned Reviewer to Approved by Reviewer in PR Queue Aug 18, 2025

kokoro-team removed the kokoro:force-run label Aug 18, 2025

fchollet merged commit 7da416d into keras-team:master Aug 19, 2025
11 checks passed

google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase labels Aug 19, 2025

github-project-automation bot moved this from Approved by Reviewer to Merged in PR Queue Aug 19, 2025

james77777778 deleted the add-grain-support-in-dataset-utils branch August 19, 2025 23:40

ryantqiu pushed a commit to snorkel-marlin-repos/keras_pr_21593_full that referenced this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and `text_dataset…

a50000d

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

ryantqiu mentioned this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and text_dataset_from_directory snorkel-marlin-repos/keras_pr_21593_full#1

Merged

ryantqiu pushed a commit to snorkel-marlin-repos/keras_pr_21593_full that referenced this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and `text_dataset…

51c6d1e

…_from_directory` Original PR #21593 by james77777778 Original: keras-team/keras#21593

ryantqiu mentioned this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and text_dataset_from_directory snorkel-marlin-repos/keras_pr_21593_full#4

Merged

ryantqiu mentioned this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and text_dataset_from_directory snorkel-marlin-repos/keras-team_keras_pr_pull_b81b4498-b708-464a-aafa-c17a5475abd5#1

Merged

ryantqiu mentioned this pull request Oct 1, 2025

Add Grain support to image_dataset_from_directory and text_dataset_from_directory snorkel-marlin-repos/keras-team_keras_pr_pull_8942b1a9-b298-4746-ba06-669f05a08223#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Grain support to `image_dataset_from_directory` and `text_dataset_from_directory` #21593

Add Grain support to `image_dataset_from_directory` and `text_dataset_from_directory` #21593

james77777778 commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Aug 17, 2025 •

edited

Loading

Uh oh!

fchollet left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add Grain support to image_dataset_from_directory and text_dataset_from_directory #21593

Add Grain support to image_dataset_from_directory and text_dataset_from_directory #21593

Conversation

james77777778 commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add Grain support to `image_dataset_from_directory` and `text_dataset_from_directory` #21593

Add Grain support to `image_dataset_from_directory` and `text_dataset_from_directory` #21593

codecov-commenter commented Aug 17, 2025 •

edited

Loading