Commit 9254492
[SPARK-22666][ML][SQL] Spark datasource for image format
## What changes were proposed in this pull request?
Implement an image schema datasource.
This image datasource support:
- partition discovery (loading partitioned images)
- dropImageFailures (the same behavior with `ImageSchema.readImage`)
- path wildcard matching (the same behavior with `ImageSchema.readImage`)
- loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`)
This datasource **NOT** support:
- specify `numPartitions` (it will be determined by datasource automatically)
- sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource)
## How was this patch tested?
Unit tests.
## Benchmark
I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource.
**cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
**test dataset**: Flickr8k_Dataset (about 8091 images)
**time cost**:
- My image datasource time (automatically generate 258 partitions): 38.04s
- `ImageSchema.read` time (set 16 partitions): 68.4s
- `ImageSchema.read` time (set 258 partitions): 90.6s
**time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**:
- My image datasource time (automatically generate 515 partitions): 95.4s
- `ImageSchema.read` (set 32 partitions): 109s
- `ImageSchema.read` (set 515 partitions): 105s
So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API.
Closes #22328 from WeichenXu123/image_datasource.
Authored-by: WeichenXu <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>1 parent c66eef8 commit 9254492
File tree
27 files changed
+323
-4
lines changed- data/mllib/images
- origin
- kittens
- multi-channel
- partitioned
- cls=kittens
- date=2018-01
- date=2018-02
- cls=multichannel
- date=2018-01
- date=2018-02
- mllib/src
- main
- resources/META-INF/services
- scala/org/apache/spark/ml/source/image
- test/scala/org/apache/spark/ml
- image
- source/image
- python/pyspark/ml
27 files changed
+323
-4
lines changedFile renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
0 commit comments