[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

bersprockets · 2023-09-07T15:47:07Z

What changes were proposed in this pull request?

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in OnHeapColumnVector and OffHeapColumnVector to use the dictionary, if present.

Why are the changes needed?

The following query gets incorrect results:

drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

The result should be:

{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}

The cast operation copies the second array by calling ColumnarArray#copy, which in turn calls ColumnarArray#toIntArray, which in turn calls ColumnVector#getInts on the underlying column vector (which is either an OnHeapColumnVector or an OffHeapColumnVector). The implementation of getInts in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:

java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:       
.f1:         
..list:      
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:         
..list:      
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

The same bug also occurs when field f2 is a map. This PR fixes that case as well.

Does this PR introduce any user-facing change?

No, except for fixing the correctness issue.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

This reverts commit 5a1f7259f0c524f6fc09585da42a9dc44d6e5639.

HyukjinKwon

Making sense to me .. but cc @cloud-fan FYI

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

wangyum · 2023-09-08T14:31:54Z

cc @cloud-fan

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-09-08T19:54:39Z

cc @sunchao, @viirya , too

…olumn vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes #42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit fac236e) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2023-09-08T20:11:55Z

Merged to master/3.5/3.4/3.3.

viirya

Good catch!

sunchao · 2023-09-08T20:21:31Z

late LGTM, thanks @bersprockets !

…ector` ### What changes were proposed in this pull request? This is a small followup of #42850. `getBytes` checks if the `dictionary` is null or not, then call `getByte` which also checks if the `dictionary` is null or not. This PR avoids the repeated if checks by copying one line code from `getByte` to `getBytes`. The same applies to other `getXXX` methods. ### Why are the changes needed? Make the perf-critical path more efficient. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #42903 from cloud-fan/vector. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…olumn vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes apache#42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit fac236e) Signed-off-by: Dongjoon Hyun <[email protected]>

… and double column type ### What changes were proposed in this pull request? `CelebornColumnDictionary` supports dictionary of float and double column type. ### Why are the changes needed? `CelebornColumnDictionary` only supports dictionary of int, long and string column type at present. It's recommended to support dictionary of float and double column type for columnar shuffle. Backport apache/spark#42850. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2607 from SteNicholas/CELEBORN-1495. Authored-by: SteNicholas <[email protected]> Signed-off-by: mingji <[email protected]>

… and double column type ### What changes were proposed in this pull request? `CelebornColumnDictionary` supports dictionary of float and double column type. ### Why are the changes needed? `CelebornColumnDictionary` only supports dictionary of int, long and string column type at present. It's recommended to support dictionary of float and double column type for columnar shuffle. Backport apache/spark#42850. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2607 from SteNicholas/CELEBORN-1495. Authored-by: SteNicholas <[email protected]> Signed-off-by: mingji <[email protected]> (cherry picked from commit 70e3b24) Signed-off-by: mingji <[email protected]>

bersprockets added 5 commits September 6, 2023 17:54

testing

9199bad

Revert "testing"

e7ae684

This reverts commit 5a1f7259f0c524f6fc09585da42a9dc44d6e5639.

Add failing tests

fec0e0c

fix

7d0f1f0

Bug fix to test

a481bea

github-actions bot added the SQL label Sep 7, 2023

HyukjinKwon approved these changes Sep 7, 2023

View reviewed changes

wangyum reviewed Sep 8, 2023

View reviewed changes

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala Outdated Show resolved Hide resolved

Review feedback

264b776

wangyum approved these changes Sep 8, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 8, 2023

View reviewed changes

dongjoon-hyun closed this in fac236e Sep 8, 2023

viirya reviewed Sep 8, 2023

View reviewed changes

cloud-fan mentioned this pull request Sep 13, 2023

[SPARK-45157][SQL] Avoid repeated if checks in [On|Off|HeapColumnVector #42903

Closed

bersprockets deleted the vector_oddity branch September 14, 2023 16:12

SteNicholas mentioned this pull request Jul 9, 2024

[CELEBORN-1495] CelebornColumnDictionary supports dictionary of float and double column type apache/celeborn#2607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

Uh oh!

bersprockets commented Sep 7, 2023

Uh oh!

HyukjinKwon left a comment

Uh oh!

Uh oh!

wangyum commented Sep 8, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Sep 8, 2023 •

edited

Loading

Uh oh!

dongjoon-hyun commented Sep 8, 2023

Uh oh!

viirya left a comment

Uh oh!

sunchao commented Sep 8, 2023

Uh oh!

Uh oh!

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

Uh oh!

Conversation

bersprockets commented Sep 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangyum commented Sep 8, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 8, 2023

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Sep 8, 2023

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 8, 2023 •

edited

Loading