Skip to content

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in OnHeapColumnVector and OffHeapColumnVector to use the dictionary, if present.

Why are the changes needed?

The following query gets incorrect results:

drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

The result should be:

{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}

The cast operation copies the second array by calling ColumnarArray#copy, which in turn calls ColumnarArray#toIntArray, which in turn calls ColumnVector#getInts on the underlying column vector (which is either an OnHeapColumnVector or an OffHeapColumnVector). The implementation of getInts in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:

java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:       
.f1:         
..list:      
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:         
..list:      
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

The same bug also occurs when field f2 is a map. This PR fixes that case as well.

Does this PR introduce any user-facing change?

No, except for fixing the correctness issue.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

This reverts commit 5a1f7259f0c524f6fc09585da42a9dc44d6e5639.
@github-actions github-actions bot added the SQL label Sep 7, 2023
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sense to me .. but cc @cloud-fan FYI

@wangyum
Copy link
Member

wangyum commented Sep 8, 2023

cc @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 8, 2023

cc @sunchao, @viirya , too

dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/3.5/3.4/3.3.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@sunchao
Copy link
Member

sunchao commented Sep 8, 2023

late LGTM, thanks @bersprockets !

dongjoon-hyun pushed a commit that referenced this pull request Sep 13, 2023
…ector`

### What changes were proposed in this pull request?

This is a small followup of #42850. `getBytes` checks if the `dictionary` is null or not, then call `getByte` which also checks if the `dictionary` is null or not. This PR avoids the repeated if checks by copying one line code from `getByte` to `getBytes`. The same applies to other `getXXX` methods.

### Why are the changes needed?

Make the perf-critical path more efficient.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42903 from cloud-fan/vector.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@bersprockets bersprockets deleted the vector_oddity branch September 14, 2023 16:12
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes apache#42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <[email protected]>
FMX pushed a commit to apache/celeborn that referenced this pull request Jul 11, 2024
… and double column type

### What changes were proposed in this pull request?

`CelebornColumnDictionary` supports dictionary of float and double column type.

### Why are the changes needed?

`CelebornColumnDictionary` only supports dictionary of int, long and string column type at present. It's recommended to support dictionary of float and double column type for columnar shuffle.

Backport apache/spark#42850.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2607 from SteNicholas/CELEBORN-1495.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: mingji <[email protected]>
FMX pushed a commit to apache/celeborn that referenced this pull request Jul 11, 2024
… and double column type

### What changes were proposed in this pull request?

`CelebornColumnDictionary` supports dictionary of float and double column type.

### Why are the changes needed?

`CelebornColumnDictionary` only supports dictionary of int, long and string column type at present. It's recommended to support dictionary of float and double column type for columnar shuffle.

Backport apache/spark#42850.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2607 from SteNicholas/CELEBORN-1495.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: mingji <[email protected]>
(cherry picked from commit 70e3b24)
Signed-off-by: mingji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants