-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
When using array_to_string
on a nested list, it doesn't properly check the element is null when doing conversion.
To Reproduce
Example SLT case:
datafusion/datafusion/sqllogictest/test_files/array.slt
Lines 4833 to 4836 in d55fb6d
query T | |
select array_to_string(arrow_cast([arrow_cast([NULL, 'a'], 'FixedSizeList(2, Utf8)'), NULL], 'FixedSizeList(2, FixedSizeList(2, Utf8))'), ',', '-'); | |
---- | |
-,a,-,- |
Here the input list looks like [[NULL, 'a'], NULL]
but the output string is -,a,-,-
, where -
is the replacement for nulls; we can see the output string has three nulls whereas our list only had two.
Another example are these Rust test cases:
// In datafusion/functions-nested/src/string.rs
#[cfg(test)]
mod tests {
use super::array_to_string_inner;
use arrow::array::{
Array, ArrayRef, AsArray, FixedSizeListArray, Int64Array, ListArray, StringArray,
};
use arrow::buffer::{NullBuffer, OffsetBuffer};
use arrow::datatypes::{DataType, Field};
use datafusion_common::Result;
use std::sync::Arc;
#[test]
fn test_list() -> Result<()> {
// Offsets + Values looks like this: [[1, NULL], [3, 4], [5, 6]]
// But with NullBuffer it's like this: [[1, NULL], NULL, [5, 6]]
let list_inner = Arc::new(ListArray::new(
Field::new_list_field(DataType::Int64, true).into(),
OffsetBuffer::new(vec![0, 2, 4, 6].into()),
Arc::new(Int64Array::from(vec![
Some(1),
None,
Some(3),
Some(4),
Some(5),
Some(6),
])),
Some(NullBuffer::from(vec![true, false, true])),
));
let list_outer = Arc::new(ListArray::new(
Field::new_list_field(list_inner.data_type().clone(), true).into(),
OffsetBuffer::new(vec![0, 3].into()),
Arc::clone(&list_inner) as ArrayRef,
None,
));
let delimiters = Arc::new(StringArray::from(vec!["-"]));
let null_strings = Arc::new(StringArray::from(vec!["[NULL]"]));
let output = array_to_string_inner(&[list_outer, delimiters, null_strings])?;
let value = output.as_string::<i32>().value(0);
assert_eq!(value, "1-[NULL]-3-4-5-6");
Ok(())
}
#[test]
fn test_fsl() -> Result<()> {
// Size + Values looks like this: [[1, NULL], [3, 4], [5, 6]]
// But with NullBuffer it's like this: [[1, NULL], NULL, [5, 6]]
let list_inner = Arc::new(FixedSizeListArray::new(
Field::new_list_field(DataType::Int64, true).into(),
2,
Arc::new(Int64Array::from(vec![
Some(1),
None,
Some(3),
Some(4),
Some(5),
Some(6),
])),
Some(NullBuffer::from(vec![true, false, true])),
));
let list_outer = Arc::new(ListArray::new(
Field::new_list_field(list_inner.data_type().clone(), true).into(),
OffsetBuffer::new(vec![0, 3].into()),
Arc::clone(&list_inner) as ArrayRef,
None,
));
let delimiters = Arc::new(StringArray::from(vec!["-"]));
let null_strings = Arc::new(StringArray::from(vec!["[NULL]"]));
let output = array_to_string_inner(&[list_outer, delimiters, null_strings])?;
let value = output.as_string::<i32>().value(0);
assert_eq!(value, "1-[NULL]-3-4-5-6");
Ok(())
}
}
We can see both output the 3-4
elements, even though in their original lists these are not valid since the List slot itself is null. These test cases should be failing as 3-4
elements don't appear in the original list.
Expected behavior
Null lists should not be accessed as if they are a valid list.
Root cause seems here:
datafusion/datafusion/functions-nested/src/string.rs
Lines 370 to 379 in d55fb6d
let list_array = as_list_array(&arr)?; | |
for i in 0..list_array.len() { | |
compute_array_to_string( | |
arg, | |
list_array.value(i), | |
delimiter.clone(), | |
null_string.clone(), | |
with_null_string, | |
)?; | |
} |
Specifically line 374, accessing list_array.value(i)
without a null check. The same goes for the other two arms below it.
Additional context
I'm concerned this similar issue may be in other functions 🤔