Skip to content

array_to_string doesn't null check for nested Lists #17692

@Jefffrey

Description

@Jefffrey

Describe the bug

When using array_to_string on a nested list, it doesn't properly check the element is null when doing conversion.

To Reproduce

Example SLT case:

query T
select array_to_string(arrow_cast([arrow_cast([NULL, 'a'], 'FixedSizeList(2, Utf8)'), NULL], 'FixedSizeList(2, FixedSizeList(2, Utf8))'), ',', '-');
----
-,a,-,-

Here the input list looks like [[NULL, 'a'], NULL] but the output string is -,a,-,-, where - is the replacement for nulls; we can see the output string has three nulls whereas our list only had two.

Another example are these Rust test cases:

// In datafusion/functions-nested/src/string.rs
#[cfg(test)]
mod tests {
    use super::array_to_string_inner;
    use arrow::array::{
        Array, ArrayRef, AsArray, FixedSizeListArray, Int64Array, ListArray, StringArray,
    };
    use arrow::buffer::{NullBuffer, OffsetBuffer};
    use arrow::datatypes::{DataType, Field};
    use datafusion_common::Result;
    use std::sync::Arc;

    #[test]
    fn test_list() -> Result<()> {
        // Offsets + Values looks like this: [[1, NULL], [3, 4], [5, 6]]
        // But with NullBuffer it's like this: [[1, NULL], NULL, [5, 6]]
        let list_inner = Arc::new(ListArray::new(
            Field::new_list_field(DataType::Int64, true).into(),
            OffsetBuffer::new(vec![0, 2, 4, 6].into()),
            Arc::new(Int64Array::from(vec![
                Some(1),
                None,
                Some(3),
                Some(4),
                Some(5),
                Some(6),
            ])),
            Some(NullBuffer::from(vec![true, false, true])),
        ));

        let list_outer = Arc::new(ListArray::new(
            Field::new_list_field(list_inner.data_type().clone(), true).into(),
            OffsetBuffer::new(vec![0, 3].into()),
            Arc::clone(&list_inner) as ArrayRef,
            None,
        ));

        let delimiters = Arc::new(StringArray::from(vec!["-"]));
        let null_strings = Arc::new(StringArray::from(vec!["[NULL]"]));

        let output = array_to_string_inner(&[list_outer, delimiters, null_strings])?;
        let value = output.as_string::<i32>().value(0);
        assert_eq!(value, "1-[NULL]-3-4-5-6");

        Ok(())
    }

    #[test]
    fn test_fsl() -> Result<()> {
        // Size + Values looks like this: [[1, NULL], [3, 4], [5, 6]]
        // But with NullBuffer it's like this: [[1, NULL], NULL, [5, 6]]
        let list_inner = Arc::new(FixedSizeListArray::new(
            Field::new_list_field(DataType::Int64, true).into(),
            2,
            Arc::new(Int64Array::from(vec![
                Some(1),
                None,
                Some(3),
                Some(4),
                Some(5),
                Some(6),
            ])),
            Some(NullBuffer::from(vec![true, false, true])),
        ));

        let list_outer = Arc::new(ListArray::new(
            Field::new_list_field(list_inner.data_type().clone(), true).into(),
            OffsetBuffer::new(vec![0, 3].into()),
            Arc::clone(&list_inner) as ArrayRef,
            None,
        ));

        let delimiters = Arc::new(StringArray::from(vec!["-"]));
        let null_strings = Arc::new(StringArray::from(vec!["[NULL]"]));

        let output = array_to_string_inner(&[list_outer, delimiters, null_strings])?;
        let value = output.as_string::<i32>().value(0);
        assert_eq!(value, "1-[NULL]-3-4-5-6");

        Ok(())
    }
}

We can see both output the 3-4 elements, even though in their original lists these are not valid since the List slot itself is null. These test cases should be failing as 3-4 elements don't appear in the original list.

Expected behavior

Null lists should not be accessed as if they are a valid list.

Root cause seems here:

let list_array = as_list_array(&arr)?;
for i in 0..list_array.len() {
compute_array_to_string(
arg,
list_array.value(i),
delimiter.clone(),
null_string.clone(),
with_null_string,
)?;
}

Specifically line 374, accessing list_array.value(i) without a null check. The same goes for the other two arms below it.

Additional context

I'm concerned this similar issue may be in other functions 🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions