Closed
Description
Is your feature request related to a problem or challenge?
This ticket is a follow on to #10918 where we implemented enough initial support for StringView
/ BinaryView
that we can show some pretty sweet ClickBench results
Describe the solution you'd like
This epic tracks remaining work to complete the "initial" work which I would like to define as "enable using StringView when reading Strings from Parquet by default"
I am sure there will be additional work / support to add StringView to various other features of DataFusion that we can maybe track with another follow on ticket
Required for enabling StringView by default:
- Enable reading StringView by default from Parquet (
schema_force_string_view
) by default #11682 - Support string concat
||
forStringViewArray
#11766 - COUNT(DISTINCT) on StringView panics:
unreachable code: Utf8/Binary should use ArrowBytesSet
#11767 - Support protobuf serialization for
ScalarValue::Utf8View
andScalarValue::BinaryView
#12117 - Support substrait serialization for
ScalarValue::Utf8View
andScalarValue::BinaryView
#12118 - Convert
Utf8View
/BinaryView
-->Utf8
/Binary
at output #12119 - Parquet statistics missing when reading
Utf8
asUtf8View
#12123 - Support StringView for binary operators like
~
,!~
, etc #12180 - Support applying parquet bloom filters to StringView columns #12499
- Support Binary --> String coercion for StringView/BinaryView in
LIKE
#12500 - Casting from Binary --> Utf8 to evaluate
LIKE
slows down some ClickBench queries #12509
Could work around but really should be fixed upstream
- Support casting
BinaryView
-->Utf8
andLargeUtf8
arrow-rs#6162 - Add support for
StringView
andBinaryView
statistics inStatisticsConverter
arrow-rs#6164 - Make
StringViewArray::slice()
andBinaryViewArray::slice()
faster / non allocating arrow-rs#6408
Additional "Nice to have" Features
- [Epic] Native
StringView
support for string functions #11790 - Reduce copying in
CoalesceBatchesExec
for StringViews #11628 - Cast Utf8 -> Utf8View (not the other way around) for binary operators #11881
- Improve performance of SUBSTR for StringViewArray #12031
- Improve performance of REPEAT functions #12015
- [Epic] Complete Initial
StringView
in DataFusion #11752 - Automate testing / ensuring that string functions get the same answer for String, LargeString, StringView, DictionaryString, etc #12415