Skip to content

Commit 3463b12

Browse files
Refactor decimal conversion in PyArrow tables to use direct casting (#544)
This PR replaces the previous implementation of convert_decimals_in_arrow_table() with a more efficient approach that uses PyArrow's native casting operation instead of going through pandas conversion and array creation. - Remove conversion to pandas DataFrame via to_pandas() and apply() methods - Remove intermediate steps of creating array from decimal column and setting it back - Replace with direct type casting using PyArrow's cast() method - Build a new table with transformed columns rather than modifying the original table - Create a new schema based on the modified fields The new approach is more performant by avoiding pandas conversion overhead. The table below highlights substantial performance improvements when retrieving all rows from a table containing decimal columns, particularly when compression is disabled. Even greater gains were observed with compression enabled—showing approximately an 84% improvement (6 seconds compared to 39 seconds). Benchmarking was performed against e2-dogfood, with the client located in the us-west-2 region. ![image](https://github.com/user-attachments/assets/5407b651-8ab6-4c13-b525-cf912f503ba0) Signed-off-by: Jayant Singh <[email protected]>
1 parent bdb5154 commit 3463b12

File tree

1 file changed

+19
-9
lines changed

1 file changed

+19
-9
lines changed

src/databricks/sql/utils.py

+19-9
Original file line numberDiff line numberDiff line change
@@ -611,21 +611,31 @@ def convert_arrow_based_set_to_arrow_table(arrow_batches, lz4_compressed, schema
611611

612612

613613
def convert_decimals_in_arrow_table(table, description) -> "pyarrow.Table":
614+
new_columns = []
615+
new_fields = []
616+
614617
for i, col in enumerate(table.itercolumns()):
618+
field = table.field(i)
619+
615620
if description[i][1] == "decimal":
616-
decimal_col = col.to_pandas().apply(
617-
lambda v: v if v is None else Decimal(v)
618-
)
619621
precision, scale = description[i][4], description[i][5]
620622
assert scale is not None
621623
assert precision is not None
622-
# Spark limits decimal to a maximum scale of 38,
623-
# so 128 is guaranteed to be big enough
624+
# create the target decimal type
624625
dtype = pyarrow.decimal128(precision, scale)
625-
col_data = pyarrow.array(decimal_col, type=dtype)
626-
field = table.field(i).with_type(dtype)
627-
table = table.set_column(i, field, col_data)
628-
return table
626+
627+
new_col = col.cast(dtype)
628+
new_field = field.with_type(dtype)
629+
630+
new_columns.append(new_col)
631+
new_fields.append(new_field)
632+
else:
633+
new_columns.append(col)
634+
new_fields.append(field)
635+
636+
new_schema = pyarrow.schema(new_fields)
637+
638+
return pyarrow.Table.from_arrays(new_columns, schema=new_schema)
629639

630640

631641
def convert_to_assigned_datatypes_in_column_table(column_table, description):

0 commit comments

Comments
 (0)