-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Iceberg has functionality to introspect tables. This is very useful for example
to check that a column is properly sorted by checking lower/upper bounds
https://iceberg.apache.org/docs/latest/spark-queries/#files
The return value of this query SELECT * FROM prod.db.table.files
Will return an upper_bounds and lower_bounds column. A map of column ID to binary.
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
To interpret the binary column we register custom UDF functions like this one to
convert the bytes in little endian
def _to_int(data):
return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())Then we can use this function to interpret the data and display it correctly.
-- Stored as 4-byte little-endian
SELECT
min(to_int(lower_bounds[1])) min_a,
max(to_int(upper_bounds[1])) max_a,
min(to_int(lower_bounds[2])) min_b,
max(to_int(upper_bounds[2])) max_b,
min(to_int(lower_bounds[3])) min_c,
max(to_int(upper_bounds[3])) max_c
FROM
prod.db.table.filesDoes Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?