Skip to content

Interpreting the upper/lower bounds column returned from querying the .files metadata #5370

@cccs-jc

Description

@cccs-jc

Iceberg has functionality to introspect tables. This is very useful for example
to check that a column is properly sorted by checking lower/upper bounds

https://iceberg.apache.org/docs/latest/spark-queries/#files

The return value of this query SELECT * FROM prod.db.table.files

Will return an upper_bounds and lower_bounds column. A map of column ID to binary.

https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

To interpret the binary column we register custom UDF functions like this one to
convert the bytes in little endian

def _to_int(data):
    return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())

Then we can use this function to interpret the data and display it correctly.

-- Stored as 4-byte little-endian
SELECT
    min(to_int(lower_bounds[1])) min_a,
    max(to_int(upper_bounds[1])) max_a,
    min(to_int(lower_bounds[2])) min_b,
    max(to_int(upper_bounds[2])) max_b,
    min(to_int(lower_bounds[3])) min_c,
    max(to_int(upper_bounds[3])) max_c
FROM
    prod.db.table.files

Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions