Interpreting the upper/lower bounds column returned from querying the .files metadata


Iceberg has functionality to introspect tables. This is very useful for example
to check that a column is properly sorted by checking lower/upper bounds

https://iceberg.apache.org/docs/latest/spark-queries/#files

The return value of this query `SELECT * FROM prod.db.table.files`

Will return an upper_bounds and lower_bounds column. A map of column ID to binary.

https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

To interpret the binary column we register custom UDF functions like this one to
convert the bytes in little endian
```python
def _to_int(data):
    return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())
```
Then we can use this function to interpret the data and display it correctly.
```sql
-- Stored as 4-byte little-endian
SELECT
    min(to_int(lower_bounds[1])) min_a,
    max(to_int(upper_bounds[1])) max_a,
    min(to_int(lower_bounds[2])) min_b,
    max(to_int(upper_bounds[2])) max_b,
    min(to_int(lower_bounds[3])) min_c,
    max(to_int(upper_bounds[3])) max_c
FROM
    prod.db.table.files
```	
Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interpreting the upper/lower bounds column returned from querying the .files metadata #5370

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Interpreting the upper/lower bounds column returned from querying the .files metadata #5370

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions