-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Hello,
So in version of datacontract-cli 0.10.24 and preivous versions, there is type mismatch in the schema that is being imported from bigquery for REPEATED RECORD:
when executing
datacontract import --format biquery --biquery-project my_gcp_project --biquery-dataset my_dataset --bigquery-table my_table >> dc_my_gcp_project_my_dataset_my_table.yml
inside my dc_my_gcp_project_my_dataset_my_table.yml the following will be generated for the my_bq_col (which is a nested repeated field) :
my_bq_col:
type: object
required: false
fields:
subfield1:
type: string
required: true
subfiel2:
type: string
required: false
subfield3:
type: string
required: false
After adding the server informations, a manual modification is needed in order to be able to execute :
datacontract test dc_my_gcp_project_my_dataset_my_table.yml --logs
the correct adaptation is as follow:
my_bq_col:
type: array
items:
type: record
fields:
subfield1:
type: string
required: true
subfield2:
type: string
required: false
subfield3:
type: string
required: false
Below you can find the ouput of executing datacontract test dc_my_gcp_project_my_dataset_my_table.yml --logs
Before the modification:
data contract is invalid, found the following errors:
my_bq_col Check that field my_bq_col has type STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>: Type Mismatch, Expected Type: STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>;
Actual Type: ARRAY<STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>>
After the modification:
passed │ Check that field my_bq_col has type ARRAY<STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>>│ my_bq_col│ │
Explanation:
- So currently datacontract-cli import --format biquery is handling the nested repeated fields as nested fields :
Nested fields (or Struct ) are defined by: type": "RECORD", "mode": "NULLABLE",
while nested repeated fields (or Array) are defined by: type": "RECORD", "mode": "REPEATED",
RECORD type seems to be handled as a unique case and not have different behavior depending on the mode.
To resolve the issue you should check the mode of the field and handle the mode "REPEATED" as described above to enable a valid datacontract schema check.
Best regards.