Skip to content

Fix Bug: Differentiate between the Nested and Repeated fields for bigquery when executing datacontract import. #737

@gregoireAGBG

Description

@gregoireAGBG

Hello,
So in version of datacontract-cli 0.10.24 and preivous versions, there is type mismatch in the schema that is being imported from bigquery for REPEATED RECORD:

when executing
datacontract import --format biquery --biquery-project my_gcp_project --biquery-dataset my_dataset --bigquery-table my_table >> dc_my_gcp_project_my_dataset_my_table.yml

inside my dc_my_gcp_project_my_dataset_my_table.yml the following will be generated for the my_bq_col (which is a nested repeated field) :

  my_bq_col:
    type: object
    required: false
    fields:
      subfield1:
        type: string
        required: true
      subfiel2:
        type: string
        required: false
      subfield3:
        type: string
        required: false

After adding the server informations, a manual modification is needed in order to be able to execute :
datacontract test dc_my_gcp_project_my_dataset_my_table.yml --logs

the correct adaptation is as follow:

  my_bq_col:
    type: array
    items:
      type: record
      fields:
        subfield1:
          type: string
          required: true
        subfield2:
          type: string
          required: false
        subfield3:
          type: string
          required: false

Below you can find the ouput of executing datacontract test dc_my_gcp_project_my_dataset_my_table.yml --logs
Before the modification:
data contract is invalid, found the following errors:
my_bq_col Check that field my_bq_col has type STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>: Type Mismatch, Expected Type: STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>;
Actual Type: ARRAY<STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>>

After the modification:
passed │ Check that field my_bq_col has type ARRAY<STRUCT<subfield1 STRING, subfield2 STRING, subfield3 STRING>>│ my_bq_col│ │

Explanation:

  • So currently datacontract-cli import --format biquery is handling the nested repeated fields as nested fields :
    Nested fields (or Struct ) are defined by: type": "RECORD", "mode": "NULLABLE",
    while nested repeated fields (or Array) are defined by: type": "RECORD", "mode": "REPEATED",
    RECORD type seems to be handled as a unique case and not have different behavior depending on the mode.

To resolve the issue you should check the mode of the field and handle the mode "REPEATED" as described above to enable a valid datacontract schema check.

Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions