Skip to content

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

Closed
@timocb

Description

@timocb

I have an issue where it is not possible to upload a Pandas DataFrame with a repeated field to BigQuery. It is very much related to an issue I've had earlier: googleapis/google-cloud-python#8093

Since that has been resolved (by being able to specify the schema), I've created a separate issue. I also couldn't find issues related to REPEATED fields.

Environment details

Mac OS X 10.14.5
Python 3.6.8

Packages:

google-api-core==1.14.2
google-auth==1.6.3
google-cloud-bigquery==1.19.0
google-cloud-core==1.0.3
google-cloud-iam==0.2.1
google-cloud-logging==1.12.1
google-resumable-media==0.3.3
googleapis-common-protos==1.6.0

Steps to reproduce

  1. Have a table with a REPEATED field
  2. Upload a Pandas DataFrame with a repeated field to this table
  3. Error

Also:

  • Getting the schema from BigQuery and using that in the JobConfig doesn't change the error.

Code example

import pandas as pd
from google.cloud import bigquery


PROJECT = "MY-PROJECT"
DATASET = "MY_DATASET"
TABLE = "MY_TABLE"


# My table schema
schema = [
    bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED"),
    bigquery.SchemaField("bar", "FLOAT", mode="REPEATED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})

job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()

Stack trace

Traceback (most recent call last):
  File "test.py", line 51, in <module>
    client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()
  File "google/cloud/bigquery/job.py", line 734, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 

Provided schema is not compatible with the file 'prod-scotty-******'. 

Field 'bar' is specified as REPEATED in provided schema 
which does not match REQUIRED as specified in the file.

Metadata

Metadata

Assignees

Labels

🚨This issue needs some love.api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions