Closed
Description
I have an issue where it is not possible to upload a Pandas DataFrame with a repeated field to BigQuery. It is very much related to an issue I've had earlier: googleapis/google-cloud-python#8093
Since that has been resolved (by being able to specify the schema), I've created a separate issue. I also couldn't find issues related to REPEATED fields.
Environment details
Mac OS X 10.14.5
Python 3.6.8
Packages:
google-api-core==1.14.2
google-auth==1.6.3
google-cloud-bigquery==1.19.0
google-cloud-core==1.0.3
google-cloud-iam==0.2.1
google-cloud-logging==1.12.1
google-resumable-media==0.3.3
googleapis-common-protos==1.6.0
Steps to reproduce
- Have a table with a REPEATED field
- Upload a Pandas DataFrame with a repeated field to this table
- Error
Also:
- Getting the schema from BigQuery and using that in the
JobConfig
doesn't change the error.
Code example
import pandas as pd
from google.cloud import bigquery
PROJECT = "MY-PROJECT"
DATASET = "MY_DATASET"
TABLE = "MY_TABLE"
# My table schema
schema = [
bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("bar", "FLOAT", mode="REPEATED"),
]
# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)
# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)
# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)
print("Table schema:")
print(table.schema)
print("Table partitioning:")
print(table.time_partitioning)
# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})
job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()
Stack trace
Traceback (most recent call last):
File "test.py", line 51, in <module>
client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()
File "google/cloud/bigquery/job.py", line 734, in result
return super(_AsyncJob, self).result(timeout=timeout)
File "google/api_core/future/polling.py", line 127, in result
raise self._exception
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message:
Provided schema is not compatible with the file 'prod-scotty-******'.
Field 'bar' is specified as REPEATED in provided schema
which does not match REQUIRED as specified in the file.