-
Notifications
You must be signed in to change notification settings - Fork 323
Description
Version
google-cloud-bigquery==1.25.0
The client.insert_rows()
function doesn't fail when inserting non-existing fields whereas the BigQuery API does fail with a message like
{
"kind": "bigquery#tableDataInsertAllResponse",
"insertErrors": [
{
"index": 0,
"errors": [
{
"reason": "invalid",
"location": "zap",
"debugInfo": "",
"message": "no such field."
}
]
}
]
}
insert_rows()
silently drops the additional columns instead.
This happens because insert_rows()
uses _record_field_to_json
which only iterates over the list of fields that are provided ignoring all the other fields that are part of the data and insert_rows()
passes the table's schema as list of fields to _record_field_to_json
.
This behavior is opposite to the BigQuery API and means we cannot reliably insert data because we're not made aware of changes to the incoming data because there's no failure.
IMHO this behavior is not correct, I think it would be OK if selected_fields
was provided but it should not silently use the schema to limit which of the fields of the input data are processed and ignore the rest.
I can image there might be cases where one wants to be lenient/ignore all fields that are not part of table so this behavior might have to be an option, possibly combined with selected_fields
.
P.S. By extension this also applies to the client.insert_rows_from_dataframe()
function which uses client.insert_rows()
.
P.P.S We initially ran into this when using insert_rows_from_dataframe()
and it was a bit of a search to find where this was going wrong because it's a somewhat indirect chain of insert_rows_from_dataframe -> insert_rows -> insert_rows_json
.
Why was this long way chosen/added instead of simply using insert_rows_json(table, df.to_dict(orient="records"))
? It seems a lot simpler and will probably be the workaround we'll implement for now.