-
Notifications
You must be signed in to change notification settings - Fork 315
Can't upload data with "2019-07-08 08:00:00" datetime format to Google Bigquery with pandas #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I tried to change DATETIME to TIMESTAMP on schema but It always has |
Also, I tried to change this column (df['created']) to milliseconds format, it didn't work. Please help me. Is that a bug. |
@namnguyenbk Will investigate this shortly. In the meantime, would ti be possible to get a minimal reproducible code sample? And the versions of the libraries used? That would make the investigation easier, thanks! |
googleapis/google-cloud-python#9996 |
@plamut |
my_df = get_sessions() # this return a dataframe has a column name is created which is datetime[ns] type ex :"2020-01-08 08:00:00"
my_df['created'] = pd.to_datetime(sessions_df['created'], format='%Y-%m-%d %H:%M:%S').astype('datetime64[ns]')
res = bigquery_client.client.load_table_from_dataframe(my_df, table_id)
res.result()
ex: my value "2020-01-08 08:00:00" is being changed as INVALID or this value "0013-03-01T03:05:00" @plamut please help |
I'm using newest version of google-cloud-python @plamut 1.24.0 |
@namnguyenbk Noted, thank you. |
@plamut @tswast |
There have been some updates on the internal issue 147108331 that I filed. It looks like DATETIME support might actually be added soon, but needs some time to rollout to production. Hopefully @shollyman can let you know when this changes and we can revert my "fix" to always use TIMESTAMP. |
Thanks for the update! I marked the issue as external, and also classified it as P2 - it's already been around for quite awhile now with relatively low number bug report count, and there also exists a schema workaround (albeit not a perfect one). |
Not through the dataframe, unfortunately, at least not yet. Although, IIRC, it might be possible to get around this by using |
@plamut how about insert_rows_from_dataframe() It worked fine with my problem but It's a streaming buffer API while I want to insert all rows in a short time likes what load_table_data_frame() does |
If your data is already in a DataFrame and you do not mind the conversion overhead, then yes, go ahead with |
@plamut many thanks for your kind help |
@plamut How to insert null DATE to Bigquery, I tried, but It was forced to "0001-01-01" |
@namnguyenbk Since this is about a DATE type, I suggest opening a separate issue for it, will make tracking both of them easier. I don't know the answer from the top of my head, unfortunately, but can have a look at it next week when I'm back from a short absence. |
@plamut I fixed anyway. I think conversion in load_table_from_dataframe might be annoying sometime, btw. |
@namnguyenbk Good to hear that, and thanks for posting the link. No need to open a separate issue then. |
Hi @plamut Any plan for production rollout? |
The internal bug 147108331 to add DATETIME (microsecond-precision) support was closed as fixed on March 16, 2020. It's probably worth reverting part of this change https://github.com/googleapis/google-cloud-python/pull/10028/files#diff-f7cb34ad7828ff0648d57694b2fc2aa4L55 or at least adding a test to check that DATETIME values can be uploaded if explicitly set in the schema. |
I'll try to have a look at this again at the end of the week maybe. In the best case removing the hack that we had to add back then would be enough, meaning that we would would only have to (re)add a test covering this specific case. |
@tswast I tried uploading a DATETIME field, but no luck, I"m afraid.
The error message is a bit surprising, as 1578470400000000 epoch microseconds is actually Posting more details below for a sanity check. Modified type map to use DATETIME again: diff --git google/cloud/bigquery/_pandas_helpers.py google/cloud/bigquery/_pandas_helpers.py
index 953b7d0..df66e76 100644
--- google/cloud/bigquery/_pandas_helpers.py
+++ google/cloud/bigquery/_pandas_helpers.py
@@ -55,7 +55,7 @@ _PANDAS_DTYPE_TO_BQ = {
"datetime64[ns, UTC]": "TIMESTAMP",
# BigQuery does not support uploading DATETIME values from Parquet files.
# See: https://github.com/googleapis/google-cloud-python/issues/9996
- "datetime64[ns]": "TIMESTAMP",
+ "datetime64[ns]": "DATETIME",
"float32": "FLOAT",
"float64": "FLOAT",
"int8": "INTEGER",
The script used to test the behavior: import pandas as pd
from google.cloud import bigquery
PROJECT = "..."
DATASET = "..."
TABLE = "..."
def main():
bigquery_client = bigquery.Client()
table_name = f"{PROJECT}.{DATASET}.{TABLE}"
df = pd.DataFrame({
"float_col": [0.255, 0.55],
"datetime_col": ["2020-01-08 08:00:00", "2112-07-22 15:56:00"],
})
df = df.astype(dtype={"datetime_col": "datetime64[ns]"})
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("float_col", "FLOAT"),
bigquery.SchemaField("datetime_col", "DATETIME"),
],
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
)
query_job = bigquery_client.load_table_from_dataframe(
df, table_name, job_config=job_config
)
query_job.result()
if __name__ == "__main__":
main() The relevant package versions were up to date:
The parquet file generated and uploaded in the process: |
Thanks. I've re-opened the internal bug 147108331. I can replicate this error with the
|
Can we confirm that this is converting to microseconds in the Parquet file? Nanoseconds for DATETIME are not supported and eng thinks that may be the cause. |
Sorry, I misunderstood the engineers in internal bug 147108331. It was marked fixed because BigQuery no longer loads nonsense DATETIME values when such a file is encountered. Now it raises a validation error. I've filed 166476249 as a feature request to properly add support for DATETIME values. |
Given the above, what is the recommended way to insert a datetime from pandas into a bigquery DATETIME column? |
@jgadbois I suggest the following two workarounds:
|
This library recently added the ability to serialize a DataFrame to CSV for upload. With that format, DATETIME columns are supported.
|
that was a great workaround, thanks! |
@tswast your workaround works. |
Backend bug with Parquet DATETIME loading (147108331) is reported to be fix in the development server. We should be able remove our client-side workarounds once it rolls out to production. |
@jimfulton Could you add system tests to https://github.com/googleapis/python-bigquery/blob/master/tests/system/test_pandas.py for uploading DATETIME columns? While you're at it, could you add similar tests for TIME columns? I believe this will let us close https://issuetracker.google.com/169230812 In both cases, there can be a difference between millisecond-precision values and microsecond-precision values, so we should test for both. Re: #56 (comment), let's revert the change to the default dtype mapping in a PR to the |
Uh oh!
There was an error while loading. Please reload this page.
Environment details
I'm using pandas with google-cloud-python
Steps to reproduce
Code example
I just updated my problem . Here
Thanks!
The text was updated successfully, but these errors were encountered: