Skip to content

Can't upload data with "2019-07-08 08:00:00" datetime format to Google Bigquery with pandas #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
namnguyenbk opened this issue Mar 9, 2020 · 36 comments · Fixed by #895
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@namnguyenbk
Copy link

namnguyenbk commented Mar 9, 2020

Environment details

I'm using pandas with google-cloud-python

Steps to reproduce

  1. I have a dataframe has datetime format, ex: "2019-07-08 08:00:00" and my schema has created column with DATETIME type.
  2. I tried covert it to use pq.to_datetime()
  3. Then I used load_table_from_dataframe() to insert data.

Code example

my_df = get_sessions()  # this return a dataframe has a column name is created which is datetime[ns] type ex :"2020-01-08 08:00:00"
my_df['created'] = pd.to_datetime(my_df['created'], format='%Y-%m-%d %H:%M:%S').astype('datetime64[ns]')
res = bigquery_client.client.load_table_from_dataframe(my_df, table_id)
res.result()

# exp: my value "2020-01-08 08:00:00" is being changed as INVALID or this value "0013-03-01T03:05:00" or other wrong value @plamut please help

I just updated my problem . Here
Thanks!

@namnguyenbk
Copy link
Author

namnguyenbk commented Mar 9, 2020

I tried to change DATETIME to TIMESTAMP on schema but It always has UTC at the end of values. How should I fix them? please!

@namnguyenbk namnguyenbk changed the title Can't upload data with "2019-07-08 08:00:00" datetime format to GBQ with pandas Can't upload data with "2019-07-08 08:00:00" datetime format to Google Bigquery with pandas Mar 9, 2020
@busunkim96 busunkim96 transferred this issue from googleapis/google-cloud-python Mar 9, 2020
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Mar 9, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Mar 10, 2020
@plamut plamut self-assigned this Mar 10, 2020
@namnguyenbk
Copy link
Author

Also, I tried to change this column (df['created']) to milliseconds format, it didn't work. Please help me. Is that a bug.

@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

@namnguyenbk Will investigate this shortly.

In the meantime, would ti be possible to get a minimal reproducible code sample? And the versions of the libraries used? That would make the investigation easier, thanks!

@namnguyenbk
Copy link
Author

googleapis/google-cloud-python#9996
tswast said change DATETIME -> TIMESTAMP to fix this, but I don't think We should do that, TIMESTAMP on Bigquery has a different format with DATETIME and my client doesn't agree with this format, because it is being forced to UTC format while it is not.
This is a bug and you guys should fix them to support insert DATETIME to Bigquery by load_table_from_dataframe(). I have used insert_rows_by_dataframe() but it is a streaming buffer API,

@namnguyenbk
Copy link
Author

@plamut
thanks for help me, just wait in a minute, I will get back to you soon

@plamut plamut added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Mar 10, 2020
@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

@tswast Bringing this to your attention.

You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?

@namnguyenbk
Copy link
Author

namnguyenbk commented Mar 10, 2020

my_df = get_sessions()  # this return a dataframe has a column name is created which is datetime[ns] type ex :"2020-01-08 08:00:00"
my_df['created'] = pd.to_datetime(sessions_df['created'], format='%Y-%m-%d %H:%M:%S').astype('datetime64[ns]')
res = bigquery_client.client.load_table_from_dataframe(my_df, table_id)
res.result()

ex: my value "2020-01-08 08:00:00" is being changed as INVALID or this value "0013-03-01T03:05:00" @plamut please help

@namnguyenbk
Copy link
Author

namnguyenbk commented Mar 10, 2020

@plamut I saw @tswast 's PR to fix this is not a correct solution,
he has changed "datetime64[ns]": "DATETIME" to "datetime64[ns]": "TIMESTAMP". I think this is a wrong solution because TIMESTAMP of Bigquery force datetime to UTC format . so that is incorrect

@namnguyenbk
Copy link
Author

I'm using newest version of google-cloud-python @plamut 1.24.0

@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

@namnguyenbk Noted, thank you.

@namnguyenbk
Copy link
Author

namnguyenbk commented Mar 10, 2020

@plamut @tswast
I just updated my problem #56 (comment) please take a look

@tswast
Copy link
Contributor

tswast commented Mar 10, 2020

You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?

There have been some updates on the internal issue 147108331 that I filed. It looks like DATETIME support might actually be added soon, but needs some time to rollout to production. Hopefully @shollyman can let you know when this changes and we can revert my "fix" to always use TIMESTAMP.

@plamut plamut added external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Mar 10, 2020
@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

Thanks for the update!

I marked the issue as external, and also classified it as P2 - it's already been around for quite awhile now with relatively low number bug report count, and there also exists a schema workaround (albeit not a perfect one).

@namnguyenbk
Copy link
Author

thank you guys.
So you mean I still can't insert DATETIME to Bigquery and must wait for updating. @plamut @tswast

@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

Not through the dataframe, unfortunately, at least not yet. Although, IIRC, it might be possible to get around this by using load_table_from_json() instead (and with an explicit schema), although the latter is slower than using a dataframe (would have to re-check).

@namnguyenbk
Copy link
Author

@plamut how about insert_rows_from_dataframe() It worked fine with my problem but It's a streaming buffer API while I want to insert all rows in a short time likes what load_table_data_frame() does

@plamut
Copy link
Contributor

plamut commented Mar 10, 2020

insert_rows_from_dataframe() uses insert_rows_json() under the hood, it just converts dataframe rows to JSON data and uploads it through the streaming API:

insert_rows_from_dataframe() --> insert_rows() (for each rows chunk) --> insert_rows_json()

If your data is already in a DataFrame and you do not mind the conversion overhead, then yes, go ahead with insert_rows_from_dataframe(), especially if your use case is not negatively affected by a few trade-offs the streaming API brings (documentation).

@namnguyenbk
Copy link
Author

@plamut many thanks for your kind help

@namnguyenbk
Copy link
Author

@plamut How to insert null DATE to Bigquery, I tried, but It was forced to "0001-01-01"
using load_table_from_dataframe()

@plamut
Copy link
Contributor

plamut commented Mar 12, 2020

@namnguyenbk Since this is about a DATE type, I suggest opening a separate issue for it, will make tracking both of them easier.

I don't know the answer from the top of my head, unfortunately, but can have a look at it next week when I'm back from a short absence.

@namnguyenbk
Copy link
Author

@plamut I fixed anyway. I think conversion in load_table_from_dataframe might be annoying sometime, btw.
This one saved my life https://stackoverflow.com/questions/42818262/pandas-dataframe-replace-nat-with-none

@plamut
Copy link
Contributor

plamut commented Mar 16, 2020

@namnguyenbk Good to hear that, and thanks for posting the link. No need to open a separate issue then.

@PPan1215
Copy link

PPan1215 commented Aug 24, 2020

Hi @plamut Any plan for production rollout?
We are having the same issue with conversion from (%Y-%m-%d %H:%M:%S') in dataframe
to DATETIME in bq.
We are using parquet to speed up our importing task with airflow.
The datetime conversion ends up becoming the blocker.
For the sake of backwards compatibility, converting existing schema type
from DATETIME in bq to TIMESTAMP, unfortunately, is not an option.

@tswast
Copy link
Contributor

tswast commented Aug 24, 2020

The internal bug 147108331 to add DATETIME (microsecond-precision) support was closed as fixed on March 16, 2020. It's probably worth reverting part of this change https://github.com/googleapis/google-cloud-python/pull/10028/files#diff-f7cb34ad7828ff0648d57694b2fc2aa4L55 or at least adding a test to check that DATETIME values can be uploaded if explicitly set in the schema.

@plamut
Copy link
Contributor

plamut commented Aug 24, 2020

I'll try to have a look at this again at the end of the week maybe. In the best case removing the hack that we had to add back then would be enough, meaning that we would would only have to (re)add a test covering this specific case.

@plamut
Copy link
Contributor

plamut commented Aug 26, 2020

@tswast I tried uploading a DATETIME field, but no luck, I"m afraid.

google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

The error message is a bit surprising, as 1578470400000000 epoch microseconds is actually 2020-01-08 08:00:00, which falls in the range of valid values.


Posting more details below for a sanity check.

Modified type map to use DATETIME again:

diff --git google/cloud/bigquery/_pandas_helpers.py google/cloud/bigquery/_pandas_helpers.py
index 953b7d0..df66e76 100644
--- google/cloud/bigquery/_pandas_helpers.py
+++ google/cloud/bigquery/_pandas_helpers.py
@@ -55,7 +55,7 @@ _PANDAS_DTYPE_TO_BQ = {
     "datetime64[ns, UTC]": "TIMESTAMP",
     # BigQuery does not support uploading DATETIME values from Parquet files.
     # See: https://github.com/googleapis/google-cloud-python/issues/9996
-    "datetime64[ns]": "TIMESTAMP",
+    "datetime64[ns]": "DATETIME",
     "float32": "FLOAT",
     "float64": "FLOAT",
     "int8": "INTEGER",

The script used to test the behavior:

import pandas as pd
from google.cloud import bigquery


PROJECT = "..."
DATASET = "..."
TABLE = "..."


def main():
    bigquery_client = bigquery.Client()

    table_name = f"{PROJECT}.{DATASET}.{TABLE}"

    df = pd.DataFrame({
        "float_col": [0.255, 0.55],
        "datetime_col": ["2020-01-08 08:00:00", "2112-07-22 15:56:00"],
    })
    df = df.astype(dtype={"datetime_col": "datetime64[ns]"})

    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("float_col", "FLOAT"),
            bigquery.SchemaField("datetime_col", "DATETIME"),
        ],
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    )
    query_job = bigquery_client.load_table_from_dataframe(
        df, table_name, job_config=job_config
    )
    query_job.result()


if __name__ == "__main__":
    main()

The relevant package versions were up to date:

pandas==1.1.1
pyarrow==1.0.1
fastparquet==0.4.1

The parquet file generated and uploaded in the process:
aaa.parquet.gz

@tswast
Copy link
Contributor

tswast commented Aug 26, 2020

Thanks. I've re-opened the internal bug 147108331. I can replicate this error with the bq command as well

$ bq load --source_format=PARQUET swast-scratch:my_dataset.aaa ./aaa.parquet ./aaa.schema
Upload complete.
Waiting on bqjob_rcdae8f70e185b28_000001742b104318_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'swast-scratch:bqjob_rcdae8f70e185b28_000001742b104318_1': Error while
reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type
'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
Failure details:
- prod-scotty-183204df-1614-4661-92c4-727e36ddf960: Error while
reading data, error message: Invalid datetime value
1578470400000000 for field 'datetime_col' of type 'INT64' (logical
type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an
invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1
00:00:00, 9999-12-31 23:59:59.999999]

aaa.schema.zip

@tswast
Copy link
Contributor

tswast commented Aug 26, 2020

Can we confirm that this is converting to microseconds in the Parquet file? Nanoseconds for DATETIME are not supported and eng thinks that may be the cause.

@tswast
Copy link
Contributor

tswast commented Aug 26, 2020

Sorry, I misunderstood the engineers in internal bug 147108331. It was marked fixed because BigQuery no longer loads nonsense DATETIME values when such a file is encountered. Now it raises a validation error.

I've filed 166476249 as a feature request to properly add support for DATETIME values.

@jgadbois
Copy link

Given the above, what is the recommended way to insert a datetime from pandas into a bigquery DATETIME column?

@tswast
Copy link
Contributor

tswast commented Sep 15, 2020

@jgadbois I suggest the following two workarounds:

  1. Create the table ahead of time, and append to it using the streaming API client.insert_rows_from_dataframe
  2. Use pandas-gbq, which currently serializes to CSV, not Parquet. https://pandas-gbq.readthedocs.io/en/latest/writing.html

@tswast
Copy link
Contributor

tswast commented Dec 22, 2020

This library recently added the ability to serialize a DataFrame to CSV for upload. With that format, DATETIME columns are supported.

job_config = bigquery.LoadJobConfig(
    schema=table_schema, source_format=SourceFormat.CSV
)
load_job = bigquery_client.load_table_from_dataframe(
    dataframe, table_id, job_config=job_config
)

@VinceCabs
Copy link

that was a great workaround, thanks!

@Lingesh7
Copy link

@tswast your workaround works.

@tswast
Copy link
Contributor

tswast commented Jun 30, 2021

Backend bug with Parquet DATETIME loading (147108331) is reported to be fix in the development server. We should be able remove our client-side workarounds once it rolls out to production.

@tswast
Copy link
Contributor

tswast commented Aug 23, 2021

@jimfulton Could you add system tests to https://github.com/googleapis/python-bigquery/blob/master/tests/system/test_pandas.py for uploading DATETIME columns?

While you're at it, could you add similar tests for TIME columns? I believe this will let us close https://issuetracker.google.com/169230812

In both cases, there can be a difference between millisecond-precision values and microsecond-precision values, so we should test for both.

Re: #56 (comment), let's revert the change to the default dtype mapping in a PR to the v3 branch, since that could be considered a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants