-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Change to_gbq() from stream to batch loading #14670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @parthea |
Hey, just want to pick up this discussion again. We have had some more confirmed cases of the streaming uploaded data not appearing in a timely manner. However, the delay is not deterministic. I depends on google and speaking to Jordan Tigrani, he said the only guarantee they provide is 90 minutes, though typical cases are much shorter, obviously. I think most users would expect to_gbq() to be transactional. The ability to query directly after to_gbq() is executed is probably a common use case. For example, one may use pandas to make a scoring table, push to bq, and run bq over a huge dataset joined to the scoring table where the large table is too large to push into local memory. At work, we've written a helper function to use the non-streaming load and most people are avoiding to_gbq() because of this issue. I see two tacks: However this latter option requires the user to have write access into a google bucket, which is a more complicated config. Thoughts ? |
Here's code to see the behaviour:
|
Agree in principle that a batch upload is a much better fit for this than a streaming load
Where do you see that? While you could load into Google Storage, and then from there to BigQuery, you can also do a single POST request: https://cloud.google.com/bigquery/loading-data-post-request |
Ah you're right. I misunderstood. If we agree that batch is better, does anyone object if I try to write this patch myself over the holidays ? I've wanted to contribute to pandas in some way for years and this would be a nice easy one to get my feet wet. |
moved to googleapis/python-bigquery-pandas#7 |
Uh oh!
There was an error while loading. Please reload this page.
A small, complete example of the issue
Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such
to_gbq()
returning is no guarantee of anything.I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.
However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.
At least we would could include an option.
Thoughts ?
(not sure how to label this IO:google)
Output of
pd.show_versions()
pandas: 0.18.1
nose: 1.3.7
pip: 9.0.0
setuptools: 25.2.0
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2015.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: