-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Currently, for API that can use BQ Storage Client to fetch data like to_dataframe_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.
| requested_streams = 1 if preserve_order else 0 |
python-bigquery/google/cloud/bigquery/_pandas_helpers.py
Lines 854 to 858 in ef8e927
| session = bqstorage_client.create_read_session( | |
| parent="projects/{}".format(project_id), | |
| read_session=requested_session, | |
| max_stream_count=requested_streams, | |
| ) |
This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.
BQ Storage Client API also suggests capping max_stream_count when resource is constrained
Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.
This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: #1292
However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.