Skip to content

Commit 5a22fca

Browse files
author
tworec
committed
BUG: fix read_gbq lost numeric precision
fixes: - lost precision for longs above 2^53 - and floats above 10k
1 parent 136a6fb commit 5a22fca

File tree

5 files changed

+197
-61
lines changed

5 files changed

+197
-61
lines changed

doc/source/install.rst

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -250,9 +250,9 @@ Optional Dependencies
250250
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
251251
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
252252

253-
- `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
254-
- `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
255-
- `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.
253+
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
254+
* `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
255+
* `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.
256256

257257
* `matplotlib <http://matplotlib.org/>`__: for plotting
258258
* For Excel I/O:
@@ -272,11 +272,8 @@ Optional Dependencies
272272
<http://www.vergenet.net/~conrad/software/xsel/>`__, or `xclip
273273
<https://github.com/astrand/xclip/>`__: necessary to use
274274
:func:`~pandas.read_clipboard`. Most package managers on Linux distributions will have ``xclip`` and/or ``xsel`` immediately available for installation.
275-
* Google's `python-gflags <<https://github.com/google/python-gflags/>`__ ,
276-
`oauth2client <https://github.com/google/oauth2client>`__ ,
277-
`httplib2 <http://pypi.python.org/pypi/httplib2>`__
278-
and `google-api-python-client <http://github.com/google/google-api-python-client>`__
279-
: Needed for :mod:`~pandas.io.gbq`
275+
* For Google BigQuery I/O - see :ref:`here <io.bigquery_deps>`.
276+
280277
* `Backports.lzma <https://pypi.python.org/pypi/backports.lzma/>`__: Only for Python 2, for writing to and/or reading from an xz compressed DataFrame in CSV; Python 3 support is built into the standard library.
281278
* One of the following combinations of libraries is needed to use the
282279
top-level :func:`~pandas.read_html` function:

doc/source/io.rst

Lines changed: 47 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ object.
3939
* :ref:`read_json<io.json_reader>`
4040
* :ref:`read_msgpack<io.msgpack>` (experimental)
4141
* :ref:`read_html<io.read_html>`
42-
* :ref:`read_gbq<io.bigquery_reader>` (experimental)
42+
* :ref:`read_gbq<io.bigquery>` (experimental)
4343
* :ref:`read_stata<io.stata_reader>`
4444
* :ref:`read_sas<io.sas_reader>`
4545
* :ref:`read_clipboard<io.clipboard>`
@@ -55,7 +55,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
5555
* :ref:`to_json<io.json_writer>`
5656
* :ref:`to_msgpack<io.msgpack>` (experimental)
5757
* :ref:`to_html<io.html>`
58-
* :ref:`to_gbq<io.bigquery_writer>` (experimental)
58+
* :ref:`to_gbq<io.bigquery>` (experimental)
5959
* :ref:`to_stata<io.stata_writer>`
6060
* :ref:`to_clipboard<io.clipboard>`
6161
* :ref:`to_pickle<io.pickle>`
@@ -4559,16 +4559,11 @@ DataFrame with a shape and data types derived from the source table.
45594559
Additionally, DataFrames can be inserted into new BigQuery tables or appended
45604560
to existing tables.
45614561

4562-
You will need to install some additional dependencies:
4563-
4564-
- Google's `python-gflags <https://github.com/google/python-gflags/>`__
4565-
- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
4566-
- `google-api-python-client <http://github.com/google/google-api-python-client>`__
4567-
45684562
.. warning::
45694563

45704564
To use this module, you will need a valid BigQuery account. Refer to the
4571-
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__ for details on the service itself.
4565+
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
4566+
for details on the service itself.
45724567

45734568
The key functions are:
45744569

@@ -4582,7 +4577,44 @@ The key functions are:
45824577

45834578
.. currentmodule:: pandas
45844579

4585-
.. _io.bigquery_reader:
4580+
4581+
Supported Data Types
4582+
++++++++++++++++++++
4583+
4584+
Pandas supports all these `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
4585+
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
4586+
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
4587+
are not supported.
4588+
4589+
Integer and boolean ``NA`` handling
4590+
+++++++++++++++++++++++++++++++++++
4591+
4592+
.. versionadded:: 0.19
4593+
4594+
Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
4595+
support for integer and boolean types, this module will store ``INTEGER`` or
4596+
``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
4597+
Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
4598+
respectively.
4599+
4600+
This is opposite to default pandas behaviour which will promote integer
4601+
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
4602+
for detailed explaination.
4603+
4604+
While this trade-off works well for most cases, it breaks down for storing
4605+
values greater than 2**53. Such values in BigQuery can represent identifiers
4606+
and unnoticed precision lost for identifier is what we want to avoid.
4607+
4608+
.. _io.bigquery_deps:
4609+
4610+
Dependencies
4611+
++++++++++++
4612+
4613+
This module requires following additional dependencies:
4614+
4615+
- `httplib2 <https://github.com/httplib2/httplib2>`__: HTTP client
4616+
- `google-api-python-client <http://github.com/google/google-api-python-client>`__: Google's API client
4617+
- `oauth2client <https://github.com/google/oauth2client>`__: authentication and authorization for Google's API
45864618

45874619
.. _io.bigquery_authentication:
45884620

@@ -4597,7 +4629,7 @@ Is possible to authenticate with either user account credentials or service acco
45974629
Authenticating with user account credentials is as simple as following the prompts in a browser window
45984630
which will be automatically opened for you. You will be authenticated to the specified
45994631
``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
4600-
The remote authentication using user account credentials is not currently supported in Pandas.
4632+
The remote authentication using user account credentials is not currently supported in pandas.
46014633
Additional information on the authentication mechanism can be found
46024634
`here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.
46034635

@@ -4606,8 +4638,6 @@ is particularly useful when working on remote servers (eg. jupyter iPython noteb
46064638
Additional information on service accounts can be found
46074639
`here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.
46084640

4609-
You will need to install an additional dependency: `oauth2client <https://github.com/google/oauth2client>`__.
4610-
46114641
Authentication via ``application default credentials`` is also possible. This is only valid
46124642
if the parameter ``private_key`` is not provided. This method also requires that
46134643
the credentials can be fetched from the environment the code is running in.
@@ -4627,6 +4657,7 @@ Additional information on
46274657
A private key can be obtained from the Google developers console by clicking
46284658
`here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.
46294659

4660+
.. _io.bigquery_reader:
46304661

46314662
Querying
46324663
''''''''
@@ -4686,7 +4717,6 @@ For more information about query configuration parameters see
46864717

46874718
.. _io.bigquery_writer:
46884719

4689-
46904720
Writing DataFrames
46914721
''''''''''''''''''
46924722

@@ -4776,6 +4806,8 @@ For example:
47764806
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
47774807
sets of data quickly, but it is not a direct replacement for a transactional database.
47784808

4809+
.. _io.bigquery_create_tables:
4810+
47794811
Creating BigQuery Tables
47804812
''''''''''''''''''''''''
47814813

@@ -4805,6 +4837,7 @@ produce the dictionary representation schema of the specified pandas DataFrame.
48054837
the new table with a different name. Refer to
48064838
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
48074839

4840+
48084841
.. _io.stata:
48094842

48104843
Stata Format

doc/source/whatsnew/v0.20.0.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ Bug Fixes
299299
- Bug in ``pd.to_numeric()`` in which float and unsigned integer elements were being improperly casted (:issue:`14941`, :issue:`15005`)
300300
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
301301

302+
- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`).
303+
302304

303305

304306
- Bug in ``pd.read_msgpack()`` in which ``Series`` categoricals were being improperly processed (:issue:`14901`)

pandas/io/gbq.py

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -603,18 +603,14 @@ def _parse_data(schema, rows):
603603
# see:
604604
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
605605
# #missing-data-casting-rules-and-indexing
606-
dtype_map = {'INTEGER': np.dtype(float),
607-
'FLOAT': np.dtype(float),
608-
# This seems to be buggy without nanosecond indicator
606+
dtype_map = {'FLOAT': np.dtype(float),
609607
'TIMESTAMP': 'M8[ns]'}
610608

611609
fields = schema['fields']
612610
col_types = [field['type'] for field in fields]
613611
col_names = [str(field['name']) for field in fields]
614612
col_dtypes = [dtype_map.get(field['type'], object) for field in fields]
615-
page_array = np.zeros((len(rows),),
616-
dtype=lzip(col_names, col_dtypes))
617-
613+
page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
618614
for row_num, raw_row in enumerate(rows):
619615
entries = raw_row.get('f', [])
620616
for col_num, field_type in enumerate(col_types):
@@ -628,7 +624,9 @@ def _parse_data(schema, rows):
628624
def _parse_entry(field_value, field_type):
629625
if field_value is None or field_value == 'null':
630626
return None
631-
if field_type == 'INTEGER' or field_type == 'FLOAT':
627+
if field_type == 'INTEGER':
628+
return int(field_value)
629+
elif field_type == 'FLOAT':
632630
return float(field_value)
633631
elif field_type == 'TIMESTAMP':
634632
timestamp = datetime.utcfromtimestamp(float(field_value))
@@ -757,10 +755,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
757755
'Column order does not match this DataFrame.'
758756
)
759757

760-
# Downcast floats to integers and objects to booleans
761-
# if there are no NaN's. This is presently due to a
762-
# limitation of numpy in handling missing data.
763-
final_df._data = final_df._data.downcast(dtypes='infer')
758+
# cast BOOLEAN and INTEGER columns from object to bool/int
759+
# if they dont have any nulls
760+
type_map = {'BOOLEAN': bool, 'INTEGER': int}
761+
for field in schema['fields']:
762+
if field['type'] in type_map and \
763+
final_df[field['name']].notnull().all():
764+
final_df[field['name']] = \
765+
final_df[field['name']].astype(type_map[field['type']])
764766

765767
connector.print_elapsed_seconds(
766768
'Total time taken',

0 commit comments

Comments
 (0)