DOC: Document BigQuery to dtype translation for read_gbq

tswast · tswast · commit 7aeacca8a0cd · 2019-04-03T13:48:53.000-07:00
Adds a table documenting the current behavior, including that pandas
0.24.0 stores as time zone aware dtype and earlier versions store naive.
I could not figure out how to make 0.24.0+ store as a naive dtype, nor
could I figure out how to make earlier versions use time zone aware.
diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst
@@ -8,6 +8,12 @@ Changelog
 
 - This fixes a bug where pandas-gbq could not upload an empty database. (:issue:`237`)
 
+Documentation
+~~~~~~~~~~~~~
+
+- Document :ref:`BigQuery data type to pandas dtype conversion
+  <reading-dtypes>` for ``read_gbq``. (:issue:`TBD`)
+
 Dependency updates
 ~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/reading.rst b/docs/source/reading.rst
@@ -9,21 +9,32 @@ Suppose you want to load all data from an existing BigQuery table
 
 .. code-block:: python
 
-   # Insert your BigQuery Project ID Here
-   # Can be found in the Google web console
+   import pandas_gbq
+
+   # TODO: Set your BigQuery Project ID.
    projectid = "xxxxxxxx"
 
-   data_frame = read_gbq('SELECT * FROM test_dataset.test_table', projectid)
+   data_frame = pandas_gbq.read_gbq(
+       'SELECT * FROM `test_dataset.test_table`',
+       project_id=projectid)
+
+.. note::
 
+    A project ID is sometimes optional if it can be inferred during
+    authentication, but it is required when authenticating with user
+    credentials. You can find your project ID in the `Google Cloud console
+    <https://console.cloud.google.com>`__.
 
 You can define which column from BigQuery to use as an index in the
 destination DataFrame as well as a preferred column order as follows:
 
 .. code-block:: python
 
-   data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
-                          index_col='index_column_name',
-                          col_order=['col1', 'col2', 'col3'], projectid)
+   data_frame = pandas_gbq.read_gbq(
+       'SELECT * FROM `test_dataset.test_table`',
+       project_id=projectid,
+       index_col='index_column_name',
+       col_order=['col1', 'col2', 'col3'])
 
 
 You can specify the query config as parameter to use additional options of
@@ -37,20 +48,45 @@ your job. For more information about query configuration parameters see `here
         "useQueryCache": False
       }
    }
-   data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
-                          configuration=configuration, projectid)
-
+   data_frame = read_gbq(
+       'SELECT * FROM `test_dataset.test_table`',
+       project_id=projectid,
+       configuration=configuration)
 
-.. note::
-
-   You can find your project id in the `Google developers console
-   <https://console.developers.google.com>`__.
 
+The ``dialect`` argument can be used to indicate whether to use
+BigQuery's ``'legacy'`` SQL or BigQuery's ``'standard'`` SQL (beta). The
+default value is ``'standard'`` For more information on BigQuery's standard
+SQL, see `BigQuery SQL Reference
+<https://cloud.google.com/bigquery/docs/reference/standard-sql/>`__
 
-.. note::
+.. code-block:: python
 
-    The ``dialect`` argument can be used to indicate whether to use BigQuery's ``'legacy'`` SQL
-    or BigQuery's ``'standard'`` SQL (beta). The default value is ``'legacy'``, though this will change
-    in a subsequent release to ``'standard'``. For more information
-    on BigQuery's standard SQL, see `BigQuery SQL Reference
-    <https://cloud.google.com/bigquery/sql-reference/>`__
+   data_frame = pandas_gbq.read_gbq(
+       'SELECT * FROM [test_dataset.test_table]',
+       project_id=projectid,
+       dialect='legacy')
+
+
+.. _reading-dtypes:
+
+Inferring the DataFrame's dtypes
+--------------------------------
+
+The :func:`~pandas_gbq.read_gbq` method infers the pandas dtype for each column, based on the BigQuery table schema.
+
+================== =========================
+BigQuery Data Type dtype
+================== =========================
+FLOAT              float
+------------------ -------------------------
+TIMESTAMP          **pandas versions 0.24.0+**
+                     :class:`~pandas.DatetimeTZDtype` with ``unit='ns'`` and
+                     ``tz='UTC'``
+                   **Earlier versions**
+                     datetime64[ns]
+------------------ -------------------------
+DATETIME           datetime64[ns]
+TIME               datetime64[ns]
+DATE               datetime64[ns]
+================== =========================
diff --git a/pandas_gbq/gbq.py b/pandas_gbq/gbq.py
@@ -650,6 +650,9 @@ def _bqschema_to_nullsafe_dtypes(schema_fields):
     # See:
     # http://pandas.pydata.org/pandas-docs/dev/missing_data.html
     # #missing-data-casting-rules-and-indexing
+    #
+    # If you update this mapping, also update the table at
+    # `docs/source/reading.rst`.
     dtype_map = {
         "FLOAT": np.dtype(float),
         # Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't