Skip to content

Commit 5d04093

Browse files
author
Artemy Kolchinsky
committed
Merge remote-tracking branch 'upstream/master' into sparse_with_nancols
2 parents e9ed3d8 + 0f899f4 commit 5d04093

29 files changed

+706
-129
lines changed

doc/README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ If you want to do a full clean build, do::
132132
python make.py build
133133

134134

135-
Staring with 0.13.1 you can tell ``make.py`` to compile only a single section
135+
Starting with 0.13.1 you can tell ``make.py`` to compile only a single section
136136
of the docs, greatly reducing the turn-around time for checking your changes.
137137
You will be prompted to delete `.rst` files that aren't required, since the
138138
last committed version can always be restored from git.

doc/source/categorical.rst

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -541,8 +541,12 @@ The same applies to ``df.append(df_different)``.
541541
Getting Data In/Out
542542
-------------------
543543

544-
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently
545-
raise ``NotImplementedError``.
544+
.. versionadded:: 0.15.2
545+
546+
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype was implemented
547+
in 0.15.2. See :ref:`here <io.hdf5-categorical>` for an example and caveats.
548+
549+
Writing data to/from Stata format files was implemented in 0.15.2.
546550

547551
Writing to a CSV file will convert the data, effectively removing any information about the
548552
categorical (categories and ordering). So if you read back the CSV file you have to convert the
@@ -805,4 +809,3 @@ Use ``copy=True`` to prevent such a behaviour or simply don't reuse `Categorical
805809
This also happens in some cases when you supply a `numpy` array instead of a `Categorical`:
806810
using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behaviour, while using
807811
a string array (e.g. ``np.array(["a","b","c","a"])``) will not.
808-

doc/source/groupby.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,21 @@ In the case of grouping by multiple keys, the group name will be a tuple:
338338
It's standard Python-fu but remember you can unpack the tuple in the for loop
339339
statement if you wish: ``for (k1, k2), group in grouped:``.
340340

341+
Selecting a group
342+
-----------------
343+
344+
A single group can be selected using ``GroupBy.get_group()``:
345+
346+
.. ipython:: python
347+
348+
grouped.get_group('bar')
349+
350+
Or for an object grouped on multiple columns:
351+
352+
.. ipython:: python
353+
354+
df.groupby(['A', 'B']).get_group(('bar', 'one'))
355+
341356
.. _groupby.aggregate:
342357

343358
Aggregation

doc/source/io.rst

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1992,6 +1992,27 @@ indices to be parsed.
19921992
19931993
read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3])
19941994
1995+
.. note::
1996+
1997+
It is possible to transform the contents of Excel cells via the `converters`
1998+
option. For instance, to convert a column to boolean:
1999+
2000+
.. code-block:: python
2001+
2002+
read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
2003+
2004+
This options handles missing values and treats exceptions in the converters
2005+
as missing data. Transformations are applied cell by cell rather than to the
2006+
column as a whole, so the array dtype is not guaranteed. For instance, a
2007+
column of integers with missing values cannot be transformed to an array
2008+
with integer dtype, because NaN is strictly a float. You can manually mask
2009+
missing data to recover integer dtype:
2010+
2011+
.. code-block:: python
2012+
2013+
cfun = lambda x: int(x) if x else -1
2014+
read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
2015+
19952016
To write a DataFrame object to a sheet of an Excel file, you can use the
19962017
``to_excel`` instance method. The arguments are largely the same as ``to_csv``
19972018
described above, the first argument being the name of the excel file, and the
@@ -3070,6 +3091,53 @@ conversion may not be necessary in future versions of pandas)
30703091
df
30713092
df.dtypes
30723093
3094+
.. _io.hdf5-categorical:
3095+
3096+
Categorical Data
3097+
~~~~~~~~~~~~~~~~
3098+
3099+
.. versionadded:: 0.15.2
3100+
3101+
Writing data to a ``HDFStore`` that contains a ``category`` dtype was implemented
3102+
in 0.15.2. Queries work the same as if it was an object array. However, the ``category`` dtyped data is
3103+
stored in a more efficient manner.
3104+
3105+
.. ipython:: python
3106+
3107+
dfcat = DataFrame({ 'A' : Series(list('aabbcdba')).astype('category'),
3108+
'B' : np.random.randn(8) })
3109+
dfcat
3110+
dfcat.dtypes
3111+
cstore = pd.HDFStore('cats.h5', mode='w')
3112+
cstore.append('dfcat', dfcat, format='table', data_columns=['A'])
3113+
result = cstore.select('dfcat', where="A in ['b','c']")
3114+
result
3115+
result.dtypes
3116+
3117+
.. warning::
3118+
3119+
The format of the ``Categorical`` is readable by prior versions of pandas (< 0.15.2), but will retrieve
3120+
the data as an integer based column (e.g. the ``codes``). However, the ``categories`` *can* be retrieved
3121+
but require the user to select them manually using the explicit meta path.
3122+
3123+
The data is stored like so:
3124+
3125+
.. ipython:: python
3126+
3127+
cstore
3128+
3129+
# to get the categories
3130+
cstore.select('dfcat/meta/A/meta')
3131+
3132+
.. ipython:: python
3133+
:suppress:
3134+
:okexcept:
3135+
3136+
cstore.close()
3137+
import os
3138+
os.remove('cats.h5')
3139+
3140+
30733141
String Columns
30743142
~~~~~~~~~~~~~~
30753143

@@ -3639,6 +3707,8 @@ outside of this range, the data is cast to ``int16``.
36393707
data frames containing categorical data will convert non-string categorical values
36403708
to strings.
36413709

3710+
Writing data to/from Stata format files with a ``category`` dtype was implemented in 0.15.2.
3711+
36423712
.. _io.stata_reader:
36433713

36443714
Reading from STATA format

doc/source/remote_data.rst

Lines changed: 62 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,14 @@ Remote Data Access
2727

2828
.. _remote_data.data_reader:
2929

30-
Functions from :mod:`pandas.io.data` extract data from various Internet
31-
sources into a DataFrame. Currently the following sources are supported:
30+
Functions from :mod:`pandas.io.data` and :mod:`pandas.io.ga` extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
3231

33-
- Yahoo! Finance
34-
- Google Finance
35-
- St. Louis FED (FRED)
36-
- Kenneth French's data library
37-
- World Bank
32+
- :ref:`Yahoo! Finance<remote_data.yahoo>`
33+
- :ref:`Google Finance<remote_data.google>`
34+
- :ref:`St.Louis FED (FRED)<remote_data.fred>`
35+
- :ref:`Kenneth French's data library<remote_data.ff>`
36+
- :ref:`World Bank<remote_data.wb>`
37+
- :ref:`Google Analytics<remote_data.ga>`
3838

3939
It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.
4040

@@ -330,7 +330,62 @@ indicators, or a single "bad" (#4 above) country code).
330330
331331
See docstrings for more info.
332332
333+
.. _remote_data.ga:
333334
335+
Google Analytics
336+
----------------
334337
338+
The :mod:`~pandas.io.ga` module provides a wrapper for
339+
`Google Analytics API <https://developers.google.com/analytics/devguides>`__
340+
to simplify retrieving traffic data.
341+
Result sets are parsed into a pandas DataFrame with a shape and data types
342+
derived from the source table.
335343
344+
Configuring Access to Google Analytics
345+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
346+
347+
The first thing you need to do is to setup accesses to Google Analytics API. Follow the steps below:
348+
349+
#. In the `Google Developers Console <https://console.developers.google.com>`__
350+
#. enable the Analytics API
351+
#. create a new project
352+
#. create a new Client ID for an "Installed Application" (in the "APIs & auth / Credentials section" of the newly created project)
353+
#. download it (JSON file)
354+
#. On your machine
355+
#. rename it to ``client_secrets.json``
356+
#. move it to the ``pandas/io`` module directory
357+
358+
The first time you use the :func:`read_ga` funtion, a browser window will open to ask you to authentify to the Google API. Do proceed.
359+
360+
Using the Google Analytics API
361+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
362+
363+
The following will fetch users and pageviews (metrics) data per day of the week, for the first semester of 2014, from a particular property.
364+
365+
.. code-block:: python
366+
367+
import pandas.io.ga as ga
368+
ga.read_ga(
369+
account_id = "2360420",
370+
profile_id = "19462946",
371+
property_id = "UA-2360420-5",
372+
metrics = ['users', 'pageviews'],
373+
dimensions = ['dayOfWeek'],
374+
start_date = "2014-01-01",
375+
end_date = "2014-08-01",
376+
index_col = 0,
377+
filters = "pagePath=~aboutus;ga:country==France",
378+
)
379+
380+
The only mandatory arguments are ``metrics,`` ``dimensions`` and ``start_date``. We can only strongly recommend you to always specify the ``account_id``, ``profile_id`` and ``property_id`` to avoid accessing the wrong data bucket in Google Analytics.
381+
382+
The ``index_col`` argument indicates which dimension(s) has to be taken as index.
383+
384+
The ``filters`` argument indicates the filtering to apply to the query. In the above example, the page has URL has to contain ``aboutus`` AND the visitors country has to be France.
385+
386+
Detailed informations in the followings:
387+
388+
* `pandas & google analytics, by yhat <http://blog.yhathq.com/posts/pandas-google-analytics.html>`__
389+
* `Google Analytics integration in pandas, by Chang She <http://quantabee.wordpress.com/2012/12/17/google-analytics-pandas/>`__
390+
* `Google Analytics Dimensions and Metrics Reference <https://developers.google.com/analytics/devguides/reporting/core/dimsmets>`_
336391

doc/source/whatsnew/v0.15.2.txt

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ Enhancements
4242
~~~~~~~~~~~~
4343

4444
- Added ability to export Categorical data to Stata (:issue:`8633`).
45+
- Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here <io.hdf5-categorical>` for an example and caveats w.r.t. prior versions of pandas.
46+
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
47+
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
4548

4649
.. _whatsnew_0152.performance:
4750

@@ -58,12 +61,14 @@ Experimental
5861

5962
Bug Fixes
6063
~~~~~~~~~
64+
- Bug in packaging pandas with ``py2app/cx_Freeze`` (:issue:`8602`, :issue:`8831`)
6165
- Bug in ``groupby`` signatures that didn't include \*args or \*\*kwargs (:issue:`8733`).
6266
- ``io.data.Options`` now raises ``RemoteDataError`` when no expiry dates are available from Yahoo and when it receives no data from Yahoo (:issue:`8761`), (:issue:`8783`).
67+
- Unclear error message in csv parsing when passing dtype and names and the parsed data is a different data type (:issue:`8833`)
6368
- Bug in slicing a multi-index with an empty list and at least one boolean indexer (:issue:`8781`)
6469
- ``io.data.Options`` now raises ``RemoteDataError`` when no expiry dates are available from Yahoo (:issue:`8761`).
6570
- ``Timedelta`` kwargs may now be numpy ints and floats (:issue:`8757`).
66-
71+
- ``sql_schema`` now generates dialect appropriate ``CREATE TABLE`` statements (:issue:`8697`)
6772

6873

6974

@@ -93,4 +98,7 @@ Bug Fixes
9398

9499
- Bug in `pd.infer_freq`/`DataFrame.inferred_freq` that prevented proper sub-daily frequency inference
95100
when the index contained DST days (:issue:`8772`).
96-
- Regression in ``Timestamp`` does not parse 'Z' zone designator for UTC (:issue:`8771`)
101+
- Bug where index name was still used when plotting a series with ``use_index=False`` (:issue:`8558`).
102+
103+
- Bugs when trying to stack multiple columns, when some (or all)
104+
of the level names are numbers (:issue:`8584`).

pandas/computation/pytables.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,17 @@ def is_in_table(self):
147147
@property
148148
def kind(self):
149149
""" the kind of my field """
150-
return self.queryables.get(self.lhs)
150+
return getattr(self.queryables.get(self.lhs),'kind',None)
151+
152+
@property
153+
def meta(self):
154+
""" the meta of my field """
155+
return getattr(self.queryables.get(self.lhs),'meta',None)
156+
157+
@property
158+
def metadata(self):
159+
""" the metadata of my field """
160+
return getattr(self.queryables.get(self.lhs),'metadata',None)
151161

152162
def generate(self, v):
153163
""" create and return the op string for this TermValue """
@@ -167,6 +177,7 @@ def stringify(value):
167177
return encoder(value)
168178

169179
kind = _ensure_decoded(self.kind)
180+
meta = _ensure_decoded(self.meta)
170181
if kind == u('datetime64') or kind == u('datetime'):
171182
if isinstance(v, (int, float)):
172183
v = stringify(v)
@@ -182,6 +193,10 @@ def stringify(value):
182193
elif kind == u('timedelta64') or kind == u('timedelta'):
183194
v = _coerce_scalar_to_timedelta_type(v, unit='s').value
184195
return TermValue(int(v), v, kind)
196+
elif meta == u('category'):
197+
metadata = com._values_from_object(self.metadata)
198+
result = metadata.searchsorted(v,side='left')
199+
return TermValue(result, result, u('integer'))
185200
elif kind == u('integer'):
186201
v = int(float(v))
187202
return TermValue(v, v, kind)

pandas/core/categorical.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,15 @@ def ndim(self):
319319
"""Number of dimensions of the Categorical """
320320
return self._codes.ndim
321321

322+
def reshape(self, new_shape, **kwargs):
323+
""" compat with .reshape """
324+
return self
325+
326+
@property
327+
def base(self):
328+
""" compat, we are always our own object """
329+
return None
330+
322331
@classmethod
323332
def from_array(cls, data, **kwargs):
324333
"""
@@ -363,10 +372,9 @@ def from_codes(cls, codes, categories, ordered=False, name=None):
363372

364373
categories = cls._validate_categories(categories)
365374

366-
if codes.max() >= len(categories) or codes.min() < -1:
375+
if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
367376
raise ValueError("codes need to be between -1 and len(categories)-1")
368377

369-
370378
return Categorical(codes, categories=categories, ordered=ordered, name=name, fastpath=True)
371379

372380
_codes = None

pandas/core/internals.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4381,7 +4381,7 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
43814381
else:
43824382
fill_value = upcasted_na
43834383

4384-
if self.is_null:
4384+
if self.is_null and not getattr(self.block,'is_categorical',None):
43854385
missing_arr = np.empty(self.shape, dtype=empty_dtype)
43864386
if np.prod(self.shape):
43874387
# NumPy 1.6 workaround: this statement gets strange if all

0 commit comments

Comments
 (0)