Skip to content

DOC: updated dtypes docs #2947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 126 additions & 31 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
np.set_printoptions(precision=4, suppress=True)

*****************************
Essential basic functionality
Essential Basic Functionality
*****************************

Here we discuss a lot of the essential functionality common to the pandas data
Expand Down Expand Up @@ -114,7 +114,7 @@ either match on the *index* or *columns* via the **axis** keyword:
d = {'one' : Series(randn(3), index=['a', 'b', 'c']),
'two' : Series(randn(4), index=['a', 'b', 'c', 'd']),
'three' : Series(randn(3), index=['b', 'c', 'd'])}
df = DataFrame(d)
df = df_orig = DataFrame(d)
df
row = df.ix[1]
column = df['two']
Expand Down Expand Up @@ -936,8 +936,8 @@ The ``by`` argument can take a list of column names, e.g.:

.. ipython:: python

df = DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
df[['one', 'two', 'three']].sort_index(by=['one','two'])
df1 = DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
df1[['one', 'two', 'three']].sort_index(by=['one','two'])

Series has the method ``order`` (analogous to `R's order function
<http://stat.ethz.ch/R-manual/R-patched/library/base/html/order.html>`__) which
Expand All @@ -959,10 +959,8 @@ Some other sorting notes / nuances:
method will likely be deprecated in a future release in favor of just using
``sort_index``.

.. _basics.cast:

Copying, type casting
---------------------
Copying
-------

The ``copy`` method on pandas objects copies the underlying data (though not
the axis indexes, since they are immutable) and returns a new object. Note that
Expand All @@ -978,36 +976,132 @@ To be clear, no pandas methods have the side effect of modifying your data;
almost all methods return new objects, leaving the original object
untouched. If data is modified, it is because you did so explicitly.

Data can be explicitly cast to a NumPy dtype by using the ``astype`` method or
alternately passing the ``dtype`` keyword argument to the object constructor.
DTypes
------

.. _basics.dtypes:

The main types stored in pandas objects are float, int, boolean, datetime64[ns],
and object. A convenient ``dtypes`` attribute for DataFrames returns a Series with
the data type of each column.

.. ipython:: python

df = DataFrame(np.arange(12).reshape((4, 3)))
df[0].dtype
df.astype(float)[0].dtype
df = DataFrame(np.arange(12).reshape((4, 3)), dtype=float)
df[0].dtype
dft = DataFrame(dict( A = np.random.rand(3), B = 1, C = 'foo', D = Timestamp('20010102'),
E = Series([1.0]*3).astype('float32'),
F = False,
G = Series([1]*3,dtype='int8')))
dft

If a DataFrame contains columns of multiple dtypes, the dtype of the column
will be chosen to accommodate all of the data types (dtype=object is the most
general).

.. _basics.cast.infer:
The related method ``get_dtype_counts`` will return the number of columns of
each type:

Inferring better types for object columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. ipython:: python

The ``convert_objects`` DataFrame method will attempt to convert
``dtype=object`` columns to a better NumPy dtype. Occasionally (after
transposing multiple times, for example), a mixed-type DataFrame will end up
with everything as ``dtype=object``. This method attempts to fix that:
dft.get_dtype_counts()

Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0).
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

.. ipython:: python

df = DataFrame(randn(6, 3), columns=['a', 'b', 'c'])
df['d'] = 'foo'
df
df = df.T.T
df.dtypes
converted = df.convert_objects()
converted.dtypes
df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
B = Series(randn(8)),
C = Series(np.array(randn(8),dtype='uint8')) ))
df2
df2.dtypes

# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype

Astype
~~~~~~

.. _basics.cast:

You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation,
then the more *general* one will be used as the result of the operation.

.. ipython:: python

df3
df3.dtypes

# conversion of dtypes
df3.astype('float32').dtypes

Object Conversion
~~~~~~~~~~~~~~~~~

To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
This will convert any datetimelike object to dates, forcing other values to ``NaT``.

In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all
the objects in a Series are of the same type, the Series will have that dtype.

.. ipython:: python

# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
s
s.convert_objects(convert_dates='coerce')


Upcasting Gotchas
~~~~~~~~~~~~~~~~~

Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0)
See also :ref:`integer na gotchas <gotchas.intna>`

.. ipython:: python

dfi = df3.astype('int32')
dfi['E'] = 1
dfi
dfi.dtypes

casted = dfi[dfi>0]
casted
casted.dtypes

While float dtypes are unchanged.

.. ipython:: python

dfa = df3.copy()
dfa['A'] = dfa['A'].astype('float32')
dfa.dtypes

casted = dfa[df2>0]
casted
casted.dtypes

.. _basics.serialize:

Expand Down Expand Up @@ -1157,8 +1251,9 @@ For instance:
.. ipython:: python

set_eng_float_format(accuracy=3, use_eng_prefix=True)
df['a']/1.e3
df['a']/1.e6
s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
s/1.e3
s/1.e6

.. ipython:: python
:suppress:
Expand Down
90 changes: 0 additions & 90 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -455,96 +455,6 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of lables in the
:ref:`section on reindexing <basics.reindexing>`.

DataTypes
~~~~~~~~~

.. _dsintro.column_types:

The main types stored in pandas objects are float, int, boolean, datetime64[ns],
and object. A convenient ``dtypes`` attribute return a Series with the data type of
each column.

.. ipython:: python

df['integer'] = 1
df['int32'] = df['integer'].astype('int32')
df['float32'] = Series([1.0]*len(df),dtype='float32')
df['timestamp'] = Timestamp('20010102')
df.dtypes

If a DataFrame contains columns of multiple dtypes, the dtype of the column
will be chosen to accommodate all of the data types (dtype=object is the most
general).

The related method ``get_dtype_counts`` will return the number of columns of
each type:

.. ipython:: python

df.get_dtype_counts()

Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0).
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

.. ipython:: python

df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
B = Series(randn(8)),
C = Series(np.array(randn(8),dtype='uint8')) ))
df2
df2.dtypes

# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype

Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.

DataType Conversion
~~~~~~~~~~~~~~~~~~~

You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all the objects in a Series are of the same type, the Series
will have that dtype.

.. ipython:: python

df3
df3.dtypes

# conversion of dtypes
df3.astype('float32').dtypes

To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
This will convert any datetimelike object to dates, forcing other values to ``NaT``.

.. ipython:: python

# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
s
s.convert_objects(convert_dates='coerce')

Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have
not ignored.

.. _gotchas.intna:

Support for integer ``NA``
~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
29 changes: 0 additions & 29 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,35 +304,6 @@ so that the original data can be modified without creating a copy:

df.mask(df >= 0)

Upcasting Gotchas
~~~~~~~~~~~~~~~~~

Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).

.. ipython:: python

dfi = df.astype('int32')
dfi['E'] = 1
dfi
dfi.dtypes

casted = dfi[dfi>0]
casted
casted.dtypes

While float dtypes are unchanged.

.. ipython:: python

df2 = df.copy()
df2['A'] = df2['A'].astype('float32')
df2.dtypes

casted = df2[df2>0]
casted
casted.dtypes

Take Methods
~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion doc/source/v0.4.x.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ New Features
``MultiIndex`` (IS188_)
- :ref:`Set <indexing.mixed_type_setting>` values in mixed-type
``DataFrame`` objects via ``.ix`` indexing attribute (GH135_)
- Added new ``DataFrame`` :ref:`methods <dsintro.column_types>`
- Added new ``DataFrame`` :ref:`methods <basics.dtypes>`
``get_dtype_counts`` and property ``dtypes`` (ENHdc_)
- Added :ref:`ignore_index <merging.ignore_index>` option to
``DataFrame.append`` to stack DataFrames (ENH1b_)
Expand Down
2 changes: 1 addition & 1 deletion doc/source/v0.6.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ New features
DataFrame, fast versions of scipy.stats.rankdata (GH428_)
- Implement :ref:`DataFrame.from_items <basics.dataframe.from_items>` alternate
constructor (GH444_)
- DataFrame.convert_objects method for :ref:`inferring better dtypes <basics.cast.infer>`
- DataFrame.convert_objects method for :ref:`inferring better dtypes <basics.cast>`
for object columns (GH302_)
- Add :ref:`rolling_corr_pairwise <stats.moments.corr_pairwise>` function for
computing Panel of correlation matrices (GH189_)
Expand Down