Skip to content

Commit 13ae597

Browse files
committed
Merge pull request #2947 from jreback/dtypes_doc
DOC: updated dtypes docs
2 parents af60d93 + 63240d7 commit 13ae597

File tree

6 files changed

+130
-152
lines changed

6 files changed

+130
-152
lines changed

doc/source/basics.rst

Lines changed: 126 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
np.set_printoptions(precision=4, suppress=True)
1111
1212
*****************************
13-
Essential basic functionality
13+
Essential Basic Functionality
1414
*****************************
1515

1616
Here we discuss a lot of the essential functionality common to the pandas data
@@ -114,7 +114,7 @@ either match on the *index* or *columns* via the **axis** keyword:
114114
d = {'one' : Series(randn(3), index=['a', 'b', 'c']),
115115
'two' : Series(randn(4), index=['a', 'b', 'c', 'd']),
116116
'three' : Series(randn(3), index=['b', 'c', 'd'])}
117-
df = DataFrame(d)
117+
df = df_orig = DataFrame(d)
118118
df
119119
row = df.ix[1]
120120
column = df['two']
@@ -936,8 +936,8 @@ The ``by`` argument can take a list of column names, e.g.:
936936

937937
.. ipython:: python
938938
939-
df = DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
940-
df[['one', 'two', 'three']].sort_index(by=['one','two'])
939+
df1 = DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
940+
df1[['one', 'two', 'three']].sort_index(by=['one','two'])
941941
942942
Series has the method ``order`` (analogous to `R's order function
943943
<http://stat.ethz.ch/R-manual/R-patched/library/base/html/order.html>`__) which
@@ -959,10 +959,8 @@ Some other sorting notes / nuances:
959959
method will likely be deprecated in a future release in favor of just using
960960
``sort_index``.
961961

962-
.. _basics.cast:
963-
964-
Copying, type casting
965-
---------------------
962+
Copying
963+
-------
966964

967965
The ``copy`` method on pandas objects copies the underlying data (though not
968966
the axis indexes, since they are immutable) and returns a new object. Note that
@@ -978,36 +976,132 @@ To be clear, no pandas methods have the side effect of modifying your data;
978976
almost all methods return new objects, leaving the original object
979977
untouched. If data is modified, it is because you did so explicitly.
980978

981-
Data can be explicitly cast to a NumPy dtype by using the ``astype`` method or
982-
alternately passing the ``dtype`` keyword argument to the object constructor.
979+
DTypes
980+
------
981+
982+
.. _basics.dtypes:
983+
984+
The main types stored in pandas objects are float, int, boolean, datetime64[ns],
985+
and object. A convenient ``dtypes`` attribute for DataFrames returns a Series with
986+
the data type of each column.
983987

984988
.. ipython:: python
985989
986-
df = DataFrame(np.arange(12).reshape((4, 3)))
987-
df[0].dtype
988-
df.astype(float)[0].dtype
989-
df = DataFrame(np.arange(12).reshape((4, 3)), dtype=float)
990-
df[0].dtype
990+
dft = DataFrame(dict( A = np.random.rand(3), B = 1, C = 'foo', D = Timestamp('20010102'),
991+
E = Series([1.0]*3).astype('float32'),
992+
F = False,
993+
G = Series([1]*3,dtype='int8')))
994+
dft
995+
996+
If a DataFrame contains columns of multiple dtypes, the dtype of the column
997+
will be chosen to accommodate all of the data types (dtype=object is the most
998+
general).
991999

992-
.. _basics.cast.infer:
1000+
The related method ``get_dtype_counts`` will return the number of columns of
1001+
each type:
9931002

994-
Inferring better types for object columns
995-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1003+
.. ipython:: python
9961004
997-
The ``convert_objects`` DataFrame method will attempt to convert
998-
``dtype=object`` columns to a better NumPy dtype. Occasionally (after
999-
transposing multiple times, for example), a mixed-type DataFrame will end up
1000-
with everything as ``dtype=object``. This method attempts to fix that:
1005+
dft.get_dtype_counts()
1006+
1007+
Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0).
1008+
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
1009+
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.
10011010

10021011
.. ipython:: python
10031012
1004-
df = DataFrame(randn(6, 3), columns=['a', 'b', 'c'])
1005-
df['d'] = 'foo'
1006-
df
1007-
df = df.T.T
1008-
df.dtypes
1009-
converted = df.convert_objects()
1010-
converted.dtypes
1013+
df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
1014+
df1
1015+
df1.dtypes
1016+
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
1017+
B = Series(randn(8)),
1018+
C = Series(np.array(randn(8),dtype='uint8')) ))
1019+
df2
1020+
df2.dtypes
1021+
1022+
# here you get some upcasting
1023+
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
1024+
df3
1025+
df3.dtypes
1026+
1027+
# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
1028+
df3.values.dtype
1029+
1030+
Astype
1031+
~~~~~~
1032+
1033+
.. _basics.cast:
1034+
1035+
You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
1036+
Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation,
1037+
then the more *general* one will be used as the result of the operation.
1038+
1039+
.. ipython:: python
1040+
1041+
df3
1042+
df3.dtypes
1043+
1044+
# conversion of dtypes
1045+
df3.astype('float32').dtypes
1046+
1047+
Object Conversion
1048+
~~~~~~~~~~~~~~~~~
1049+
1050+
To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
1051+
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
1052+
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
1053+
This will convert any datetimelike object to dates, forcing other values to ``NaT``.
1054+
1055+
In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all
1056+
the objects in a Series are of the same type, the Series will have that dtype.
1057+
1058+
.. ipython:: python
1059+
1060+
# mixed type conversions
1061+
df3['D'] = '1.'
1062+
df3['E'] = '1'
1063+
df3.convert_objects(convert_numeric=True).dtypes
1064+
1065+
# same, but specific dtype conversion
1066+
df3['D'] = df3['D'].astype('float16')
1067+
df3['E'] = df3['E'].astype('int32')
1068+
df3.dtypes
1069+
1070+
# forcing date coercion
1071+
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
1072+
s
1073+
s.convert_objects(convert_dates='coerce')
1074+
1075+
1076+
Upcasting Gotchas
1077+
~~~~~~~~~~~~~~~~~
1078+
1079+
Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
1080+
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0)
1081+
See also :ref:`integer na gotchas <gotchas.intna>`
1082+
1083+
.. ipython:: python
1084+
1085+
dfi = df3.astype('int32')
1086+
dfi['E'] = 1
1087+
dfi
1088+
dfi.dtypes
1089+
1090+
casted = dfi[dfi>0]
1091+
casted
1092+
casted.dtypes
1093+
1094+
While float dtypes are unchanged.
1095+
1096+
.. ipython:: python
1097+
1098+
dfa = df3.copy()
1099+
dfa['A'] = dfa['A'].astype('float32')
1100+
dfa.dtypes
1101+
1102+
casted = dfa[df2>0]
1103+
casted
1104+
casted.dtypes
10111105
10121106
.. _basics.serialize:
10131107

@@ -1157,8 +1251,9 @@ For instance:
11571251
.. ipython:: python
11581252
11591253
set_eng_float_format(accuracy=3, use_eng_prefix=True)
1160-
df['a']/1.e3
1161-
df['a']/1.e6
1254+
s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
1255+
s/1.e3
1256+
s/1.e6
11621257
11631258
.. ipython:: python
11641259
:suppress:

doc/source/dsintro.rst

Lines changed: 0 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -455,96 +455,6 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
455455
fundamentals of reindexing / conforming to new sets of lables in the
456456
:ref:`section on reindexing <basics.reindexing>`.
457457

458-
DataTypes
459-
~~~~~~~~~
460-
461-
.. _dsintro.column_types:
462-
463-
The main types stored in pandas objects are float, int, boolean, datetime64[ns],
464-
and object. A convenient ``dtypes`` attribute return a Series with the data type of
465-
each column.
466-
467-
.. ipython:: python
468-
469-
df['integer'] = 1
470-
df['int32'] = df['integer'].astype('int32')
471-
df['float32'] = Series([1.0]*len(df),dtype='float32')
472-
df['timestamp'] = Timestamp('20010102')
473-
df.dtypes
474-
475-
If a DataFrame contains columns of multiple dtypes, the dtype of the column
476-
will be chosen to accommodate all of the data types (dtype=object is the most
477-
general).
478-
479-
The related method ``get_dtype_counts`` will return the number of columns of
480-
each type:
481-
482-
.. ipython:: python
483-
484-
df.get_dtype_counts()
485-
486-
Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0).
487-
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
488-
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.
489-
490-
.. ipython:: python
491-
492-
df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
493-
df1
494-
df1.dtypes
495-
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
496-
B = Series(randn(8)),
497-
C = Series(np.array(randn(8),dtype='uint8')) ))
498-
df2
499-
df2.dtypes
500-
501-
# here you get some upcasting
502-
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
503-
df3
504-
df3.dtypes
505-
506-
# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
507-
df3.values.dtype
508-
509-
Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.
510-
511-
DataType Conversion
512-
~~~~~~~~~~~~~~~~~~~
513-
514-
You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
515-
In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all the objects in a Series are of the same type, the Series
516-
will have that dtype.
517-
518-
.. ipython:: python
519-
520-
df3
521-
df3.dtypes
522-
523-
# conversion of dtypes
524-
df3.astype('float32').dtypes
525-
526-
To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
527-
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
528-
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
529-
This will convert any datetimelike object to dates, forcing other values to ``NaT``.
530-
531-
.. ipython:: python
532-
533-
# mixed type conversions
534-
df3['D'] = '1.'
535-
df3['E'] = '1'
536-
df3.convert_objects(convert_numeric=True).dtypes
537-
538-
# same, but specific dtype conversion
539-
df3['D'] = df3['D'].astype('float16')
540-
df3['E'] = df3['E'].astype('int32')
541-
df3.dtypes
542-
543-
# forcing date coercion
544-
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
545-
s
546-
s.convert_objects(convert_dates='coerce')
547-
548458
Data alignment and arithmetic
549459
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
550460

doc/source/gotchas.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ detect NA values.
3838
However, it comes with it a couple of trade-offs which I most certainly have
3939
not ignored.
4040

41+
.. _gotchas.intna:
42+
4143
Support for integer ``NA``
4244
~~~~~~~~~~~~~~~~~~~~~~~~~~
4345

doc/source/indexing.rst

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -304,35 +304,6 @@ so that the original data can be modified without creating a copy:
304304
305305
df.mask(df >= 0)
306306
307-
Upcasting Gotchas
308-
~~~~~~~~~~~~~~~~~
309-
310-
Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
311-
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).
312-
313-
.. ipython:: python
314-
315-
dfi = df.astype('int32')
316-
dfi['E'] = 1
317-
dfi
318-
dfi.dtypes
319-
320-
casted = dfi[dfi>0]
321-
casted
322-
casted.dtypes
323-
324-
While float dtypes are unchanged.
325-
326-
.. ipython:: python
327-
328-
df2 = df.copy()
329-
df2['A'] = df2['A'].astype('float32')
330-
df2.dtypes
331-
332-
casted = df2[df2>0]
333-
casted
334-
casted.dtypes
335-
336307
Take Methods
337308
~~~~~~~~~~~~
338309

doc/source/v0.4.x.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ New Features
1717
``MultiIndex`` (IS188_)
1818
- :ref:`Set <indexing.mixed_type_setting>` values in mixed-type
1919
``DataFrame`` objects via ``.ix`` indexing attribute (GH135_)
20-
- Added new ``DataFrame`` :ref:`methods <dsintro.column_types>`
20+
- Added new ``DataFrame`` :ref:`methods <basics.dtypes>`
2121
``get_dtype_counts`` and property ``dtypes`` (ENHdc_)
2222
- Added :ref:`ignore_index <merging.ignore_index>` option to
2323
``DataFrame.append`` to stack DataFrames (ENH1b_)

doc/source/v0.6.1.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ New features
2323
DataFrame, fast versions of scipy.stats.rankdata (GH428_)
2424
- Implement :ref:`DataFrame.from_items <basics.dataframe.from_items>` alternate
2525
constructor (GH444_)
26-
- DataFrame.convert_objects method for :ref:`inferring better dtypes <basics.cast.infer>`
26+
- DataFrame.convert_objects method for :ref:`inferring better dtypes <basics.cast>`
2727
for object columns (GH302_)
2828
- Add :ref:`rolling_corr_pairwise <stats.moments.corr_pairwise>` function for
2929
computing Panel of correlation matrices (GH189_)

0 commit comments

Comments
 (0)