Skip to content

Commit 45fd5f6

Browse files
committed
Merge branch 'master' of https://github.com/pandas-dev/pandas into Tdtype
2 parents 739bd69 + 7deda21 commit 45fd5f6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1028
-489
lines changed

asv_bench/benchmarks/replace.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,23 @@ def time_replace_series(self, inplace):
3636
self.s.replace(self.to_rep, inplace=inplace)
3737

3838

39+
class ReplaceList:
40+
# GH#28099
41+
42+
params = [(True, False)]
43+
param_names = ["inplace"]
44+
45+
def setup(self, inplace):
46+
self.df = pd.DataFrame({"A": 0, "B": 0}, index=range(4 * 10 ** 7))
47+
48+
def time_replace_list(self, inplace):
49+
self.df.replace([np.inf, -np.inf], np.nan, inplace=inplace)
50+
51+
def time_replace_list_one_match(self, inplace):
52+
# the 1 can be held in self._df.blocks[0], while the inf and -inf cant
53+
self.df.replace([np.inf, -np.inf, 1], np.nan, inplace=inplace)
54+
55+
3956
class Convert:
4057

4158
params = (["DataFrame", "Series"], ["Timestamp", "Timedelta"])

doc/source/development/contributing.rst

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,136 @@ You'll also need to
699699

700700
See :ref:`contributing.warnings` for more.
701701

702+
.. _contributing.type_hints:
703+
704+
Type Hints
705+
----------
706+
707+
*pandas* strongly encourages the use of :pep:`484` style type hints. New development should contain type hints and pull requests to annotate existing code are accepted as well!
708+
709+
Style Guidelines
710+
~~~~~~~~~~~~~~~~
711+
712+
Types imports should follow the ``from typing import ...`` convention. So rather than
713+
714+
.. code-block:: python
715+
716+
import typing
717+
718+
primes = [] # type: typing.List[int]
719+
720+
You should write
721+
722+
.. code-block:: python
723+
724+
from typing import List, Optional, Union
725+
726+
primes = [] # type: List[int]
727+
728+
``Optional`` should be used where applicable, so instead of
729+
730+
.. code-block:: python
731+
732+
maybe_primes = [] # type: List[Union[int, None]]
733+
734+
You should write
735+
736+
.. code-block:: python
737+
738+
maybe_primes = [] # type: List[Optional[int]]
739+
740+
In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described in `Mypy 1775 <https://github.com/python/mypy/issues/1775#issuecomment-310969854>`_. The defensive solution here is to create an unambiguous alias of the builtin and use that without your annotation. For example, if you come across a definition like
741+
742+
.. code-block:: python
743+
744+
class SomeClass1:
745+
str = None
746+
747+
The appropriate way to annotate this would be as follows
748+
749+
.. code-block:: python
750+
751+
str_type = str
752+
753+
class SomeClass2:
754+
str = None # type: str_type
755+
756+
In some cases you may be tempted to use ``cast`` from the typing module when you know better than the analyzer. This occurs particularly when using custom inference functions. For example
757+
758+
.. code-block:: python
759+
760+
from typing import cast
761+
762+
from pandas.core.dtypes.common import is_number
763+
764+
def cannot_infer_bad(obj: Union[str, int, float]):
765+
766+
if is_number(obj):
767+
...
768+
else: # Reasonably only str objects would reach this but...
769+
obj = cast(str, obj) # Mypy complains without this!
770+
return obj.upper()
771+
772+
The limitation here is that while a human can reasonably understand that ``is_number`` would catch the ``int`` and ``float`` types mypy cannot make that same inference just yet (see `mypy #5206 <https://github.com/python/mypy/issues/5206>`_. While the above works, the use of ``cast`` is **strongly discouraged**. Where applicable a refactor of the code to appease static analysis is preferable
773+
774+
.. code-block:: python
775+
776+
def cannot_infer_good(obj: Union[str, int, float]):
777+
778+
if isinstance(obj, str):
779+
return obj.upper()
780+
else:
781+
...
782+
783+
With custom types and inference this is not always possible so exceptions are made, but every effort should be exhausted to avoid ``cast`` before going down such paths.
784+
785+
Syntax Requirements
786+
~~~~~~~~~~~~~~~~~~~
787+
788+
Because *pandas* still supports Python 3.5, :pep:`526` does not apply and variables **must** be annotated with type comments. Specifically, this is a valid annotation within pandas:
789+
790+
.. code-block:: python
791+
792+
primes = [] # type: List[int]
793+
794+
Whereas this is **NOT** allowed:
795+
796+
.. code-block:: python
797+
798+
primes: List[int] = [] # not supported in Python 3.5!
799+
800+
Note that function signatures can always be annotated per :pep:`3107`:
801+
802+
.. code-block:: python
803+
804+
def sum_of_primes(primes: List[int] = []) -> int:
805+
...
806+
807+
808+
Pandas-specific Types
809+
~~~~~~~~~~~~~~~~~~~~~
810+
811+
Commonly used types specific to *pandas* will appear in `pandas._typing <https://github.com/pandas-dev/pandas/blob/master/pandas/_typing.py>`_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas.
812+
813+
For example, quite a few functions in *pandas* accept a ``dtype`` argument. This can be expressed as a string like ``"object"``, a ``numpy.dtype`` like ``np.int64`` or even a pandas ``ExtensionDtype`` like ``pd.CategoricalDtype``. Rather than burden the user with having to constantly annotate all of those options, this can simply be imported and reused from the pandas._typing module
814+
815+
.. code-block:: python
816+
817+
from pandas._typing import Dtype
818+
819+
def as_type(dtype: Dtype) -> ...:
820+
...
821+
822+
This module will ultimately house types for repeatedly used concepts like "path-like", "array-like", "numeric", etc... and can also hold aliases for commonly appearing parameters like `axis`. Development of this module is active so be sure to refer to the source for the most up to date list of available types.
823+
824+
Validating Type Hints
825+
~~~~~~~~~~~~~~~~~~~~~
826+
827+
*pandas* uses `mypy <http://mypy-lang.org>`_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running
828+
829+
.. code-block:: shell
830+
831+
mypy pandas
702832
703833
.. _contributing.ci:
704834

doc/source/getting_started/10min.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ Using a single column's values to select data.
278278

279279
.. ipython:: python
280280
281-
df[df.A > 0]
281+
df[df['A'] > 0]
282282
283283
Selecting values from a DataFrame where a boolean condition is met.
284284

doc/source/getting_started/basics.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -926,7 +926,7 @@ Single aggregations on a ``Series`` this will return a scalar value:
926926

927927
.. ipython:: python
928928
929-
tsdf.A.agg('sum')
929+
tsdf['A'].agg('sum')
930930
931931
932932
Aggregating with multiple functions
@@ -950,13 +950,13 @@ On a ``Series``, multiple functions return a ``Series``, indexed by the function
950950

951951
.. ipython:: python
952952
953-
tsdf.A.agg(['sum', 'mean'])
953+
tsdf['A'].agg(['sum', 'mean'])
954954
955955
Passing a ``lambda`` function will yield a ``<lambda>`` named row:
956956

957957
.. ipython:: python
958958
959-
tsdf.A.agg(['sum', lambda x: x.mean()])
959+
tsdf['A'].agg(['sum', lambda x: x.mean()])
960960
961961
Passing a named function will yield that name for the row:
962962

@@ -965,7 +965,7 @@ Passing a named function will yield that name for the row:
965965
def mymean(x):
966966
return x.mean()
967967
968-
tsdf.A.agg(['sum', mymean])
968+
tsdf['A'].agg(['sum', mymean])
969969
970970
Aggregating with a dict
971971
+++++++++++++++++++++++
@@ -1065,7 +1065,7 @@ Passing a single function to ``.transform()`` with a ``Series`` will yield a sin
10651065

10661066
.. ipython:: python
10671067
1068-
tsdf.A.transform(np.abs)
1068+
tsdf['A'].transform(np.abs)
10691069
10701070
10711071
Transform with multiple functions
@@ -1084,7 +1084,7 @@ resulting column names will be the transforming functions.
10841084

10851085
.. ipython:: python
10861086
1087-
tsdf.A.transform([np.abs, lambda x: x + 1])
1087+
tsdf['A'].transform([np.abs, lambda x: x + 1])
10881088
10891089
10901090
Transforming with a dict

doc/source/getting_started/comparison/comparison_with_r.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ R pandas
8181
=========================================== ===========================================
8282
``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']``
8383
``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})``
84-
``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)``
84+
``mutate(df, c=a-b)`` ``df.assign(c=df['a']-df['b'])``
8585
=========================================== ===========================================
8686

8787

@@ -258,8 +258,8 @@ index/slice as well as standard boolean indexing:
258258
259259
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
260260
df.query('a <= b')
261-
df[df.a <= df.b]
262-
df.loc[df.a <= df.b]
261+
df[df['a'] <= df['b']]
262+
df.loc[df['a'] <= df['b']]
263263
264264
For more details and examples see :ref:`the query documentation
265265
<indexing.query>`.
@@ -284,7 +284,7 @@ In ``pandas`` the equivalent expression, using the
284284
285285
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
286286
df.eval('a + b')
287-
df.a + df.b # same as the previous expression
287+
df['a'] + df['b'] # same as the previous expression
288288
289289
In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than
290290
evaluation in pure Python. For more details and examples see :ref:`the eval

doc/source/user_guide/advanced.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -738,7 +738,7 @@ and allows efficient indexing and storage of an index with a large number of dup
738738
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
739739
df
740740
df.dtypes
741-
df.B.cat.categories
741+
df['B'].cat.categories
742742
743743
Setting the index will create a ``CategoricalIndex``.
744744

doc/source/user_guide/cookbook.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -592,8 +592,8 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
592592
.. ipython:: python
593593
594594
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
595-
df.A.groupby((df.A != df.A.shift()).cumsum()).groups
596-
df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
595+
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
596+
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()
597597
598598
Expanding data
599599
**************
@@ -719,7 +719,7 @@ Rolling Apply to multiple columns where function calculates a Series before a Sc
719719
df
720720
721721
def gm(df, const):
722-
v = ((((df.A + df.B) + 1).cumprod()) - 1) * const
722+
v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
723723
return v.iloc[-1]
724724
725725
s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)

doc/source/user_guide/enhancingperf.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -393,15 +393,15 @@ Consider the following toy example of doubling each observation:
393393
.. code-block:: ipython
394394
395395
# Custom function without numba
396-
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
396+
In [5]: %timeit df['col1_doubled'] = df['a'].apply(double_every_value_nonumba) # noqa E501
397397
1000 loops, best of 3: 797 us per loop
398398
399399
# Standard implementation (faster than a custom function)
400-
In [6]: %timeit df['col1_doubled'] = df.a * 2
400+
In [6]: %timeit df['col1_doubled'] = df['a'] * 2
401401
1000 loops, best of 3: 233 us per loop
402402
403403
# Custom function with numba
404-
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
404+
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df['a'].to_numpy())
405405
1000 loops, best of 3: 145 us per loop
406406
407407
Caveats
@@ -643,8 +643,8 @@ The equivalent in standard Python would be
643643
.. ipython:: python
644644
645645
df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
646-
df['c'] = df.a + df.b
647-
df['d'] = df.a + df.b + df.c
646+
df['c'] = df['a'] + df['b']
647+
df['d'] = df['a'] + df['b'] + df['c']
648648
df['a'] = 1
649649
df
650650
@@ -688,7 +688,7 @@ name in an expression.
688688
689689
a = np.random.randn()
690690
df.query('@a < a')
691-
df.loc[a < df.a] # same as the previous expression
691+
df.loc[a < df['a']] # same as the previous expression
692692
693693
With :func:`pandas.eval` you cannot use the ``@`` prefix *at all*, because it
694694
isn't defined in that context. ``pandas`` will let you know this if you try to

0 commit comments

Comments
 (0)