From 87ecb1665b8644e8cb9985744c9684bb404006fc Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 15:08:29 -0700 Subject: [PATCH 01/12] DOC: Use more executed instead of static code blocks --- doc/source/user_guide/advanced.rst | 33 ++++----- doc/source/user_guide/basics.rst | 9 ++- doc/source/user_guide/categorical.rst | 11 ++- doc/source/user_guide/indexing.rst | 100 +++++--------------------- doc/source/user_guide/io.rst | 73 +++++++------------ doc/source/user_guide/merging.rst | 8 +-- doc/source/user_guide/timeseries.rst | 36 +++++----- doc/source/whatsnew/v0.21.0.rst | 1 - 8 files changed, 82 insertions(+), 189 deletions(-) diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index 41b0c98e339da..852165b647032 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -620,31 +620,23 @@ inefficient (and show a ``PerformanceWarning``). It will also return a copy of the data rather than a view: .. ipython:: python + :okwarning: dfm = pd.DataFrame( {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)} ) dfm = dfm.set_index(["jim", "joe"]) dfm - -.. code-block:: ipython - - In [4]: dfm.loc[(1, 'z')] - PerformanceWarning: indexing past lexsort depth may impact performance. - - Out[4]: - jolie - jim joe - 1 z 0.64094 + dfm.loc[(1, 'z')] .. _advanced.unsorted: Furthermore, if you try to index something that is not fully lexsorted, this can raise: -.. code-block:: ipython +.. ipython:: python + :okwarning: - In [5]: dfm.loc[(0, 'y'):(1, 'z')] - UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' + dfm.loc[(0, 'y'):(1, 'z')] The :meth:`~MultiIndex.is_monotonic_increasing` method on a ``MultiIndex`` shows if the index is sorted: @@ -836,10 +828,10 @@ values **not** in the categories, similarly to how you can reindex **any** panda df5 = df5.set_index("B") df5.index - .. code-block:: ipython + .. ipython:: python + :okexcept: - In [1]: pd.concat([df4, df5]) - TypeError: categories must match existing categories when appending + pd.concat([df4, df5]) .. _advanced.rangeindex: @@ -1062,15 +1054,14 @@ On the other hand, if the index is not monotonic, then both slice bounds must be # OK because 2 and 4 are in the index df.loc[2:4, :] -.. code-block:: ipython +.. ipython:: python + :okexcept: # 0 is not in the index - In [9]: df.loc[0:4, :] - KeyError: 0 + df.loc[0:4, :] # 3 is not a unique label - In [11]: df.loc[2:3, :] - KeyError: 'Cannot get right slice bound for non-unique label: 3' + df.loc[2:3, :] ``Index.is_monotonic_increasing`` and ``Index.is_monotonic_decreasing`` only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst index 06e52d8713409..96f0e5ae53513 100644 --- a/doc/source/user_guide/basics.rst +++ b/doc/source/user_guide/basics.rst @@ -404,13 +404,12 @@ objects of the same length: Trying to compare ``Index`` or ``Series`` objects of different lengths will raise a ValueError: -.. code-block:: ipython +.. ipython:: python + :okexcept: - In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar']) - ValueError: Series lengths must match to compare + pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar']) - In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo']) - ValueError: Series lengths must match to compare + pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo']) Note that this is different from the NumPy behavior where a comparison can be broadcast: diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index 61ecbff96ac7d..9efa7df3ff669 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -873,13 +873,12 @@ categoricals of the same categories and order information The below raises ``TypeError`` because the categories are ordered and not identical. -.. code-block:: ipython +.. ipython:: python + :okexcept: - In [1]: a = pd.Categorical(["a", "b"], ordered=True) - In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True) - In [3]: union_categoricals([a, b]) - Out[3]: - TypeError: to union ordered Categoricals, all categories must be the same + a = pd.Categorical(["a", "b"], ordered=True) + b = pd.Categorical(["a", "b", "c"], ordered=True) + union_categoricals([a, b]) Ordered categoricals with different categories or orderings can be combined by using the ``ignore_ordered=True`` argument. diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index b574ae9cb12c7..64aff2d0447ee 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -244,17 +244,13 @@ You can use attribute access to modify an existing element of a Series or column if you try to use attribute access to create a new column, it creates a new attribute rather than a new column and will this raise a ``UserWarning``: -.. code-block:: ipython - - In [1]: df = pd.DataFrame({'one': [1., 2., 3.]}) - In [2]: df.two = [4, 5, 6] - UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access - In [3]: df - Out[3]: - one - 0 1.0 - 1 2.0 - 2 3.0 +.. ipython:: python + :okwarning: + + df = pd.DataFrame({'one': [1., 2., 3.]}) + df.two = [4, 5, 6] + df + Slicing ranges -------------- @@ -304,17 +300,14 @@ Selection by label ``.loc`` is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a ``DatetimeIndex``. These will raise a ``TypeError``. - .. ipython:: python + .. ipython:: python + :okexcept: - dfl = pd.DataFrame(np.random.randn(5, 4), + dfl = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'), index=pd.date_range('20130101', periods=5)) - dfl - - .. code-block:: ipython - - In [4]: dfl.loc[2:3] - TypeError: cannot do slice indexing on with these indexers [2] of + dfl + dfl.loc[2:3] String likes in slicing *can* be convertible to the type of the index and lead to natural slicing. @@ -618,59 +611,6 @@ For getting *multiple* indexers, using ``.get_indexer``: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])] -.. _deprecate_loc_reindex_listlike: -.. _indexing.deprecate_loc_reindex_listlike: - -Indexing with list with missing labels is deprecated ----------------------------------------------------- - -In prior versions, using ``.loc[list-of-labels]`` would work as long as *at least 1* of the keys was found (otherwise it -would raise a ``KeyError``). This behavior was changed and will now raise a ``KeyError`` if at least one label is missing. -The recommended alternative is to use ``.reindex()``. - -For example. - -.. ipython:: python - - s = pd.Series([1, 2, 3]) - s - -Selection with all keys found is unchanged. - -.. ipython:: python - - s.loc[[1, 2]] - -Previous behavior - -.. code-block:: ipython - - In [4]: s.loc[[1, 2, 3]] - Out[4]: - 1 2.0 - 2 3.0 - 3 NaN - dtype: float64 - - -Current behavior - -.. code-block:: ipython - - In [4]: s.loc[[1, 2, 3]] - Passing list-likes to .loc with any non-matching elements will raise - KeyError in the future, you can use .reindex() as an alternative. - - See the documentation here: - https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike - - Out[4]: - 1 2.0 - 2 3.0 - 3 NaN - dtype: float64 - - Reindexing ~~~~~~~~~~ @@ -690,14 +630,11 @@ Alternatively, if you want to select only *valid* keys, the following is idiomat Having a duplicated index will raise for a ``.reindex()``: .. ipython:: python + :okexcept: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c']) labels = ['c', 'd'] - -.. code-block:: ipython - - In [17]: s.reindex(labels) - ValueError: cannot reindex on an axis with duplicate labels + s.reindex(labels) Generally, you can intersect the desired labels with the current axis, and then reindex. @@ -708,12 +645,11 @@ axis, and then reindex. However, this would *still* raise if your resulting index is duplicated. -.. code-block:: ipython - - In [41]: labels = ['a', 'd'] +.. ipython:: python + :okexcept: - In [42]: s.loc[s.index.intersection(labels)].reindex(labels) - ValueError: cannot reindex on an axis with duplicate labels + labels = ['a', 'd'] + s.loc[s.index.intersection(labels)].reindex(labels) .. _indexing.basics.partial_setting: diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index bb51124f10e54..15aadd83e6a7a 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -1212,36 +1212,23 @@ too many fields will raise an error by default: You can elect to skip bad lines: -.. code-block:: ipython - - In [29]: pd.read_csv(StringIO(data), on_bad_lines="warn") - Skipping line 3: expected 3 fields, saw 4 +.. ipython:: python + :okwarning: - Out[29]: - a b c - 0 1 2 3 - 1 8 9 10 + data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" + pd.read_csv(StringIO(data), on_bad_lines="warn") Or pass a callable function to handle the bad line if ``engine="python"``. The bad line will be a list of strings that was split by the ``sep``: -.. code-block:: ipython - - In [29]: external_list = [] - - In [30]: def bad_lines_func(line): - ...: external_list.append(line) - ...: return line[-3:] - - In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") - Out[31]: - a b c - 0 1 2 3 - 1 5 6 7 - 2 8 9 10 +.. ipython:: python - In [32]: external_list - Out[32]: [4, 5, 6, 7] + external_list = [] + def bad_lines_func(line): + external_list.append(line) + return line[-3:] + pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") + external_list .. versionadded:: 1.4.0 @@ -1250,7 +1237,7 @@ Bad lines caused by other errors will be silently skipped. For example: -.. code-block:: ipython +.. ipython:: python def bad_lines_func(line): print(line) @@ -1264,29 +1251,17 @@ The line was not processed in this case, as a "bad line" here is caused by an es You can also use the ``usecols`` parameter to eliminate extraneous column data that appear in some lines but not others: -.. code-block:: ipython - - In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2]) +.. ipython:: python - Out[33]: - a b c - 0 1 2 3 - 1 4 5 6 - 2 8 9 10 + pd.read_csv(StringIO(data), usecols=[0, 1, 2]) In case you want to keep all data including the lines with too many fields, you can specify a sufficient number of ``names``. This ensures that lines with not enough fields are filled with ``NaN``. -.. code-block:: ipython - - In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) +.. ipython:: python - Out[34]: - a b c d - 0 1 2 3 NaN - 1 4 5 6 7 - 2 8 9 10 NaN + pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) .. _io.dialect: @@ -4385,16 +4360,15 @@ will yield a tuple for each group key along with the relative keys of its conten Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node. - .. code-block:: ipython + .. ipython:: python + :okexcept: - In [8]: store.foo.bar.bah - AttributeError: 'HDFStore' object has no attribute 'foo' + store.foo.bar.bah + + .. ipython:: python # you can directly access the actual PyTables node but using the root node - In [9]: store.root.foo.bar.bah - Out[9]: - /foo/bar/bah (Group) '' - children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)] + store.root.foo.bar.bah Instead, use explicit string based keys: @@ -4547,7 +4521,8 @@ The right-hand side of the sub-expression (after a comparison operator) can be: instead of this - .. code-block:: ipython + .. code-block:: python + :okexcept: string = "HolyMoly'" store.select('df', f'index == {string}') diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 962de385a08c5..65506feda6e15 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -734,15 +734,11 @@ In the following example, there are duplicate values of ``B`` in the right .. ipython:: python + :okexcept: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]}) right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) - -.. code-block:: ipython - - In [53]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") - ... - MergeError: Merge keys are not unique in right dataset; not a one-to-one merge + result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") If the user is aware of the duplicates in the right ``DataFrame`` but wants to ensure there are no duplicates in the left DataFrame, one can use the diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index a0754ba0d2995..bc6a3926188f1 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -289,10 +289,10 @@ Invalid data The default behavior, ``errors='raise'``, is to raise when unparsable: -.. code-block:: ipython +.. ipython:: python + :okexcept: - In [2]: pd.to_datetime(['2009/07/31', 'asd'], errors='raise') - ValueError: Unknown datetime string format + pd.to_datetime(['2009/07/31', 'asd'], errors='raise') Pass ``errors='ignore'`` to return the original input when unparsable: @@ -2016,12 +2016,11 @@ If ``Period`` freq is daily or higher (``D``, ``H``, ``T``, ``S``, ``L``, ``U``, p + datetime.timedelta(minutes=120) p + np.timedelta64(7200, "s") -.. code-block:: ipython +.. ipython:: python + :okexcept: + + p + pd.offsets.Minute(5) - In [1]: p + pd.offsets.Minute(5) - Traceback - ... - ValueError: Input has different freq from Period(freq=H) If ``Period`` has other frequencies, only the same ``offsets`` can be added. Otherwise, ``ValueError`` will be raised. @@ -2030,12 +2029,11 @@ If ``Period`` has other frequencies, only the same ``offsets`` can be added. Oth p = pd.Period("2014-07", freq="M") p + pd.offsets.MonthEnd(3) -.. code-block:: ipython +.. ipython:: python + :okexcept: + + p + pd.offsets.MonthBegin(3) - In [1]: p + pd.offsets.MonthBegin(3) - Traceback - ... - ValueError: Input has different freq from Period(freq=M) Taking the difference of ``Period`` instances with the same frequency will return the number of frequency units between them: @@ -2564,10 +2562,10 @@ twice within one day ("clocks fall back"). The following options are available: This will fail as there are ambiguous times (``'11/06/2011 01:00'``) -.. code-block:: ipython +.. ipython:: python + :okexcept: - In [2]: rng_hourly.tz_localize('US/Eastern') - AmbiguousTimeError: Cannot infer dst time from Timestamp('2011-11-06 01:00:00'), try using the 'ambiguous' argument + rng_hourly.tz_localize('US/Eastern') Handle these ambiguous times by specifying the following. @@ -2599,10 +2597,10 @@ can be controlled by the ``nonexistent`` argument. The following options are ava Localization of nonexistent times will raise an error by default. -.. code-block:: ipython +.. ipython:: python + :okexcept: - In [2]: dti.tz_localize('Europe/Warsaw') - NonExistentTimeError: 2015-03-29 02:30:00 + dti.tz_localize('Europe/Warsaw') Transform nonexistent times to ``NaT`` or shift the times. diff --git a/doc/source/whatsnew/v0.21.0.rst b/doc/source/whatsnew/v0.21.0.rst index 1dae2e8463c27..b45ea8a2b522c 100644 --- a/doc/source/whatsnew/v0.21.0.rst +++ b/doc/source/whatsnew/v0.21.0.rst @@ -440,7 +440,6 @@ Indexing with a list with missing labels is deprecated Previously, selecting with a list of labels, where one or more labels were missing would always succeed, returning ``NaN`` for missing labels. This will now show a ``FutureWarning``. In the future this will raise a ``KeyError`` (:issue:`15747`). This warning will trigger on a ``DataFrame`` or a ``Series`` for using ``.loc[]`` or ``[[]]`` when passing a list-of-labels with at least 1 missing label. -See the :ref:`deprecation docs `. .. ipython:: python From dfc47c2b1ee237d56aaa757e379c7c41628fa1db Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 15:47:38 -0700 Subject: [PATCH 02/12] change to except --- doc/source/user_guide/advanced.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index 852165b647032..f07a7a54aaac1 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -634,7 +634,7 @@ return a copy of the data rather than a view: Furthermore, if you try to index something that is not fully lexsorted, this can raise: .. ipython:: python - :okwarning: + :okexcept: dfm.loc[(0, 'y'):(1, 'z')] From 0fca552fff23b6dbfd4f669879217fd182abd229 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 16:08:30 -0700 Subject: [PATCH 03/12] Convert more code blocks --- doc/source/user_guide/advanced.rst | 6 ++---- doc/source/user_guide/basics.rst | 25 +++++++++++-------------- 2 files changed, 13 insertions(+), 18 deletions(-) diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index f07a7a54aaac1..73d7a831b427e 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -913,11 +913,9 @@ Selecting using an ``Interval`` will only return exact matches. Trying to select an ``Interval`` that is not exactly contained in the ``IntervalIndex`` will raise a ``KeyError``. -.. code-block:: python +.. ipython:: python - In [7]: df.loc[pd.Interval(0.5, 2.5)] - --------------------------------------------------------------------------- - KeyError: Interval(0.5, 2.5, closed='right') + df.loc[pd.Interval(0.5, 2.5)] Selecting all ``Intervals`` that overlap a given ``Interval`` can be performed using the :meth:`~IntervalIndex.overlaps` method to create a boolean indexer. diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst index 96f0e5ae53513..5cac6ec11a705 100644 --- a/doc/source/user_guide/basics.rst +++ b/doc/source/user_guide/basics.rst @@ -322,24 +322,21 @@ You can test if a pandas object is empty, via the :attr:`~DataFrame.empty` prope .. warning:: - You might be tempted to do the following: + Asserting the truthiness of a pandas object will raise an error, as the testing of the emptiness + or values is ambiguous. - .. code-block:: python - - >>> if df: - ... pass - - Or - - .. code-block:: python + .. ipython:: python + :okexcept: - >>> df and df2 + if df: + print(True) - These will both raise errors, as you are trying to compare multiple values.:: + .. ipython:: python + :okexcept: - ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). + df and df2 -See :ref:`gotchas` for a more detailed discussion. + See :ref:`gotchas` for a more detailed discussion. .. _basics.equals: @@ -911,7 +908,7 @@ maximum value for each column occurred: You may also pass additional arguments and keyword arguments to the :meth:`~DataFrame.apply` method. For instance, consider the following function you would like to apply: -.. code-block:: python +.. ipython:: python def subtract_and_divide(x, sub, divide=1): return (x - sub) / divide From 7d3bfe38ae8e58d58cf8ee8425485a13789bb6b2 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 16:16:00 -0700 Subject: [PATCH 04/12] convert even more --- doc/source/user_guide/basics.rst | 9 +++------ doc/source/user_guide/indexing.rst | 17 ++++++++++------- doc/source/user_guide/text.rst | 6 +++--- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst index 5cac6ec11a705..2e299da5e5794 100644 --- a/doc/source/user_guide/basics.rst +++ b/doc/source/user_guide/basics.rst @@ -906,18 +906,15 @@ maximum value for each column occurred: tsdf.apply(lambda x: x.idxmax()) You may also pass additional arguments and keyword arguments to the :meth:`~DataFrame.apply` -method. For instance, consider the following function you would like to apply: +method. .. ipython:: python def subtract_and_divide(x, sub, divide=1): return (x - sub) / divide -You may then apply this function as follows: - -.. code-block:: python - - df.apply(subtract_and_divide, args=(5,), divide=3) + df_udf = pd.DataFrame(np.ones((2, 2))) + df_udf.apply(subtract_and_divide, args=(5,), divide=3) Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row: diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index 64aff2d0447ee..ace32c3927a07 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -535,13 +535,15 @@ A single indexer that is out of bounds will raise an ``IndexError``. A list of indexers where any element is out of bounds will raise an ``IndexError``. -.. code-block:: python +.. ipython:: python + :okexcept: - >>> dfl.iloc[[4, 5, 6]] - IndexError: positional indexers are out-of-bounds + dfl.iloc[[4, 5, 6]] - >>> dfl.iloc[:, 4] - IndexError: single positional indexer is out-of-bounds +.. ipython:: python + :okexcept: + + dfl.iloc[:, 4] .. _indexing.callable: @@ -1695,9 +1697,10 @@ Adding an ad hoc index If you create an index yourself, you can just assign it to the ``index`` field: -.. code-block:: python +.. ipython:: python - data.index = index + data.index = pd.Index([10, 20, 30, 40], name="a") + data .. _indexing.view_versus_copy: diff --git a/doc/source/user_guide/text.rst b/doc/source/user_guide/text.rst index c193df5118926..cf27fc8385223 100644 --- a/doc/source/user_guide/text.rst +++ b/doc/source/user_guide/text.rst @@ -574,10 +574,10 @@ returns a ``DataFrame`` if ``expand=True``. It raises ``ValueError`` if ``expand=False``. -.. code-block:: python +.. ipython:: python + :okexcept: - >>> s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=False) - ValueError: only one regex group is supported with Index + s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=False) The table below summarizes the behavior of ``extract(expand=False)`` (input subject in first column, number of groups in regex in From 5c2ee8204d488147c483700907f219bc74dac573 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 16:40:55 -0700 Subject: [PATCH 05/12] okexcept --- doc/source/user_guide/advanced.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index 73d7a831b427e..1f2dc3b78e6f6 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -914,6 +914,7 @@ Selecting using an ``Interval`` will only return exact matches. Trying to select an ``Interval`` that is not exactly contained in the ``IntervalIndex`` will raise a ``KeyError``. .. ipython:: python + :okexcept: df.loc[pd.Interval(0.5, 2.5)] From 0b6b66bbc6d966c408d2eb084a11e3fccf5999b1 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 17:16:10 -0700 Subject: [PATCH 06/12] More fixes --- doc/source/user_guide/indexing.rst | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index ace32c3927a07..e7cd3c3e2a791 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -247,9 +247,9 @@ new column and will this raise a ``UserWarning``: .. ipython:: python :okwarning: - df = pd.DataFrame({'one': [1., 2., 3.]}) - df.two = [4, 5, 6] - df + df_new = pd.DataFrame({'one': [1., 2., 3.]}) + df_new.two = [4, 5, 6] + df_new Slicing ranges @@ -304,8 +304,8 @@ Selection by label :okexcept: dfl = pd.DataFrame(np.random.randn(5, 4), - columns=list('ABCD'), - index=pd.date_range('20130101', periods=5)) + columns=list('ABCD'), + index=pd.date_range('20130101', periods=5)) dfl dfl.loc[2:3] @@ -620,6 +620,7 @@ The idiomatic way to achieve selecting potentially not-found elements is via ``. .. ipython:: python + s = pd.Series([1, 2, 3]) s.reindex([1, 2, 3]) Alternatively, if you want to select only *valid* keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection. @@ -1695,12 +1696,13 @@ discards the index, instead of putting index values in the DataFrame's columns. Adding an ad hoc index ~~~~~~~~~~~~~~~~~~~~~~ -If you create an index yourself, you can just assign it to the ``index`` field: +You can assign a custom index to the ``index`` attribute: .. ipython:: python - data.index = pd.Index([10, 20, 30, 40], name="a") - data + df_idx = pd.DataFrame(range(4)) + df_idx.index = pd.Index([10, 20, 30, 40], name="a") + df_idx .. _indexing.view_versus_copy: From a048b01aedc6ed1151ad854512f6267394ff7b09 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 17:35:12 -0700 Subject: [PATCH 07/12] Address more --- doc/source/user_guide/advanced.rst | 3 ++- doc/source/user_guide/dsintro.rst | 4 ++-- doc/source/user_guide/indexing.rst | 26 ++++++++++---------------- doc/source/user_guide/io.rst | 2 +- doc/source/user_guide/merging.rst | 6 +++--- 5 files changed, 18 insertions(+), 23 deletions(-) diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index 1f2dc3b78e6f6..682fa4c9b4fcc 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -1099,7 +1099,8 @@ accomplished as such: However, if you only had ``c`` and ``e``, determining the next element in the index can be somewhat complicated. For example, the following does not work: -:: +.. ipython:: python + :okexcept: s.loc['c':'e' + 1] diff --git a/doc/source/user_guide/dsintro.rst b/doc/source/user_guide/dsintro.rst index 4b0829e4a23b9..d60532f5f4027 100644 --- a/doc/source/user_guide/dsintro.rst +++ b/doc/source/user_guide/dsintro.rst @@ -31,9 +31,9 @@ Series type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to create a :class:`Series` is to call: -:: +.. code-block:: python - >>> s = pd.Series(data, index=index) + s = pd.Series(data, index=index) Here, ``data`` can be many different things: diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index e7cd3c3e2a791..52bc43f52b1d3 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -1833,15 +1833,12 @@ chained indexing expression, you can set the :ref:`option ` This however is operating on a copy and will not work. -:: +.. ipython:: python + :okwarning: + :okexcept: - >>> pd.set_option('mode.chained_assignment','warn') - >>> dfb[dfb['a'].str.startswith('o')]['c'] = 42 - Traceback (most recent call last) - ... - SettingWithCopyWarning: - A value is trying to be set on a copy of a slice from a DataFrame. - Try using .loc[row_index,col_indexer] = value instead + with option_context('mode.chained_assignment','warn'): + dfb[dfb['a'].str.startswith('o')]['c'] = 42 A chained assignment can also crop up in setting in a mixed dtype frame. @@ -1878,15 +1875,12 @@ The following *can* work at times, but it is not guaranteed to, and therefore sh Last, the subsequent example will **not** work at all, and so should be avoided: -:: +.. ipython:: python + :okwarning: + :okexcept: - >>> pd.set_option('mode.chained_assignment','raise') - >>> dfd.loc[0]['a'] = 1111 - Traceback (most recent call last) - ... - SettingWithCopyError: - A value is trying to be set on a copy of a slice from a DataFrame. - Try using .loc[row_index,col_indexer] = value instead + with option_context('mode.chained_assignment','raise'): + dfd.loc[0]['a'] = 1111 .. warning:: diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 15aadd83e6a7a..4e1ab19d08e57 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -1244,7 +1244,7 @@ For example: data = 'name,type\nname a,a is of type a\nname b,"b\" is of type b"' data - pd.read_csv(data, on_bad_lines=bad_lines_func, engine="python") + pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") The line was not processed in this case, as a "bad line" here is caused by an escape character. diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 65506feda6e15..14d9b627bec89 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -155,10 +155,10 @@ functionality below. reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. -:: + .. code-block:: python - frames = [ process_your_file(f) for f in files ] - result = pd.concat(frames) + frames = [process_your_file(f) for f in files] + result = pd.concat(frames) .. note:: From d212ac1d9235698eab81540f539c89d4c6aa326b Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Thu, 27 Jul 2023 17:49:38 -0700 Subject: [PATCH 08/12] another okexcept --- doc/source/user_guide/io.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 4e1ab19d08e57..81e6a27529d3c 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -1252,6 +1252,7 @@ You can also use the ``usecols`` parameter to eliminate extraneous column data that appear in some lines but not others: .. ipython:: python + :okexcept: pd.read_csv(StringIO(data), usecols=[0, 1, 2]) From 30b23da63ec719c5200542cd18ee5c42b77da0d9 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Jul 2023 11:22:57 -0700 Subject: [PATCH 09/12] Fix okexcept --- doc/source/user_guide/io.rst | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 81e6a27529d3c..985ea1464c11b 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -4265,12 +4265,16 @@ This format is specified by default when using ``put`` or ``to_hdf`` or by ``for A ``fixed`` format will raise a ``TypeError`` if you try to retrieve using a ``where``: - .. code-block:: python + .. ipython:: python + :okexcept: - >>> pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", "df") - >>> pd.read_hdf("test_fixed.h5", "df", where="index>5") - TypeError: cannot pass a where specification when reading a fixed format. - this store must be selected in its entirety + pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", "df") + pd.read_hdf("test_fixed.h5", "df", where="index>5") + + .. ipython:: python + :suppress: + + os.remove("test_fixed.h5") .. _io.hdf5-table: @@ -4362,7 +4366,7 @@ will yield a tuple for each group key along with the relative keys of its conten Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node. .. ipython:: python - :okexcept: + :okexcept: store.foo.bar.bah From 9e09fb92d61eed222bf542eac1acc17811852f56 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Jul 2023 12:53:49 -0700 Subject: [PATCH 10/12] address again --- doc/source/user_guide/io.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index 985ea1464c11b..b2f3a3f9c786f 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -1213,10 +1213,9 @@ too many fields will raise an error by default: You can elect to skip bad lines: .. ipython:: python - :okwarning: data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" - pd.read_csv(StringIO(data), on_bad_lines="warn") + pd.read_csv(StringIO(data), on_bad_lines="skip") Or pass a callable function to handle the bad line if ``engine="python"``. The bad line will be a list of strings that was split by the ``sep``: From 5e50e9871a390103845a455059f958b4d05b70a2 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Jul 2023 13:29:33 -0700 Subject: [PATCH 11/12] more fixes --- doc/source/user_guide/io.rst | 56 +++++++++++++++++------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index b2f3a3f9c786f..290fe716f5a8a 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -849,8 +849,8 @@ column names: with open("tmp.csv", "w") as fh: fh.write(data) - df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]]) - df + df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]]) + df By default the parser removes the component date columns, but you can choose to retain them via the ``keep_date_col`` keyword: @@ -1103,10 +1103,10 @@ By default, numbers with a thousands separator will be parsed as strings: with open("tmp.csv", "w") as fh: fh.write(data) - df = pd.read_csv("tmp.csv", sep="|") - df + df = pd.read_csv("tmp.csv", sep="|") + df - df.level.dtype + df.level.dtype The ``thousands`` keyword allows integers to be parsed correctly: @@ -1217,6 +1217,8 @@ You can elect to skip bad lines: data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" pd.read_csv(StringIO(data), on_bad_lines="skip") +.. versionadded:: 1.4.0 + Or pass a callable function to handle the bad line if ``engine="python"``. The bad line will be a list of strings that was split by the ``sep``: @@ -1229,23 +1231,20 @@ The bad line will be a list of strings that was split by the ``sep``: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") external_list - .. versionadded:: 1.4.0 +.. note:: -Note that the callable function will handle only a line with too many fields. -Bad lines caused by other errors will be silently skipped. + The callable function will handle only a line with too many fields. + Bad lines caused by other errors will be silently skipped. -For example: - -.. ipython:: python + .. ipython:: python - def bad_lines_func(line): - print(line) + bad_lines_func = lambda line: print(line) - data = 'name,type\nname a,a is of type a\nname b,"b\" is of type b"' - data - pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") + data = 'name,type\nname a,a is of type a\nname b,"b\" is of type b"' + data + pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") -The line was not processed in this case, as a "bad line" here is caused by an escape character. + The line was not processed in this case, as a "bad line" here is caused by an escape character. You can also use the ``usecols`` parameter to eliminate extraneous column data that appear in some lines but not others: @@ -4432,19 +4431,19 @@ storing/selecting from homogeneous index ``DataFrames``. .. ipython:: python - index = pd.MultiIndex( - levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], - codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], - names=["foo", "bar"], - ) - df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) - df_mi + index = pd.MultiIndex( + levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], + codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=["foo", "bar"], + ) + df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) + df_mi - store.append("df_mi", df_mi) - store.select("df_mi") + store.append("df_mi", df_mi) + store.select("df_mi") - # the levels are automatically included as data columns - store.select("df_mi", "foo=bar") + # the levels are automatically included as data columns + store.select("df_mi", "foo=bar") .. note:: The ``index`` keyword is reserved and cannot be use as a level name. @@ -4526,7 +4525,6 @@ The right-hand side of the sub-expression (after a comparison operator) can be: instead of this .. code-block:: python - :okexcept: string = "HolyMoly'" store.select('df', f'index == {string}') From 603aa73076d16fe6512f3b01f4c640465cb68174 Mon Sep 17 00:00:00 2001 From: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Date: Fri, 28 Jul 2023 13:54:00 -0700 Subject: [PATCH 12/12] fix merging --- doc/source/user_guide/merging.rst | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 14d9b627bec89..10793a6973f8a 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -732,13 +732,12 @@ In the following example, there are duplicate values of ``B`` in the right ``DataFrame``. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised. - .. ipython:: python :okexcept: - left = pd.DataFrame({"A": [1, 2], "B": [1, 2]}) - right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) - result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") + left = pd.DataFrame({"A": [1, 2], "B": [1, 2]}) + right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) + result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") If the user is aware of the duplicates in the right ``DataFrame`` but wants to ensure there are no duplicates in the left DataFrame, one can use the