Skip to content

ENH: Fix by in DataFrame.plot.hist and DataFrame.plot.box #28373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 178 commits into from
Jul 12, 2021
Merged
Show file tree
Hide file tree
Changes from 149 commits
Commits
Show all changes
178 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
e36592c
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Sep 10, 2019
b2f45a6
fix by in hist
charlesdong1991 Sep 10, 2019
8b6e00a
make plot work
charlesdong1991 Sep 10, 2019
dc0c2ec
add _group_plot function
charlesdong1991 Sep 10, 2019
d803938
check function
charlesdong1991 Sep 10, 2019
33dd762
reformat
charlesdong1991 Sep 10, 2019
d59d642
put import up
charlesdong1991 Sep 10, 2019
66eb06c
add comments
charlesdong1991 Sep 10, 2019
ea267ad
Mimic group plot
charlesdong1991 Sep 10, 2019
8095224
fix import failure
charlesdong1991 Sep 10, 2019
31decc1
reformat
charlesdong1991 Sep 10, 2019
e4bdbd0
fix test
charlesdong1991 Sep 10, 2019
4033159
hacky fix
charlesdong1991 Sep 10, 2019
57a3bdf
fix isrot
charlesdong1991 Sep 10, 2019
8060223
fix tests
charlesdong1991 Sep 11, 2019
d666334
fix import failure
charlesdong1991 Sep 11, 2019
3216d59
fix import error
charlesdong1991 Sep 11, 2019
45f4b7f
Update imports
charlesdong1991 Sep 11, 2019
2b0785b
test imports
charlesdong1991 Sep 11, 2019
d79dba3
new change
charlesdong1991 Sep 12, 2019
eca597b
fix conflict
charlesdong1991 Sep 12, 2019
321fbd2
restore removed line
charlesdong1991 Sep 12, 2019
a7b9ae5
Remove unused line
charlesdong1991 Sep 12, 2019
d2d13fd
Disruptive change
charlesdong1991 Sep 12, 2019
5abedb6
should work this time
charlesdong1991 Sep 13, 2019
d73115a
Add in-code comments
charlesdong1991 Sep 13, 2019
d7998bb
remove print
charlesdong1991 Sep 13, 2019
1bbf7ea
reformat
charlesdong1991 Sep 13, 2019
a279f45
Dropna
charlesdong1991 Sep 13, 2019
2b793ea
Add isna for multi column
charlesdong1991 Sep 14, 2019
04de066
try to remove warning
charlesdong1991 Sep 15, 2019
4adc324
test if removing pd works
charlesdong1991 Sep 16, 2019
d0103a4
revert changes
charlesdong1991 Sep 16, 2019
f94dbb4
try if warning gone
charlesdong1991 Sep 16, 2019
0415cb0
try again
charlesdong1991 Sep 17, 2019
525200b
merge master and fix conflict
charlesdong1991 Dec 23, 2019
1ab4310
merge master and fix conflict
charlesdong1991 Jan 4, 2020
c005880
fix conflict and merge master
charlesdong1991 Jan 4, 2020
a1fabc5
Fix linting error
charlesdong1991 Jan 4, 2020
70453f1
Add test
charlesdong1991 Jan 4, 2020
b6579a5
remove unused code
charlesdong1991 Jan 4, 2020
e99f3dc
add test and make code more robust
charlesdong1991 Jan 4, 2020
99d6d67
remove comment
charlesdong1991 Jan 4, 2020
8e2fcf6
clean the code
charlesdong1991 Jan 4, 2020
d02f4ac
simplify code
charlesdong1991 Jan 4, 2020
947189c
simplify code
charlesdong1991 Jan 4, 2020
6b5203d
fix linting
charlesdong1991 Jan 4, 2020
27d0d21
Add doc for hist
charlesdong1991 Jan 4, 2020
48ff521
revert change
charlesdong1991 Jan 4, 2020
f39d948
fix warning
charlesdong1991 Jan 4, 2020
5d1705c
isort
charlesdong1991 Jan 4, 2020
90471aa
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jan 11, 2020
46a8031
simplify code
charlesdong1991 Jan 11, 2020
57a96e6
simpler python
charlesdong1991 Jan 11, 2020
29127f0
remove unused
charlesdong1991 Jan 11, 2020
61bb97f
restore blank lines
charlesdong1991 Jan 11, 2020
62fb9e6
Add extensive tests
charlesdong1991 Jan 11, 2020
638174b
fix seed
charlesdong1991 Jan 11, 2020
02de005
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jan 15, 2020
5adb25d
code change based on reviews
charlesdong1991 Jan 15, 2020
7051432
fix linting
charlesdong1991 Jan 15, 2020
adbde9f
update doc
charlesdong1991 Jan 15, 2020
5dfad18
merge master and fix conflict
charlesdong1991 Feb 10, 2020
abd10f3
code change based on reviews
charlesdong1991 Feb 10, 2020
c20d81a
fixup
charlesdong1991 Feb 10, 2020
07112c0
fixup
charlesdong1991 Feb 10, 2020
fb0b87c
code change on reviews
charlesdong1991 Feb 11, 2020
a6a8e57
fix isort
charlesdong1991 Feb 11, 2020
7f77f48
short code
charlesdong1991 Feb 12, 2020
c09bb19
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Feb 19, 2020
a120d27
simpler python
charlesdong1991 Feb 23, 2020
f87afee
add inline comment
charlesdong1991 Feb 23, 2020
82711ee
simplier pandas
charlesdong1991 Feb 23, 2020
60f7298
code change on JR review
charlesdong1991 Feb 23, 2020
071488b
fix linting
charlesdong1991 Feb 23, 2020
f2a0210
rebase and fix conflict
charlesdong1991 Feb 27, 2020
bb07e15
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Mar 8, 2020
867094a
code change on reviews
charlesdong1991 Mar 8, 2020
b0f06b2
Add docstring
charlesdong1991 Mar 8, 2020
111e89c
fix typo
charlesdong1991 Mar 15, 2020
6472053
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Mar 15, 2020
83ec868
remove blank
charlesdong1991 Mar 15, 2020
d6c8566
use more meaningful example
charlesdong1991 Mar 15, 2020
6a0ac8d
keep as is
charlesdong1991 Mar 15, 2020
49d0791
remove less useful comment
charlesdong1991 Mar 15, 2020
2bfbe78
change figsize
charlesdong1991 Mar 15, 2020
c5d7518
clean iter_data
charlesdong1991 Apr 4, 2020
03356ce
remove unused docs
charlesdong1991 Apr 4, 2020
7abc47d
cleaner pandas
charlesdong1991 Apr 4, 2020
db832b4
cleaner
charlesdong1991 Apr 4, 2020
be99a97
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Apr 4, 2020
9ae5987
fixup
charlesdong1991 Apr 4, 2020
10c2ad1
rename
charlesdong1991 Apr 4, 2020
ce8cfd4
code change on reviews
charlesdong1991 Apr 7, 2020
627cc02
fixup
charlesdong1991 Apr 7, 2020
4bfbf03
rebase and resolve conflicts
charlesdong1991 Apr 10, 2020
ee8972d
linting
charlesdong1991 Apr 10, 2020
12ff785
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Apr 10, 2020
163f920
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Apr 15, 2020
0839be2
annotation
charlesdong1991 May 1, 2020
142ee53
annotation
charlesdong1991 May 1, 2020
ef65137
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 May 1, 2020
d793703
reverse change on annotation
charlesdong1991 May 5, 2020
2710cf2
fixup
charlesdong1991 May 5, 2020
f76d2cb
remove
charlesdong1991 May 5, 2020
a5ecbd7
add missing annoatation
charlesdong1991 May 11, 2020
439be51
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 May 11, 2020
5fd420e
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 May 11, 2020
4b4832f
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 May 18, 2020
7425dff
code change on WA review
charlesdong1991 May 21, 2020
8ab4b90
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 17, 2020
b06e454
solve mypy
charlesdong1991 Jun 19, 2020
79294ed
fix typo
charlesdong1991 Jun 19, 2020
9523bb9
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 19, 2020
cd59370
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 20, 2020
050ba95
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 20, 2020
add406f
code change on reviews
charlesdong1991 Jun 23, 2020
bb22c53
fix linting
charlesdong1991 Jun 23, 2020
25214e6
rename
charlesdong1991 Jun 27, 2020
aaa5c95
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 27, 2020
77e46f4
modulize reformat_y for hist
charlesdong1991 Jun 27, 2020
9de9c61
better annotation
charlesdong1991 Jun 27, 2020
af68d2e
improve annotation
charlesdong1991 Jun 27, 2020
b75015a
fix linting
charlesdong1991 Jun 27, 2020
b90303d
improve docstring
charlesdong1991 Jun 27, 2020
898fa9b
rebase
charlesdong1991 May 21, 2021
2ac32f5
remove added test file
charlesdong1991 May 21, 2021
f7bcdb7
revert change
charlesdong1991 May 21, 2021
aeb32e5
rebase
charlesdong1991 May 21, 2021
dc17959
fixup
charlesdong1991 May 21, 2021
4aee3e0
black
charlesdong1991 May 21, 2021
4eb466f
fixup
charlesdong1991 May 21, 2021
5160224
fix mypy
charlesdong1991 May 22, 2021
e2de0d3
fix mypy
charlesdong1991 May 22, 2021
b2b33ac
fix mypy
charlesdong1991 May 22, 2021
1199a93
fix mypy
charlesdong1991 May 22, 2021
c4a5842
fix mypy
charlesdong1991 May 22, 2021
6556414
fix mypy
charlesdong1991 May 22, 2021
826f277
fix flake8
charlesdong1991 May 22, 2021
891dc55
add by support for boxplot
charlesdong1991 May 22, 2021
4c4a158
doc
charlesdong1991 May 22, 2021
ea7e5b1
Add tests
charlesdong1991 May 22, 2021
006588e
flake8
charlesdong1991 May 22, 2021
4f0a1dc
move file
charlesdong1991 May 22, 2021
e1579e2
pprint label
charlesdong1991 May 24, 2021
f2c141f
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 May 27, 2021
5f96abd
rebase and fix conflict
charlesdong1991 May 31, 2021
06483af
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 1, 2021
08f534d
rebase fix conflicts
charlesdong1991 Jun 3, 2021
f1c3a1f
Resolve conflicts and adopt all feedbacks from Marc
charlesdong1991 Jun 29, 2021
e6e96d3
parametrize tests
charlesdong1991 Jun 29, 2021
52e47f1
Fix test
charlesdong1991 Jun 29, 2021
bc2f282
Code changes based on Marc reviews
charlesdong1991 Jun 30, 2021
a43d3bb
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jun 30, 2021
ceeb3c5
update doc
charlesdong1991 Jun 30, 2021
4fea841
version change
charlesdong1991 Jun 30, 2021
3ea2603
Use self.by
charlesdong1991 Jun 30, 2021
9f48139
better code
charlesdong1991 Jun 30, 2021
b1094e3
better inline comment
charlesdong1991 Jun 30, 2021
97bde59
code changes based on Marc reviews
charlesdong1991 Jun 30, 2021
444a964
minor fix
charlesdong1991 Jun 30, 2021
b66dad0
mypy
charlesdong1991 Jun 30, 2021
982f562
add future annotation
charlesdong1991 Jun 30, 2021
c76ad67
fix pre commit
charlesdong1991 Jun 30, 2021
2c1aa33
minor experimental fix
charlesdong1991 Jun 30, 2021
6896546
better doc string
charlesdong1991 Jun 30, 2021
3c54302
fixup doc fail
charlesdong1991 Jul 1, 2021
2d20178
code change on Macro reviews
charlesdong1991 Jul 1, 2021
a169dfd
Add more tests
charlesdong1991 Jul 1, 2021
d0b56ff
fixup
charlesdong1991 Jul 1, 2021
dec313c
code change on reviews
charlesdong1991 Jul 10, 2021
143f286
changes based on Jeff review
charlesdong1991 Jul 12, 2021
283286f
doc
charlesdong1991 Jul 12, 2021
f2a0736
Merge remote-tracking branch 'upstream/master' into fix_by_plot
charlesdong1991 Jul 12, 2021
f1aeee0
fix flake8
charlesdong1991 Jul 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,7 @@ Other enhancements
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
- Add keyword ``dropna`` to :meth:`DataFrame.value_counts` to allow counting rows that include ``NA`` values (:issue:`41325`)
- :meth:`Series.replace` will now cast results to ``PeriodDtype`` where possible instead of ``object`` dtype (:issue:`41526`)
- Add support for assigning values to ``by`` argument in :meth:``DataFrame.plot.hist`` and :meth:``DataFrame.plot.box`` (:issue:`15079`)

.. ---------------------------------------------------------------------------

Expand Down
13 changes: 13 additions & 0 deletions pandas/plotting/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1277,6 +1277,9 @@ def hist(self, by=None, bins=10, **kwargs):
----------
by : str or sequence, optional
Column in the DataFrame to group by.

.. versionadded:: 1.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged right? can you add a line on what is changing here

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was doubting about this. We do accept by right now but don't do anything about that. Now we do start supporting by to make plots for groups, so i put added here instead of changed.

I think then i will change to versionchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the release note, this should also be updated to the new expected release version.

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, changed!


bins : int, default 10
Number of histogram bins to be used.
**kwargs
Expand Down Expand Up @@ -1308,6 +1311,16 @@ def hist(self, by=None, bins=10, **kwargs):
... columns = ['one'])
>>> df['two'] = df['one'] + np.random.randint(1, 7, 6000)
>>> ax = df.plot.hist(bins=12, alpha=0.5)

A grouped histogram can be generated by providing the parameter `by` (which
can be a column name, or a list of column names):

.. plot::
:context: close-figs

>>> age_list = [8, 10, 12, 14, 72, 74, 76, 78, 20, 25, 30, 35, 60, 85]
>>> df = pd.DataFrame({"gender": list("MMMMMMMMFFFFFF"), "age": age_list})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was by allowed before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do allow by, but don't do anything on that. i will also change to versionchanged

>>> ax = df.plot.hist(column=["age"], by="gender", figsize=(10, 8))
"""
return self(kind="hist", by=by, bins=bins, **kwargs)

Expand Down
15 changes: 14 additions & 1 deletion pandas/plotting/_matplotlib/boxplot.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
LinePlot,
MPLPlot,
)
from pandas.plotting._matplotlib.groupby import create_iter_data_given_by
from pandas.plotting._matplotlib.style import get_standard_colors
from pandas.plotting._matplotlib.tools import (
create_subplots,
Expand Down Expand Up @@ -135,17 +136,29 @@ def _make_plot(self):
if self.subplots:
self._return_obj = pd.Series(dtype=object)

for i, (label, y) in enumerate(self._iter_data()):
data = create_iter_data_given_by(self.data, self.by, self._kind)
for i, (label, y) in enumerate(self._iter_data(data=data)):
ax = self._get_ax(i)
kwds = self.kwds.copy()

# When by is applied, show title for subplots to know which group it is
# just like df.boxplot, and need to apply T on y to provide right input
if self.by is not None:
y = y.T
ax.set_title(pprint_thing(label))

ret, bp = self._plot(
ax, y, column_num=i, return_type=self.return_type, **kwds
)
self.maybe_color_bp(bp)
self._return_obj[label] = ret

label = [pprint_thing(label)]

# When `by` is assigned, the ticklabels will become unique grouped
# values, instead of label which is used as subtitle in this case.
if self.by is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we move the line with label = [pprint_thing(label)] at the beginning of the loop, then you can merge this if with the previous, and things will be a bit more readable and compact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, indeed! moved!

label = [pprint_thing(col) for col in self.data.columns.levels[0]]
self._set_ticklabels(ax, label)
else:
y = self.data.values.T
Expand Down
40 changes: 36 additions & 4 deletions pandas/plotting/_matplotlib/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from matplotlib.artist import Artist
import numpy as np

from pandas._typing import IndexLabel
from pandas.errors import AbstractMethodError
from pandas.util._decorators import cache_readonly

Expand Down Expand Up @@ -42,6 +43,7 @@
from pandas.io.formats.printing import pprint_thing
from pandas.plotting._matplotlib.compat import mpl_ge_3_0_0
from pandas.plotting._matplotlib.converter import register_pandas_matplotlib_converters
from pandas.plotting._matplotlib.groupby import reconstruct_data_with_by
from pandas.plotting._matplotlib.style import get_standard_colors
from pandas.plotting._matplotlib.timeseries import (
decorate_axes,
Expand Down Expand Up @@ -99,7 +101,7 @@ def __init__(
self,
data,
kind=None,
by=None,
by: IndexLabel | None = None,
subplots=False,
sharex=None,
sharey=False,
Expand All @@ -124,13 +126,34 @@ def __init__(
table=False,
layout=None,
include_bool=False,
column: IndexLabel | None = None,
**kwds,
):

import matplotlib.pyplot as plt

self.data = data
self.by = by
self.by = com.maybe_make_list(by)

# Assign the rest of columns into self.columns if by is explicitly defined
# while column is not, so as to keep the same behaviour with current df.hist
# or df.boxplot.
if self.by and column is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but since you've got two conditions with self.by, maybe a if self.by: that contains these two if. Not sure, up to you, but feels like it could group the code executed when by is present in a more readable way.

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! And I also did small changes and realised that we could put column check part separately, because no matter if by input is None or not, column should be updated accordingly. I think it is now a bit clearer

self.columns = [
col
for col in data.columns
if col not in self.by and is_numeric_dtype(data[col])
]
else:
self.columns = com.maybe_make_list(column)

# When `by` is explicitly assigned, grouped data size will be defined, and
# this will determine number of subplots to have, aka `self.nseries`
if self.by:
if self._kind == "hist":
self._grouped_data_size = len(data.groupby(self.by))
elif self._kind == "box":
self._grouped_data_size = len(self.columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a personal opinion, but this feels a bit unreliable. I mean, defining self._grouped_data_size only when self.by is not None, and in other methods only accessing it in an if self.by is not None.

I'd personally move all this block inside nseries. Something like:

def nseries(self) -> int:
    if self.data.ndim == 1:
        return 1
    elif self.by and self._kind == 'hist':
        return len(self.data.groupby(self.by))
    elif self.by and self._kind == 'box':
        return len(self.colmns)
    else:
        return self.data.shape[1]

Does this make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! I moved here. Only that the self.data needs to be changed a bit, since self.data is updated later, and what we need here is the original grouped data length. Please take a look at the change here and see if it is okay for you?


self.kind = kind

Expand All @@ -139,7 +162,9 @@ def __init__(
self.subplots = subplots

if sharex is None:
if ax is None:

# if by is defined, subplots are used and sharex should be False
if ax is None and by is None:
self.sharex = True
else:
# if we get an axis, the users should do the visibility
Expand Down Expand Up @@ -275,8 +300,10 @@ def _iter_data(self, data=None, keep_index=False, fillna=None):
def nseries(self) -> int:
if self.data.ndim == 1:
return 1
else:
elif self.by is None:
return self.data.shape[1]
else:
return self._grouped_data_size

def draw(self):
self.plt.draw_if_interactive()
Expand Down Expand Up @@ -421,6 +448,11 @@ def _compute_plot_data(self):
label = "None"
data = data.to_frame(name=label)

# GH15079 reconstruct data if by is defined
if self.by is not None:
self.subplots = True
data = reconstruct_data_with_by(self.data, by=self.by, cols=self.columns)

# GH16953, _convert is needed as fallback, for ``Series``
# with ``dtype == object``
data = data._convert(datetime=True, timedelta=True)
Expand Down
136 changes: 136 additions & 0 deletions pandas/plotting/_matplotlib/groupby.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
from typing import (
Dict,
Optional,
Union,
)

import numpy as np

from pandas._typing import (
FrameOrSeriesUnion,
IndexLabel,
)

from pandas.core.dtypes.missing import remove_na_arraylike

from pandas import (
DataFrame,
MultiIndex,
Series,
concat,
)


def create_iter_data_given_by(
data: DataFrame, by: Optional[IndexLabel] = None, kind: str = "hist"
) -> Union[DataFrame, Dict[str, FrameOrSeriesUnion]]:
"""
Create data for iteration given `by` is assigned or not, and it is only
used in both hist and boxplot.

If `by` is assigned, return a dictionary of DataFrames in which the key of
dictionary is the values in groups.
If `by` is not assigned, return input as is, and this preserves current
status of iter_data.

Parameters
----------
data: reformatted grouped data from `_compute_plot_data` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a space before the colon, in all then, also in Returns.

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

by: list or None, value assigned to `by`.
kind: str, plot kind. This function is only used for `hist` and `box` plots.

Returns
-------
iter_data: DataFrame or Dictionary of DataFrames

Examples
--------
If `by` is assigned:

>>> import numpy as np
>>> tuples = [('h1', 'a'), ('h1', 'b'), ('h2', 'a'), ('h2', 'b')]
>>> mi = MultiIndex.from_tuples(tuples)
>>> value = [[1, 3, np.nan, np.nan],
... [3, 4, np.nan, np.nan], [np.nan, np.nan, 5, 6]]
>>> data = DataFrame(value, columns=mi)
>>> create_iter_data_given_by(data, by=["col1"])
{'h1': DataFrame({'a': [1, 3, np.nan], 'b': [3, 4, np.nan]}),
'h2': DataFrame({'a': [np.nan, np.nan, 5], 'b': [np.nan, np.nan, 6]})}
"""
if kind == "hist":
level = 0
elif kind == "box":
level = 1
else:
raise ValueError("This function is only used for hist and box plot")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but in general is better to provide the value that has been passed. Maybe this is called at some point in the middle of some tests, and this is raised, but the user/developer has no idea what has been passed to kind. Having something like `f'create_iter_data_given_by can only be used with kind "hist" and "box" plots, but used with "{kind}"'.

In some cases this may save some time to whoever is facing the exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated accordingly!

This else branch won't be reached at all, because this function is only used hist and boxplot plot implementation and not being called anywhere else.


iter_data: Union[DataFrame, Dict[str, FrameOrSeriesUnion]]
if not by:
iter_data = data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it'd be better it this function is only called with a by, and this case is managed in the caller. I think it'll make things much simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure, updated accordingly!

else:
# Select sub-columns based on the value of first level of MI
assert isinstance(data.columns, MultiIndex)
cols = data.columns.levels[level]
iter_data = {
col: data.loc[:, data.columns.get_level_values(level) == col]
for col in cols
}
return iter_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I'd return the comprehension directly, assigning to an intermediate variable seems to add verbosity, but not much value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, changed like above!



def reconstruct_data_with_by(
data: DataFrame, by: IndexLabel, cols: IndexLabel
) -> DataFrame:
"""
Internal function to group data, and reassign multiindex column names onto the
result in order to let grouped data be used in _compute_plot_data method.

Parameters
----------
data: Original DataFrame to plot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about space before colons.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

by: grouped `by` parameter selected by users
cols: columns of data set (excluding columns used in `by`)

Returns
-------
Output is the reconstructed DataFrame with MultiIndex columns. The first level
of MI is unique values of groups, and second level of MI is the columns
selected by users.

Examples
--------
>>> d = {'h': ['h1', 'h1', 'h2'], 'a': [1, 3, 5], 'b': [3, 4, 6]}
>>> df = DataFrame(d)
>>> reconstruct_data_with_by(df, by='h', cols=['a', 'b'])
h1 h2
a b a b
0 1 3 NaN NaN
1 3 4 NaN NaN
2 NaN NaN 5 6
"""
grouped = data.groupby(by)

data_list = []
for key, group in grouped:
columns = MultiIndex.from_product([[key], cols])
sub_group = group[cols]
sub_group.columns = columns
data_list.append(sub_group)

data = concat(data_list, axis=1)
return data


def reformat_hist_y_given_by(
y: Union[Series, np.ndarray], by: Optional[IndexLabel] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. You've got IndexLabel | None in some places, and Optional[IndexLabel] here. Is there a reason or one preferred to the other depending on the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you are right. The reason was i think there was a pre-commit hook (saw in other PR) that was added that automatically change Optional[IndexLabel] to IndexLabel | None? I forgot to change this one actually to the latest preferred way in pandas which is IndexLabel | None.

Just changed to IndexLabel | None to align with new standard in pandas!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, i am surprised that IndexLabel | None fails here, @MarcoGorelli could you please help me a bit? Maybe those two still have some nuances that I didn't notice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @charlesdong1991 - you need to have from __future__ import annotations at the top of a file for this new syntax to work

If you have this import, then there should be no need to change it manually, pyupgrade will do it for you as part of the suite of hooks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i changed from IndexLabel | None = None to IndexLabel | None then local mypy passed, however, if using IndexLabel | None = None, then I get this error: error: X | Y syntax for unions requires Python 3.10, pretty weird

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I see! Great thanks! I will add then! @MarcoGorelli

) -> Union[Series, np.ndarray]:
"""Internal function to reformat y given `by` is applied or not for hist plot.

If by is None, input y is 1-d with NaN removed; and if by is not None, groupby
will take place and input y is multi-dimensional array.
"""
if by is not None and len(y.shape) > 1:
y = np.array([remove_na_arraylike(col) for col in y.T]).T
else:
y = remove_na_arraylike(y)
return y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, and maybe just personal preference, but I'd write this as:

if by is not None and len(y.shape) > 1:
    return np.array([remove_na_arraylike(col) for col in y.T]).T
return remove_na_arraylike(y)

I find it a bit distracting being it so verbose for not much value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Loading