Skip to content

Grouped dataframe "name" attribute overrides column access / not well documented #25457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alkasm opened this issue Feb 26, 2019 · 11 comments
Open
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@alkasm
Copy link

alkasm commented Feb 26, 2019

Code Sample

>>> df = pd.DataFrame({'val': [9, 10, 3, 6, 2, 3], 'name': list('xxyxyy'), 'group': list('aaabbb')})
>>> df

	val	name	group
0	9	x	a
1	10	x	a
2	3	y	a
3	6	x	b
4	2	y	b
5	3	y	b

Works correctly:

>>> df.groupby('group').apply(lambda g: g[g['name'] == 'x'])

		val	name	group
group				
a	0	9	x	a
	1	10	x	a
b	3	6	x	b

Errors out:

>>> df.groupby('group').apply(lambda g: g[g.name == 'x'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

Rest of the traceback is here:

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2655             try:
-> 2656                 return self._engine.get_loc(key)
   2657             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    688             try:
--> 689                 result = self._python_apply_general(f)
    690             except Exception:

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

~/venv/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

<ipython-input-282-522c70a9fa21> in <lambda>(g)
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657             except KeyError:
-> 2658                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2659         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2655             try:
-> 2656                 return self._engine.get_loc(key)
   2657             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-282-522c70a9fa21> in <module>
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    699 
    700                 with _group_selection_context(self):
--> 701                     return self._python_apply_general(f)
    702 
    703         return result

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    705     def _python_apply_general(self, f):
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 
    709         return self._wrap_applied_output(

~/venv/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    188             # group might be modified
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):
    192                 mutated = True

<ipython-input-282-522c70a9fa21> in <lambda>(g)
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656                 return self._engine.get_loc(key)
   2657             except KeyError:
-> 2658                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2659         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2660         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

Problem description

When you call a method on a groupby that expects a function, the documentation states that the function should expect a DataFrame. Indeed if you inspect the type, a DataFrame is passed. However, this DataFrame has a new attribute added --- name --- and this overwrites the dot accessor for a column "name" if it exists. This is a little troubling since doing dotted access will work outside of the groupby and not inside the groupby function.

First, this needs to be documented more explicitly---I found it once in the documentation for one of the related functions that the .name attribute gets added, but cannot find it again, so I'm not sure which one it was. Edit: It was the transform() method's docstring, as shown in a comment below.

Related

#9545

Expected Output

The expected output is what happens when you use the [] indexer, instead of dot access.

Suggestions

Not necessarily mutually exclusive:

  • Give a better error message so that it's known .name is an attribute of the group DataFrame
  • Improve documentation on all related methods to know the .name attribute exists and will override the column dot access
  • Only add the .name attribute if a column of that name doesn't exist
  • Append an underscore to the attribute so there's a much lower chance of conflicts, e.g. groupdf.name_
  • Change the attribute name entirely for similar lower chance of conflicts, e.g. groupdf.grouped_value
  • Move into a method call instead, e.g. groupdf.get_group_name()
  • Add a kwarg to apply() / transform() which toggles whether to send a second argument into the function, that second argument being the group name

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.beta.4
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.18
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung
Copy link
Member

gfyoung commented Feb 27, 2019

Bug indeed, though finding the proper "name" for this column needs more robustness (e.g. an algorithm that keeps "building" the column name until there are no conflicts). Investigation and PR are welcome!

@alkasm
Copy link
Author

alkasm commented Feb 27, 2019

Bug indeed, though finding the proper "name" for this column needs more robustness (e.g. an algorithm that keeps "building" the column name until there are no conflicts). Investigation and PR are welcome!

Presumably the coding isn't the hard part, but the API choices here. Also want to say it's not a column name, but an attribute name.

One other possibility that I didn't mention in the suggestions above is:

Add a kwarg to apply(), transform() and whatever other groupby methods that accept a function, which tells it whether to pass the group name as well as the group dataframe into the function. In other words, something like:

def groupedfunc(group, grouped_value):
    print(grouped_value)
    return group

df.groupby('col').apply(groupedfunc, pass_grouped_value=True)

So that groupedfunc will pass the group and the grouped by value into the function. By default the kwarg would be False so it works like normal, but without the name attribute.

I like this idea since it doesn't add attributes to a dataframe that aren't part of the documented dataframe interface. The biggest problem is that it's not as simple of a swap for anyone currently using the .name attribute, like it would be if the name were changed to .name_ or pushed into a method call, but I think it's a better design overall.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

Admitting that I have a personal bias against attribute access do we really want to change anything here? It is a documented limitation of attribute access that things won't work if it conflicts with an existing attribute name:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#attribute-access

I think we'd be going down an endless rabbit hole if we tried to make guarantees around this

@TomAugspurger
Copy link
Contributor

Agreed with @WillAyd. Attribute access is merely a convenience especially geared towards interactive use. If you want to select columns, use __getitem__.

@alkasm
Copy link
Author

alkasm commented Feb 27, 2019

It is a documented limitation of attribute access that things won't work if it conflicts with an existing attribute name

The problem IMO is that it isn't an existing attribute name until it gets passed into the function. It's an undocumented attribute of a dataframe that gets "magically" added. It's not part of the dataframe interface and isn't something that exists with dir() or help() or whatever on pd.DataFrame.

I understand that dotted access is syntactic sugar for quicker interactive work, but I'm more concerned with something working outside of a groupby function and then not working inside the function that acts on each group. The API just suddenly changes in one part of a pipeline.

If this is something that happens often in Pandas (adding new attributes to dataframes when passing them around), then I guess this is something that just needs to be documented better. But if it's a one-off case, I believe it's worth re-thinking the necessity of adding/overwriting an attribute.

@alkasm
Copy link
Author

alkasm commented Feb 27, 2019

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#attribute-access

In terms of additional documentation, perhaps here it should be noted that name shouldn't be used, either.

@TomAugspurger
Copy link
Contributor

Completely agreed that we can improve the documentation here.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

The problem IMO is that it isn't an existing attribute name until it gets passed into the function. It's an undocumented attribute of a dataframe that gets "magically" added. It's not part of the dataframe interface and isn't something that exists with dir() or help() or whatever on pd.DataFrame.

I haven't stepped through to verify but .name is a documented attribute of a Series, and by-Series evaluation could certainly be the culprit here

@alkasm
Copy link
Author

alkasm commented Feb 27, 2019

I haven't stepped through to verify but .name is a documented attribute of a Series, and by-Series evaluation could certainly be the culprit here

That's good to know! What does "by-Series evaluation" mean in this context? The object passed into the function is indeed a dataframe and has the .name attribute.

Also to have it documented here, running help(pd.core.groupby.DataFrameGroupBy.transform) gives:

Help on function transform in module pandas.core.groupby.groupby:

transform(self, func, *args, **kwargs)
    Call function producing a like-indexed DataFrame on each group and
    return a DataFrame having the same indexes as the original object
    filled with the transformed values

    Parameters
    ----------
    f : function
        Function to apply to each group

    Notes
    -----
    Each group is endowed the attribute 'name' in case you need to know
    which group you are working on.

...

OTOH help(pd.core.groupby.DataFrameGroupBy.apply) does not have a note about the 'name' attribute.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

Interestingly enough that docstring you mention isn't part of the API. I think it might actually be wrong and only part of the SeriesGroupBy implementation linked below:

for name, group in self:

Though point remains that we can't make guarantees about all attributes name perhaps there is something here to be cleaned up given this isn't part of the API and may just be an internal modification. Not sure what value this provides

@alkasm
Copy link
Author

alkasm commented Feb 27, 2019

Interestingly enough that docstring you mention isn't part of the API.

What do you mean? The df.groupby('column')) object is of type pandas.core.groupby.groupby.DataFrameGroupBy so I pulled the docstring from that class's transform method.

There's a similar thing also at Line 481 in the NDFrameGroupBy (which is subclassed by DataFrameGroupBy):

for name, group in gen:
object.__setattr__(group, 'name', name)

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

5 participants