-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Grouped dataframe "name" attribute overrides column access / not well documented #25457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Bug indeed, though finding the proper "name" for this column needs more robustness (e.g. an algorithm that keeps "building" the column name until there are no conflicts). Investigation and PR are welcome! |
Presumably the coding isn't the hard part, but the API choices here. Also want to say it's not a column name, but an attribute name. One other possibility that I didn't mention in the suggestions above is: Add a kwarg to def groupedfunc(group, grouped_value):
print(grouped_value)
return group
df.groupby('col').apply(groupedfunc, pass_grouped_value=True) So that I like this idea since it doesn't add attributes to a dataframe that aren't part of the documented dataframe interface. The biggest problem is that it's not as simple of a swap for anyone currently using the |
Admitting that I have a personal bias against attribute access do we really want to change anything here? It is a documented limitation of attribute access that things won't work if it conflicts with an existing attribute name: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#attribute-access I think we'd be going down an endless rabbit hole if we tried to make guarantees around this |
Agreed with @WillAyd. Attribute access is merely a convenience especially geared towards interactive use. If you want to select columns, use |
The problem IMO is that it isn't an existing attribute name until it gets passed into the function. It's an undocumented attribute of a dataframe that gets "magically" added. It's not part of the dataframe interface and isn't something that exists with I understand that dotted access is syntactic sugar for quicker interactive work, but I'm more concerned with something working outside of a groupby function and then not working inside the function that acts on each group. The API just suddenly changes in one part of a pipeline. If this is something that happens often in Pandas (adding new attributes to dataframes when passing them around), then I guess this is something that just needs to be documented better. But if it's a one-off case, I believe it's worth re-thinking the necessity of adding/overwriting an attribute. |
In terms of additional documentation, perhaps here it should be noted that |
Completely agreed that we can improve the documentation here. |
I haven't stepped through to verify but |
That's good to know! What does "by-Series evaluation" mean in this context? The object passed into the function is indeed a dataframe and has the Also to have it documented here, running
OTOH |
Interestingly enough that docstring you mention isn't part of the API. I think it might actually be wrong and only part of the SeriesGroupBy implementation linked below: pandas/pandas/core/groupby/generic.py Line 932 in fe1654f
Though point remains that we can't make guarantees about all attributes name perhaps there is something here to be cleaned up given this isn't part of the API and may just be an internal modification. Not sure what value this provides |
What do you mean? The There's a similar thing also at Line 481 in the pandas/pandas/core/groupby/generic.py Lines 480 to 481 in fe1654f
|
Uh oh!
There was an error while loading. Please reload this page.
Code Sample
Works correctly:
Errors out:
Rest of the traceback is here:
Problem description
When you call a method on a groupby that expects a function, the documentation states that the function should expect a DataFrame. Indeed if you inspect the type, a DataFrame is passed. However, this DataFrame has a new attribute added ---
name
--- and this overwrites the dot accessor for a column"name"
if it exists. This is a little troubling since doing dotted access will work outside of the groupby and not inside the groupby function.First, this needs to be documented more explicitly---I found it once in the documentation for one of the related functions that the
.name
attribute gets added, but cannot find it again, so I'm not sure which one it was. Edit: It was thetransform()
method's docstring, as shown in a comment below.Related
#9545
Expected Output
The expected output is what happens when you use the
[]
indexer, instead of dot access.Suggestions
Not necessarily mutually exclusive:
.name
is an attribute of the group DataFrame.name
attribute exists and will override the column dot access.name
attribute if a column of that name doesn't existgroupdf.name_
groupdf.grouped_value
groupdf.get_group_name()
apply()
/transform()
which toggles whether to send a second argument into the function, that second argument being the group nameOutput of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.0.beta.4
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.18
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: