Skip to content

No API reference for DataFrameGroupBy and "combining" step #2644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jankatins opened this issue Jan 5, 2013 · 36 comments
Closed

No API reference for DataFrameGroupBy and "combining" step #2644

jankatins opened this issue Jan 5, 2013 · 36 comments
Labels

Comments

@jankatins
Copy link
Contributor

At least by searching google, I found no references for the DataFrameGroupBy. The docs at http://pandas.pydata.org/pandas-docs/dev/groupby.html speak about a "GroupBy object"...

What are all the functions of that object? Using IPython, I get inline help, but this informations seems to be missing in the online docs.

I'm also missing the "combine" step in the split-apply-combine doc: how can values be added to the original dataframe? I only found something about transform, which returns a complete dataframe. grouped.apply() returns a single Series. Did I miss something or is the combine step really just a "merge the new dataframe/series with the old one".

@mhlr
Copy link

mhlr commented Apr 23, 2014

+1. An undocumented GroupBy really limits the usefulness of Pandas.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

well the groupby docs are quite complete, what exactly is the problem?

if you read thru them, you will see exactly what a GroupBy object is:

http://pandas.pydata.org/pandas-docs/dev/groupby.html#splitting-an-object-into-groups

And the fact that it applies the functions directly

http://pandas.pydata.org/pandas-docs/dev/groupby.html#aggregation

If you think their is something missing, the pls be specific about it

+1. An undocumented GroupBy really limits the usefulness of Pandas.

is a very odd statement

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

There are some API docs missing for GroupBy objects, I think that is what @JanSchulz is talking about

  • filter: (even though that is documented elsewhere)
  • first/last: (these are different from the DataFrame and Series methods and should be documented)
  • name: not sure what the purpose of this is
  • nth: very useful when you want to try out an operation on a more expensive groupby apply (e.g., many levels)

@jorisvandenbossche
Copy link
Member

Yep, I also follow the initial poster. @jreback I think the groupby guide is already in very good shape, certainly. But I personally think it is important to see that the user guide and reference docs are complementary, and both important.
And there are indeed some things missing in the api docs (I was also working on a draft issue to list some of the missing docs, will try to finish that).

@mhlr It would indeed be a bit more constructive to say what you miss than just saying that it limits the usefulness. In fact, if you would want to do that, give some concrete things you miss in the docs about this, that would be very helpful for us! Really.
(a PR even more helpful .. :-) but already knowing what is missing also. If you are working a lot on this, it is sometimes difficult to see what information is missing for a new user).

@JanSchulz What do you exactly mean with the missing "combine" step? Because I think you misinterpret the combine step, as this is about combining the different groups together in one dataframe, not about combining the output of the groupby with the original array (for that you can just assign it to a column in the original array, or concat both, ..) That is not clear enough from the first paragraph in the docs?

@jorisvandenbossche
Copy link
Member

And something else, in the original issue @JanSchulz also mentions the fact that the docs only speak about the GroupBy object, while in practice you always get a DataFrameGroupBy of SeriesGroupBy object. For us it is maybe obvious, but not for a lot of users I thinkt (but not much to do about that I suppose? apart from mentioning it in the docs).

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

In my experience with pandas i've almost never cared about the distinction between the types of groupby objects, whereas with series vs frame i have to be more cognizant of any api differences (but even then not so much). IMO distinguishing between the different kinds of groupbys in the main docs is too much detail and is best left for api docs.

@mhlr
Copy link

mhlr commented Apr 23, 2014

DataFrame.groupby is well documented. It returns a DataFrameGroupBy object which is not documented.
I see no document which gives a definitive list of DataFrameGroupBy methods with their signatures and assumtions/requirements/contracts the way that http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html does for DataFrame. There is no http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrameGroupBy.html
There is a an introduction with examples at
http://pandas.pydata.org/pandas-docs/dev/groupby.html
but that does not constitute an API reference.

@jorisvandenbossche
Copy link
Member

@cpcloud yes, I agree fully with that, but still the user sees DataFrameGroupBy, and maybe you know that you shouldn't care about, but some other users don't

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

I suppose easy enough to add the whitelisted methods in the groupby API section, but seems kind of repetive, maybe list them in the groupby docs (so we can auto-generate them)...that is the big issue here.

DataFrameGroupby/SeriesGroupby.sum() is de facto DataFrame.sum()

with the exception of certain methods which are slightly different (which is itself an API bug, they should be the same)

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

i don't think listing whitelisted methods is appropriate ... why not just list the methods that are specific to groupby-type objects?

@jorisvandenbossche
Copy link
Member

@mhlr Thanks for the elaboration. The GroupBy object is indeed missing in the API docs (and it is not much work to add it actually, just one line in api.rst). As said in the previous comments, the DataFrameGroupby is actually the 'same' (for a user).

@jorisvandenbossche
Copy link
Member

@cpcloud Why isn't it appropriate? Why otherwise whitelist them? They are whitelisted because they are usefull, so user should be able to know about that?

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

well, can we auto-generate them in the API section? (want to avoid manually doing this, so they they are only in 1 place, namely the whitelist)

@mhlr
Copy link

mhlr commented Apr 23, 2014

I think it would be useful to {DataFrame,Series}GroupBy API pages as well even if they just link or redirect to the GroupBy page just to avoid the initial confusion caused by not being able to find them. Definitely should have a GroupBy page.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@mhlr did you see this:

Well some are already documented: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

@jorisvandenbossche
Copy link
Member

But not all are that useful ... (aggregate and transform are just empty)

@jreback about listing the whitelisted methods automatically, I don't know if this is possible. Because, in a python session, they don't appear if you do pd.core.groupby.DataFrameGroupby.<TAB> only on an instantiated groupby object g = df.groupby(..); g.<TAB> So I don't think sphinx (autosummary for the GroupBy class) will find them.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@jorisvandenbossche ahh...I think Groupby needs to have a base class of StringMixIn and define __localdir__ and then it would work (as this is where the tab completion comes from)

@jorisvandenbossche
Copy link
Member

@jreback actually it is numpydoc (not sphinx autodoc) that is doing this, and they use inspect.getmembers (https://github.com/numpy/numpydoc/blob/master/numpydoc/docscrape.py#L519), but I suppose this method will use __dir__ of a class?

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

StringMixIn defines __dir__ (the __localdir__ is just to add to it)

umm....you know better than I how that works, but if it uses __dir__ then it should work

@jorisvandenbossche
Copy link
Member

yep, inspect.getmembers is just a loop through dir(object)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@jorisvandenbossche Groupby is subclassing PandasObject so its ok,
and __dir__ is OK; its the whitelist, so should work

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

@jorisvandenbossche i was thinking that the whitelisted methods can be seen with tab completion and operate on each group in exactly the same way that they work without the groupby ... so was thinking that that's very repetitive ... not a huge deal

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

dir(DataFrameGroupBy) actually looks ok, but in the docs we are using GroupBy, so maybe just add those 2 sections (and suppress the GroupBy one)

@jorisvandenbossche
Copy link
Member

@cpcloud actually I agree it's maybe a bit repetitive to have for each of them seperate generated api docstring pages, as it is indeed exactly the same method (so in that case they don't need to appear in dir(GroupBy)), but I think it is interesting to just list them somewhere (eg in the docstring of GroupBy) so it's not only by discovering through TAB that you can find out about it.

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

@jorisvandenbossche fair enough :)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

you could generate the __doc__ easy enough

@jorisvandenbossche
Copy link
Member

@jreback yep, that is actually true, and a good idea! Just inject a string with the list of methods in the groupby docstring from the _apply_whitelist (and format it a little bit). Something like that?

@jreback strange, as the docstring of dir() says If the object supplies a method named __dir__, it will be used

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@jorisvandenbossche I think dir is fine, its just not looking at DataFrameGroupBy (where the __dir_ has ALL the methods / and SeriesGroupBy). Just need to add those classes in the auto-summary.

@jorisvandenbossche
Copy link
Member

@jreback but I think only in class instances, not the class itself (because if I do pd.core.groupby.DataFrameGroupBy.<TAB> they also don't appear, and in g = df.groupby(..): g.<TAB> they do)

@jorisvandenbossche
Copy link
Member

I digged up my draft notes about this and made a new issue (#6944) with an overview of what I think is missing in the reference docs (made a new issue because this is already getting long and was originally not that broad, and you know I like clear overview issues :-))

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok....let's close this and use your new issue then

maybe _local_dir should be a staticmethod then...hmm

@jankatins
Copy link
Contributor Author

@jorisvandenbossche My issue was simple that the doc has the big header "split-apply-combine" but afterwards only talk about "split" and "apply", but never combines them (Only this is mentioned about combine: "Combining the results into a data structure"). As a newbie to that paradigm it was a bit confusing to miss that step in the docs.

@jorisvandenbossche
Copy link
Member

@JanSchulz OK, that can be indeed made more clear in the intro. The reason that afterwards there is no seperate "combine" section, is that it happens at the same time as the "apply". And you cannnot really adjust this as a user (no different ways to 'combine'); the groups are just concatenated to a dataframe/series (in contrast to eg plyr in R, where you as a user can choose between different functions to have different data type outputs for how to combine the different groups).

@cpcloud
Copy link
Member

cpcloud commented Apr 25, 2014

@jorisvandenbossche is there a feature request in there somewhere? 😄

@jorisvandenbossche
Copy link
Member

@cpcloud I don't know, is there a need? :-)
Personally I find it a plus of pandas that we do not have separate groupby functions to output lists, arrays or dataframes ..

@cpcloud
Copy link
Member

cpcloud commented Apr 25, 2014

Oh I see what you mean. Yes, that is a very nice aspect of pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants