Skip to content

DOC/API: Discussion on whether __finalize__ is public #33338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Apr 6, 2020 · 2 comments
Open

DOC/API: Discussion on whether __finalize__ is public #33338

TomAugspurger opened this issue Apr 6, 2020 · 2 comments
Labels
Docs metadata _metadata, .attrs Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects

Comments

@TomAugspurger
Copy link
Contributor

Currently NDFrame.__finalize__ is not documented. We only have a couple references to it in the whatsnew.

I have a few questions around its API

  1. Is it considered public? Can subclasses call self.__finalize__()?
  2. Are the values passed in method='...' considered public?
@TomAugspurger TomAugspurger added Docs Needs Discussion Requires discussion from core team before further action metadata _metadata, .attrs labels Apr 6, 2020
@jorisvandenbossche
Copy link
Member

That makes me hesitant to include method in some of the methods here, which in turn makes me think that's it's premature to say method is stable if it's been in a release.

I'd rather work with subclassers to find what they're relying on. But since we're in an API gray zone that may be hard.

To be clear, I only meant to say "those method that have been in a release for ages". So not implying that the new ones you added now will become "stable" once we release 1.1.

So for clarity, before your PR to add method arguments, there were only 2 places where we actually used this: "concat" and the different "merge" flavors.
And both are used by GeoPandas:

https://github.com/geopandas/geopandas/blob/078062d303e401aaa5e37e04c0e7c3ce188920fe/geopandas/geodataframe.py#L643-L656

    def __finalize__(self, other, method=None, **kwargs):
        """propagate metadata from other to self """
        # merge operation: using metadata of the left object
        if method == "merge":
            for name in self._metadata:
                object.__setattr__(self, name, getattr(other.left, name, None))
        # concat operation: using metadata of the first object
        elif method == "concat":
            for name in self._metadata:
                object.__setattr__(self, name, getattr(other.objs[0], name, None))
        else:
            for name in self._metadata:
                object.__setattr__(self, name, getattr(other, name, None))
        return self

To repeat what I said on the PR: I always assumed this is part of our sublcassing API. Although it is actually not documented in our subclassing docs. But probably since geopandas has always been using that, I assumed it be part of the sublcassing API.
Also note that we don't actually use methods internally ourselves. It was still added to the API, so I can only think this was done exactly with the purpose for subclasses to be able to override finalize to use methods (or do other custom handling of metadata).

That said, the "API" is quite brittle though. If you look at the GeoPandas snippet above, you will see that for "concat" it gets the _Concatenator object, where it relies on other.objs[0] being a DataFrame/Series, and for the "merge" method it gets the _MergeOperation object, where it relies on other.left being a DataFrame/Series.
So this way, through the actual object that is passed to __finalize__ (and so not just __finalize__ itself or the values passed to method=) it also reaches quite deep into the internals of pandas ...

For new APIs, if we decide to go through with finalize methods and regard this as public, we should maybe try to rationalize what object is passed. For example, for something like "concat", it could also have been a list of dataframe objects, instead of the _Concatenator object, if we assume that this is all the information that would be needed to decide on metadata propagation.

@TomAugspurger
Copy link
Contributor Author

That said, the "API" is quite brittle though.

Making this more robust is my main goal of any refactor. The current status of method is some arbitrary string, which may or may not inform you of what's in other (depending on whether we consider the type of other as part of the API) isn't really sustainable.

@jorisvandenbossche jorisvandenbossche added the Subclassing Subclassing pandas objects label Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs metadata _metadata, .attrs Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

2 participants