-
Notifications
You must be signed in to change notification settings - Fork 118
Pandas 2 assign_in_place() fix with categoricals
#948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
activitysim/core/util.py
Outdated
| # when df and df2 column are both categorical, union categories | ||
| from pandas.api.types import union_categoricals | ||
|
|
||
| uc = union_categoricals([df[c], df2[c]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend setting sort_categories=True here, which will help make sure that categories populate in a stable order under multiprocessing, which will help with reproducibility/stability, especially when multiprocessing.
|
Has this been tested? Can we write a test for it? It looks like it should work, but it also may have unexpected consequences (e.g. from reordering categories). I see the current tests are passing, but they were also passing without this so it's clear our current tests are not exercising the problem being solved. |
|
@i-am-sijia, I just also saw your comment in #946 (comment) . Can we check if bringing "sort_categories=True" to the other existing usage[s] of union_categoricals will fail any of our existing tests? |
|
@jpn-- I added activitysim/activitysim/abm/models/mandatory_tour_frequency.py Lines 137 to 141 in ba4a12e
We need union_categoricals() when new category values are added. You are right that under multiprocessing, the sequence of how the new categories are being appended may be unstable. But if we set sort_categories=True during union, it will also change the original sequence of the old category values.
In terms of unit tests. I was thinking we can start with the following:
|
|
Maybe we are getting too far ahead of ourselves. I agree that most categoricals are fully enumerated at creation time, in some logical order that may be meaningful to modelers (e.g. driving, then transit, then nonmotorized). And, most categoricals in ActivitySim are logically unordered. For sharrow, any change in a categorical data type (adding categories, removing unused categories, reordering them, making them "ordered" when they were previously just nominal), literally any change at all will result in recompiling. As long as we get stable sort ordering most of the time, we can survive with corner cases that don't have stable order (e.g. the sample is too small and some unusual tour type doesn't always get included). So, maybe we just merge this to fix the bug encountered as reported at the top, and leave a more complete solution to efficiently handling categoricals until ActivitySim 2.x? |
ActivitySim models often call the
activitysim.core.util.assign_in_place(df, df2, ...)method to update existing values indfusing values fromdf2, or to add new columns fromdf2. When performing the former, it callsdf.update(df2)to update common columns in place.In Pandas 2.x,
df.update(df2)will raise aTypeErrorif the common columnCindfis of categorical dtype anddf2['C']contains category value(s) that are not present indf['C']'s categories.This issue was encountered by SANDAG in the trip preprocessor of their airport model, see discussion #946 . None of the existing ActivitySim example models exhibit this issue.
The solution in this PR first makes sure
df2['C']is a categorical type, and then unions the two categoricals before calling df.update(df2).