-
Notifications
You must be signed in to change notification settings - Fork 5
Tidyup 8 - Expanding the filter() family
#30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
jennybc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the proposal! Made a few comments as I reacted to a first reading.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
filter() family
This comment was marked as resolved.
This comment was marked as resolved.
|
tidyups/008-dplyr-filter-family.md Line 948 in 5b76b43
FWIW, as a user I would much prefer the name Love the idea for this API btw! |
|
@wurli most of us felt that We also really appreciated how it feels like a "variant" of With |
|
With |
|
Love this implementation, I do think the |
|
Awesome proposal! My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated |
|
@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming |
|
Love the idea! How do you teach that # Sequence with filter()
. |>
filter(x) |>
filter(y)
# Same as conjunction
. |>
filter(x, y)
# Sequence with filter_out()
. |>
filter_out(x) |>
filter_out(y)
# Same as alternation (!?!)
. |>
filter_out(x | y) |
|
I think the best way to teach this is probably something like:
# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)
# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))I think the fact that |
|
To me, the antisymmetry is not only theoretically pleasing. I'm reading
I'd never read it like:
Even stronger with . |>
filter_out(
x,
y
)To me, the |
|
Completely agree with @krlmlr here. I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect |
|
There are two competing worldviews at play here.
Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn. As complementsIf both df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))
# ---
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
df |> filter(x | y)
df |> filter(when_any(x, y))Notice how everything above the line related to I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that As a nice side effect this means you only need to worry about This all means that if you are translating from a
patients <- tibble::tibble(
name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)
patientsWith years of patients |>
filter(!(deceased & date < 2012))But immediately get frustrated when it drops your patients |>
filter_out(deceased & date < 2012)And boom that works as expected. And since there is only 1 rule that applies for both patients |>
filter_out(deceased, date < 2012)You also get this nice result, i.e. they are complements of one another # Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= dfIt is true that you can't break df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)
df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where df |> filter(cyl == 5, disp > 20)and it would not occur to me to write this, even though they are equivalent df |> filter(cyl == 5) |> filter(disp > 20)In other words, my problem statement of "rows where This also means that I don't find Kirill's idea that I think a more appropriate goal of As chainable equivalentsIf df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
# ---
df |> filter(x | y)
df |> filter(when_any(x, y))
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))My argument is that this is actually much harder for people to learn.
And this is on top of having to think about But most importantly, you can no longer easily translate a patients |>
filter(!(deceased & date < 2012))then you have to translate to this patients |>
filter_out(when_all(deceased, date < 2012))and I'd argue that is an increase in mental burden to translate to over the "just drop the In my ideal world both This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching df |> filter(x, y)
df |> filter(x) |> filter(y)
df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)So why do
|
|
In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition. Examples:
And without “filter” language at all:
In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side. If you also think of |
Readable link
Most relevant issues
filter(.missing = )option to optionally retain missing values dplyr#6560filter(.missing = NULL, .how = c("keep", "drop"))dplyr#6891We are open to feedback until Monday, November 24th.