Tidyup 8 - Expanding the `filter()` family #30

DavisVaughan · 2025-11-04T14:51:00Z

Readable link

https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md

Most relevant issues

We are open to feedback until Monday, November 24th.

jennybc

I like the proposal! Made a few comments as I reacted to a first reading.

008-dplyr-retain-and-exclude.md

008-dplyr-filter-family.md

008-dplyr-retain-and-exclude.Rmd

008-dplyr-filter-family.Rmd

008-dplyr-retain-and-exclude.Rmd

008-dplyr-filter-family.Rmd

008-dplyr-retain-and-exclude.Rmd

This reverts commit cff71bc.

wurli · 2025-11-07T16:58:23Z

tidyups/008-dplyr-filter-family.md

Line 948 in 5b76b43

- `exclude()`, as noted above, which would have been paired with

FWIW, as a user I would much prefer the name exclude() to filter_out(). IMO, to the uninitiated it would not be clear which of filter()/filter_out() retains and which excludes, but I think if filter() is paired with exclude() then the purpose of both becomes clearer. I also like retain() as an alias for filter(), but on balance I agree it's probably best not to add an alias since filter() is so well established.

Love the idea for this API btw!

DavisVaughan · 2025-11-07T17:17:42Z

@wurli most of us felt that filter_out() was very clear that it's removing the indicated rows, which helps you intuit that filter() must keep them.

We also really appreciated how it feels like a "variant" of filter() rather than another core verb. On the home page of dplyr we'd still just list filter(), that's the core verb. It's only when you come to filter()s help docs that you'd also learn about filter_out() (or your teacher would tell you about it). Similar to slice() being the core verb and slice_*() being the variants. I think there is something pretty powerful to this idea, and it also helps with autocompletions, i.e. filt<tab> brings up both, which is quite nice.

With exclude(), I'd feel the need to say "filter() / exclude() to keep or drop cases based on their values" on the dplyr home page and that felt like a net negative in comparison https://dplyr.tidyverse.org/#overview

jrosell · 2025-11-07T17:34:35Z

With filter_out, would someone wonder if it exists filter_in as a filter alias?

joeycouse · 2025-11-07T18:29:08Z

Love this implementation, I do think the filter() and filter_out() is slightly unclear but I don't see this as a major hurdle in practice. IMO keeping filter() api should be a priority over slightly clearer language and introducing new core verbs and leaving filter() stranded.

davidhodge931 · 2025-11-09T20:42:56Z

Awesome proposal!

My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated

DavisVaughan · 2025-11-09T21:08:50Z

@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming filter(), so we are working within the constraints of that. Renaming filter() is likely just too disruptive to the whole community to be worth it.

krlmlr · 2025-11-10T19:15:30Z

Love the idea!

How do you teach that filter_out(x, y) is actually filter_out(x & y) and not filter_out(x | y) ? I'd be confused about half the time. Would it be safer to allow just one predicate in filter_out() ? Haven't followed the entire discussion, please disregard if redundant.

# Sequence with filter()
. |>
  filter(x) |>
  filter(y)

# Same as conjunction
. |>
  filter(x, y)

# Sequence with filter_out()
. |>
  filter_out(x) |>
  filter_out(y)

# Same as alternation (!?!)
. |>
  filter_out(x | y)

DavisVaughan · 2025-11-11T17:01:59Z

I think the best way to teach this is probably something like:

With filter(), target rows to keep
With filter_out(), target rows to drop
Both combine with & (consistent for both)
If you want |, use when_any() (consistent for both)

# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)

# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))

I think the fact that df |> filter_out(x | y) is equivalent to df |> filter_out(x) |> filter_out(y), and df |> filter(x & y) is equivalent to df |> filter(x) |> filter(y) is theoretically pleasing, but isn't something I would harp on while teaching. Instead I'd focus on when_any(), which is used the same way no matter which verb you use.

krlmlr · 2025-11-20T11:26:07Z

To me, the antisymmetry is not only theoretically pleasing. I'm reading . |> filter_out(x, y) like:

I'm taking the input
I'm filtering out the entries that match x
Then, I'm filtering out the entries that match y

I'd never read it like:

I'm taking the input
I'm filtering out the entries that match x and also match y

Even stronger with

. |>
  filter_out(
    x,
    y
  )

To me, the , translates to a "then" much better than to an "and". Is it only me? I don't know, but I'd like us to think a bit longer about the ambiguity here and the options that we have. The option most appealing to me is to implement an initial draft that accepts only one argument; there's much less ambiguity in filter_out(when_any(...)) and filter(when_all()) . Then we can play with it and decide if and how we extend to multiple arguments.

joeycouse · 2025-11-20T17:42:31Z

Completely agree with @krlmlr here.

df |> filter(x) |> filter(y) 

 df |> filter(x,y)

I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect filter_out() to have the same behavior.

DavisVaughan · 2025-11-20T20:54:33Z

There are two competing worldviews at play here.

filter() and filter_out() as complements of one another.
filter(df, x, y) and filter_out(df, x, y) as equivalent to df |> filter(x) |> filter(y) and df |> filter_out(x) |> filter_out(y).

Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn.

As complements

If both filter() and filter_out() combine using &, then you get the following result table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

# ---

df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

df |> filter(x | y)
df |> filter(when_any(x, y))

Notice how everything above the line related to & works the exact same regardless of whether it is filter() or filter_out(). Similarly, everything below the line works the same with |.

I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that , separated conditions are combined with &. This exactly matches what people have been doing with filter() since day 1 of dplyr. There are no mental gymnastics required when swapping between filter() and filter_out() if you remember this 1 rule you've been using the whole time.

As a nice side effect this means you only need to worry about when_any() - if you find yourself using | in either filter() or filter_out(), you can immediately switch to when_any(), no extra thought required. filter() and filter_out() users should never need when_all() because conditions combine with & already, and that's perfectly fine, one less thing to learn, and when_all() is still useful on its own in other contexts.

This all means that if you are translating from a filter() to a filter_out() to simplify your conditions, then doing so is very easy by design. For example:

Filter out rows where the patient is deceased and the year of death was before 2012.

patients <- tibble::tibble(
  name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
  deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
  date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)

patients

# A tibble: 7 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Mark  TRUE      2010
3 Sarah NA          NA
4 Davis TRUE      2020
5 Max   NA        2010
6 Derek FALSE       NA
7 Tina  TRUE        NA

With years of filter() muscle memory built up, you might start with this:

patients |>
  filter(!(deceased & date < 2012))

# A tibble: 3 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Davis TRUE      2020
3 Derek FALSE       NA

But immediately get frustrated when it drops your NAs, then you remember filter_out()! It is intentionally designed so that you can very easily drop the ! and () to translate to:

patients |>
  filter_out(deceased & date < 2012)

# A tibble: 6 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Sarah NA          NA
3 Davis TRUE      2020
4 Max   NA        2010
5 Derek FALSE       NA
6 Tina  TRUE        NA

And boom that works as expected.

And since there is only 1 rule that applies for both filter() and filter_out() - that conditions are combined with &, you'll probably also remember that you can simplify further to:

patients |>
  filter_out(deceased, date < 2012)

You also get this nice result, i.e. they are complements of one another

# Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= df

It is true that you can't break df |> filter_out(x, y) into df |> filter_out(x) |> filter_out(y) like you can with filter():

df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)

df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)

But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where cyl == 5 and disp > 20" then I'd write:

df |> filter(cyl == 5, disp > 20)

and it would not occur to me to write this, even though they are equivalent

df |> filter(cyl == 5) |> filter(disp > 20)

In other words, my problem statement of "rows where cyl == 5 and disp > 20" is made up of two coupled conditions and I would never separate them across two filter() statements.

This also means that I don't find Kirill's idea that , is treated like a "then" very convincing. I very much read the , like an "and" that translates directly from my real-life problem statement of "rows where cyl == 5 and disp > 20".

I think a more appropriate goal of filter_out() is ease of translation from a "negated filter", which ends up resulting in this complement worldview.

As chainable equivalents

If filter() combines conditions with & and filter_out() combines conditions with |, you end up with this table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

# ---

df |> filter(x | y)
df |> filter(when_any(x, y))

df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

My argument is that this is actually much harder for people to learn.

You must remember that filter() combines with &, but filter_out() combines with |.
You must remember to use when_any() in filter() but when_all() in filter_out().

And this is on top of having to think about NA handling! So that's 3 different aspects you have to think about all at once (filter for vs filter out, & vs |, and when_any vs when_all). With the complement approach I'd argue there is only 1 aspect to think about - filter for vs filter out, because everything else works the same.

But most importantly, you can no longer easily translate a filter() that you mistakenly started into a filter_out(). With the above example, when you realize that this is the wrong approach:

patients |>
  filter(!(deceased & date < 2012))

then you have to translate to this filter_out(),

patients |>
  filter_out(when_all(deceased, date < 2012))

and I'd argue that is an increase in mental burden to translate to over the "just drop the !" translation of filter_out(deceased & date < 2012).

In my ideal world both when_all() and when_any() are rarely required, and this holds true with the current "treat them as complements" worldview, where only when_any() is ever needed, which is also only in the rare case of needing to combine with |. This would not be the case if filter_out() combined conditions with |, because pretty much every time you'd reach for a filter_out() with >1 conditions, you'd also need when_all(), because combining conditions with & is the more common situation.

This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching filter() or filter_out(), because, as mentioned in the previous section, when you have a problem like find "rows where cyl == 5 and disp > 20" you would not want to split that over two filter() calls.

df |> filter(x, y)
df |> filter(x) |> filter(y)

df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)

So why do `,` separated conditions combine with `&` at all?

Good question!

I think this is the heart of the problem. Deciding whether to combine , separated conditions with & or | is inherently ambiguous. But back in the origins of dplyr it must have been decided that combining with & was the more common case, and I do think that has held true.

I think Kirill nailed it by mentioning that in an ideal world there is only 1 expr allowed. This would force the explicit usage of either & / | or when_all() / when_any() (where there is no ambiguity about how ... combine). That would have been a pretty elegant way to solve all of this!

In fact, this is exactly how Stata's keep if and drop if work, their specification is:

keep if expr
drop if expr

and you must use explicit & and | like drop if inlist(v1,88,99) | missing(v2). No ambiguity there!

But I think limiting filter_out() to just 1 condition would do the world a disservice and would just cause more confusion about why filter() and filter_out() aren't equivalent in this regard.

Instead, I'm arguing that we should just lean into the status quo. Rather than contribute to the ambiguity of how , separated conditions should be combined by chainging its meaning between filter() and filter_out(), let's just have 1 consistent rule of "combine with &", which is already ambiguous enough but has years of muscle memory built up for most filter() users.

t-kalinowski · 2025-11-20T22:25:18Z

In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition.

Examples:

“Filter out spam and promotional emails from my inbox.”
→ Remove any email that is spam or promotional.
“Filter out missing values and zeros before plotting.”
→ Remove any row that is missing or zero.

And without “filter” language at all:

“Exclude France and Germany from the analysis.”
→ Drop any row where the country is France or Germany.
“Ignore students who failed and students who dropped the course.”
→ Ignore a student if they failed or dropped.
“Remove late submissions and plagiarized submissions.”
→ Remove a submission if it is late or plagiarized.

In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side.

If you also think of filter_out() as “the complement of filter(),” basic logic points the same way: the complement of “keep A and B” is “drop A or B” (De Morgan’s law, !(A & B) ≡ (!A | !B).). So both everyday English and the usual “complement of filter” story support reading “filter out X and Y” as “filter out X or Y.”

Tidyup 008 - Retaining and excluding rows

5d93144

This comment was marked as resolved.

Sign in to view

Add tables

b0eeee0

jennybc reviewed Nov 4, 2025

View reviewed changes

hadley reviewed Nov 4, 2025

View reviewed changes

008-dplyr-retain-and-exclude.Rmd Outdated Show resolved Hide resolved

008-dplyr-retain-and-exclude.Rmd Outdated Show resolved Hide resolved

008-dplyr-retain-and-exclude.Rmd Outdated Show resolved Hide resolved

008-dplyr-retain-and-exclude.Rmd Outdated Show resolved Hide resolved

DavisVaughan commented Nov 4, 2025

View reviewed changes

008-dplyr-retain-and-exclude.Rmd Outdated Show resolved Hide resolved

DavisVaughan added 7 commits November 4, 2025 13:14

Use a logical for deceased

e5a3805

Add a name column for easier row tracking

f0891da

Tweak the SQL header name

b13a1a9

Change .combine to .when

dae8eed

Fix typo

97d0c15

Talk about %in%

732ad36

Drop .missing and talk about why it's a red herring

d75c835

This comment was marked as resolved.

Sign in to view

Replace when_any/all()'s 3-valued missing = with binary na_rm =

cb54ec3

DavisVaughan commented Nov 5, 2025

View reviewed changes

008-dplyr-filter-family.Rmd Show resolved Hide resolved

mine-cetinkaya-rundel reviewed Nov 5, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

DavisVaughan added 4 commits November 5, 2025 14:38

Mention data |> exclude(this | that | those)

cff71bc

Revert "Mention data |> exclude(this | that | those)"

3b52800

This reverts commit cff71bc.

Rework %in% example

b54b63d

Mention union() invariant

5c06bcc

This comment was marked as resolved.

Sign in to view

Rework around filter_out()

5b76b43

DavisVaughan changed the title ~~Tidyup 8 - Retaining and excluding rows~~ Tidyup 8 - Expanding the filter() family Nov 6, 2025

This comment was marked as resolved.

Sign in to view

DavisVaughan marked this pull request as ready for review November 6, 2025 15:26

sierrajohnson mentioned this pull request Nov 7, 2025

yeet() and vibe_check() inconsistent hadley/genzplyr#10

Open

Tidyup 8 - Expanding the filter() family #30

Are you sure you want to change the base?

Tidyup 8 - Expanding the filter() family #30

Conversation

DavisVaughan commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

jennybc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

wurli commented Nov 7, 2025

Uh oh!

DavisVaughan commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrosell commented Nov 7, 2025

Uh oh!

joeycouse commented Nov 7, 2025

Uh oh!

davidhodge931 commented Nov 9, 2025

Uh oh!

DavisVaughan commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krlmlr commented Nov 10, 2025

Uh oh!

DavisVaughan commented Nov 11, 2025

Uh oh!

krlmlr commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeycouse commented Nov 20, 2025

Uh oh!

DavisVaughan commented Nov 20, 2025

As complements

As chainable equivalents

So why do , separated conditions combine with & at all?

Uh oh!

t-kalinowski commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Tidyup 8 - Expanding the `filter()` family #30

Tidyup 8 - Expanding the `filter()` family #30

DavisVaughan commented Nov 4, 2025 •

edited

Loading

DavisVaughan commented Nov 7, 2025 •

edited

Loading

DavisVaughan commented Nov 9, 2025 •

edited

Loading

krlmlr commented Nov 20, 2025 •

edited

Loading

So why do `,` separated conditions combine with `&` at all?