Skip to content

Conversation

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 4, 2025

@topepo

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposal! Made a few comments as I reacted to a first reading.

@t-kalinowski

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

@shikokuchuo

This comment was marked as resolved.

@DavisVaughan

This comment was marked as resolved.

lionel-

This comment was marked as resolved.

EmilHvitfeldt

This comment was marked as resolved.

@DavisVaughan DavisVaughan changed the title Tidyup 8 - Retaining and excluding rows Tidyup 8 - Expanding the filter() family Nov 6, 2025
@DavisVaughan

This comment was marked as resolved.

@DavisVaughan DavisVaughan marked this pull request as ready for review November 6, 2025 15:26
@wurli
Copy link

wurli commented Nov 7, 2025

- `exclude()`, as noted above, which would have been paired with

FWIW, as a user I would much prefer the name exclude() to filter_out(). IMO, to the uninitiated it would not be clear which of filter()/filter_out() retains and which excludes, but I think if filter() is paired with exclude() then the purpose of both becomes clearer. I also like retain() as an alias for filter(), but on balance I agree it's probably best not to add an alias since filter() is so well established.

Love the idea for this API btw!

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 7, 2025

@wurli most of us felt that filter_out() was very clear that it's removing the indicated rows, which helps you intuit that filter() must keep them.

We also really appreciated how it feels like a "variant" of filter() rather than another core verb. On the home page of dplyr we'd still just list filter(), that's the core verb. It's only when you come to filter()s help docs that you'd also learn about filter_out() (or your teacher would tell you about it). Similar to slice() being the core verb and slice_*() being the variants. I think there is something pretty powerful to this idea, and it also helps with autocompletions, i.e. filt<tab> brings up both, which is quite nice.

With exclude(), I'd feel the need to say "filter() / exclude() to keep or drop cases based on their values" on the dplyr home page and that felt like a net negative in comparison https://dplyr.tidyverse.org/#overview

@jrosell
Copy link

jrosell commented Nov 7, 2025

With filter_out, would someone wonder if it exists filter_in as a filter alias?

@joeycouse
Copy link

Love this implementation, I do think the filter() and filter_out() is slightly unclear but I don't see this as a major hurdle in practice. IMO keeping filter() api should be a priority over slightly clearer language and introducing new core verbs and leaving filter() stranded.

@davidhodge931
Copy link

Awesome proposal!

My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 9, 2025

@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming filter(), so we are working within the constraints of that. Renaming filter() is likely just too disruptive to the whole community to be worth it.

@krlmlr
Copy link
Member

krlmlr commented Nov 10, 2025

Love the idea!

How do you teach that filter_out(x, y) is actually filter_out(x & y) and not filter_out(x | y) ? I'd be confused about half the time. Would it be safer to allow just one predicate in filter_out() ? Haven't followed the entire discussion, please disregard if redundant.

# Sequence with filter()
. |>
  filter(x) |>
  filter(y)

# Same as conjunction
. |>
  filter(x, y)

# Sequence with filter_out()
. |>
  filter_out(x) |>
  filter_out(y)

# Same as alternation (!?!)
. |>
  filter_out(x | y)

@DavisVaughan
Copy link
Member Author

I think the best way to teach this is probably something like:

  • With filter(), target rows to keep
  • With filter_out(), target rows to drop
  • Both combine with & (consistent for both)
  • If you want |, use when_any() (consistent for both)
# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)

# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))

I think the fact that df |> filter_out(x | y) is equivalent to df |> filter_out(x) |> filter_out(y), and df |> filter(x & y) is equivalent to df |> filter(x) |> filter(y) is theoretically pleasing, but isn't something I would harp on while teaching. Instead I'd focus on when_any(), which is used the same way no matter which verb you use.

@krlmlr
Copy link
Member

krlmlr commented Nov 20, 2025

To me, the antisymmetry is not only theoretically pleasing. I'm reading . |> filter_out(x, y) like:

  • I'm taking the input
  • I'm filtering out the entries that match x
  • Then, I'm filtering out the entries that match y

I'd never read it like:

  • I'm taking the input
  • I'm filtering out the entries that match x and also match y

Even stronger with

. |>
  filter_out(
    x,
    y
  )

To me, the , translates to a "then" much better than to an "and". Is it only me? I don't know, but I'd like us to think a bit longer about the ambiguity here and the options that we have. The option most appealing to me is to implement an initial draft that accepts only one argument; there's much less ambiguity in filter_out(when_any(...)) and filter(when_all()) . Then we can play with it and decide if and how we extend to multiple arguments.

@joeycouse
Copy link

Completely agree with @krlmlr here.

df |> filter(x) |> filter(y) 

 df |> filter(x,y)

I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect filter_out() to have the same behavior.

@DavisVaughan
Copy link
Member Author

There are two competing worldviews at play here.

  • filter() and filter_out() as complements of one another.

  • filter(df, x, y) and filter_out(df, x, y) as equivalent to df |> filter(x) |> filter(y) and df |> filter_out(x) |> filter_out(y).

Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn.

As complements

If both filter() and filter_out() combine using &, then you get the following result table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

# ---

df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

df |> filter(x | y)
df |> filter(when_any(x, y))

Notice how everything above the line related to & works the exact same regardless of whether it is filter() or filter_out(). Similarly, everything below the line works the same with |.

I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that , separated conditions are combined with &. This exactly matches what people have been doing with filter() since day 1 of dplyr. There are no mental gymnastics required when swapping between filter() and filter_out() if you remember this 1 rule you've been using the whole time.

As a nice side effect this means you only need to worry about when_any() - if you find yourself using | in either filter() or filter_out(), you can immediately switch to when_any(), no extra thought required. filter() and filter_out() users should never need when_all() because conditions combine with & already, and that's perfectly fine, one less thing to learn, and when_all() is still useful on its own in other contexts.

This all means that if you are translating from a filter() to a filter_out() to simplify your conditions, then doing so is very easy by design. For example:

Filter out rows where the patient is deceased and the year of death was before 2012.

patients <- tibble::tibble(
  name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
  deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
  date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)

patients
# A tibble: 7 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Mark  TRUE      2010
3 Sarah NA          NA
4 Davis TRUE      2020
5 Max   NA        2010
6 Derek FALSE       NA
7 Tina  TRUE        NA

With years of filter() muscle memory built up, you might start with this:

patients |>
  filter(!(deceased & date < 2012))
# A tibble: 3 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Davis TRUE      2020
3 Derek FALSE       NA

But immediately get frustrated when it drops your NAs, then you remember filter_out()! It is intentionally designed so that you can very easily drop the ! and () to translate to:

patients |>
  filter_out(deceased & date < 2012)
# A tibble: 6 × 3
  name  deceased  date
  <chr> <lgl>    <dbl>
1 Anne  FALSE     2005
2 Sarah NA          NA
3 Davis TRUE      2020
4 Max   NA        2010
5 Derek FALSE       NA
6 Tina  TRUE        NA

And boom that works as expected.

And since there is only 1 rule that applies for both filter() and filter_out() - that conditions are combined with &, you'll probably also remember that you can simplify further to:

patients |>
  filter_out(deceased, date < 2012)

You also get this nice result, i.e. they are complements of one another

# Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= df

It is true that you can't break df |> filter_out(x, y) into df |> filter_out(x) |> filter_out(y) like you can with filter():

df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)

df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)

But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where cyl == 5 and disp > 20" then I'd write:

df |> filter(cyl == 5, disp > 20)

and it would not occur to me to write this, even though they are equivalent

df |> filter(cyl == 5) |> filter(disp > 20)

In other words, my problem statement of "rows where cyl == 5 and disp > 20" is made up of two coupled conditions and I would never separate them across two filter() statements.

This also means that I don't find Kirill's idea that , is treated like a "then" very convincing. I very much read the , like an "and" that translates directly from my real-life problem statement of "rows where cyl == 5 and disp > 20".

I think a more appropriate goal of filter_out() is ease of translation from a "negated filter", which ends up resulting in this complement worldview.

As chainable equivalents

If filter() combines conditions with & and filter_out() combines conditions with |, you end up with this table:

df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))

df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))

# ---

df |> filter(x | y)
df |> filter(when_any(x, y))

df |> filter_out(x & y)
df |> filter_out(when_all(x, y))

My argument is that this is actually much harder for people to learn.

  • You must remember that filter() combines with &, but filter_out() combines with |.
  • You must remember to use when_any() in filter() but when_all() in filter_out().

And this is on top of having to think about NA handling! So that's 3 different aspects you have to think about all at once (filter for vs filter out, & vs |, and when_any vs when_all). With the complement approach I'd argue there is only 1 aspect to think about - filter for vs filter out, because everything else works the same.

But most importantly, you can no longer easily translate a filter() that you mistakenly started into a filter_out(). With the above example, when you realize that this is the wrong approach:

patients |>
  filter(!(deceased & date < 2012))

then you have to translate to this filter_out(),

patients |>
  filter_out(when_all(deceased, date < 2012))

and I'd argue that is an increase in mental burden to translate to over the "just drop the !" translation of filter_out(deceased & date < 2012).

In my ideal world both when_all() and when_any() are rarely required, and this holds true with the current "treat them as complements" worldview, where only when_any() is ever needed, which is also only in the rare case of needing to combine with |. This would not be the case if filter_out() combined conditions with |, because pretty much every time you'd reach for a filter_out() with >1 conditions, you'd also need when_all(), because combining conditions with & is the more common situation.

This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching filter() or filter_out(), because, as mentioned in the previous section, when you have a problem like find "rows where cyl == 5 and disp > 20" you would not want to split that over two filter() calls.

df |> filter(x, y)
df |> filter(x) |> filter(y)

df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)

So why do , separated conditions combine with & at all?

Good question!

I think this is the heart of the problem. Deciding whether to combine , separated conditions with & or | is inherently ambiguous. But back in the origins of dplyr it must have been decided that combining with & was the more common case, and I do think that has held true.

I think Kirill nailed it by mentioning that in an ideal world there is only 1 expr allowed. This would force the explicit usage of either & / | or when_all() / when_any() (where there is no ambiguity about how ... combine). That would have been a pretty elegant way to solve all of this!

In fact, this is exactly how Stata's keep if and drop if work, their specification is:

keep if expr
drop if expr

and you must use explicit & and | like drop if inlist(v1,88,99) | missing(v2). No ambiguity there!

But I think limiting filter_out() to just 1 condition would do the world a disservice and would just cause more confusion about why filter() and filter_out() aren't equivalent in this regard.

Instead, I'm arguing that we should just lean into the status quo. Rather than contribute to the ambiguity of how , separated conditions should be combined by chainging its meaning between filter() and filter_out(), let's just have 1 consistent rule of "combine with &", which is already ambiguous enough but has years of muscle memory built up for most filter() users.

@t-kalinowski
Copy link

In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition.

Examples:

  • “Filter out spam and promotional emails from my inbox.”
    → Remove any email that is spam or promotional.

  • “Filter out missing values and zeros before plotting.”
    → Remove any row that is missing or zero.

And without “filter” language at all:

  • “Exclude France and Germany from the analysis.”
    → Drop any row where the country is France or Germany.

  • “Ignore students who failed and students who dropped the course.”
    → Ignore a student if they failed or dropped.

  • “Remove late submissions and plagiarized submissions.”
    → Remove a submission if it is late or plagiarized.

In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side.

If you also think of filter_out() as “the complement of filter(),” basic logic points the same way: the complement of “keep A and B” is “drop A or B” (De Morgan’s law, !(A & B) ≡ (!A | !B).). So both everyday English and the usual “complement of filter” story support reading “filter out X and Y” as “filter out X or Y.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.