The case for `filter(.missing = NULL, .how = c("keep", "drop"))`

There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to _drop_. Additionally, it has traditionally been somewhat difficult to explain that `filter()` is about specifying rows to _keep_; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing that `select()` is about columns and `filter()` is about rows, again, there isn't anything in the verb names to describe the difference.

---

One thing we could consider doing is to add two new very explicit verbs:
- `keep_rows(data, ..., by = , missing = )`
- `drop_rows(data, ..., by = , missing = )`

Where `keep_rows()` is equivalent to `filter()`, and `drop_rows()` is the opposite.

To be very clear, `filter()` would _never_ disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead of `filter()`. We'd even consider not even superseding `filter()`, which many people find scary. Instead we'd just be aliasing `keep_rows()` as `filter()`.

The biggest annoyance when writing a "drop" style expression with `filter()` is that you first have to write a "keep" expression and then painfully invert it. i.e.:

"drop rows from `df` where `a` and `b` and `c` are `TRUE`"

``` r
library(dplyr)

df <- tibble(
  id = c(1, 2, 3),
  a = c(TRUE, NA, TRUE),
  b = c(FALSE, TRUE, TRUE),
  c = c(TRUE, TRUE, TRUE)
)
df
#> # A tibble: 3 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 
#> 3     3 TRUE  TRUE  TRUE

# "keep rows where a and b and c are TRUE"
# this nicely drops NAs because they don't match our specified criteria
filter(df, a, b, c)
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     3 TRUE  TRUE  TRUE

# "drop rows where a and b and c are TRUE"
# this AWFULLY drops NAs because of how filter() works!
# NA doesn't match our criteria so shouldnt be seen as something to drop
filter(df, !(a & b & c))
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE
```

Note that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.

The `drop_rows()` version would be:

```r
df %>% drop_rows(a, b, c)
#> # A tibble: 2 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 
```

Where `NA` isn't considered something you "drop" by default, but would be if `missing` was tweaked to whatever we decide means "treat a missing value like `TRUE`".

---

A few other notes:
- `missing` is from https://github.com/tidyverse/dplyr/issues/6560 and controls how missing values are treated. By default, both functions would treat an `NA` as `FALSE` (i.e. missing values are never kept or dropped), but could be made to treat them as `TRUE` or an error. Though I don't think `missing = c("keep", "drop", "error")` works uniformly for both verbs so we'd need to think of another parameterization.
- Both functions would support `if_all()` and `if_any()`, which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good for `drop_rows(df, if_any(c(a, b), is.na))`. That is like `tidyr::drop_na()`.
- Neither would support `across()`, which we have been deprecating from `filter()` for a little while now.
- Both functions would combine multiple conditions using `&`, as that is typically the natural way to combine multiple conditions and you can always get `|` behavior with either an explicit `|` or by using multiple calls to the function. i.e. `df %>% drop_rows(x > 5 | y > 6)` is the same as `df %>% drop_rows(x > 5) %>% drop_rows(y > 6)` (and you can't do that split trick with `&`). `if_any()` can also work for `|` when you need to apply the same function to multiple columns.
- Both functions would support `by`

Some issues and questions related to this:
- https://github.com/tidyverse/dplyr/issues/6888
- https://github.com/tidyverse/dplyr/issues/1527
- https://github.com/tidyverse/dplyr/issues/741
- https://github.com/tidyverse/dplyr/issues/1797
- https://stackoverflow.com/questions/45661377/delete-rows-based-on-multiple-conditions-with-dplyr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The case for `filter(.missing = NULL, .how = c("keep", "drop"))` #6891

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The case for filter(.missing = NULL, .how = c("keep", "drop")) #6891

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The case for `filter(.missing = NULL, .how = c("keep", "drop"))` #6891