Skip to content

The case for filter(.missing = NULL, .how = c("keep", "drop")) #6891

@DavisVaughan

Description

@DavisVaughan

There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to drop. Additionally, it has traditionally been somewhat difficult to explain that filter() is about specifying rows to keep; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing that select() is about columns and filter() is about rows, again, there isn't anything in the verb names to describe the difference.


One thing we could consider doing is to add two new very explicit verbs:

  • keep_rows(data, ..., by = , missing = )
  • drop_rows(data, ..., by = , missing = )

Where keep_rows() is equivalent to filter(), and drop_rows() is the opposite.

To be very clear, filter() would never disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead of filter(). We'd even consider not even superseding filter(), which many people find scary. Instead we'd just be aliasing keep_rows() as filter().

The biggest annoyance when writing a "drop" style expression with filter() is that you first have to write a "keep" expression and then painfully invert it. i.e.:

"drop rows from df where a and b and c are TRUE"

library(dplyr)

df <- tibble(
  id = c(1, 2, 3),
  a = c(TRUE, NA, TRUE),
  b = c(FALSE, TRUE, TRUE),
  c = c(TRUE, TRUE, TRUE)
)
df
#> # A tibble: 3 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 
#> 3     3 TRUE  TRUE  TRUE

# "keep rows where a and b and c are TRUE"
# this nicely drops NAs because they don't match our specified criteria
filter(df, a, b, c)
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     3 TRUE  TRUE  TRUE

# "drop rows where a and b and c are TRUE"
# this AWFULLY drops NAs because of how filter() works!
# NA doesn't match our criteria so shouldnt be seen as something to drop
filter(df, !(a & b & c))
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE

Note that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.

The drop_rows() version would be:

df %>% drop_rows(a, b, c)
#> # A tibble: 2 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 

Where NA isn't considered something you "drop" by default, but would be if missing was tweaked to whatever we decide means "treat a missing value like TRUE".


A few other notes:

  • missing is from filter(.missing = ) option to optionally retain missing values #6560 and controls how missing values are treated. By default, both functions would treat an NA as FALSE (i.e. missing values are never kept or dropped), but could be made to treat them as TRUE or an error. Though I don't think missing = c("keep", "drop", "error") works uniformly for both verbs so we'd need to think of another parameterization.
  • Both functions would support if_all() and if_any(), which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good for drop_rows(df, if_any(c(a, b), is.na)). That is like tidyr::drop_na().
  • Neither would support across(), which we have been deprecating from filter() for a little while now.
  • Both functions would combine multiple conditions using &, as that is typically the natural way to combine multiple conditions and you can always get | behavior with either an explicit | or by using multiple calls to the function. i.e. df %>% drop_rows(x > 5 | y > 6) is the same as df %>% drop_rows(x > 5) %>% drop_rows(y > 6) (and you can't do that split trick with &). if_any() can also work for | when you need to apply the same function to multiple columns.
  • Both functions would support by

Some issues and questions related to this:

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancementrows ↕️Operations on rows: filter(), slice(), arrange()verbs 🏃‍♀️

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions