-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to drop. Additionally, it has traditionally been somewhat difficult to explain that filter() is about specifying rows to keep; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing that select() is about columns and filter() is about rows, again, there isn't anything in the verb names to describe the difference.
One thing we could consider doing is to add two new very explicit verbs:
keep_rows(data, ..., by = , missing = )drop_rows(data, ..., by = , missing = )
Where keep_rows() is equivalent to filter(), and drop_rows() is the opposite.
To be very clear, filter() would never disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead of filter(). We'd even consider not even superseding filter(), which many people find scary. Instead we'd just be aliasing keep_rows() as filter().
The biggest annoyance when writing a "drop" style expression with filter() is that you first have to write a "keep" expression and then painfully invert it. i.e.:
"drop rows from df where a and b and c are TRUE"
library(dplyr)
df <- tibble(
id = c(1, 2, 3),
a = c(TRUE, NA, TRUE),
b = c(FALSE, TRUE, TRUE),
c = c(TRUE, TRUE, TRUE)
)
df
#> # A tibble: 3 × 4
#> id a b c
#> <dbl> <lgl> <lgl> <lgl>
#> 1 1 TRUE FALSE TRUE
#> 2 2 NA TRUE TRUE
#> 3 3 TRUE TRUE TRUE
# "keep rows where a and b and c are TRUE"
# this nicely drops NAs because they don't match our specified criteria
filter(df, a, b, c)
#> # A tibble: 1 × 4
#> id a b c
#> <dbl> <lgl> <lgl> <lgl>
#> 1 3 TRUE TRUE TRUE
# "drop rows where a and b and c are TRUE"
# this AWFULLY drops NAs because of how filter() works!
# NA doesn't match our criteria so shouldnt be seen as something to drop
filter(df, !(a & b & c))
#> # A tibble: 1 × 4
#> id a b c
#> <dbl> <lgl> <lgl> <lgl>
#> 1 1 TRUE FALSE TRUENote that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.
The drop_rows() version would be:
df %>% drop_rows(a, b, c)
#> # A tibble: 2 × 4
#> id a b c
#> <dbl> <lgl> <lgl> <lgl>
#> 1 1 TRUE FALSE TRUE
#> 2 2 NA TRUE TRUE Where NA isn't considered something you "drop" by default, but would be if missing was tweaked to whatever we decide means "treat a missing value like TRUE".
A few other notes:
missingis fromfilter(.missing = )option to optionally retain missing values #6560 and controls how missing values are treated. By default, both functions would treat anNAasFALSE(i.e. missing values are never kept or dropped), but could be made to treat them asTRUEor an error. Though I don't thinkmissing = c("keep", "drop", "error")works uniformly for both verbs so we'd need to think of another parameterization.- Both functions would support
if_all()andif_any(), which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good fordrop_rows(df, if_any(c(a, b), is.na)). That is liketidyr::drop_na(). - Neither would support
across(), which we have been deprecating fromfilter()for a little while now. - Both functions would combine multiple conditions using
&, as that is typically the natural way to combine multiple conditions and you can always get|behavior with either an explicit|or by using multiple calls to the function. i.e.df %>% drop_rows(x > 5 | y > 6)is the same asdf %>% drop_rows(x > 5) %>% drop_rows(y > 6)(and you can't do that split trick with&).if_any()can also work for|when you need to apply the same function to multiple columns. - Both functions would support
by
Some issues and questions related to this: