You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 008-dplyr-retain-and-exclude.Rmd
+72-12Lines changed: 72 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -37,8 +37,8 @@ To address these issues, we propose two new families of dplyr verbs:
37
37
38
38
```r
39
39
# Data frame functions
40
-
retain(.data, ..., .by=NULL, .missing=FALSE)
41
-
exclude(.data, ..., .by=NULL, .missing=FALSE)
40
+
retain(.data, ..., .by=NULL)
41
+
exclude(.data, ..., .by=NULL)
42
42
43
43
# Vector functions
44
44
when_any(..., missing=NULL)
@@ -53,10 +53,6 @@ For `retain()` and `exclude()`:
53
53
54
54
- As we will see, having `exclude()` work in this way simplifies many cases of using `filter()` to exclude rows.
55
55
56
-
-`.missing = TRUE` opts in to treating `NA` like `TRUE`.
57
-
For `retain()`, this retains missing values.
58
-
For `exclude()`, this excludes missing values.
59
-
60
56
For `when_any()` and `when_all()`:
61
57
62
58
- These are equivalents to `pmin()` and `pmax()`, but applied to `any()` and `all()`.
@@ -70,6 +66,9 @@ For `when_any()` and `when_all()`:
70
66
-`missing = NULL` propagates `NA` through according to the typical `&` and `|` rules.
71
67
Propagating missing values by default combines well `retain()` and `exclude()`.
72
68
69
+
-`missing = FALSE / TRUE` replace `NA` with the specified `missing` value before combining.
70
+
This acts as a more flexible `na_rm` style argument.
71
+
73
72
- These functions can be used anywhere, not just in `retain()` and `exclude()`.
74
73
75
74
- They do have the potential to be confused with `if_any()` and `if_all()`, which apply a function to a selection of columns but otherwise operate similarly.
@@ -514,17 +513,12 @@ data |> exclude(this) |> exclude(that)
514
513
515
514
This isn't the case for `retain()` and `|`, hence the added value of `when_any()`.
516
515
517
-
## TODO: An example for `.missing`?
518
-
519
-
Is `.missing = TRUE` ever useful in `retain()` and `exclude()`, or was it just a red herring that is resolved by the fact that we have `exclude()` now?
520
-
I seem to remember there are cases in Sarah's examples where `.missing = TRUE` would still be useful.
521
-
522
516
## Backwards compatibility
523
517
524
518
### `filter()`
525
519
526
520
`filter()` would alias to `retain()` and would never be superseded or deprecated.
527
-
We would be very careful to retain all existing behavior of `filter()`, but we may decide not to give it new features that `retain()` and `exclude()`would gain, like `.missing`.
521
+
We would be very careful to retain all existing behavior of `filter()`, but we may decide not to give it new features that `retain()` and `exclude()` gain.
528
522
529
523
## How to teach
530
524
@@ -579,6 +573,72 @@ With alternate names:
579
573
580
574
The existing verbs are all single words, and the `_rows()` suffix here throws off the overall coherence.
581
575
576
+
#### `retain(.missing =)` and `exclude(.missing =)`
577
+
578
+
An earlier version of this tidyup considered adding `.missing = FALSE / TRUE` to `retain()` and `exclude()`, with `FALSE` being the default to "treat an `NA` like `FALSE`".
579
+
This was in response to *many* requests for an argument like this on the dplyr Issues page.
580
+
After gathering more examples and feedback, we've determined:
581
+
582
+
- This argument is *highly* confusing to think about.
583
+
584
+
- The argument is a red herring.
585
+
You actually wanted an `exclude()` all along.
586
+
587
+
Here's the theoretical motivation for `.missing = TRUE`:
588
+
589
+
> *Exclude* rows where `x` and `y` are equal.
590
+
591
+
```{r}
592
+
data <- tibble(
593
+
x = c(1, 1, 1, 2, 2, 2, NA, NA, NA),
594
+
y = c(1, 2, NA, 1, 2, NA, 1, 2, NA)
595
+
)
596
+
597
+
data
598
+
```
599
+
600
+
Because dplyr didn't have an "exclude rows" style function, you'd reach for `filter()` with `!=`:
601
+
602
+
```{r}
603
+
data |> filter(x != y)
604
+
```
605
+
606
+
But then you'd get frustrated that this didn't drop *only* the rows where `x` and `y` are equal, it also dropped the `NA` rows where the result is ambiguous.
607
+
So you'd add `is.na()` calls:
608
+
609
+
```{r}
610
+
data |> filter(x != y | is.na(x) | is.na(y))
611
+
```
612
+
613
+
At that point, people reasonably thought that a `.missing = TRUE` argument might be useful.
614
+
This would automatically treat `NA`s resulting from `x != y` as `TRUE` rather than the default of `FALSE`.
615
+
In other words, they wanted to write:
616
+
617
+
```{r, eval = FALSE}
618
+
data |> filter(x != y, .missing = TRUE)
619
+
```
620
+
621
+
But this is both a red herring and a fairly unintuitive bit of code to come back and read a year from now.
622
+
623
+
We've determined that what we were *actually* missing was `exclude()`, because this is just:
624
+
625
+
```{r}
626
+
data |> exclude(x == y)
627
+
```
628
+
629
+
This has the benefits of being short, intuitive, and clearly aligning with the intent of the original goal.
630
+
631
+
Every issue / question below is actually a request for `exclude()` in disguise:
0 commit comments