Skip to content

Commit d75c835

Browse files
committed
Drop .missing and talk about why it's a red herring
1 parent 732ad36 commit d75c835

File tree

2 files changed

+197
-26
lines changed

2 files changed

+197
-26
lines changed

008-dplyr-retain-and-exclude.Rmd

Lines changed: 72 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ To address these issues, we propose two new families of dplyr verbs:
3737

3838
``` r
3939
# Data frame functions
40-
retain(.data, ..., .by = NULL, .missing = FALSE)
41-
exclude(.data, ..., .by = NULL, .missing = FALSE)
40+
retain(.data, ..., .by = NULL)
41+
exclude(.data, ..., .by = NULL)
4242

4343
# Vector functions
4444
when_any(..., missing = NULL)
@@ -53,10 +53,6 @@ For `retain()` and `exclude()`:
5353

5454
- As we will see, having `exclude()` work in this way simplifies many cases of using `filter()` to exclude rows.
5555

56-
- `.missing = TRUE` opts in to treating `NA` like `TRUE`.
57-
For `retain()`, this retains missing values.
58-
For `exclude()`, this excludes missing values.
59-
6056
For `when_any()` and `when_all()`:
6157

6258
- These are equivalents to `pmin()` and `pmax()`, but applied to `any()` and `all()`.
@@ -70,6 +66,9 @@ For `when_any()` and `when_all()`:
7066
- `missing = NULL` propagates `NA` through according to the typical `&` and `|` rules.
7167
Propagating missing values by default combines well `retain()` and `exclude()`.
7268

69+
- `missing = FALSE / TRUE` replace `NA` with the specified `missing` value before combining.
70+
This acts as a more flexible `na_rm` style argument.
71+
7372
- These functions can be used anywhere, not just in `retain()` and `exclude()`.
7473

7574
- They do have the potential to be confused with `if_any()` and `if_all()`, which apply a function to a selection of columns but otherwise operate similarly.
@@ -514,17 +513,12 @@ data |> exclude(this) |> exclude(that)
514513

515514
This isn't the case for `retain()` and `|`, hence the added value of `when_any()`.
516515

517-
## TODO: An example for `.missing`?
518-
519-
Is `.missing = TRUE` ever useful in `retain()` and `exclude()`, or was it just a red herring that is resolved by the fact that we have `exclude()` now?
520-
I seem to remember there are cases in Sarah's examples where `.missing = TRUE` would still be useful.
521-
522516
## Backwards compatibility
523517

524518
### `filter()`
525519

526520
`filter()` would alias to `retain()` and would never be superseded or deprecated.
527-
We would be very careful to retain all existing behavior of `filter()`, but we may decide not to give it new features that `retain()` and `exclude()` would gain, like `.missing`.
521+
We would be very careful to retain all existing behavior of `filter()`, but we may decide not to give it new features that `retain()` and `exclude()` gain.
528522

529523
## How to teach
530524

@@ -579,6 +573,72 @@ With alternate names:
579573

580574
The existing verbs are all single words, and the `_rows()` suffix here throws off the overall coherence.
581575

576+
#### `retain(.missing =)` and `exclude(.missing =)`
577+
578+
An earlier version of this tidyup considered adding `.missing = FALSE / TRUE` to `retain()` and `exclude()`, with `FALSE` being the default to "treat an `NA` like `FALSE`".
579+
This was in response to *many* requests for an argument like this on the dplyr Issues page.
580+
After gathering more examples and feedback, we've determined:
581+
582+
- This argument is *highly* confusing to think about.
583+
584+
- The argument is a red herring.
585+
You actually wanted an `exclude()` all along.
586+
587+
Here's the theoretical motivation for `.missing = TRUE`:
588+
589+
> *Exclude* rows where `x` and `y` are equal.
590+
591+
```{r}
592+
data <- tibble(
593+
x = c(1, 1, 1, 2, 2, 2, NA, NA, NA),
594+
y = c(1, 2, NA, 1, 2, NA, 1, 2, NA)
595+
)
596+
597+
data
598+
```
599+
600+
Because dplyr didn't have an "exclude rows" style function, you'd reach for `filter()` with `!=`:
601+
602+
```{r}
603+
data |> filter(x != y)
604+
```
605+
606+
But then you'd get frustrated that this didn't drop *only* the rows where `x` and `y` are equal, it also dropped the `NA` rows where the result is ambiguous.
607+
So you'd add `is.na()` calls:
608+
609+
```{r}
610+
data |> filter(x != y | is.na(x) | is.na(y))
611+
```
612+
613+
At that point, people reasonably thought that a `.missing = TRUE` argument might be useful.
614+
This would automatically treat `NA`s resulting from `x != y` as `TRUE` rather than the default of `FALSE`.
615+
In other words, they wanted to write:
616+
617+
```{r, eval = FALSE}
618+
data |> filter(x != y, .missing = TRUE)
619+
```
620+
621+
But this is both a red herring and a fairly unintuitive bit of code to come back and read a year from now.
622+
623+
We've determined that what we were *actually* missing was `exclude()`, because this is just:
624+
625+
```{r}
626+
data |> exclude(x == y)
627+
```
628+
629+
This has the benefits of being short, intuitive, and clearly aligning with the intent of the original goal.
630+
631+
Every issue / question below is actually a request for `exclude()` in disguise:
632+
633+
- [`exclude(col1 == col2)`](https://github.com/tidyverse/dplyr/issues/6432)
634+
- [`exclude(Species == "virginica")`](https://github.com/tidyverse/dplyr/issues/6013)
635+
- [`exclude(y == "a")`](https://github.com/tidyverse/dplyr/issues/3196)
636+
- [`exclude(col == "str")`](https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr)
637+
- [`exclude(var1 == 1)`](https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable)
638+
639+
In the *extremely* rare cases where you might need `missing =`, you can use `when_any()` and `when_all()` inside of `retain()` and `exclude()`.
640+
These propagate missings by default but have a `missing` argument for you to control this behavior.
641+
582642
## Appendix
583643

584644
### References

008-dplyr-retain-and-exclude.md

Lines changed: 125 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@ To address these issues, we propose two new families of dplyr verbs:
3838

3939
``` r
4040
# Data frame functions
41-
retain(.data, ..., .by = NULL, .missing = FALSE)
42-
exclude(.data, ..., .by = NULL, .missing = FALSE)
41+
retain(.data, ..., .by = NULL)
42+
exclude(.data, ..., .by = NULL)
4343

4444
# Vector functions
4545
when_any(..., missing = NULL)
@@ -55,10 +55,6 @@ For `retain()` and `exclude()`:
5555
- As we will see, having `exclude()` work in this way simplifies many
5656
cases of using `filter()` to exclude rows.
5757

58-
- `.missing = TRUE` opts in to treating `NA` like `TRUE`. For
59-
`retain()`, this retains missing values. For `exclude()`, this
60-
excludes missing values.
61-
6258
For `when_any()` and `when_all()`:
6359

6460
- These are equivalents to `pmin()` and `pmax()`, but applied to `any()`
@@ -74,6 +70,10 @@ For `when_any()` and `when_all()`:
7470
and `|` rules. Propagating missing values by default combines well
7571
`retain()` and `exclude()`.
7672

73+
- `missing = FALSE / TRUE` replace `NA` with the specified `missing`
74+
value before combining. This acts as a more flexible `na_rm` style
75+
argument.
76+
7777
- These functions can be used anywhere, not just in `retain()` and
7878
`exclude()`.
7979

@@ -872,21 +872,14 @@ data |> exclude(this) |> exclude(that)
872872
This isn’t the case for `retain()` and `|`, hence the added value of
873873
`when_any()`.
874874

875-
## TODO: An example for `.missing`?
876-
877-
Is `.missing = TRUE` ever useful in `retain()` and `exclude()`, or was
878-
it just a red herring that is resolved by the fact that we have
879-
`exclude()` now? I seem to remember there are cases in Sarah’s examples
880-
where `.missing = TRUE` would still be useful.
881-
882875
## Backwards compatibility
883876

884877
### `filter()`
885878

886879
`filter()` would alias to `retain()` and would never be superseded or
887880
deprecated. We would be very careful to retain all existing behavior of
888881
`filter()`, but we may decide not to give it new features that
889-
`retain()` and `exclude()` would gain, like `.missing`.
882+
`retain()` and `exclude()` gain.
890883

891884
## How to teach
892885

@@ -948,6 +941,124 @@ With alternate names:
948941
The existing verbs are all single words, and the `_rows()` suffix here
949942
throws off the overall coherence.
950943

944+
#### `retain(.missing =)` and `exclude(.missing =)`
945+
946+
An earlier version of this tidyup considered adding
947+
`.missing = FALSE / TRUE` to `retain()` and `exclude()`, with `FALSE`
948+
being the default to “treat an `NA` like `FALSE`”. This was in response
949+
to *many* requests for an argument like this on the dplyr Issues page.
950+
After gathering more examples and feedback, we’ve determined:
951+
952+
- This argument is *highly* confusing to think about.
953+
954+
- The argument is a red herring. You actually wanted an `exclude()` all
955+
along.
956+
957+
Here’s the theoretical motivation for `.missing = TRUE`:
958+
959+
> *Exclude* rows where `x` and `y` are equal.
960+
961+
``` r
962+
data <- tibble(
963+
x = c(1, 1, 1, 2, 2, 2, NA, NA, NA),
964+
y = c(1, 2, NA, 1, 2, NA, 1, 2, NA)
965+
)
966+
967+
data
968+
```
969+
970+
## # A tibble: 9 × 2
971+
## x y
972+
## <dbl> <dbl>
973+
## 1 1 1
974+
## 2 1 2
975+
## 3 1 NA
976+
## 4 2 1
977+
## 5 2 2
978+
## 6 2 NA
979+
## 7 NA 1
980+
## 8 NA 2
981+
## 9 NA NA
982+
983+
Because dplyr didn’t have an “exclude rows” style function, you’d reach
984+
for `filter()` with `!=`:
985+
986+
``` r
987+
data |> filter(x != y)
988+
```
989+
990+
## # A tibble: 2 × 2
991+
## x y
992+
## <dbl> <dbl>
993+
## 1 1 2
994+
## 2 2 1
995+
996+
But then you’d get frustrated that this didn’t drop *only* the rows
997+
where `x` and `y` are equal, it also dropped the `NA` rows where the
998+
result is ambiguous. So you’d add `is.na()` calls:
999+
1000+
``` r
1001+
data |> filter(x != y | is.na(x) | is.na(y))
1002+
```
1003+
1004+
## # A tibble: 7 × 2
1005+
## x y
1006+
## <dbl> <dbl>
1007+
## 1 1 2
1008+
## 2 1 NA
1009+
## 3 2 1
1010+
## 4 2 NA
1011+
## 5 NA 1
1012+
## 6 NA 2
1013+
## 7 NA NA
1014+
1015+
At that point, people reasonably thought that a `.missing = TRUE`
1016+
argument might be useful. This would automatically treat `NA`s resulting
1017+
from `x != y` as `TRUE` rather than the default of `FALSE`. In other
1018+
words, they wanted to write:
1019+
1020+
``` r
1021+
data |> filter(x != y, .missing = TRUE)
1022+
```
1023+
1024+
But this is both a red herring and a fairly unintuitive bit of code to
1025+
come back and read a year from now.
1026+
1027+
We’ve determined that what we were *actually* missing was `exclude()`,
1028+
because this is just:
1029+
1030+
``` r
1031+
data |> exclude(x == y)
1032+
```
1033+
1034+
## # A tibble: 7 × 2
1035+
## x y
1036+
## <dbl> <dbl>
1037+
## 1 1 2
1038+
## 2 1 NA
1039+
## 3 2 1
1040+
## 4 2 NA
1041+
## 5 NA 1
1042+
## 6 NA 2
1043+
## 7 NA NA
1044+
1045+
This has the benefits of being short, intuitive, and clearly aligning
1046+
with the intent of the original goal.
1047+
1048+
Every issue / question below is actually a request for `exclude()` in
1049+
disguise:
1050+
1051+
- [`exclude(col1 == col2)`](https://github.com/tidyverse/dplyr/issues/6432)
1052+
- [`exclude(Species == "virginica")`](https://github.com/tidyverse/dplyr/issues/6013)
1053+
- [`exclude(y == "a")`](https://github.com/tidyverse/dplyr/issues/3196)
1054+
- [`exclude(col == "str")`](https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr)
1055+
- [`exclude(var1 == 1)`](https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable)
1056+
1057+
In the *extremely* rare cases where you might need `missing =`, you can
1058+
use `when_any()` and `when_all()` inside of `retain()` and `exclude()`.
1059+
These propagate missings by default but have a `missing` argument for
1060+
you to control this behavior.
1061+
9511062
## Appendix
9521063

9531064
### References

0 commit comments

Comments
 (0)