Skip to content

Commit cb54ec3

Browse files
committed
Replace when_any/all()'s 3-valued missing = with binary na_rm =
1 parent d75c835 commit cb54ec3

File tree

2 files changed

+128
-45
lines changed

2 files changed

+128
-45
lines changed

008-dplyr-retain-and-exclude.Rmd

Lines changed: 57 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ retain(.data, ..., .by = NULL)
4141
exclude(.data, ..., .by = NULL)
4242

4343
# Vector functions
44-
when_any(..., missing = NULL)
45-
when_all(..., missing = NULL)
44+
when_any(..., na_rm = FALSE)
45+
when_all(..., na_rm = FALSE)
4646
```
4747

4848
For `retain()` and `exclude()`:
@@ -63,11 +63,9 @@ For `when_any()` and `when_all()`:
6363

6464
- `when_all()` combines conditions with `&`.
6565

66-
- `missing = NULL` propagates `NA` through according to the typical `&` and `|` rules.
66+
- `na_rm = FALSE` propagates `NA` through according to the typical `&` and `|` rules.
6767
Propagating missing values by default combines well `retain()` and `exclude()`.
68-
69-
- `missing = FALSE / TRUE` replace `NA` with the specified `missing` value before combining.
70-
This acts as a more flexible `na_rm` style argument.
68+
`na_rm = TRUE` removes `NA`s "rowwise" from the computation, exactly like in `pmin()` and `pmax()`.
7169

7270
- These functions can be used anywhere, not just in `retain()` and `exclude()`.
7371

@@ -636,8 +634,9 @@ Every issue / question below is actually a request for `exclude()` in disguise:
636634
- [`exclude(col == "str")`](https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr)
637635
- [`exclude(var1 == 1)`](https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable)
638636

639-
In the *extremely* rare cases where you might need `missing =`, you can use `when_any()` and `when_all()` inside of `retain()` and `exclude()`.
640-
These propagate missings by default but have a `missing` argument for you to control this behavior.
637+
In the *extremely* rare cases where you might need `missing = TRUE`, you can nest `when_all(na_rm = TRUE)` inside of `retain()` and `exclude()`.
638+
This propagates missings by default but `na_rm = TRUE` removes missings from the computation.
639+
For an "all" style operation, that is equivalent to treating missings like `TRUE` (i.e. `all()` and `all(NA, na.rm = TRUE)` both return `TRUE`).
641640

642641
## Appendix
643642

@@ -658,23 +657,23 @@ Related issues and examples:
658657

659658
Tables like these help us ensure there aren't any holes in our designs.
660659

661-
Intent vs Combine
660+
#### Intent vs Combine
662661

663-
+------------+------------+-------------------------------------------------------+
664-
| Intent | Combine | Solution |
665-
+============+============+=======================================================+
666-
| Retain | And | `retain(a, b, c)` |
667-
+------------+------------+-------------------------------------------------------+
668-
| Retain | Or | `retain(when_any(a, b, c))` |
669-
+------------+------------+-------------------------------------------------------+
670-
| Exclude | And | `exclude(a, b, c)` |
671-
+------------+------------+-------------------------------------------------------+
672-
| Exclude | Or | `exclude(when_any(a, b, c))` |
673-
| | | |
674-
| | | In practice: `exclude(a) |> exclude(b) |> exclude(c)` |
675-
+------------+------------+-------------------------------------------------------+
662+
+---------+---------+----------------------+-----------+-------------------------------------------------------+
663+
| Intent | Combine | Hypothetical usage % | Currently | Solution |
664+
+=========+=========+======================+===========+=======================================================+
665+
| Retain | And | 50% | | `retain(a, b, c)` |
666+
+---------+---------+----------------------+-----------+-------------------------------------------------------+
667+
| Retain | Or | 5% | | `retain(when_any(a, b, c))` |
668+
+---------+---------+----------------------+-----------+-------------------------------------------------------+
669+
| Exclude | And | 35% | | `exclude(a, b, c)` |
670+
+---------+---------+----------------------+-----------+-------------------------------------------------------+
671+
| Exclude | Or | 10% | | `exclude(when_any(a, b, c))` |
672+
| | | | | |
673+
| | | | | In practice: `exclude(a) |> exclude(b) |> exclude(c)` |
674+
+---------+---------+----------------------+-----------+-------------------------------------------------------+
676675

677-
Intent vs Missings
676+
#### Intent vs Missings
678677

679678
+-----------+------------------+---------------------------------------------------------+----------------------------------------------------------+
680679
| Intent | Missings | Outcome | Usefulness |
@@ -683,7 +682,40 @@ Intent vs Missings
683682
+-----------+------------------+---------------------------------------------------------+----------------------------------------------------------+
684683
| Exclude | Treat as `FALSE` | Exclude rows where you *know* the conditions are `TRUE` | Very. Simplifies "treat `filter()` as an exclude" cases. |
685684
+-----------+------------------+---------------------------------------------------------+----------------------------------------------------------+
686-
| Retain | Treat as `TRUE` | Retain rows where conditions are `TRUE` or `NA` | Unconvinced. Often this is an `exclude()` in disguise. |
685+
| Retain | Treat as `TRUE` | Retain rows where conditions are `TRUE` or `NA` | Not. This is an `exclude()` in disguise. |
687686
+-----------+------------------+---------------------------------------------------------+----------------------------------------------------------+
688-
| Exclude | Treat as `TRUE` | Exclude rows where conditions are `TRUE` or `NA` | Unconvinced. |
687+
| Exclude | Treat as `TRUE` | Exclude rows where conditions are `TRUE` or `NA` | Not. Never seen an example of this. |
689688
+-----------+------------------+---------------------------------------------------------+----------------------------------------------------------+
689+
690+
#### Connection to vctrs
691+
692+
We purposefully don't expose `missing` directly on the dplyr side.
693+
The 3-valued argument is quite complicated to think about.
694+
Instead it bubbles up through `retain()` / `exclude()` using `missing = FALSE` and `when_all()` / `when_any()`'s `na_rm` argument.
695+
696+
Particularly confusing for the average consumer is that `when_all(na_rm = TRUE)` maps to `list_pall(missing = TRUE)` but `when_any(na_rm = TRUE)` maps to `list_pany(missing = FALSE)`.
697+
Exposing only `na_rm = TRUE` saves users from having to do these very hard mental gymnastics.
698+
699+
+------------------------------+--------------------------+---------------------------+
700+
| vctrs | Data frame | Vector |
701+
+==============================+==========================+===========================+
702+
| `list_pall(missing = NULL)` | | `when_all(na_rm = FALSE)` |
703+
+------------------------------+--------------------------+---------------------------+
704+
| `list_pall(missing = FALSE)` | `retain()` / `exclude()` | |
705+
+------------------------------+--------------------------+---------------------------+
706+
| `list_pall(missing = TRUE)` | | `when_all(na_rm = TRUE)` |
707+
+------------------------------+--------------------------+---------------------------+
708+
| `list_pany(missing = NULL)` | | `when_any(na_rm = FALSE)` |
709+
+------------------------------+--------------------------+---------------------------+
710+
| `list_pany(missing = FALSE)` | | `when_any(na_rm = TRUE)` |
711+
+------------------------------+--------------------------+---------------------------+
712+
| `list_pany(missing = TRUE)` | | |
713+
+------------------------------+--------------------------+---------------------------+
714+
715+
- `list_pall(missing = FALSE)`:
716+
717+
- Interesting how this is useful as the `retain()` / `exclude()` default behavior but becomes too confusing to try and expose in `when_all()` as `missing` vs the simpler `na_rm`. Keeping "the most flexible" vector function way in vctrs feels right since the `missing = FALSE` case here is less useful in a vector context. It doesn't prevent you from doing `retain(when_all())` because the default propagates `NA` and then `retain()` itself does the `missing = FALSE` part.
718+
719+
- `list_pany(missing = TRUE)`:
720+
721+
- Like `list_pall(missing = FALSE)`, this is not the useful variant to expose at the vector level. Also happens to not have an exposed data frame variant, so dplyr doesn't expose it at all, which feels fine. Not a single example needed it.

008-dplyr-retain-and-exclude.md

Lines changed: 71 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ retain(.data, ..., .by = NULL)
4242
exclude(.data, ..., .by = NULL)
4343

4444
# Vector functions
45-
when_any(..., missing = NULL)
46-
when_all(..., missing = NULL)
45+
when_any(..., na_rm = FALSE)
46+
when_all(..., na_rm = FALSE)
4747
```
4848

4949
For `retain()` and `exclude()`:
@@ -66,13 +66,10 @@ For `when_any()` and `when_all()`:
6666

6767
- `when_all()` combines conditions with `&`.
6868

69-
- `missing = NULL` propagates `NA` through according to the typical `&`
69+
- `na_rm = FALSE` propagates `NA` through according to the typical `&`
7070
and `|` rules. Propagating missing values by default combines well
71-
`retain()` and `exclude()`.
72-
73-
- `missing = FALSE / TRUE` replace `NA` with the specified `missing`
74-
value before combining. This acts as a more flexible `na_rm` style
75-
argument.
71+
`retain()` and `exclude()`. `na_rm = TRUE` removes `NA`s “rowwise”
72+
from the computation, exactly like in `pmin()` and `pmax()`.
7673

7774
- These functions can be used anywhere, not just in `retain()` and
7875
`exclude()`.
@@ -1054,10 +1051,12 @@ disguise:
10541051
- [`exclude(col == "str")`](https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr)
10551052
- [`exclude(var1 == 1)`](https://stackoverflow.com/questions/32908589/why-does-dplyrs-filter-drop-na-values-from-a-factor-variable)
10561053

1057-
In the *extremely* rare cases where you might need `missing =`, you can
1058-
use `when_any()` and `when_all()` inside of `retain()` and `exclude()`.
1059-
These propagate missings by default but have a `missing` argument for
1060-
you to control this behavior.
1054+
In the *extremely* rare cases where you might need `missing = TRUE`, you
1055+
can nest `when_all(na_rm = TRUE)` inside of `retain()` and `exclude()`.
1056+
This propagates missings by default but `na_rm = TRUE` removes missings
1057+
from the computation. For an “all” style operation, that is equivalent
1058+
to treating missings like `TRUE` (i.e. `all()` and
1059+
`all(NA, na.rm = TRUE)` both return `TRUE`).
10611060

10621061
## Appendix
10631062

@@ -1079,52 +1078,104 @@ Related issues and examples:
10791078

10801079
Tables like these help us ensure there aren’t any holes in our designs.
10811080

1082-
Intent vs Combine
1081+
#### Intent vs Combine
10831082

1084-
<table style="width:99%;">
1083+
<table style="width:97%;">
10851084
<colgroup>
1086-
<col style="width: 15%" />
1087-
<col style="width: 15%" />
1088-
<col style="width: 67%" />
1085+
<col style="width: 8%" />
1086+
<col style="width: 8%" />
1087+
<col style="width: 20%" />
1088+
<col style="width: 10%" />
1089+
<col style="width: 49%" />
10891090
</colgroup>
10901091
<thead>
10911092
<tr>
10921093
<th>Intent</th>
10931094
<th>Combine</th>
1095+
<th>Hypothetical usage %</th>
1096+
<th>Currently</th>
10941097
<th>Solution</th>
10951098
</tr>
10961099
</thead>
10971100
<tbody>
10981101
<tr>
10991102
<td>Retain</td>
11001103
<td>And</td>
1104+
<td>50%</td>
1105+
<td>✅</td>
11011106
<td><code>retain(a, b, c)</code></td>
11021107
</tr>
11031108
<tr>
11041109
<td>Retain</td>
11051110
<td>Or</td>
1111+
<td>5%</td>
1112+
<td>❌</td>
11061113
<td><code>retain(when_any(a, b, c))</code></td>
11071114
</tr>
11081115
<tr>
11091116
<td>Exclude</td>
11101117
<td>And</td>
1118+
<td>35%</td>
1119+
<td>❌</td>
11111120
<td><code>exclude(a, b, c)</code></td>
11121121
</tr>
11131122
<tr>
11141123
<td>Exclude</td>
11151124
<td>Or</td>
1125+
<td>10%</td>
1126+
<td>❌</td>
11161127
<td><p><code>exclude(when_any(a, b, c))</code></p>
11171128
<p>In practice:
11181129
<code>exclude(a) |&gt; exclude(b) |&gt; exclude(c)</code></p></td>
11191130
</tr>
11201131
</tbody>
11211132
</table>
11221133

1123-
Intent vs Missings
1134+
#### Intent vs Missings
11241135

11251136
| Intent | Missings | Outcome | Usefulness |
11261137
|----|----|----|----|
11271138
| Retain | Treat as `FALSE` | Retain rows where you *know* the conditions are `TRUE` | Very. Existing `filter()` behavior. |
11281139
| Exclude | Treat as `FALSE` | Exclude rows where you *know* the conditions are `TRUE` | Very. Simplifies “treat `filter()` as an exclude” cases. |
1129-
| Retain | Treat as `TRUE` | Retain rows where conditions are `TRUE` or `NA` | Unconvinced. Often this is an `exclude()` in disguise. |
1130-
| Exclude | Treat as `TRUE` | Exclude rows where conditions are `TRUE` or `NA` | Unconvinced. |
1140+
| Retain | Treat as `TRUE` | Retain rows where conditions are `TRUE` or `NA` | Not. This is an `exclude()` in disguise. |
1141+
| Exclude | Treat as `TRUE` | Exclude rows where conditions are `TRUE` or `NA` | Not. Never seen an example of this. |
1142+
1143+
#### Connection to vctrs
1144+
1145+
We purposefully don’t expose `missing` directly on the dplyr side. The
1146+
3-valued argument is quite complicated to think about. Instead it
1147+
bubbles up through `retain()` / `exclude()` using `missing = FALSE` and
1148+
`when_all()` / `when_any()`’s `na_rm` argument.
1149+
1150+
Particularly confusing for the average consumer is that
1151+
`when_all(na_rm = TRUE)` maps to `list_pall(missing = TRUE)` but
1152+
`when_any(na_rm = TRUE)` maps to `list_pany(missing = FALSE)`. Exposing
1153+
only `na_rm = TRUE` saves users from having to do these very hard mental
1154+
gymnastics.
1155+
1156+
| vctrs | Data frame | Vector |
1157+
|----|----|----|
1158+
| `list_pall(missing = NULL)` | | `when_all(na_rm = FALSE)` |
1159+
| `list_pall(missing = FALSE)` | `retain()` / `exclude()` | |
1160+
| `list_pall(missing = TRUE)` | | `when_all(na_rm = TRUE)` |
1161+
| `list_pany(missing = NULL)` | | `when_any(na_rm = FALSE)` |
1162+
| `list_pany(missing = FALSE)` | | `when_any(na_rm = TRUE)` |
1163+
| `list_pany(missing = TRUE)` | | |
1164+
1165+
- `list_pall(missing = FALSE)`:
1166+
1167+
- Interesting how this is useful as the `retain()` / `exclude()`
1168+
default behavior but becomes too confusing to try and expose in
1169+
`when_all()` as `missing` vs the simpler `na_rm`. Keeping “the most
1170+
flexible” vector function way in vctrs feels right since the
1171+
`missing = FALSE` case here is less useful in a vector context. It
1172+
doesn’t prevent you from doing `retain(when_all())` because the
1173+
default propagates `NA` and then `retain()` itself does the
1174+
`missing = FALSE` part.
1175+
1176+
- `list_pany(missing = TRUE)`:
1177+
1178+
- Like `list_pall(missing = FALSE)`, this is not the useful variant to
1179+
expose at the vector level. Also happens to not have an exposed data
1180+
frame variant, so dplyr doesn’t expose it at all, which feels fine.
1181+
Not a single example needed it.

0 commit comments

Comments
 (0)