Description
Problem
Our covidcast weekly signals pass through this acquisition code, which detects an epiweek time format YYYYWW in the receiving file name and assigns it as the issue value. As far as I understand, if the source provides two or more issues in the same epiweek, we keep only the latest one. This is a problem for the forecasting team as it makes accurate backtesting impossible: if a source (like NHSN) updates later in the same week than the forecast date, our database will show data that wasn't available at forecast time.
Here is a plot showing NHSN update times on the x-axis and the epiweek that time would be assigned to on the y-axis. The red-dashed line is the forecast date. Points to the right of the forecast date but in the same epiweek will be in our db's historical record for that week, but weren't available at forecast time.
cc @dsweber2 @brookslogan @aysim319 @melange396
Plot generated with:
library(aws.s3)
library(lubridate)
library(MMWRweek)
library(tidyverse)
# Bucket file format is like
# nhsn_data_raw_2024-12-18_11-01-08.124565_prelim.parquet
# where the time is in UTC, ymd_hms assumes UTC by default, we use with_tz
# below to translate to PST time (which is more correct for determining day
# boundaries)
get_version_timestamp <- function(filename) ymd_hms(str_match(filename, "[0-9]{4}-..-.._..-..-..\\.[^.^_]*"))
get_epiweek_from_timestamp <- function(timestamp) {
paste0(MMWRweek::MMWRweek(timestamp)$MMWRyear, "-", str_pad(MMWRweek::MMWRweek(timestamp)$MMWRweek, 2, "left", "0"))
}
# Requires credentials to the forecasting-team-data bucket
update_times <- aws.s3::get_bucket_df(prefix = "nhsn_data_raw", bucket = "forecasting-team-data") %>%
pull(Key) %>%
get_version_timestamp() %>%
with_tz(tzone = "America/Los_Angeles")
epiweeks <- update_times %>% get_epiweek_from_timestamp()
# These were the actual forecast dates this season (accounting for holidays and
# other delays)
forecast_dates <- c(
as.Date(c("2024-11-22", "2024-11-27", "2024-12-04", "2024-12-11", "2024-12-18", "2024-12-26", "2025-01-02")),
seq.Date(as.Date("2025-01-08"), Sys.Date(), by = 7L)
)
ggplot(data.frame(update_times, epiweeks), aes(x = as.Date(update_times), y = epiweeks)) +
geom_point() +
geom_vline(xintercept = forecast_dates, color = "red", linetype = "dashed") +
theme_minimal() +
labs(x = "Update at", y = "Epiweek")
Possible Solutions
- Change our NHSN Cronicle schedule to not run after Wednesday. Last I heard, our Cronicle update schedule is "wednesday/friday @ 12:30pm [est]", so currently Thursday and Friday updates are overwriting the Wednesday updates. This is a simple solution and has the advantage of no code updates, but it's fragile/manual, since forecast dates tend to get delayed by holidays and data outages, which would require us to be on call and modify the Cronicle schedule as needed. Possibly the correct thing for this season, which is winding down, but a burden long-term.
- Since issue defaults to
issue=(date.today(), epi.Week.fromdate(date.today()))
in the code, just use date.today() in this case- Sounds simple, but other parts of the code might unexpectedly depend on time_value and issue being the same format
- Also, unclear how we could carve out an acquisition logic exception for NHSN
- Store both time_value and issue (for NHSN) as date and use documentation to clarify that each value represents a weekly sum (the way we already do with 7dav signals)
- Avoids the possible consistency issues from first approach
- We have the raw data files for the whole season stored in an S3 bucket, so playing forward through an updated acquisition pipeline is possible
- Requires NHSN indicator code changes
- Legacy weekly signals have the same problem and it's out of scope to fix those