-
Notifications
You must be signed in to change notification settings - Fork 16
Production for Backfill Correction #1700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We could also publish the output to an AWS S3 bucket. |
To start the conversation off on this, my understanding was that we'd train the models once a month on the first of the month at night (when we normally schedule longer-running processes). |
|
We need to establish structured relationships between different signals, e.g. those related by JIT work. Backfill projection signals should be part of that. |
Proposed AWS S3 file organization: Unique file per
Then a single-region timeseries is a single file fetch. Storage for this is going to get big pretty quickly, since we're saving the full correction history. To save space, consider:
|
Katie, can you expand on this? Are you talking about diffing the previous day's and the current day's data to find changes only? The compression and limiting precision points make sense. |
The structure we are designing here for storing backfill projections is pretty much the same as that needed for storing forecasts, nowcasts, and backcasts. In fact backfill projection is essentially a backcast of a particular indicator. So I'd like to bring @ryantibs , @brookslogan and potentially other forecast-related folks into this discussion. I think the relevant dimensions are:
To put these in S3 files, I agree it make sense to separate the data at least by:
and possibly also by:
This leaves a 2D table of {reference_date X as-of_datetime}. Since these data will be produced every as_of day, it makes sense to add as_of to the file name/identify, and make the file consist of ~60 values corresponding to lags from 0 or 1 up to ~60, corresponding to reverse-successive reference dates. Since we are going to have all ~60 values, we don't need to store the actual lag (and we can always represent a missing value with an extra separator). Alternatively, we could store all the quantiles in the same file. But it can get messy if different quantile sets are produced at different times or for different indicators. @ryantibs , @brookslogan : how are quantile forecasts stored by us? By the Forecast Hub? |
does that mean we're okay with users needing to pull multiple files in order to build a time series covering the whole pandemic? |
I do not mean diffing; that would be a last resort as it would require much more complicated data fetching capabilities than are easy to do in S3. It's possible, and we'd have help setting it up (it was recommended by one of the Amazon data teams), but it would take substantial effort. I mean pulling any data that's the same for all rows of the file out into the filename or some kind of header, and pulling any categorical data with long human-readable names into an index file and referring to it by a numeric id instead. |
this will create additional operational costs when source (covariate) data patches are applied, and the operational cost of data patches is already high. i'll follow up with you offline. |
Trying to give some quick answers and think more about this later. We currently store forecasts in a couple of formats, with no real care for saving space:
The Hub, at least in their GitHub, stores:
|
Possibly. Backfill corrections don't matter beyond a 60 day lag, and for any reference_date older than 60 days ago, they are not needed except for retrospective error analysis and training forecasting models. More generally, here are the use cases I can think of for backfill-projected signals:
So (1) is 1 file, and (3) and (4) can afford to take more time. My remaining concern is the map (2). To solve that, we could: |
We store contingency tables for CTIS as static files, but that's the only one I'm aware of. @nmdefries can comment on their format and makeup. |
Upon reflection, I am partial to (B) or (C) above. Since it's only one of the O(10) quantiles, storing it a second time will increase space by only O(10%). Alternatively, we can ignore this problem until we see actual demand for such maps. It's good enough that we have a solution ready. |
Does "per forecast" mean per computation_time/as-of_time, but for all regions and forecasting targets?
Does "all forecasts" mean forecasts for all regions, and/or all targets? Or also all forecasting_times? |
Yes, "per forecast" means per model & forecast_date/as_of, but containing data for all regions, targets, & quantiles. We'd probably also make this per geo_type and time_type if we were really dealing with multiple of those. (We do calculate national from state, but as a fixed post-processing step; we don't do national-level analysis.) "all forecasts" is everything: all models, forecast_dates, regions, targets, & quantiles. (Again, if we had multiple geo&time types, it'd probably be one file per type combination.) But we've tried so many models that this is slow and uses up too much RAM; currently, we get by by loading only a subset of models of interest. Our primary/sole use case above --- pseudoprospective & prospective forecast evaluation --- looks like use case 3. And while we haven't really done this with covid & influenza hospitalization forecasting, we may also match use case 1 for post hoc investigation & debugging of bad forecasts or forecasts near data anomalies. For 2: what |
Thanks @brookslogan . I agree that for evaluation you would use case (3) and maybe also case (1) of the forecasts rather than of the backfilled covariates).
Good point. I wasn't thinking clearly. I was indeed thinking about real-time PH users. They would typically want the most uptodate estimate for today, and possibly to scroll back to the most uptodate estimates for any past reference_date. That means the @krivard Do we know the space/time tradeoffs of large files vs. small files in S3? E.g. a linear time/space complexity model of file size? |
Each CTIS contingency table contains all signals of interest, each as an additional column, for all geo values (Texas, California, etc) of a given geo type (state, e.g.) for a particular time period. So we have one file for each time period + geo level. The contingency tables aren't versioned, so if we have to regenerate data we overwrite the old file. Having each file contain multiple value columns is pretty inconvenient. If you need to regenerate a single signal or backfill a new signal that you want the history for, you spend a lot of time computing data you already know. It's also slower to fetch data if you only want to process a single signal, e.g. for plotting. |
in terms of download speed? not precisely, but based on general principles i'd expect there to be some amount of per-file overhead. if you want to know for sure and are willing to wait for results we can run an experiment. what factors are you thinking? |
I was thinking of files either consisting of 60 values (as per my proposal above), or else lumping together all regions in a geo-level (so 60 x ~50 for U.S. states, and 60 x ~3000 for counties). Use case (1) typically needs only a single region. If you have a clear sense of which is better, or another solution you prefer, I am fine just running with it. Estimating the per-file and per-byte access cost might be generally useful beyond this question. |
|
I think we are aiming for a general approach for signals that have significant backfill, but only those where the backfill is dense enough that he statistical model Jingjing developed is reasonable. @brookslogan Can you please give concrete examples of the kind of signals, or the kind of hypothetical backoff, that you are concerned about? |
I think I just misread "for signals" etc. to mean for the signals themselves rather than the backcasts. I think the proposed approach is fine for any sort of backcast, nowcast, or forecast that outputs a manageable set of behinds/aheads. What I was concerned about is taking this approach and applying it also to raw signal archiving, where changes don't necessarily occur in a manageable set of behinds/aheads; e.g., we don't have a hard guarantee that ILI, JHU-CSSE case count reporting, or the CHNG raw data for some day two years ago won't be revised tomorrow, so there might be some extra trouble when thinking about raw signal archiving. But this might be off topic. |
Uh oh!
There was an error while loading. Please reload this page.
We already have most of the work done in
covid-19
(the private repo).The main goal of this project is to provide backfill correction for the values that are reported everyday.
Work left
covid-19
repo tocovidcast-indicators
@nmdefrieschng
,quidel
, etc and load to midas. @jingjtang2022-10-06 - Engineering meeting notes
The text was updated successfully, but these errors were encountered: