Description
There are several significant problems with the EpiData Database & API:
- There is a tremendous amount of signal duplication. For example, Cases from the same source are stored:
new vs. cumulative
counts vs. ratios (=normalized by population)
daily vs. 7day average
(also raw vs. smooth?)
This increases storage by a factor of 8 (16?) for Cases from EACH source (JHU, USAFacts, hybrid?).
Same for deaths.
Same for covid tests of all types.
For some of the other signals, the multiplier is 4 or 2.
As we struggle with the growth of our DB (both more signals and longer time period), we can't afford this waste.
2 The pre-processing we do, e.g. smoothing, averaging and convert cumulative into 'new', each represent just one choice of multiple reasonable choices. For example, some users may want 14day averaging for some signals. Some may want to allocate "bumps" in cumulative counts differently than we have, e.g. by distributing them uniformly (or proportionately) over some past period, or by eliminating negative adjustments.
- As we add sources and signals, our naming sometimes needs to evolve to remain clear and accurate. Right now we are stuck indefinitely with naming decisions we made long ago. We need a more flexible way to evolve while remaining backward compatible. We need to have a process for gradually deprecating old names.
I think we can solve all of the above by introducing a layer of indirection at the highest level of the API calls. A new API call will have additional parameters, e.g.:
Source=JHU, Signal=Cases, Mode=Cumulative, Window=7day, Smoothing="XYZ".
This will give us the flexibility to:
(1) support a larger (unlimited) range of preprocessing options, which can be easily extended.
(2) continue to allow some popular combinations to be pre-computed and stored in the DB as 'pre-compiled' signals. But also allow the option to create some combinations on the fly. We can also cache some results for commonly asked queries, representing something between "on the fly" and a permanent signal in the DB. The decision between teh three options (pre-compiled, cached, or on-the-fly) can be made either manually or automatically based on frequency of use.
(3) Enable naming evolution by using this layer to support backward compatibility, with some feedback about deprecation. So, the current API calls would still work, but would be translated into the new calls, and old signal names will be mapped into new signal names, for as long as we want, after which they will return a "please switch to the new name" error message.