Update 2022-12-01-epidata-v4.Rmd

carlynvandyke · web-flow · commit 93afcf5b6732 · 2022-12-14T14:14:39.000-05:00
diff --git a/content/blog/2022-12-01-epidata-v4.Rmd b/content/blog/2022-12-01-epidata-v4.Rmd
@@ -10,13 +10,13 @@ authors:
   - george
 heroImageThumb: blog-thumb.jpg
 summary: |
-  Epidata v0.4.0 ("v4" for short) launched on September 26, 2022, bringing about a major revision to how we store data served by the Epidata API. The changes prioritize fast access to the most up-to-date data while retaining the deep data revision history needed by researchers. The launch included a prototype of a modular data organization system intended to generalize across multiple pathogens, with stubs for more advanced and efficient timestamping and greater flexibility in data stratification.
+  Epidata v0.4.0 ("v4" for short) launched on September 26, 2022, bringing about a major revision to how we store data served by the [Epidata API](https://cmu-delphi.github.io/delphi-epidata/). The changes prioritize fast access to the most up-to-date data while retaining the deep data revision history needed by researchers. The launch included a prototype of a modular data organization system intended to generalize across multiple pathogens, with stubs for more advanced and efficient timestamping and greater flexibility in data stratification.
 output:
   blogdown::html_page:
     toc: true
 ---
 
-Epidata v0.4.0 ("v4" for short) launched on September 26, 2022, bringing about a major revision to how we store data served by the Epidata API. The changes prioritize fast access to the most up-to-date data while retaining the deep data revision history needed by researchers. The launch included a prototype of a modular data organization system intended to generalize across multiple pathogens, with stubs for more advanced and efficient timestamping and greater flexibility in data stratification.
+Epidata v0.4.0 ("v4" for short) launched on September 26, 2022, bringing about a major revision to how we store data served by the [Epidata API](https://cmu-delphi.github.io/delphi-epidata/). The changes prioritize fast access to the most up-to-date data while retaining the deep data revision history needed by researchers. The launch included a prototype of a modular data organization system intended to generalize across multiple pathogens, with stubs for more advanced and efficient timestamping and greater flexibility in data stratification.
 
 In this post, we will discuss why these changes were needed, describe the new data architecture, and give a brief summary of how v4 performance compares to the previous version.
 
@@ -63,13 +63,13 @@ A small single-table design can be manageable, but our data set grew very quickl
 
 Eventually we undertook the challenge of trying to improve our performance and reduce our resource usage – to come up with v4!
 
-We chose to switch from [MariaDB](TODO) to [Percona Server for MySQL](TODO) for our database engine and management system.  Both are variants of the free open source [MySQL](TODO) database, and both provide additional support and features on top of the base product.  Percona offered us access to a more consistently developed drop-in replacement for MySQL, along with expertise and knowledge concentrated in both the open source community and Percona the company. Additionally, Percona Server for MySQL is a reliable target for monitoring with Percona’s [PMM](TODO).
+We chose to switch from [MariaDB](https://mariadb.org) to [Percona Server for MySQL](https://www.percona.com/software/mysql-database/percona-server) for our database engine and management system.  Both are variants of the free open source [MySQL](https://www.mysql.com) database, and both provide additional support and features on top of the base product.  Percona offered us access to a more consistently developed drop-in replacement for MySQL, along with expertise and knowledge concentrated in both the open source community and Percona the company. Additionally, Percona Server for MySQL is a reliable target for monitoring with Percona’s [PMM](https://docs.percona.com/percona-monitoring-and-management/).
 
-To reduce our data footprint and speed lookups, we took steps to "[normalize](TODO)" our database.  This involved redefining our schema to move some columns out of the main table and into tables of their own, which we reference with "[foreign key](TODO)" columns instead.  By arranging our data this way, the long and often-repeated text strings that name our sources and signals as well as the geographic locations we cover are only stored once each, and referred to by a number instead.  The relatively short numbers take up much less space on disk compared to the text that they replace, and the database can optimize locating records by these numbers more easily than it can for text.  We employ two of these dimension tables, one for source and signal, and one for geographic type and value.  Using dimension tables efficiently requires [JOIN](TODO) operations when retrieving from the database.  We avoided adding these slightly-more-complicated JOIN statements to many places in our code base by hiding them behind [VIEWs](TODO) -- our VIEWs are a shortcut to the JOIN logic that make our new normalized tables look virtually indistinguishable from old monolithic table to our web server code.
+To reduce our data footprint and speed lookups, we took steps to "[normalize](https://en.wikipedia.org/wiki/Database_normalization)" our database.  This involved redefining our schema to move some columns out of the main table and into tables of their own, which we reference with "[foreign key](https://en.wikipedia.org/wiki/Foreign_key)" columns instead.  By arranging our data this way, the long and often-repeated text strings that name our sources and signals as well as the geographic locations we cover are only stored once each, and referred to by a number instead.  The relatively short numbers take up much less space on disk compared to the text that they replace, and the database can optimize locating records by these numbers more easily than it can for text.  We employ two of these dimension tables, one for source and signal, and one for geographic type and value.  Using dimension tables efficiently requires [JOIN](https://en.wikipedia.org/wiki/Join_(SQL)) operations when retrieving from the database.  We avoided adding these slightly-more-complicated JOIN statements to many places in our code base by hiding them behind [VIEWs](https://en.wikipedia.org/wiki/View_(SQL)) -- our VIEWs are a shortcut to the JOIN logic that make our new normalized tables look virtually indistinguishable from old monolithic table to our web server code.
 
-We also added a "latest" table that includes a copy of just the issues that are most recent for all of our data points and removed the "latest issue" flag.  Since this no longer relies on looking in our gigantic collection of all issues for data that is flagged as being the latest, we gained a significant improvement to query times.  This is the data that drives our main [website](TODO), so it is important for these operations to be fast for those browsing, and since we have so many visitors, it adds up to reduce our load significantly.  The "latest table" simplifies our data management responsibilities compared to the "latest flag", in that we don't have to clear a flag from an old row (a previously "latest" issue of data) when a new "latest" row is added -- instead we simply overwrite the old row in our latest table, and the database does this with great ease.
+We also added a "latest" table that includes a copy of just the issues that are most recent for all of our data points and removed the "latest issue" flag.  Since this no longer relies on looking in our gigantic collection of all issues for data that is flagged as being the latest, we gained a significant improvement to query times.  This is the data that drives our main [website](https://delphi.cmu.edu/covidcast/), so it is important for these operations to be fast for those browsing, and since we have so many visitors, it adds up to reduce our load significantly.  The "latest table" simplifies our data management responsibilities compared to the "latest flag", in that we don't have to clear a flag from an old row (a previously "latest" issue of data) when a new "latest" row is added -- instead we simply overwrite the old row in our latest table, and the database does this with great ease.
 
-Additionally, the reduced size of the foreign keys compared to their text equivalents made it possible to include more [indexes](TODO) in our database that let it quickly find data based on different subsets of criteria.  No matter how our users specify their requests, we have ways to look it up fast.
+Additionally, the reduced size of the foreign keys compared to their text equivalents made it possible to include more [indexes](https://en.wikipedia.org/wiki/Database_index) in our database that let it quickly find data based on different subsets of criteria.  No matter how our users specify their requests, we have ways to look it up fast.
 
 ![Fig 2: simplified diagram of v4 design, showing dimension tables with foreign key relationships and latest/full table scheme. [Full v4 schema details available on GitHub.](https://github.com/cmu-delphi/delphi-epidata/blob/v0.4.0/src/ddl/v4_schema.sql)](https://github.com/cmu-delphi/www-main/blob/krivard/blog-tests-more/content/blog/images/v4-table-design.png)
 
@@ -78,17 +78,34 @@ All performance figures were collected in an isolated QA environment. The core d
 
 Overall, the v4 system improves Epidata performance for adding new data and substantially improves performance for querying the most up-to-date figures. Storage performance is the same, although we suspect we could improve this by removing an index or two without much impact on query performance. Query performance for queries that must touch the data revision history is worse, but not by much; performance should still be adequate for most research use cases (*please do get in touch if this is not the case for you*).
 
-TODO METRIC TABLE
-
-TODO QUERY TIME TABLE
+Metric | v4 performance relative to v3 |
+--- | --- | --- |
+Disk footprint | same |
+Adding new data | 30% faster |
+Querying latest data, i.e., the most recent issue of any reference date | 96% faster |
+Querying historical data, e.g. for as-of, issue, or lag queries | 120% slower |
+
+Query time (seconds) | Latest | Historical |
+--- | --- | --- |
+Percentile | v3 | v4 | v3 | v4 |
+0th (minimum) | 0.005 | 0.005 | 0.005 | 0 .005 |
+25th | 0.007 | 0.006 | 0.006 | 0.007 |
+50th (median) | 0.009 | 0.008 | 0.007 | 0.008 |
+75th | 0.058 | 0.017 | 0.012 | 0.023 |
+100th (maximum) | 739.332 | 20.069 | 17.121 | 15.997 |
 
 The largest factor in the query time improvements for latest data is a reduction in the frequency and severity of long-running queries. In v3, over seven hundred queries required over ten seconds to compute, and the longest-running query took twelve minutes; in v4 only three queries needed more than ten seconds and no query took more than twenty-one seconds. 
 
 For historical data, long-running queries were more frequent in v4 than in v3, but not more severe. In v3, only 8 queries required more than one second to compute, and the longest-running query took seventeen seconds; in v4 over two hundred queries needed more than one second, but the max query time was sixteen seconds. 
 
 The empirical cumulative distribution functions for query times in the latest and historical datasets shows both of these effects in greater detail.
 
-TODO LONG-RUNNING QUERIES TABLE
+Long-running queries (counts) | Latest (of 37,396) | Historical (of 19,359) |
+--- | --- | --- |
+Queries slower than | v3 | v4 | v3 | v4 |
+1 second | 2734 | 17 | 8 | 228 |
+10 seconds | 706 | 3 | 1 | 1 |
+100 seconds | 3 | 0 | 0 | 0 |
 
 ![Fig 3: Empirical cumulative distribution functions for query times, latest data](https://github.com/cmu-delphi/www-main/blob/krivard/blog-tests-more/content/blog/images/query-time-latest-data.png)