Skip to content

Commit 02d2533

Browse files
committed
Auto merge of #1800 - smarnach:dump-db, r=carols10cents
Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball
2 parents 028e033 + 01a4e98 commit 02d2533

File tree

17 files changed

+920
-24
lines changed

17 files changed

+920
-24
lines changed

Cargo.lock

+57
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

+2
Original file line numberDiff line numberDiff line change
@@ -83,12 +83,14 @@ tokio = "0.1"
8383
hyper = "0.12"
8484
ctrlc = { version = "3.0", features = ["termination"] }
8585
indexmap = "1.0.2"
86+
handlebars = "2.0.1"
8687

8788
[dev-dependencies]
8889
conduit-test = "0.8"
8990
hyper-tls = "0.3"
9091
lazy_static = "1.0"
9192
tokio-core = "0.1"
93+
diesel_migrations = { version = "1.3.0", features = ["postgres"] }
9294

9395
[build-dependencies]
9496
dotenv = "0.11"

app/router.js

+1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Router.map(function() {
4646
this.route('category-slugs', { path: 'category_slugs' });
4747
this.route('team', { path: '/teams/:team_id' });
4848
this.route('policies');
49+
this.route('data-access');
4950
this.route('confirm', { path: '/confirm/:email_token' });
5051

5152
this.route('catch-all', { path: '*path' });

app/templates/data-access.hbs

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<div id='crates-heading'>
2+
{{svg-jar 'circle-with-i'}}
3+
<h1>Accessing the Crates.io Data</h1>
4+
</div>
5+
6+
<p>
7+
There are several ways of accessing the Crates.io data. You should try the
8+
options in the order listed.
9+
</p>
10+
11+
<ol>
12+
<li>
13+
<b>
14+
The <a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>.
15+
</b>
16+
This git repository is updated by crates.io, and it is used
17+
by Cargo to speed up local dependency resolution. It contains the majority
18+
of the data exposed by crates.io and is cheap to clone and get updates.
19+
</li>
20+
<li>
21+
<b>The database dumps (experimental).</b> The dump contains all information
22+
exposed by the API in a single download. It is updated every 24 hours.
23+
The latest dump is available at the address
24+
<a href='https://static.crates.io/db-dump.tar.gz'>https://static.crates.io/db-dump.tar.gz</a>.
25+
Information on using the dump is contained in the tarball.
26+
</li>
27+
<li>
28+
<b>Crawl the crates.io API.</b> This should be used as a last resort, and
29+
doing so is subject to our {{#link-to 'policies'}}crawling policy{{/link-to}}.
30+
If the index and the database dumps do not satisfy your needs, we're happy to
31+
discuss solutions that don't require you to crawl the registry.
32+
You can email us at <a href="mailto:[email protected]">[email protected]</a>.
33+
</li>
34+
</ol>

app/templates/policies.hbs

+2-9
Original file line numberDiff line numberDiff line change
@@ -112,15 +112,8 @@
112112
<h2 id='crawlers'><a href='#crawlers'>Crawlers</a></h2>
113113

114114
<p>
115-
Before resorting to crawling crates.io, you should first see if you are able to
116-
gather the information you need from the
117-
<a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>,
118-
which is a public git repository containing the majority
119-
of the information availble through our API.
120-
121-
If the index does not have the information you need, we're also happy to
122-
discuss solutions to your needs that don't require you to crawl the registry.
123-
You can email us at <a href="mailto:[email protected]">[email protected]</a>.
115+
Before resorting to crawling crates.io, please read
116+
{{#link-to 'data-access'}}Accessing the Crates.io Data{{/link-to}}.
124117
</p>
125118

126119
<p>

migrations/2017-10-08-193512_category_trees/up.sql

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
-- Your SQL goes here
2-
CREATE EXTENSION ltree;
1+
CREATE EXTENSION IF NOT EXISTS ltree;
32

43
-- Create the new column which will represent our category tree.
54
-- Fill it with values from `slug` column and then set to non-null
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
CREATE EXTENSION pg_trgm;
1+
CREATE EXTENSION IF NOT EXISTS pg_trgm;
22
CREATE INDEX index_crates_name_tgrm ON crates USING gin (canon_crate_name(name) gin_trgm_ops);

src/bin/enqueue-job.rs

+24-12
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,29 @@
1-
use cargo_registry::util::{CargoError, CargoResult};
2-
use cargo_registry::{db, tasks};
3-
use std::env::args;
4-
use swirl::Job;
1+
use cargo_registry::util::{human, CargoError, CargoResult};
2+
use cargo_registry::{db, env, tasks};
3+
use diesel::PgConnection;
54

65
fn main() -> CargoResult<()> {
76
let conn = db::connect_now()?;
7+
let mut args = std::env::args().skip(1);
8+
match &*args.next().unwrap_or_default() {
9+
"update_downloads" => tasks::update_downloads().enqueue(&conn),
10+
"dump_db" => {
11+
let database_url = args.next().unwrap_or_else(|| env("DATABASE_URL"));
12+
let target_name = args
13+
.next()
14+
.unwrap_or_else(|| String::from("db-dump.tar.gz"));
15+
tasks::dump_db(database_url, target_name).enqueue(&conn)
16+
}
17+
other => Err(human(&format!("Unrecognized job type `{}`", other))),
18+
}
19+
}
820

9-
match &*args().nth(1).unwrap_or_default() {
10-
"update_downloads" => tasks::update_downloads()
11-
.enqueue(&conn)
12-
.map_err(|e| CargoError::from_std_error(e))?,
13-
other => panic!("Unrecognized job type `{}`", other),
14-
};
15-
16-
Ok(())
21+
/// Helper to map the `PerformError` returned by `swirl::Job::enqueue()` to a
22+
/// `CargoError`. Can be removed once `map_err()` isn't needed any more.
23+
trait Enqueue: swirl::Job {
24+
fn enqueue(self, conn: &PgConnection) -> CargoResult<()> {
25+
<Self as swirl::Job>::enqueue(self, conn).map_err(|e| CargoError::from_std_error(e))
26+
}
1727
}
28+
29+
impl<J: swirl::Job> Enqueue for J {}

src/tasks.rs

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
pub mod dump_db;
12
mod update_downloads;
23

4+
pub use dump_db::dump_db;
35
pub use update_downloads::update_downloads;

0 commit comments

Comments
 (0)