Skip to content

Remove logging of dates from update-downloads #603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 9, 2017

Conversation

carols10cents
Copy link
Member

@carols10cents carols10cents commented Mar 8, 2017

A not-insignificant portion of our production logs is the update-downloads worker printing the date:

Mar 07 11:47:38 crates-io app/worker.1: Mon, 06 Mar 2017 19:47:37 GMT 
Mar 07 11:47:38 crates-io app/worker.1: Tue, 07 Mar 2017 00:00:00 GMT 
Mar 07 11:47:38 crates-io app/worker.1: Mon, 06 Mar 2017 19:47:37 GMT 
Mar 07 11:47:38 crates-io app/worker.1: Tue, 07 Mar 2017 00:00:00 GMT 
Mar 07 11:47:38 crates-io app/worker.1: Mon, 06 Mar 2017 19:47:37 GMT 

I think this is printing like... 1000 times every 30 seconds or so.

In addition to removing log noise, this should hopefully help our papertrail quota stretch a bit farther.

@alexcrichton
Copy link
Member

Er this I think indicates a bug?

These should update at most like 1k crates per minute or something like that. If we're printing thousands of messages then something's going very wrong I think

@carols10cents
Copy link
Member Author

Hm soooo it might not be as much as I guessed, but take a look at papertrail for "worker updating versions" (to just get this log statement, not even the one i'm trying to get rid of).

I think the worker got behind when sgrif's STD crate was crashing it? It seems to have started attempting to process more and more rows on sunday, and it looks like it's still increasing today even though I deleted the STD crate (using delete-crate.rs that also gets rid of the version_downloads rows) yesterday...

It's hard to see if it's still erroring because of all these date lines though...

@carols10cents
Copy link
Member Author

Ah yeah the last result for "worker main" showing a crash was yesterday... investigating more....

@carols10cents
Copy link
Member Author

It looks like the script sleeps for 5 min, then loops over batches of 1000 but it does try to process all rows where processed=false every 5 min. Right now, select count(*) from version_downloads where processed = false; is about 29k.

I think this script is querying for more rows than it needs, and doing an extra operation on both rows it needs and rows it does not need.

The query that happens every 5 minutes is querying for all rows where processed = false.

Then it skips over processing any rows whose date is after the cutoff and whose counted = downloads (since they have not gotten any more downloads since we last counted these rows). We could be excluding these rows from the initial query instead.

Then it gets the difference between downloads and counted, and unconditionally updates the row's processed and counted values. That update is important if we're changing processed from false to true (if this is a record from yesterday) OR if the counted value is changing (ie downloads - counted != 0, ie downloads != counted). If neither of those is true, we're doing an update that isn't changing anything.

We're then skipping the rest of the loop for rows whose downloads and counted are the same, so this is another reason we really should be excluding those records from today that haven't changed in the last 5 minutes from the query.

Am I missing anything about how this works?

@alexcrichton
Copy link
Member

@carols10cents yeah that sounds about right.

The tl;dr; is that I don't know how to implement download counts. I assumed that every download updating a global counter would be incredibly slow (and lossy), so every download updates a version-local counter and that's it. Those counters are then picked up by this batch process (serially) and it propagates the download counts upward (from version to crate, from crate to website).

So the general gist is that this script looks at rows that need to be propagated upwards, but doesn't look at too many. I'd definitely believe there's more rows to be filtered out!

@alexcrichton
Copy link
Member

That being said, 29k rows per 5 minutes should be 29 messages per 5 minutes not 10k/5min ...

@alexcrichton
Copy link
Member

Urgh of course this is printing a line per row. I'm crazy.

@alexcrichton alexcrichton merged commit 35d2240 into rust-lang:master Mar 9, 2017
@carols10cents carols10cents deleted the log-less branch March 11, 2017 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants