-
Notifications
You must be signed in to change notification settings - Fork 645
Fix keywords that weren't lowercased before inserting #919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What I'd love to know from reviewers:
|
@carols10cents - I'll give it a look over. Since I don't know the schema (at all), I'll look more toward the general SQL side of it. |
@carols10cents - Absolutely wonderful write up in the PR description and in the query comments! It made it so easy for somebody like me, who has never seen this database, to understand exactly what the problem is and how you intend to fix it. With that said, I believe the SQL will do what you want it to. There might have been a small opportunity for an optimization using a Hopefully that review helps. I'm going to follow this ticket. Let me know when you run it - I'm curious to hear how it goes! LGTM 💯 |
Thank you so much @srobbibaro!!! I miss having you and the other TTMers around to review my SQL all the time ❤️ ❤️ ❤️ ❤️ I'm definitely setting up some relevant data on staging and running this there before I run this on production!! |
It was fun to review - just like old times. Happy to help out! :) |
36a9db6
to
ffd1f48
Compare
This fixes invalid data that currently exists in the database. Between commits d5c5635 and 4c2bd42, it was possible for keywords to be created without lowercasing them first. Then, due to this code: ``` pub fn find_or_create_all(conn: &PgConnection, names: &[&str]) -> QueryResult<Vec<Keyword>> { // ...elided... keywords::table .filter(::lower(keywords::keyword).eq(any(&lowercase_names))) .load(conn) } ``` which selected any keywords that matched the given keywords *regardless* of case and associated these keywords with crates, many crates now have BOTH the uppercase and lowercase variants of a keyword associated with them. That is, the keywords table contains: | id | keyword | |----|---------| | 1 | foo | | 2 | Foo | <-- should not exist And a crate with id 5 that was published with any of: `keywords = ["foo"]` `keywords = ["Foo"]` `keywords = ["FOO"]` etc in their Cargo.toml metadata would get the following records created in the crates_keywords table: | crate_id | keyword_id | |----------|------------| | 5 | 1 | <-- correct | 5 | 2 | <-- incorrect, for all cases So this migration removes all the crates_keywords rows associated with keywords that aren't lowercased and then removes the keywords themselves. BUT! There are some crates that only have these associations currently with the uppercased keyword! That is, crate id 5 has this in crates_keywords: | crate_id | keyword_id | |----------|------------| | 5 | 2 | So BEFORE making the fix above, this migration inserts the corresponding lowercased keyword association for these crates (so 5, 1 in this case) so that these crates can be treated like all the rest that are associated with both. ALSO! There are some keywords that have only ever been used in their uppercase variant, so the corresponding lowercase keyword doesn't exist at all. Such as: | id | keyword | |----|---------| | 3 | BAR | In this case, the migration updates the keyword value, leaving the associated crates_keywords alone. WHEW!
Non-lowercased keywords shouldn't exist, but in case they do, don't associate crates with them. That'll at least keep the SHOUTING KEYWORDS out of the UI.
ffd1f48
to
772e009
Compare
I'm feeling good about this and I'd like to get the database cleaned up, so I'm going to merge this, test on staging, and deploy :) bors r+ |
919: Fix keywords that weren't lowercased before inserting r=carols10cents Background for anyone unfamiliar with crates.io's database schema: - We've got crates - We've got keywords - Crates can have many keywords and keywords can have many crates - There's a join table `crates_keywords` that contains records associating crate.ids and keyword.ids This migration fixes invalid data that currently exists in the database. Between commits d5c5635 and 4c2bd42, it was possible for keywords to be created without lowercasing them first. Then, due to this code: ``` pub fn find_or_create_all(conn: &PgConnection, names: &[&str]) -> QueryResult<Vec<Keyword>> { // ...elided... keywords::table .filter(::lower(keywords::keyword).eq(any(&lowercase_names))) .load(conn) } ``` which selected any keywords that matched the given keywords *regardless* of case and associated these keywords with crates, many crates now have BOTH the uppercase and lowercase variants of a keyword associated with them. That is, the keywords table contains: | id | keyword | | |----|---------|-| | 1 | foo | | | 2 | Foo | this record should not exist | And a crate with id 5 that was published with any of: `keywords = ["foo"]` `keywords = ["Foo"]` `keywords = ["FOO"]` etc in their Cargo.toml metadata would get the following records created in the crates_keywords table: | crate_id | keyword_id | | |----------|------------|-| | 5 | 1 | this record is correct, for all cases | | 5 | 2 | this record is incorrect, for all cases | So this migration removes all the crates_keywords rows associated with keywords that aren't lowercased and then removes the keywords themselves. BUT! There are some crates that only have these associations currently with the uppercased keyword! That is, crate id 5 has this in crates_keywords: | crate_id | keyword_id | |----------|------------| | 5 | 2 | So BEFORE making the fix above, this migration inserts the corresponding lowercased keyword association for these crates (so 5, 1 in this case) so that these crates can be treated like all the rest that are associated with both. ALSO! There are some keywords that have only ever been used in their uppercase variant, so the corresponding lowercase keyword doesn't exist at all. Such as: | id | keyword | |----|---------| | 3 | BAR | In this case, the migration updates the keyword value, leaving the associated crates_keywords alone. WHEW!
Build succeeded |
Deployed to staging where i had set up keywords and crates in each of these conditions, the migration did everything i expected it to and nothing I didn't expect it to so we're going to prod! |
prod is looking much better now!!! |
Background for anyone unfamiliar with crates.io's database schema:
crates_keywords
that contains records associating crate.ids and keyword.idsThis migration fixes invalid data that currently exists in the database. Between
commits d5c5635 and 4c2bd42, it was possible for keywords to be
created without lowercasing them first. Then, due to this code:
which selected any keywords that matched the given keywords regardless
of case and associated these keywords with crates, many crates now have
BOTH the uppercase and lowercase variants of a keyword associated with
them.
That is, the keywords table contains:
And a crate with id 5 that was published with any of:
keywords = ["foo"]
keywords = ["Foo"]
keywords = ["FOO"]
etcin their Cargo.toml metadata would get the following records created in
the crates_keywords table:
So this migration removes all the crates_keywords rows associated with
keywords that aren't lowercased and then removes the keywords
themselves.
BUT!
There are some crates that only have these associations currently with
the uppercased keyword! That is, crate id 5 has this in crates_keywords:
So BEFORE making the fix above, this migration inserts the corresponding
lowercased keyword association for these crates (so 5, 1 in this case)
so that these crates can be treated like all the rest that are
associated with both.
ALSO!
There are some keywords that have only ever been used in their uppercase
variant, so the corresponding lowercase keyword doesn't exist at all.
Such as:
In this case, the migration updates the keyword value, leaving the
associated crates_keywords alone.
WHEW!