Skip to content

Fix keywords that weren't lowercased before inserting #919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 2, 2017

Conversation

carols10cents
Copy link
Member

@carols10cents carols10cents commented Jul 28, 2017

Background for anyone unfamiliar with crates.io's database schema:

  • We've got crates
  • We've got keywords
  • Crates can have many keywords and keywords can have many crates
  • There's a join table crates_keywords that contains records associating crate.ids and keyword.ids

This migration fixes invalid data that currently exists in the database. Between
commits d5c5635 and 4c2bd42, it was possible for keywords to be
created without lowercasing them first. Then, due to this code:

pub fn find_or_create_all(conn: &PgConnection, names: &[&str]) ->
QueryResult<Vec<Keyword>> {
    // ...elided...

    keywords::table
        .filter(::lower(keywords::keyword).eq(any(&lowercase_names)))
        .load(conn)
}

which selected any keywords that matched the given keywords regardless
of case and associated these keywords with crates, many crates now have
BOTH the uppercase and lowercase variants of a keyword associated with
them.

That is, the keywords table contains:

id keyword
1 foo
2 Foo this record should not exist

And a crate with id 5 that was published with any of:

keywords = ["foo"]
keywords = ["Foo"]
keywords = ["FOO"] etc

in their Cargo.toml metadata would get the following records created in
the crates_keywords table:

crate_id keyword_id
5 1 this record is correct, for all cases
5 2 this record is incorrect, for all cases

So this migration removes all the crates_keywords rows associated with
keywords that aren't lowercased and then removes the keywords
themselves.

BUT!

There are some crates that only have these associations currently with
the uppercased keyword! That is, crate id 5 has this in crates_keywords:

crate_id keyword_id
5 2

So BEFORE making the fix above, this migration inserts the corresponding
lowercased keyword association for these crates (so 5, 1 in this case)
so that these crates can be treated like all the rest that are
associated with both.

ALSO!

There are some keywords that have only ever been used in their uppercase
variant, so the corresponding lowercase keyword doesn't exist at all.
Such as:

id keyword
3 BAR

In this case, the migration updates the keyword value, leaving the
associated crates_keywords alone.

WHEW!

@carols10cents
Copy link
Member Author

What I'd love to know from reviewers:

  • Is this SQL doing what I want it to do?
  • Are there any unforeseen consequences of this SQL?
  • Am I forgetting any cases that might exist based on my description of the bug?

@srobbibaro
Copy link

@carols10cents - I'll give it a look over. Since I don't know the schema (at all), I'll look more toward the general SQL side of it.

@srobbibaro
Copy link

@carols10cents - Absolutely wonderful write up in the PR description and in the query comments! It made it so easy for somebody like me, who has never seen this database, to understand exactly what the problem is and how you intend to fix it.

With that said, I believe the SQL will do what you want it to. There might have been a small opportunity for an optimization using a NOT EXISTS (see: https://stackoverflow.com/questions/173041/not-in-vs-not-exists), but if you know the data set to be "reasonable," it's probably not worth even making the change--especially if you've already tested it out.

Hopefully that review helps. I'm going to follow this ticket. Let me know when you run it - I'm curious to hear how it goes!

LGTM 💯

@carols10cents
Copy link
Member Author

carols10cents commented Jul 28, 2017

Thank you so much @srobbibaro!!! I miss having you and the other TTMers around to review my SQL all the time ❤️ ❤️ ❤️ ❤️

I'm definitely setting up some relevant data on staging and running this there before I run this on production!!

@srobbibaro
Copy link

It was fun to review - just like old times. Happy to help out! :)

This fixes invalid data that currently exists in the database. Between
commits d5c5635 and 4c2bd42, it was possible for keywords to be
created without lowercasing them first. Then, due to this code:

```
pub fn find_or_create_all(conn: &PgConnection, names: &[&str]) ->
QueryResult<Vec<Keyword>> {
    // ...elided...

    keywords::table
        .filter(::lower(keywords::keyword).eq(any(&lowercase_names)))
        .load(conn)
}
```

which selected any keywords that matched the given keywords *regardless*
of case and associated these keywords with crates, many crates now have
BOTH the uppercase and lowercase variants of a keyword associated with
them.

That is, the keywords table contains:

| id | keyword |
|----|---------|
| 1  | foo     |
| 2  | Foo     | <-- should not exist

And a crate with id 5 that was published with any of:

`keywords = ["foo"]`
`keywords = ["Foo"]`
`keywords = ["FOO"]` etc

in their Cargo.toml metadata would get the following records created in
the crates_keywords table:

| crate_id | keyword_id |
|----------|------------|
| 5        | 1          | <-- correct
| 5        | 2          | <-- incorrect, for all cases

So this migration removes all the crates_keywords rows associated with
keywords that aren't lowercased and then removes the keywords
themselves.

BUT!

There are some crates that only have these associations currently with
the uppercased keyword! That is, crate id 5 has this in crates_keywords:

| crate_id | keyword_id |
|----------|------------|
| 5        | 2          |

So BEFORE making the fix above, this migration inserts the corresponding
lowercased keyword association for these crates (so 5, 1 in this case)
so that these crates can be treated like all the rest that are
associated with both.

ALSO!

There are some keywords that have only ever been used in their uppercase
variant, so the corresponding lowercase keyword doesn't exist at all.
Such as:

| id | keyword |
|----|---------|
| 3  | BAR     |

In this case, the migration updates the keyword value, leaving the
associated crates_keywords alone.

WHEW!
Non-lowercased keywords shouldn't exist, but in case they do, don't
associate crates with them. That'll at least keep the SHOUTING KEYWORDS
out of the UI.
@carols10cents
Copy link
Member Author

I'm feeling good about this and I'd like to get the database cleaned up, so I'm going to merge this, test on staging, and deploy :)

bors r+

bors-voyager bot added a commit that referenced this pull request Aug 2, 2017
919: Fix keywords that weren't lowercased before inserting r=carols10cents

Background for anyone unfamiliar with crates.io's database schema:

- We've got crates
- We've got keywords
- Crates can have many keywords and keywords can have many crates
- There's a join table `crates_keywords` that contains records associating crate.ids and keyword.ids

This migration fixes invalid data that currently exists in the database. Between
commits d5c5635 and 4c2bd42, it was possible for keywords to be
created without lowercasing them first. Then, due to this code:

```
pub fn find_or_create_all(conn: &PgConnection, names: &[&str]) ->
QueryResult<Vec<Keyword>> {
    // ...elided...

    keywords::table
        .filter(::lower(keywords::keyword).eq(any(&lowercase_names)))
        .load(conn)
}
```

which selected any keywords that matched the given keywords *regardless*
of case and associated these keywords with crates, many crates now have
BOTH the uppercase and lowercase variants of a keyword associated with
them.

That is, the keywords table contains:

| id | keyword | |
|----|---------|-|
| 1  | foo     | |
| 2  | Foo     | this record should not exist |

And a crate with id 5 that was published with any of:

`keywords = ["foo"]`
`keywords = ["Foo"]`
`keywords = ["FOO"]` etc

in their Cargo.toml metadata would get the following records created in
the crates_keywords table:

| crate_id | keyword_id | |
|----------|------------|-|
| 5        | 1          | this record is correct, for all cases |
| 5        | 2          | this record is incorrect, for all cases |

So this migration removes all the crates_keywords rows associated with
keywords that aren't lowercased and then removes the keywords
themselves.

BUT!

There are some crates that only have these associations currently with
the uppercased keyword! That is, crate id 5 has this in crates_keywords:

| crate_id | keyword_id |
|----------|------------|
| 5        | 2          |

So BEFORE making the fix above, this migration inserts the corresponding
lowercased keyword association for these crates (so 5, 1 in this case)
so that these crates can be treated like all the rest that are
associated with both.

ALSO!

There are some keywords that have only ever been used in their uppercase
variant, so the corresponding lowercase keyword doesn't exist at all.
Such as:

| id | keyword |
|----|---------|
| 3  | BAR     |

In this case, the migration updates the keyword value, leaving the
associated crates_keywords alone.

WHEW!
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Aug 2, 2017

Build succeeded

@bors-voyager bors-voyager bot merged commit 772e009 into rust-lang:master Aug 2, 2017
@carols10cents carols10cents deleted the fix-keywords branch August 2, 2017 14:08
@carols10cents
Copy link
Member Author

Deployed to staging where i had set up keywords and crates in each of these conditions, the migration did everything i expected it to and nothing I didn't expect it to so we're going to prod!

@carols10cents
Copy link
Member Author

prod is looking much better now!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants