Skip to content

Get ~all users to verify email addresses #3632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
dstufft opened this issue Apr 11, 2018 · 33 comments
Closed
5 tasks done

Get ~all users to verify email addresses #3632

dstufft opened this issue Apr 11, 2018 · 33 comments

Comments

@dstufft
Copy link
Member

dstufft commented Apr 11, 2018

We have a problem with a bit of our data, namely that due to historical reasons we have a fair amount of users in the database that do not have a verified primary email address. The side effect of this is that we're currently sending emails to email addresses that we have not had verified. This is a bad situation to be in, because in order to keep our bounce/spam rate low, we should be confirming all email addresses before sending email to them. In addition the way our bounce handling code works is it un-verifies the email address, which the intent was to stop sending email to it until the user has reverified their email address.

In total there are about 193k user accounts with a unverified email address for their primary address, and 44k that do have a verified email address for their primary account.

So we need to come up with a strategy to resolve this, because it's pretty important that we don't send email to unverified addresses.

Here's what I've come up with, but I'd like to see what other people think as well.

For background, the way activation worked on legacy PyPI was that when you registered, it added a One time token (OTK) to a separate table that stored (username, OTK, datetime). When you verified your email with PyPI it would delete the entry from this other table, so effectively this table acts as a list of user accounts that legacy PyPI registered, but whom never activated their account via legacy PyPI.

So that means we have accounts in 3 possible states:

  • They have a primary email address that is verified.
  • They have a primary email address that is unverified, and they exist in the OTK table.
  • They have a primary email address that is unverified, and they do not exist in the OTK table.

The first state is the happy state, and we currently have 44k accounts in that state. Looking at the OTK table, there are currently ~135k rows, if we assume that 100% of them are for accounts that did not end up verifying via Warehouse instead, that means that we have 135k accounts in the second state, and ~58k accounts in the third state. Just to correlate this, we also have ~135k users who are not in the is_active state.

Thus my plan of action is:

  • 1) Start displaying a flash-message like warning at the top of every page load for logged in users without a verified primary email address with a call to action to get a verified email address as their primary email address.
  • 2) Expand the limitations of not having a verified, primary address so that you cannot do much in the ways of project management without it. What exactly should be limited is on the table, but I think uploads in general should require a valid, verified email, and likely so should other actions like deletions, managing contributors, etc.
  • 3) Start a campaign of blogs, tweets, mailing list posts, etc to ask users to verify their email addresses with PyPI.
  • 4) Assume the ~135k are drive by accounts that have never been activated, and leave them marked unverified and inactive (if they haven't verified on Warehouse).
  • 5) Take the other 58k people, and start slowly sending emails to them asking them to verify the email address on file. Tell them that unless they verify their address, this will be the last email address they get from us. Assuming steps 1-4 don't reduce the 58k number, if we sent to, 200 people a day, we'd be looking at processing the backlog in 8-9 months.

The end result then is that through (1) and (2) people are heavily incentivized to keep a working, verified email address hooked up to their account, through (3) we hopefully prompt some number of people to look at their accounts and verify, through (4) we reduce the size of the affected accounts considerably, and through (5) we give accounts one last notification to verify their email address.

I believe that once we get to (3), we should disable sending emails to unverified addresses (except for the email sent in (5)).

A few open questions left that I'm not sure of:

  1. Once we disable sending emails to unverified addresses, what emails should still be sent? Off hand I can think of:
    • Email verification email (this one is obvious)
    • MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].
  2. There are about 73 users whose primary email address is unverified, but whom have added a verified alternative email address. Do we want to do anything special with these users like automatically promote their verified email to primary? Or should we just them work through the above plan naturally?
  3. Similar to the above, do we want to do anything special if a user's email address gets unverified due to delivery issues/spam complaint and they have other verified emails on their account?
    • I think certainly if they marked one of our email as spam we shouldn't then pick another email address they had previously given us and start sending to that address instead. A Spam complaint is a pretty heavy handed signal to stop sending them email.
    • I think that perhaps if we un-verify their primary email address, it wouldn't be unreasonable to send an email to an alternative email address to tell them we did. I'm not sure though, and if we do how do we pick which verified address to send to if they have multiple? Or would we send to all of them?

[1] Of course the email verification email is also such an email, but ideally that email should be adjusted to include some verbiage about how to contact the administrators if they're getting those emails and we can blacklist their email address from being used? If we do that, perhaps something automated too that would allow users to stop these emails from being sent to them by clicking on a link and confirming it?

@dstufft dstufft added the needs discussion a product management/policy issue maintainers and users should discuss label Apr 11, 2018
@reaperhulk
Copy link
Contributor

This issue made me check my own account, where I found out I was not verified. However, I rarely log into pypi except via CLI tooling like twine to upload packages. I have no idea if I'm a typical user, but ideally there would be some way to communicate the need to confirm an email address to users using that path as well.

@dstufft
Copy link
Member Author

dstufft commented Apr 12, 2018

@reaperhulk Yea, the step (2) would basically do that, although via making twine upload fail until you verified rather than by printing a nice message.

@dstufft
Copy link
Member Author

dstufft commented Apr 12, 2018

To be clear, you'd get an error message that told you why it failed, but it wouldn't be "oh it worked, but also here's a thing you should do".

@di
Copy link
Member

di commented Apr 12, 2018

Start displaying a flash-message like warning at the top of every page load for logged in users without a verified primary email address with a call to action to get a verified email address as their primary email address.

Expand the limitations of not having a verified, primary address so that you cannot do much in the ways of project management without it. What exactly should be limited is on the table, but I think uploads in general should require a valid, verified email, and likely so should other actions like deletions, managing contributors, etc.

I think it'd make sense to just immediately redirect to a "verify your email address" screen after a successful login and preventing the user from doing anything in the UI until they do, and skipping a flash message entirely (as well as preventing uploads like we do new project registrations at the moment).

Take the other 58k people, and start slowly sending emails to them asking them to verify the email address on file. Tell them that unless they verify their address, this will be the last email address they get from us. Assuming steps 1-4 don't reduce the 58k number, if we sent to, 200 people a day, we'd be looking at processing the backlog in 8-9 months.

I think this is not unreasonable, though obviously it'd be better to get as many users verified as possible before resorting to this.

MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].

Definitely should allow this until we get to (5). Afterwards probably as well? If a user has a single verified email, then accidentally marks it as spam (thus getting it unverfied), and they've forgotten their password, they've essentially permanently locked themselves out of their account.

There are about 73 users whose primary email address is unverified, but whom have added a verified alternative email address. Do we want to do anything special with these users like automatically promote their verified email to primary? Or should we just them work through the above plan naturally?

I think they should just go through the regular flow, although I'm curious why they exist because (at least in Warehouse) you can't make a non-verified email your primary address. Perhaps worth looking into more...

Similar to the above, do we want to do anything special if a user's email address gets unverified due to delivery issues/spam complaint and they have other verified emails on their account?

Thinking on this again, I agree that just making it unverified should suffice. I think all the locks that come with not having a verified primary email should get activated, and they'll need to re-verify that address or manually switch primary emails to unlock.

I think certainly if they marked one of our email as spam we shouldn't then pick another email address they had previously given us and start sending to that address instead. A Spam complaint is a pretty heavy handed signal to stop sending them email.

I think that perhaps if we un-verify their primary email address, it wouldn't be unreasonable to send an email to an alternative email address to tell them we did. I'm not sure though, and if we do how do we pick which verified address to send to if they have multiple? Or would we send to all of them?

I think we probably should never attempt to contact non-primary email addresses. If they get locked out of uploading, etc. due to not having a verified primary email, they should get pretty obvious messages why they can't do what they're trying to do.

@dstufft
Copy link
Member Author

dstufft commented Apr 13, 2018

MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].

Definitely should allow this until we get to (5). Afterwards probably as well? If a user has a single verified email, then accidentally marks it as spam (thus getting it unverfied), and they've forgotten their password, they've essentially permanently locked themselves out of their account.

Yea, they would have perma locked themselves out, but they could reach out to us and we could manually verify an email for them in the worst case.

@dstufft
Copy link
Member Author

dstufft commented Apr 15, 2018

I think they should just go through the regular flow, although I'm curious why they exist because (at least in Warehouse) you can't make a non-verified email your primary address. Perhaps worth looking into more...

If you're one of the people who already have an unverified email as your primary address, and you add a verified email but you don't make it your primary.

@di
Copy link
Member

di commented Apr 15, 2018

If you're one of the people who already have an unverified email as your primary address, and you add a verified email but you don't make it your primary.

Seems like folks might be missing the fact that adding a new email and verifying it does not make it your primary. Maybe we should do this automatically if the primary is unverified?

@brainwane
Copy link
Contributor

When should we start this whole process?

Start a campaign of blogs, tweets, mailing list posts, etc to ask users to verify their email addresses with PyPI.

PyCon would be a great time for us to spread this message. That's now under a month away.

@brainwane brainwane changed the title Email Address Verification Get ~all users to verify email addresses Apr 16, 2018
@dstufft
Copy link
Member Author

dstufft commented Apr 16, 2018

There's no technical reason to start messaging on a particular date, so really whenever folks think is a good time is fine.

@brainwane
Copy link
Contributor

We discussed this issue in our weekly meeting last week. It sounds like a call-to-action about this won't necessarily fit in Dustin's talk at PyCon.

Ernest noted that we had excellent results when we asked people to verify before releasing a new package -- a little confusion, but no grouchiness. He suggested that we might want to go back, summarize how those responses went in issues, and improve our messaging before doing further publicity.

@rspeer
Copy link

rspeer commented Apr 24, 2018

May I suggest that you should increase the expiration time on the verification e-mail, which would probably increase the rate of people successfully completing the process?

There are things in my e-mail that are important enough that they have to be dealt with in 6 hours. PyPI is not one of them.

@KOLANICH
Copy link

KOLANICH commented Apr 25, 2018

I suggest not to verify email addresses, but just stop collecting them and implement signon and signup without any email verification, phone account verification, bank card verification, ID verification, fingerprint verification or DNA profile verification or anything of this kind of shit. Email is overcentralised and we must get rid of it. If tomorrow the email provider closed my access to email I would have lost all my accounts and all my online identity. So I prefer my accounts not to be bound to email. Just use a crypto key for both signup, signon and packages signing.

The end result then is that through (1) and (2) people are heavily incentivized to keep a working, verified email address hooked up to their account

I guess it heavily incentivizes not to use this service at all.

@dstufft
Copy link
Member Author

dstufft commented Jul 13, 2018

#4292 implements (2) of the published plan. I think that might be enough of a restriction in and of itself, since uploading a package is the primary thing people with "important" user accounts tend to do with their PyPI account that it's going to act as a pretty strong forcing function. Additionally, trying to turn UI items into errors is a lot harder then an API item, and the red banner at the top already acts as a guide to get people to verify their email.

@fungi
Copy link

fungi commented Jul 17, 2018

Announcing solely via a banner on a WebUI assumes people use the WebUI with those accounts. It caught us by surprise since the account we use in our release automation (which ~nobody ever logs into the WebUI with) started getting its uploads rejected. The followup announcement to distutils-sig today was helpful, but would have been more helpful in advance of landing #4292.

Regardless, thanks for working on this--it's a great improvement!

@di
Copy link
Member

di commented Jul 17, 2018

@fungi There should have also been a error message when the upload failed, was that not shown in your automation logs?

@fungi
Copy link

fungi commented Jul 17, 2018

Yep, the error message worked fine but turned it into a reactive scenario rather than a proactive one. In our case the people driving the release automation aren't the same as the people with access to the credentials and inbox for the account used by said automation, so release management activities had to be paused while a solution was coordinated with systems administrators.

In a positive note, this has made it apparent to me that our project should use a different E-mail address for that PyPI account than the one which also catches the massive backscatter from our code review system. ;)

@brainwane
Copy link
Contributor

@fungi Sorry for the late notice! I agree that we should have spread the word more.

Per @ewdurbin 👍

Ernest noted that we had excellent results when we asked people to verify before releasing a new package -- a little confusion, but no grouchiness. He suggested that we might want to go back, summarize how those responses went in issues, and improve our messaging before doing further publicity.

Anyone have any pointers to some of those responses? When did we make this change?

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2018

5 days ago we moved from only blocking on new projects, to blocking on any attempt to upload anything. The banner warning at the top started on April 15, and I think that blocking on new projects happened prior to this ticket even being opened.

@fungi
Copy link

fungi commented Jul 17, 2018

Also, to be clear, the account where this took us by surprise historically was not used to register/create new projects, but instead gets added to the access controls for projects created by other accounts. Perhaps a bit of a corner case, but explains why we wouldn't have noticed without explicit announcement.

@dmerejkowsky
Copy link

dmerejkowsky commented Jul 17, 2018

For point 3) I've taken the liberty of writing a short article on my blog and dev.to.

Hope this helps!

@dstufft
Copy link
Member Author

dstufft commented Jul 18, 2018

Ok, I've done some digging here, and my estimations before were off I think (trying to remember how I arrived at the numbers above, and the best I can think of is I estimated poorly). So here are the new numbers:

Number of Users: 232131
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
    WHERE accounts_user.id = accounts_email.user_id
        AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
        AND accounts_email.primary = 't'
        AND accounts_email.unverify_reason IS NULL
Number of Users with Verified Primary Email Address: 45702
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
    WHERE accounts_user.id = accounts_email.user_id
        AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
        AND accounts_email.primary = 't'
        AND accounts_email.verified = 't'
Number of Users with Unverified Primary Email Address: 186429
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
    WHERE accounts_user.id = accounts_email.user_id
        AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
        AND accounts_email.primary = 't'
        AND accounts_email.verified = 'f'
        AND accounts_email.unverify_reason IS NULL
Number of Users with Unverified Primary Email Address NOT IN OTK table: 86187
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
  WHERE accounts_user.id = accounts_email.user_id
      AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
      AND accounts_email.primary = 't'
      AND accounts_email.verified = 'f'
      AND accounts_email.unverify_reason IS NULL
      AND NOT EXISTS (
        SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username
      )
Number of Users with Unverified Primary Email Address IN OTK table: 100242
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
  WHERE accounts_user.id = accounts_email.user_id
      AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
      AND accounts_email.primary = 't'
      AND accounts_email.verified = 'f'
      AND accounts_email.unverify_reason IS NULL
      AND EXISTS (
        SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username
      )

A few important notes:

  • All of these queries are limited to only users who have a primary email address that has not been unverified due to delivery issues.
  • All of these queries are limited only to used who have joined the site prior to 2018-02-18, because that is when new user registration got disabled on legacy PyPI, and any users registered after that would not appear in rego_otk no matter what.
    • This includes users with a NULL join date, because those come from before Warehouse was even a thing.

So my new numbers greatly expand upon the number of users that would be emailed in step 5 above, to the point that I'm concerned about the number of emails we would have to send. So I've been thinking about how we can modify the plan above, and I think that with two modifications, we should be back on track:

  • Modify step (2) to also include the limitation that a user without a verified primary email address cannot be added as a maintainer to an existing package, that will prevent anyone who doesn't have contact information on file from being newly given permissions to existing projects.
  • Modify step (5) to limit the emails we're sending, to be only to people who have maintainer/owner of an existing projects.

The above changes eliminates the need to send email to someone who isn't currently capable of managing a project, which is expanded out from anyone who is currently uploading to a project. It also limits sending emails strictly to people who actually have projects under their control, which are the people we truly care about having an email address on file with anyways (all of our notifications have to do with project administration, currently at least).

All in all, these changes would mean that instead of sending an email to 86,187 people, we are instead going to be sending an email to 37,838 which brings our backlog down to 6-7 months instead of 8-9 months. This change would also catch more projects (if we had implemented it first before the upload restrictions, it would have caught the Openstack example above, assuming they had a new project at any time).

Thoughts?

Query for the ~37k Users
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
  WHERE accounts_user.id = accounts_email.user_id
      AND (accounts_user.date_joined < date '2018-02-18'
              OR accounts_user.date_joined IS NULL)
      AND accounts_email.primary = 't'
      AND accounts_email.verified = 'f'
      AND accounts_email.unverify_reason IS NULL
      AND NOT EXISTS (
        SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username
      )
      AND EXISTS (
          SELECT 1 from roles WHERE roles.user_name = accounts_user.username
      )

@fungi
Copy link

fungi commented Jul 19, 2018

Sounds like an excellent next step.

I'd like to think we'd have noticed if direct E-mails were sent earlier, but to be honest we chose poorly on what address to associate with that automation account (years ago) and the direct notification would likely have been buried in a ton of noise (since remedied this week). The modification to step #2 would certainly have come to our attention though as that account is added to more projects at least weekly.

Regardless, given the point of having verified addresses for uploaders is to be able to reliably contact them, it's entirely reasonable for PyPI naintainers to consider a notice sent to those addresses as sufficient due diligence for such a behavior change.

@dstufft
Copy link
Member Author

dstufft commented Jul 19, 2018

Ok, #4322 prevents users from being added to a project unless they have a verified primary email address.

@dstufft
Copy link
Member Author

dstufft commented Jul 19, 2018

One further modification, if I remove the qualifications for date_joined and not existing in rego_otk from my query, then the ~37k people increases to ~38k people, so I think that it makes sense to not limit it to accounts from before the switch.

Number of Users to email: 38,824
SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email
  WHERE accounts_user.id = accounts_email.user_id
      AND accounts_email.primary = 't'
      AND accounts_email.verified = 'f'
      AND accounts_email.unverify_reason IS NULL
      AND EXISTS (
          SELECT 1 from roles WHERE roles.user_name = accounts_user.username
      )

@mlissner
Copy link

mlissner commented Jul 20, 2018

I think I read through all of this, but I remain pretty annoyed that things just broke for my automated upload system.

Why can't we send emails before we start breaking things for people? Seems to me that we should do whatever we can to start emailing unverified accounts now before more people have to stop what they're doing in their otherwise productive day, to figure out what the hack is going on.

One other note: I pretty much never log into the website, so this is kind of a special opportunity for us, since a lot of the pypi users will suddenly be adjusting their accounts. I know that these days whenever I'm tinkering with my credentials, verifying accounts, etc, I make sure I have 2FA enabled while I'm there. It would great to have 2FA ready to go before embarking on this ticket any further. If that's not far off, would it be crazy to do something like:

  1. Stop blocking uploads for a bit. This is an annoying and aggressive step to take without emailing first.
  2. Get 2FA enabled
  3. Send all the emails to people asking them to verify (note that 2FA is now available)
  4. Send emails to already-verified accounts telling them 2FA is now available
  5. THEN: Start breaking things for people again

I note that last week Node had a crisis because somebody didn't have 2FA enabled and their account got phished. Time to up this priority?

@brainwane
Copy link
Contributor

Hi, @mlissner! I'm sorry that we broke things and didn't announce stuff first.

From December 2017 till the end of April 2018, PyPI had a paid project manager (me) who made sure that we gave lots of advance warning before breaking stuff. Then the grant ran out and we have, as far as I know, no one paid to work on PyPI; volunteers are improving the software and infrastructure sides of things and sometimes the communications side doesn't catch up as fast. The Packaging Working Group is seeking donations and applying for further grants to fund more design work, more and faster development, and better project management.

I'm interested in your idea regarding hooking this process into #996 and ask @mschwager to comment.

@mlissner
Copy link

I feel ya, thanks @brainwane. Hopefully the outline I've got there isn't really much more work if we were planning to email at the end anyway. Mostly, it'd just be a re-ordering of things, I think. But I do understand resource constraints. I'm largely in the same boat.

@dstufft
Copy link
Member Author

dstufft commented Jul 21, 2018

We're unlikely going to revert the changes at this point, as the primary thing we're waiting on before sending out the email is approval from the WG on spending money on sending out nearly 40k emails via MailChimp. Once that has been approved, then we're going to fairly soon after be sending out the email.

Part of that is because we need a pretty clear line in the sand of who we're emailing, and "the set of people able to do things to PyPI" is a pretty reasonable set of people. However, we don't want to allow that set of people to grow, because then it gets much harder to campaign to get people to verify (because now we have to track who we have set an email to in the past, and who we haven't).

Unfortunately, 2fa on PyPI is a non trivial amount of effort, because our uploading requires logging in with a username/password as well and doesn't have any mechanism in it to support 2fa. So we have a bit of a stack of yaks to shave before we're going to be able to meaningfully do that, and I don't think blocking this effort on that makes sense.

@dstufft
Copy link
Member Author

dstufft commented Aug 3, 2018

Ok we've sent out the email to everyone, and stopped sending email to unverified email addresses, and it appears that people are indeed logging in and verifying their emails. We're down to 35k (from 37k) so far and that number is still dropping.

There's nothing else to be done on this issue, thanks everyone!

@dstufft dstufft closed this as completed Aug 3, 2018
@dstufft dstufft removed the needs discussion a product management/policy issue maintainers and users should discuss label Aug 3, 2018
@AkihiroSuda
Copy link

@dstufft Sorry to be harsh, but your email really looked like a phishing mail 😢
Next time please consider removing all hyperlinks except https://pypi.org?

@terryjreedy
Copy link

In particular, the email tells people to log in and give their credentials at https://pypi.us18.list-manage.com/track/... .

@di
Copy link
Member

di commented Aug 6, 2018

Sorry folks, this was an oversight, we don't use MailChimp very often and didn't realize it would automatically wrap URLs in the message body with those tracking links.

@brainwane
Copy link
Contributor

And I'll belatedly mention that in mid-July I sent an announcement email to pypi-announce and tweeted as @ThePyPA asking people to verify their email addresses, so that is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests