Skip to content

Normalize oauth email username #28561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 4, 2024

Conversation

kdumontnu
Copy link
Contributor

@kdumontnu kdumontnu commented Dec 20, 2023

Closes #28461

I ran into a couple potential solutions here:

  • We could follow GitHub/Sourcegraph and replace any character outside of our supported character set with -.
    • The problem here is that there are still some invalid characters that would result in "broken" usernames (eg. here matthias.schöpfer would become matthias.sch-pfer)
    • This could silently hide mask some user configuration problems
    • The change will be breaking if we ever change it (eg. we are safe to replace more characters, but we will break user linking if we change character replacement in the future).
  • We could do the above, but go one step further and remove diacritics before replacement
    • This has the same downsides as above, but aims to solve one more reported issue
    • Still doesn't handle some cases, like Ægidius -> -gidius
  • [Implemented] Take a more selective approach that can be expanded in the future. Create a set of known characters that should be removed, set of characters that should be replaced (with -), and a set of custom replacements (eg Æ -> AE)
    • This doesn't have the downside of hiding potential user configuration errors
    • It's safe to add to the set in the future (or even move to one of the more aggressive approaches above)

There doesn't seem to be a well-supported go package that does unicode -> ascii normalization. There are some other solutions, like this, that build a relatively complete replacement map. That would probably be more complete and somewhat more performant. However, it's also way overkill for us, and as I mentioned previously, once we replace a character, changing that will be a breaking change for users. However, if illegal characters continue to be an issue and we have to revisit this, we can use a solution like that.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Dec 20, 2023
@GiteaBot GiteaBot added lgtm/blocked A maintainer has reservations with the PR and thus it cannot be merged and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Dec 20, 2023
testCases := [][]interface{}{
// input name, expected normalized name, is normalized name valid
{"test", "test", true},
{"Sinéad.O'Connor", "Sin-ad.OConnor", true}, // We should consider allowing custom replacement characters (eg. é -> e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's necessary to do so .... maybe the Go library also has ability to do so.

Otherwise Sin-ad doesn't seem good for a username.

Copy link
Contributor

@wxiaoguang wxiaoguang Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why it's necessary: maybe nobody likes making "breaking" changes again and again.

Think about a case: if this PR doesn't have a complete solution, in the future, if someone decides to "consider allowing replacement characters", what they could do? They would needs to introduce "option_v2", "option_v3" .... it would be a mess.

Copy link
Contributor Author

@kdumontnu kdumontnu Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion in discord about this too.

I'm not sure that we want to take on normalizing every possible user input the way that a user would want/expect. There are hundreds thousands of characters we'd have to transform from unicode to ascii. The most accessible method would be to extend this solution in the future with either another config setting, like NORMALIZE_CHARS, or add it directly to the auth source form.

However... if we want to take it on, there is a pretty good lookup table here we could transpose from javascript: https://web.archive.org/web/20120918093154/http://lehelk.com/2011/05/06/script-to-remove-diacritics/

I'm happy to go either way, but if we do the first method, we can follow up with a future, non-breaking PR.

I guess the problem with the first option is that admins could easily break user account external linking if they change the character replacements and then restart the server, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unicode standard has defined it.

https://go.dev/blog/normalization

Copy link
Contributor Author

@kdumontnu kdumontnu Dec 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That article is specifically for characters that can be decomposed with diacritics. So, it should work for characters with accents (though the code they provide is out of date and needs to be modified), but is not a general-solution nor a complete "normalization" imo.

I think there is a good reason other software services don't do this (eg.): we are now making some assumptions about how users' names should be transformed, and it's only one step in that direction, which could potentially be endless. For example, Ægidius will still become -gidius, which will fail user creation.

After putting some thought into it I agree that we should the change you suggest, as it solves an additional user issue. However, if we choose to go this direction I suggest we do NOT replace unknown characters with -. The reasons being:

  • As you say, that could result in breaking changes in the future if we decide to replace additional characters. Instead, we selectively add characters into the replacement set safely.
  • This solves all of the open issues, while providing significant, incremental improvement
  • It's less likely that we are silently hiding a user configuration error

@kdumontnu kdumontnu force-pushed the kd/normalize_oauth_email_username branch 2 times, most recently from 001a679 to 39cd778 Compare December 31, 2023 23:44
@kdumontnu kdumontnu requested a review from jolheiser January 2, 2024 14:54
@GiteaBot GiteaBot removed the lgtm/blocked A maintainer has reservations with the PR and thus it cannot be merged label Jan 4, 2024
@GiteaBot GiteaBot added the lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. label Jan 4, 2024
@jolheiser jolheiser merged commit 54acf7b into go-gitea:main Jan 4, 2024
@GiteaBot GiteaBot added this to the 1.22.0 milestone Jan 4, 2024
@kdumontnu kdumontnu deleted the kd/normalize_oauth_email_username branch January 4, 2024 01:36
zjjhot added a commit to zjjhot/gitea that referenced this pull request Jan 4, 2024
* giteaofficial/main:
  Normalize oauth email username (go-gitea#28561)
  Fix wrapping of label list (go-gitea#28684)
  Fix grammar in `actions.variables.id_not_exist` (en-US) (go-gitea#28680)
  Fix grammar issues on the repository Actions page (en-US) (go-gitea#28679)
  Fix tooltip of variable edit button (go-gitea#28681)
  Make cross-reference issue links work in markdown documents again (go-gitea#28682)
fuxiaohei pushed a commit to fuxiaohei/gitea that referenced this pull request Jan 17, 2024
silverwind pushed a commit to silverwind/gitea that referenced this pull request Feb 20, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OAuth Client username normalization
5 participants