Skip to content

Commit 04e8fc0

Browse files
committed
Respect non-ASCII identifiers in sanitization for clearer names
See also Rust RFC 2457: rust-lang/rfcs#2457
1 parent 4ad26e5 commit 04e8fc0

File tree

5 files changed

+92
-29
lines changed

5 files changed

+92
-29
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@ Notable library changes are documented here in a format based on
66

77
## Unreleased
88

9+
### Changed
10+
11+
- Respect Unicode identifiers in
12+
[name sanitization](https://github.com/evolutics/iftree#name-sanitization).
13+
If you only use ASCII file paths, then this change has no effect. Essentially,
14+
non-ASCII characters that are valid in identifiers (from Rust 1.53.0) are
15+
preserved instead of replaced by an underscore `"_"`.
16+
917
## 0.1.1 – 2021-05-14
1018

1119
### Fixed

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ quote = "1.0"
2222
serde = { version = "1.0", features = ["derive"] }
2323
syn = { version = "1.0", features = ["default", "extra-traits"] }
2424
toml = "0.5"
25+
unicode-xid = "0.2"
2526

2627
[dev-dependencies]
2728
actix-web = "3.3"

README.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -163,20 +163,31 @@ See
163163

164164
### Name sanitization
165165

166-
When generating identifiers based on paths, names are sanitized as follows to
167-
ensure they are
168-
[valid identifiers](https://doc.rust-lang.org/reference/identifiers.html):
169-
170-
- Characters other than ASCII alphanumericals are replaced by `"_"`.
171-
- If the first character is numeric, then `"_"` is prepended.
172-
- If the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`, then `"_"`
173-
is appended.
166+
When generating identifiers based on paths, names are sanitized. For example, a
167+
folder name `.my-assets` is sanitized to an identifier `_my_assets`.
168+
169+
The sanitization process is designed to generate valid
170+
[Unicode identifiers](https://doc.rust-lang.org/nightly/reference/identifiers.html).
171+
Essentially, it replaces invalid identifier characters by underscores `"_"`.
172+
More precisely:
173+
174+
1. Characters without the property `XID_Continue` are replaced by `"_"`. The set
175+
of `XID_Continue` characters in ASCII is `[0-9A-Z_a-z]`.
176+
1. Next, if the first character does not have the property `XID_Start`, then
177+
`"_"` is prepended unless the first character is already `"_"`. The set of
178+
`XID_Start` characters in ASCII is `[A-Za-z]`.
179+
1. Finally, if the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`,
180+
then `"_"` is appended.
174181

175182
Names are further adjusted to respect naming conventions in the default case:
176183

177184
- Lowercase for folders (because they map to module names).
178185
- Uppercase for filenames (because they map to static variables).
179186

187+
Note that non-ASCII identifiers are only supported from Rust 1.53.0. For earlier
188+
versions, the sanitization here may generate invalid identifiers if you use
189+
non-ASCII paths, in which case you need to manually rename the affected files.
190+
180191
### Portable file paths
181192

182193
To prevent issues when developing on different platforms, any paths in your

src/generate_view/sanitize_name.rs

Lines changed: 45 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ fn sanitize_by_convention(name: &str, convention: Convention) -> String {
2121
fn sanitize_special_characters(name: &str) -> String {
2222
name.chars()
2323
.map(|character| {
24-
if character.is_ascii_alphanumeric() {
24+
if unicode_xid::UnicodeXID::is_xid_continue(character) {
2525
character
2626
} else {
2727
'_'
@@ -32,14 +32,14 @@ fn sanitize_special_characters(name: &str) -> String {
3232

3333
fn sanitize_first_character(name: String) -> String {
3434
match name.chars().next() {
35-
Some(first_character) if first_character.is_numeric() => format!("_{}", name),
36-
_ => name,
35+
Some(first_character) if unicode_xid::UnicodeXID::is_xid_start(first_character) => name,
36+
Some('_') => name,
37+
_ => format!("_{}", name),
3738
}
3839
}
3940

4041
fn sanitize_special_cases(name: String) -> String {
4142
match name.as_ref() {
42-
"" => String::from("__"),
4343
"_" | "crate" | "self" | "Self" | "super" => format!("{}_", name),
4444
_ => name,
4545
}
@@ -60,33 +60,65 @@ mod tests {
6060

6161
#[test]
6262
fn handles_convention_of_screaming_snake_case() {
63-
let actual = main("README.md", Convention::ScreamingSnakeCase);
63+
let actual = main("README_ß_ʼn.md", Convention::ScreamingSnakeCase);
6464

65-
let expected = quote::format_ident!("r#README_MD");
65+
let expected = quote::format_ident!("r#README_SS_ʼN_MD");
6666
assert_eq!(actual, expected);
6767
}
6868

6969
#[test]
7070
fn handles_convention_of_snake_case() {
71-
let actual = main("README.md", Convention::SnakeCase);
71+
let actual = main("README_ß_ʼn.md", Convention::SnakeCase);
7272

73-
let expected = quote::format_ident!("r#readme_md");
73+
let expected = quote::format_ident!("r#readme_ß_ʼn_md");
7474
assert_eq!(actual, expected);
7575
}
7676

7777
#[test]
7878
fn handles_special_characters() {
79-
let actual = main("A B##C_D±EÅF𝟙G.H", Convention::ScreamingSnakeCase);
79+
let actual = main("_0 1##2$3±4√5👽6.7", stubs::convention());
80+
81+
let expected = quote::format_ident!("r#_0_1__2_3_4_5_6_7");
82+
assert_eq!(actual, expected);
83+
}
84+
85+
#[test]
86+
fn handles_non_ascii_identifiers() {
87+
let actual = main("åb_π_𝟙", Convention::SnakeCase);
88+
89+
let expected = quote::format_ident!("r#åb_π_𝟙");
90+
assert_eq!(actual, expected);
91+
}
92+
93+
#[test]
94+
fn handles_first_character_if_xid_start() {
95+
let actual = main("a", Convention::SnakeCase);
96+
97+
let expected = quote::format_ident!("r#a");
98+
assert_eq!(actual, expected);
99+
}
100+
101+
#[test]
102+
fn handles_first_character_if_underscore() {
103+
let actual = main("_2", stubs::convention());
104+
105+
let expected = quote::format_ident!("r#_2");
106+
assert_eq!(actual, expected);
107+
}
108+
109+
#[test]
110+
fn handles_first_character_if_xid_continue_but_not_xid_start() {
111+
let actual = main("3", stubs::convention());
80112

81-
let expected = quote::format_ident!("r#A_B__C_D_E_F_G_H");
113+
let expected = quote::format_ident!("r#_3");
82114
assert_eq!(actual, expected);
83115
}
84116

85117
#[test]
86-
fn handles_first_character() {
87-
let actual = main("2a", Convention::SnakeCase);
118+
fn handles_first_character_if_not_xid_continue() {
119+
let actual = main(".4", stubs::convention());
88120

89-
let expected = quote::format_ident!("r#_2a");
121+
let expected = quote::format_ident!("r#_4");
90122
assert_eq!(actual, expected);
91123
}
92124

src/lib.rs

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -161,20 +161,31 @@
161161
//!
162162
//! ## Name sanitization
163163
//!
164-
//! When generating identifiers based on paths, names are sanitized as follows to
165-
//! ensure they are
166-
//! [valid identifiers](https://doc.rust-lang.org/reference/identifiers.html):
167-
//!
168-
//! - Characters other than ASCII alphanumericals are replaced by `"_"`.
169-
//! - If the first character is numeric, then `"_"` is prepended.
170-
//! - If the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`, then `"_"`
171-
//! is appended.
164+
//! When generating identifiers based on paths, names are sanitized. For example, a
165+
//! folder name `.my-assets` is sanitized to an identifier `_my_assets`.
166+
//!
167+
//! The sanitization process is designed to generate valid
168+
//! [Unicode identifiers](https://doc.rust-lang.org/nightly/reference/identifiers.html).
169+
//! Essentially, it replaces invalid identifier characters by underscores `"_"`.
170+
//! More precisely:
171+
//!
172+
//! 1. Characters without the property `XID_Continue` are replaced by `"_"`. The set
173+
//! of `XID_Continue` characters in ASCII is `[0-9A-Z_a-z]`.
174+
//! 1. Next, if the first character does not have the property `XID_Start`, then
175+
//! `"_"` is prepended unless the first character is already `"_"`. The set of
176+
//! `XID_Start` characters in ASCII is `[A-Za-z]`.
177+
//! 1. Finally, if the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`,
178+
//! then `"_"` is appended.
172179
//!
173180
//! Names are further adjusted to respect naming conventions in the default case:
174181
//!
175182
//! - Lowercase for folders (because they map to module names).
176183
//! - Uppercase for filenames (because they map to static variables).
177184
//!
185+
//! Note that non-ASCII identifiers are only supported from Rust 1.53.0. For earlier
186+
//! versions, the sanitization here may generate invalid identifiers if you use
187+
//! non-ASCII paths, in which case you need to manually rename the affected files.
188+
//!
178189
//! ## Portable file paths
179190
//!
180191
//! To prevent issues when developing on different platforms, any paths in your

0 commit comments

Comments
 (0)