Skip to content

Give tip for unicode chars which might not be visible when rendered #100439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions compiler/rustc_parse/src/lexer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,8 @@ impl<'a> StringReader<'a> {
let token = unicode_chars::check_for_substitution(self, start, c, &mut err);
if c == '\x00' {
err.help("source files must contain UTF-8 encoded text, unexpected null bytes might occur when a different encoding is used");
} else if let Ok(v) = &err.suggestions && v.len() == 0 && !c.is_ascii() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a different heuristic for "is not visible when rendered", since this triggers on every non-ascii char.

Also, why are you checking err.suggestions here?

Copy link
Member Author

@chenyukang chenyukang Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a different heuristic for "is not visible when rendered", since this triggers on every non-ascii char.

Also, why are you checking err.suggestions here?

Yes, I'm still thinking any better method for check a unicode...

I'm checking the err.suggestions here, because we already have a strategy to give suggestions for those possible substitution,

let token = unicode_chars::check_for_substitution(self, start, c, &mut err);
.

My idea is if we already have such kind of suggestion, we won't add more, since it's clear enough:

 --> /home/cat/code/rust/src/test/ui/parser/unicode-quote-chars.rs:2:14
  |
2 |     println!(“hello world”);
  |              ^
  |
= help: Unicode character \u{323} might not be visible when rendered
help: Unicode characters '“' (Left Double Quotation Mark) and '”' (Right Double Quotation Mark) look like '"' (Quotation Mark), but are not

I think = help: Unicode character \u{323} might not be visible when rendered here will be redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking the err.suggestions here, because we already have a strategy to give suggestions for those possible substitution,

I think if you change your PR to only note characters that have a rendered width of zero (i.e. combining characters, etc) then this check is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a better way to figure out "not visible when rendered" chars.
We have a test case in

fn is_es_whitespace(self) -> bool {

The unicode_space_separator, unicode_space_separator and combining_spacing_mark categories seem contain the chars which are hard to spot, but those are also not totaly 'not visible`, and I think introduce those big tables into compiler code base maybe too heavy ..

err.help(format!("Unicode character {} might not be visible when rendered", escaped_char(c)));
}
err.emit();
token?
Expand Down
2 changes: 2 additions & 0 deletions src/test/ui/fmt/incorrect-separator.stderr
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ error: unknown start of token: \u{326}
|
LL | format!("A number: {}" ̦ iter::once(42).next().unwrap());
| ^
|
= help: Unicode character \u{326} might not be visible when rendered

error: expected `,`, found `.`
--> $DIR/incorrect-separator.rs:7:27
Expand Down
2 changes: 2 additions & 0 deletions src/test/ui/issues/issue-69130.stderr
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ error: unknown start of token: \u{a7}
|
LL | M (§& u8)}
| ^
|
= help: Unicode character \u{a7} might not be visible when rendered

error[E0106]: missing lifetime specifier
--> $DIR/issue-69130.rs:4:5
Expand Down
2 changes: 2 additions & 0 deletions src/test/ui/lexer/lex-bad-token.stderr
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ error: unknown start of token: \u{25cf}
|
LL | ●
| ^
|
= help: Unicode character \u{25cf} might not be visible when rendered

error: aborting due to previous error

4 changes: 4 additions & 0 deletions src/test/ui/lexer/lex-bad-unicode-char.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fn main() {
println!("hello {}"̣, 1);
//~^ ERROR: unknown start of token: \u{323}
}
10 changes: 10 additions & 0 deletions src/test/ui/lexer/lex-bad-unicode-char.stderr
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
error: unknown start of token: \u{323}
--> $DIR/lex-bad-unicode-char.rs:2:24
|
LL | println!("hello {}"̣, 1);
| ^
|
= help: Unicode character \u{323} might not be visible when rendered

error: aborting due to previous error

Binary file modified src/test/ui/parser/issues/issue-66473.stderr
Binary file not shown.