Skip to content

Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing #5815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Manishearth opened this issue Jul 5, 2023 · 14 comments

Comments

@Manishearth
Copy link
Member

Manishearth commented Jul 5, 2023

Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.

Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.

Essentially, the algorithm works at the level of what it calls "atoms" (basically, tokens, so strings and comments and identifiers are all atoms), and it allows for bidirectional stuff within an atom, but "resets" the text direction after them, using a format character called a left-to-right mark (and two others that "pop" directional formatting). This format character is treated as a space by Rust, and is not one of the characters Rust warns about, since in a left-to-right context all it can do is undo bidirectional stuff, it cannot create Exciting New Bidirectional stuff.

Basically the algorithm is "if the atom starts with an RTL character, insert an LRM after the atom", though it has some nuanced additional stuff as well.

This is better than the existing heuristics since it also ensures that code that legitimately contains bidirectional text will render correctly, i.e. the expression FOO - bar, where FOO is a variable name written in Arabic will render in the right order (instead of rendering as bar - FOO).

The algorithm does also insert some characters that Rust lints about (ones that "pop" directional formatting) in contexts where a comment has insufficiently terminated opening directional format character.

It also does recommend inserting FSI characters which Rust lints about, we should improve the linting algorithm using the official guidance. My proposal in rust-lang/rust#113363 would fix that.


Anyway, I propose we add a format config mode that runs this algorithm. I'm not yet proposing that this be on track to be "default" (I'm not sure if it should be), but it's worth considering.

I don't have familiarity with the rustfmt codebase but I do have familiarity with Unicode properties and the bidirectional algorithm and am happy to help someone implement this. (cc @crlf0710 who may also be interested; and perhaps we can implement some of this inside unicode-security)

@ytmimi
Copy link
Contributor

ytmimi commented Jul 21, 2023

@Manishearth is there a library that you know of that implements the "conversion to plain text" algorithm?

@Manishearth
Copy link
Member Author

No, it's a new spec, and it is tricky to write that algorithm in a reusable way. I'm happy to mentor anyone who wishes to implement it, the algorithm itself is not that complicated.

I would normally recommend putting it in https://github.com/unicode-rs/unicode-security but the algorithm is based on generic concepts like "atoms" and editing text that are tricky to make general APIs for, so it really makes sense as something built expressly for the purposes of rustfmt.

(Ideally, if I can get a signal that this would be accepted, I can even see if I can find someone with interest in implementing this, but I don't want to do that without getting a general +1 from the team first)

@calebcartwright
Copy link
Member

Ideally, if I can get a signal that this would be accepted, I can even see if I can find someone with interest in implementing this, but I don't want to do that without getting a general +1 from the team first

Sure, makes sense (and feel free to consider this comment the 👍).

I confess I've not been able to make time to read the full text of the spec, but based on your summary and a cursory glance of the linked content it seems like there's a lot of rationale for Rust to support this at the formatter level. I think that's especially true in our case given that rustfmt is already processing and manipulating strings in a Unicode-aware context e.g.

let graphemes = UnicodeSegmentation::graphemes(&*stripped_str, false).collect::<Vec<&str>>();

The opportunity for encapsulation and reuse, if any, was the one question I was going to ask too. I.e. what could the rustc lint future look like and is this a capability in full or part that could be a candidate for inclusion in rustfix at somepoint?

@ytmimi
Copy link
Contributor

ytmimi commented Jul 23, 2023

I'm also on board. My main focus for the project right now is to get style_edition config implemented but I'd definitely be interested in helping out with the "conversion to plain text" implementation.

@Manishearth
Copy link
Member Author

@ytmimi

Sounds great! Let me know if you get started.

@calebcartwright

I.e. what could the rustc lint future look like and is this a capability in full or part that could be a candidate for inclusion in rustfix at somepoint?

The lint is definitely something that could go into unicode-security. I'm just not sure there's a way to write reusable code over edits without the abstraction-over-edits being more code than the code we are trying to reuse.

@ytmimi
Copy link
Contributor

ytmimi commented Aug 3, 2023

@Manishearth from what I can tell the algorithm starts by parsing the content of a line into Atoms. I think I'll start by implementing that before taking on the rest of the algorithm

@Manishearth
Copy link
Member Author

Manishearth commented Aug 3, 2023

We should not parse a new notion of atoms, we should use rustfmt's internal concepts for this. Atom here is just stuff like comments and identifiers and literals and operators, rustfmt should already understand this.

(I suggest discussing with the rest of rustfmt team about what is the best abstraction for this; figuring out the right level of abstraction for this is the tricky part here)

@ytmimi
Copy link
Contributor

ytmimi commented Aug 4, 2023

We should not parse a new notion of atoms, we should use rustfmt's internal concepts for this. Atom here is just stuff like comments and identifiers and literals and operators, rustfmt should already understand this.

rustfmt's concept of Atom's then are AST nodes, but I'm struggling to think of a way to add the "conversion to plain text" algorithm directly into that rewrite process without things getting too messy / introducing a lot of code changes across the codebase. Furthermore, the AST doesn't encode whitespace or comments so that would also be tricky to deal with. rustfmt does its best to recover all comments when formatting and it completely rewrites nearly all whitespace in the file.

At a very high level, rustfmt formats each file by walking the AST it get from the rustc parser and produces a formatted string for each node it comes across. That all gets written to an output buffer before rustfmt decides to emit the results (diff output, overriding a file, outputting to stdout, etc.).

Given that rustfmt doesn't operate on lines of code, I feel the best place to implement this feature is somewhere after rustmft has written all formatted code into the output buffer, but before it emits that buffer. Then we can parse out atoms for each line like the algorithm suggests and go from there. Additionally, I feel that implementing the algorithm at this stage in rustfmt's process better centralize the "convert to plain text" logic.

For those reasons I think we'll need to start by implementing an Atom parser. @Manishearth please let me know if that explanation is clear. @calebcartwright let me know if you think there's an easier / cleaner approach to this that I'm overlooking.

@crlf0710
Copy link
Member

crlf0710 commented Aug 4, 2023

https://gist.github.com/crlf0710/87480a71ff9b9d35bd6da0054b0d0645

Here's a POC implementation of the algorithm itself. Wrote it at a weekend, though not too satisfied with the outcome. And the interaction with unicode-bidi crate is still todo!().

Feel free to reuse the code anywhere if anyone feel it's useful.

@Manishearth
Copy link
Member Author

Furthermore, the AST doesn't encode whitespace or comments so that would also be tricky to deal with. rustfmt does its best to recover all comments when formatting and it completely rewrites nearly all whitespace in the file.

This is actually great; you want to run the algorithm after this has already been done. The code that "recovers comments" is what is relevant for atoms; I feel like you should parse comment atoms there.

Parsing atoms at a line-by-line basis feels tricky because you have to know the preceding context. Bear in mind, atoms are not some special spec concept; the spec is defining a collection of Rust concepts. The spec doesn't give any guidance on how to parse atoms because that's up to the target language (here, Rust).

Given that rustfmt doesn't operate on lines of code

But you can walk the tree in order and keep track of when you're on a new line, yes? You're also keeping track of where the newlines are inserted anyway.

But yeah the more I think about this I think parsing atoms at a line by line basis may work, it'll be annoying but fine. I think it's fine to start experimenting with that and see how it goes.

@ytmimi
Copy link
Contributor

ytmimi commented Aug 4, 2023

@crlf0710 thank you for putting that code snippet together! I think it'll come in handy when we're a little further along!

@ytmimi
Copy link
Contributor

ytmimi commented Aug 4, 2023

This is actually great; you want to run the algorithm after this has already been done. The code that "recovers comments" is what is relevant for atoms; I feel like you should parse comment atoms there.

I think rustfmt's current design makes extracting atoms at that point more of a headache than it initially seems. Happy to elaborate further if you'd like some more context.

But you can walk the tree in order and keep track of when you're on a new line, yes? You're also keeping track of where the newlines are inserted anyway

rustfmt walks the tree in order while reformatting, but doesn't really keep track of when it's inserting a newline while it's reformatting. It's technically possible to get the line info from the current node's Span, but then again, that's technically the line number in the source text which isn't necessarily the line number after formatting.

But yeah the more I think about this I think parsing atoms at a line by line basis may work, it'll be annoying but fine. I think it's fine to start experimenting with that and see how it goes.

Great! let's explore this option and see what we can come up with. @Manishearth can I ask you for a review once I'm ready?

@Manishearth
Copy link
Member Author

Sure

@Manishearth
Copy link
Member Author

Also, for figuring out the bidi class of a character, unicode_bidi::bidi_class is probably the easiest for now. It might be better for rustc to switch to ICU4X for unicode dependencies (it already does for list formatting) but we should wait for ICU4X 1.3 so we don't have to muck with data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants