-
Notifications
You must be signed in to change notification settings - Fork 926
Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing #5815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Manishearth is there a library that you know of that implements the "conversion to plain text" algorithm? |
No, it's a new spec, and it is tricky to write that algorithm in a reusable way. I'm happy to mentor anyone who wishes to implement it, the algorithm itself is not that complicated. I would normally recommend putting it in https://github.com/unicode-rs/unicode-security but the algorithm is based on generic concepts like "atoms" and editing text that are tricky to make general APIs for, so it really makes sense as something built expressly for the purposes of rustfmt. (Ideally, if I can get a signal that this would be accepted, I can even see if I can find someone with interest in implementing this, but I don't want to do that without getting a general +1 from the team first) |
Sure, makes sense (and feel free to consider this comment the 👍). I confess I've not been able to make time to read the full text of the spec, but based on your summary and a cursory glance of the linked content it seems like there's a lot of rationale for Rust to support this at the formatter level. I think that's especially true in our case given that rustfmt is already processing and manipulating strings in a Unicode-aware context e.g. Line 79 in a9ae746
The opportunity for encapsulation and reuse, if any, was the one question I was going to ask too. I.e. what could the rustc lint future look like and is this a capability in full or part that could be a candidate for inclusion in rustfix at somepoint? |
I'm also on board. My main focus for the project right now is to get |
Sounds great! Let me know if you get started.
The lint is definitely something that could go into unicode-security. I'm just not sure there's a way to write reusable code over edits without the abstraction-over-edits being more code than the code we are trying to reuse. |
@Manishearth from what I can tell the algorithm starts by parsing the content of a line into Atoms. I think I'll start by implementing that before taking on the rest of the algorithm |
We should not parse a new notion of atoms, we should use rustfmt's internal concepts for this. Atom here is just stuff like comments and identifiers and literals and operators, rustfmt should already understand this. (I suggest discussing with the rest of rustfmt team about what is the best abstraction for this; figuring out the right level of abstraction for this is the tricky part here) |
rustfmt's concept of Atom's then are AST nodes, but I'm struggling to think of a way to add the "conversion to plain text" algorithm directly into that rewrite process without things getting too messy / introducing a lot of code changes across the codebase. Furthermore, the AST doesn't encode whitespace or comments so that would also be tricky to deal with. rustfmt does its best to recover all comments when formatting and it completely rewrites nearly all whitespace in the file. At a very high level, rustfmt formats each file by walking the AST it get from the rustc parser and produces a formatted string for each node it comes across. That all gets written to an output buffer before rustfmt decides to emit the results (diff output, overriding a file, outputting to stdout, etc.). Given that rustfmt doesn't operate on lines of code, I feel the best place to implement this feature is somewhere after rustmft has written all formatted code into the output buffer, but before it emits that buffer. Then we can parse out atoms for each line like the algorithm suggests and go from there. Additionally, I feel that implementing the algorithm at this stage in rustfmt's process better centralize the "convert to plain text" logic. For those reasons I think we'll need to start by implementing an Atom parser. @Manishearth please let me know if that explanation is clear. @calebcartwright let me know if you think there's an easier / cleaner approach to this that I'm overlooking. |
https://gist.github.com/crlf0710/87480a71ff9b9d35bd6da0054b0d0645 Here's a POC implementation of the algorithm itself. Wrote it at a weekend, though not too satisfied with the outcome. And the interaction with Feel free to reuse the code anywhere if anyone feel it's useful. |
This is actually great; you want to run the algorithm after this has already been done. The code that "recovers comments" is what is relevant for atoms; I feel like you should parse comment atoms there. Parsing atoms at a line-by-line basis feels tricky because you have to know the preceding context. Bear in mind, atoms are not some special spec concept; the spec is defining a collection of Rust concepts. The spec doesn't give any guidance on how to parse atoms because that's up to the target language (here, Rust).
But you can walk the tree in order and keep track of when you're on a new line, yes? You're also keeping track of where the newlines are inserted anyway. But yeah the more I think about this I think parsing atoms at a line by line basis may work, it'll be annoying but fine. I think it's fine to start experimenting with that and see how it goes. |
@crlf0710 thank you for putting that code snippet together! I think it'll come in handy when we're a little further along! |
I think rustfmt's current design makes extracting atoms at that point more of a headache than it initially seems. Happy to elaborate further if you'd like some more context.
rustfmt walks the tree in order while reformatting, but doesn't really keep track of when it's inserting a newline while it's reformatting. It's technically possible to get the line info from the current node's
Great! let's explore this option and see what we can come up with. @Manishearth can I ask you for a review once I'm ready? |
Sure |
Also, for figuring out the bidi class of a character, |
Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.
Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.
Essentially, the algorithm works at the level of what it calls "atoms" (basically, tokens, so strings and comments and identifiers are all atoms), and it allows for bidirectional stuff within an atom, but "resets" the text direction after them, using a format character called a left-to-right mark (and two others that "pop" directional formatting). This format character is treated as a space by Rust, and is not one of the characters Rust warns about, since in a left-to-right context all it can do is undo bidirectional stuff, it cannot create Exciting New Bidirectional stuff.
Basically the algorithm is "if the atom starts with an RTL character, insert an LRM after the atom", though it has some nuanced additional stuff as well.
This is better than the existing heuristics since it also ensures that code that legitimately contains bidirectional text will render correctly, i.e. the expression
FOO - bar
, where FOO is a variable name written in Arabic will render in the right order (instead of rendering asbar - FOO
).The algorithm does also insert some characters that Rust lints about (ones that "pop" directional formatting) in contexts where a comment has insufficiently terminated opening directional format character.
It also does recommend inserting FSI characters which Rust lints about, we should improve the linting algorithm using the official guidance. My proposal in rust-lang/rust#113363 would fix that.
Anyway, I propose we add a format config mode that runs this algorithm. I'm not yet proposing that this be on track to be "default" (I'm not sure if it should be), but it's worth considering.
I don't have familiarity with the rustfmt codebase but I do have familiarity with Unicode properties and the bidirectional algorithm and am happy to help someone implement this. (cc @crlf0710 who may also be interested; and perhaps we can implement some of this inside unicode-security)
The text was updated successfully, but these errors were encountered: