-
Notifications
You must be signed in to change notification settings - Fork 125
Sanitizing html after markdown and tool/inject-html directive rendering #2833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This generally seems like a good approach, and, well, what I was signed up to do but if you're interested I don't object. :-) Some of the rendering tests will need to be updated, of course. I am also curious if you have tested interaction with the tools feature and how it is used with Flutter. Also, sanitize_html will need to be incorporated into the Dart SDK so that's another lift should we decide to go this route. I am in favor of hiding the sanitize_html parameter and making it mandatory at some point. That will be easier to do with hide: true. |
Iterable<String> _addLinkRel(String uri) { | ||
final u = Uri.tryParse(uri); | ||
if (u == null || !['http', 'https', 'mailto'].contains(u.scheme)) { | ||
return ['nofollow']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This seems to add nofollow to comment references, not just random links. I could see an argument for random links, but why comment references? Telling Google to ignore them is probably going to degrade search rather than improve it on most documentation sets.
@@ -20,16 +23,16 @@ class DocumentationRendererHtml implements DocumentationRenderer { | |||
DocumentationRenderResult render( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good place for this code. I had considered inserting this somewhere in the markdown rendering pipeline too to try to check embedded HTML, but perhaps users of the markdown output already have their own sanitization in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package:markdown
explicitly says to do sanitization as a post-processing step:
https://pub.dev/packages/markdown#html-sanitization
They also reference Markdown's XSS Vulnerability (and how to mitigate it).
TL;DR: XSS mitigation is best done post-markdown rendering.
Maybe it's possible to write a custom implementation of HtmlRenderer which does sanitizing on-the-fly. But I'm not sure if Text
nodes in markdown can contain inline HTML, as is HtmlRenderer
does not escape <>
when rendering Text
nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick stab a it, and it doesn't look easy t make HtmlRenderer
produce safe HTML... Even just using htmlSerializeEscape
when outputting Text
broke a lot of tests that aimed to ensure that inline HTML keeps working.
My two cents is that maybe... but probably we have to sanitize after markdown rendering -- otherwise, we'll probably have to dramatically refactor our markdown rendering pipeline.
If we wanted to support {@tool}
directives which can side-step HTML sanitizing, then I suggest we extend the markdown parser with custom block syntax, and have it translate {@tool foo --my-flag}...{@end-tool}
into <pre data-tool="foo --my-flag">...</pre>
, then we could replace these <pre>
tags after HTML sanitizing, run tools at this point and decide if we want to sanitize the output of a tool on a tool by tool basis.
Played a bit around with package:markdown
, and we can do things like:
final _beginToolPattern = RegExp(r'^{@tool ([^}\n]+)}$');
final _endToolPattern = RegExp(r'^{@end-tool}');
class ToolBlockSyntax extends BlockSyntax {
@override
RegExp get pattern => _beginToolPattern;
const ToolBlockSyntax();
@override
bool canParse(BlockParser parser) {
return pattern.hasMatch(parser.current);
}
@override
List<String> parseChildLines(BlockParser parser) {
var childLines = <String>[];
parser.advance();
while (!parser.isDone && !_endToolPattern.hasMatch(parser.current)) {
childLines.add(parser.current);
parser.advance();
}
if (!parser.isDone) {
parser.advance();
}
return childLines;
}
@override
Node parse(BlockParser parser) {
var match = pattern.firstMatch(parser.current)!;
var toolCommand = match.group(1)!;
var childLines = parseChildLines(parser);
// We should always end with a line-break (just makes sense, I think)
childLines.add('');
var text = childLines.join('\n');
var pre = Element.text('pre', escapeHtml(text));
pre.attributes['data-tool'] = escapeHtmlAttribute(toolCommand);
return pre;
}
}
Then we just have find <pre data-tool="COMMAND">STDIN</pre>
afterwards, decode the HTML text and attribute text before running the tool. Anyways, maybe that's a larger refactor :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this probably involves a quite a bit of refactoring to make it work -- as we would effectively need to move execution of tools and rendering of markdown into the renderer. I think it makes some sense to put rendering of markdown in the renderer, but I can also see the argument that execution of tools might not need to happen there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Moving tool execution to the back end rather than the front breaks all kinds of internal assumptions in dartdoc and runs into a sync vs. async problem as well IIRC from the last time I tried to change the pipeline.
When you say that execution of tools "might not need to happen here" do you mean that it literally doesn't need to run, or just that it doesn't have to be moved? I do wonder if we can get away with "not solving this" for pub for the moment -- if we just forbid tool execution when the post-processing flag is on and then leave me the cleanup to get this working in a more comprehensive manner. IIRC tool execution is forbidden anyhow on pub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC tool execution is forbidden anyhow on pub?
Yes,
I'm thinking that -- if you want to do tool execution you have two options:
- Execute tools before sanitizing HTML (approach I took in this PR)
- Pros:
- This makes side-stepping the HTML sanitizer impossible.
- This is clearly the safest option.
- Tools that produce images and tables should still work perfectly fine.
- Cons:
- This puts limits on the kinds of HTML that tools can produce.
- You can't choose not to sanitize the HTML produced by tools.
- If tools are trusted, you might want to avoid trust the HTML produced.
- Pros:
- Execute tools after sanitizing HTML.
- Pros:
- This makes it possible to have trusted tools, that generate interactive experiences with embedded CSS / JS.
- It is still possible to have untrusted tools where we sanitize the HTML output of the tool.
- Cons:
- This isn't as safe as sanitizing all HTML that comes out of any tool.
- Pros:
It's not unreasonable that dartdoc simply stick with (1), and then if needed expand the allowlist for the sanitizer. In practice most tools probably just want to:
- Create images and include these with an
<img>
tag (allowed), - Embed an iframe from youtube or dartpad (could be allowed with a small extension to the sanitizer),
And maybe allowing embedding of sandboxed iframes could offer an escape hatch for remaining use-cases. So perhaps option (1) is fine, and we just always sanitize output from tools and {@inject-html}
directives.
Note on HTML sanitizing:
- A lot of the cons can be mitigated by expanding the allow-listed set of HTML tags and attributes.
- Thus, it's not hard to expand the sanitizer to allow:
- inline SVG,
- sandboxed iframes (maybe sandboxed iframes with dataurls could be used to allow arbitrary HTML including css/js),
- iframes from trusted domains.
- But it would be very hard to expand the sanitizer to support:
- CSS (both
<style>
or attributesstyle="..."
) - Javascript (both
<script>
and attributesonclick="..."
)
- CSS (both
- Expanding the sanitizer to allow new tags and attributes is not very hard. But since everything needs to be allowlisted we would always be playing catch-up. And we would probably want to support a CSS subset at some point. On the other hand, we can get very far with a fairly limited allowlist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed notes on all of this!
Thinking about this from dartdoc's perspective going with #1
seems reasonable for now. Being as safe as possible seems a good default choice in any event. But we're going to have to do something about Flutter as script tags are part of their existing tool infrastructure. Perhaps we can make the sanitizer able to be disabled in dartdoc_options.yaml so it can be scoped? Pub already will filter things from dartdoc_options files so that seems like it could be reasonable to do (as long as pub knows that dartdoc_options.yaml files can be located anywhere in the tree?).
Examples from Flutter docs:
Trusted domain iframe from a youtube
directive:
https://api.flutter.dev/flutter/material/material-library.html
Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trusted domain iframe from a youtube directive:
https://api.flutter.dev/flutter/material/material-library.html
I think it would be easy allow iframes from trusted domains. Or we could simply hardcode an exception for youtube, that seems reasonable given how many users want to embed youtube.
But in general, I suspect <iframe sandbox="allow-same-origin allow-scripts"
is reasonably safe, so long as the embedded document is, either: (i) from a separate domain, or (ii) trusted (if it's from the same domain). It seems like a reasonable expectation that if the document is from the same domain, then it's trusted.
Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html
Embedding <script>
inline violates a lot of good practices, include security recommendations around content-security-policies (I think we strive to use something like strict CSP whenever possible).
In many cases, it would be better if the {@tool
directive generated something like:
<div data-snippet-buttons="longSnippet1"></div>
Then dartdoc
options to inject a <script src="static/snippet-buttons.js">
file in the header. Then have snippet-buttons.js
do something like:
window.addEventListener('DOMContentLoaded', function() {
document.querySelectorAll('[data-snippet-buttons]').forEach(function(div) {
// Do whatever you want to do with `div`, setup child elements and onclick handlers...
});
});
Hmm, looking at it Flutter docs already embed https://api.flutter.dev/assets/snippets.js
, so instead of embedding onclick
handlers in the HTML code, they just have use data-...
properties (or special CSS classes) to indicate that an HTML element should onclick
handlers, and set it up in an DOMContentLoaded
handler in snippets.js
.
It's similar to what we do with syntax highlighting, we just apply a special CSS class language-<language>
on a <pre>
tag, and then highlight.pack.js?v1
will do the rest in an DOMContentLoaded
handler.
So perhaps we should just allow custom data-
attributes, or we could limit it to only allow data-attributes on the form data-custom-*
(that way you can't mess with data-base-href
and data-using-base-href
which are already used in dartdoc generate output).
} else { | ||
// Never allow scripts! | ||
for (var s in asHtmlFragment.querySelectorAll('script')) { | ||
s.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly we can't do this, see other comments. I'm slightly surprised that the tests still work (probably because this condition is Flutter specific), but browsing to generated flutter documentation will definitely break if we unconditionally forbid this.
The script tags we need are structured in a very, very specific way -- maybe we can pattern match here.
Tagging @gspencergoog , but it looks like maybe we could just get away with dumping any script that doesn't match the pattern, here:
https://github.com/flutter/flutter/blob/5d587f9528fbeb7f308d686b7e5751dd232640db/dev/snippets/config/skeletons/sample.html#L12
(where serial
is an integer)
I mean, it's kind of hacky, but if it works? Definitely open to better ideas here too. Making this a configuration option seems like it might be wise too, in case we try it and go "oh no it is using the wrong thing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some onclick
s here too that maybe could be handled similarly when the sanitizer is on.
Maybe even some sort of dartdoc_options.yaml option that takes a set of regular expressions to match onClick or script tags. And yes, that scares me.... will continue scratching my head on this, and input very welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because this is what happens on master
, currently:
dartdoc/lib/src/render/documentation_renderer.dart
Lines 29 to 31 in b4fa868
for (var s in asHtmlDocument.querySelectorAll('script')) { | |
s.remove(); | |
} |
But yeah, looking at the code it seems that <script>var visibleSnippet1 = "longSnippet1";</script>
is present when I look do view-source:https://api.flutter.dev/flutter/widgets/AbsorbPointer-class.html
I'm 100% okay removing it, since this is with HTML sanitizing not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm alright with this as is then, we can iterate to improve matters. I do think some end-to-end tests by hand would be good; the pipeline is complicated enough for doc generation that I want to be sure we're not missing something.
@jcollins-g, I can't merge, if you're happy feel free to merge :D Also when do you think this will be published? |
targeting publish and roll for this 10/25 |
I stole some of the logic for HTML sanitizing from package:sanitize_html which contributed under CLA (so I think that's okay). We can't use the package directly because:
<pre class=...
and we don't want to parse twice.It's also possible that we want to evolve the sanitizing rules in
dartdoc
to be different fromsanitize_html
which aims to be a very simple and reasonably foolproof package. The rules proposed in this PR aims to mirror rules for GFM on GithubI added an
--no-sanitize-html
option, but it's possible that we should just fold this into--inject-html
, since it's more or less the same thing. However, since sanitizing happens after{@inject-html}
directives it can be argued that--no-sanitize-html
is strictly more dangerous.We could also consider different HTML sanitation profiles:
unsafe
, no HTML sanitizing,gfm
, same sanitizing as happens on Github with GFM, and,safe
, allow most things that are unlikely to be dangerous.The different between
gfm
andsafe
is that there is lots of HTML tags and attributes which are not necessary dangerous, but could easily allow HTML to mess-up the layout of the page. Thegfm
sanitation profile aims to reduce the likelihood of this happening.