Sanitizing html after markdown and tool/inject-html directive rendering #2833

jonasfj · 2021-10-12T15:54:49Z

I stole some of the logic for HTML sanitizing from package:sanitize_html which contributed under CLA (so I think that's okay). We can't use the package directly because:

We need to handle a few special cases around <pre class=... and we don't want to parse twice.
The package is an unofficial Google project, not included in the Dart SDK.

It's also possible that we want to evolve the sanitizing rules in dartdoc to be different from sanitize_html which aims to be a very simple and reasonably foolproof package. The rules proposed in this PR aims to mirror rules for GFM on Github

I added an --no-sanitize-html option, but it's possible that we should just fold this into --inject-html, since it's more or less the same thing. However, since sanitizing happens after {@inject-html} directives it can be argued that --no-sanitize-html is strictly more dangerous.

We could also consider different HTML sanitation profiles:

unsafe, no HTML sanitizing,
gfm, same sanitizing as happens on Github with GFM, and,
safe, allow most things that are unlikely to be dangerous.

The different between gfm and safe is that there is lots of HTML tags and attributes which are not necessary dangerous, but could easily allow HTML to mess-up the layout of the page. The gfm sanitation profile aims to reduce the likelihood of this happening.

jcollins-g · 2021-10-12T16:12:44Z

This generally seems like a good approach, and, well, what I was signed up to do but if you're interested I don't object. :-) Some of the rendering tests will need to be updated, of course. I am also curious if you have tested interaction with the tools feature and how it is used with Flutter. Also, sanitize_html will need to be incorporated into the Dart SDK so that's another lift should we decide to go this route.

I am in favor of hiding the sanitize_html parameter and making it mandatory at some point. That will be easier to do with hide: true.

jcollins-g · 2021-10-12T16:33:27Z

lib/src/render/documentation_renderer.dart

+Iterable<String> _addLinkRel(String uri) {
+  final u = Uri.tryParse(uri);
+  if (u == null || !['http', 'https', 'mailto'].contains(u.scheme)) {
+    return ['nofollow'];


Hmm. This seems to add nofollow to comment references, not just random links. I could see an argument for random links, but why comment references? Telling Google to ignore them is probably going to degrade search rather than improve it on most documentation sets.

jcollins-g · 2021-10-12T16:36:27Z

lib/src/render/documentation_renderer.dart

  const DocumentationRendererHtml();

  @override
  DocumentationRenderResult render(


This is a good place for this code. I had considered inserting this somewhere in the markdown rendering pipeline too to try to check embedded HTML, but perhaps users of the markdown output already have their own sanitization in place.

package:markdown explicitly says to do sanitization as a post-processing step:
https://pub.dev/packages/markdown#html-sanitization

They also reference Markdown's XSS Vulnerability (and how to mitigate it).

TL;DR: XSS mitigation is best done post-markdown rendering.

Maybe it's possible to write a custom implementation of HtmlRenderer which does sanitizing on-the-fly. But I'm not sure if Text nodes in markdown can contain inline HTML, as is HtmlRenderer does not escape <> when rendering Text nodes.

I took a quick stab a it, and it doesn't look easy t make HtmlRenderer produce safe HTML... Even just using htmlSerializeEscape when outputting Text broke a lot of tests that aimed to ensure that inline HTML keeps working.

My two cents is that maybe... but probably we have to sanitize after markdown rendering -- otherwise, we'll probably have to dramatically refactor our markdown rendering pipeline.

If we wanted to support {@tool} directives which can side-step HTML sanitizing, then I suggest we extend the markdown parser with custom block syntax, and have it translate {@tool foo --my-flag}...{@end-tool} into <pre data-tool="foo --my-flag">...</pre>, then we could replace these <pre> tags after HTML sanitizing, run tools at this point and decide if we want to sanitize the output of a tool on a tool by tool basis.

Played a bit around with package:markdown, and we can do things like:

final _beginToolPattern = RegExp(r'^{@tool ([^}\n]+)}$'); final _endToolPattern = RegExp(r'^{@end-tool}'); class ToolBlockSyntax extends BlockSyntax { @override RegExp get pattern => _beginToolPattern; const ToolBlockSyntax(); @override bool canParse(BlockParser parser) { return pattern.hasMatch(parser.current); } @override List<String> parseChildLines(BlockParser parser) { var childLines = <String>[]; parser.advance(); while (!parser.isDone && !_endToolPattern.hasMatch(parser.current)) { childLines.add(parser.current); parser.advance(); } if (!parser.isDone) { parser.advance(); } return childLines; } @override Node parse(BlockParser parser) { var match = pattern.firstMatch(parser.current)!; var toolCommand = match.group(1)!; var childLines = parseChildLines(parser); // We should always end with a line-break (just makes sense, I think) childLines.add(''); var text = childLines.join('\n'); var pre = Element.text('pre', escapeHtml(text)); pre.attributes['data-tool'] = escapeHtmlAttribute(toolCommand); return pre; } }

Then we just have find <pre data-tool="COMMAND">STDIN</pre> afterwards, decode the HTML text and attribute text before running the tool. Anyways, maybe that's a larger refactor :D

Hmm, this probably involves a quite a bit of refactoring to make it work -- as we would effectively need to move execution of tools and rendering of markdown into the renderer. I think it makes some sense to put rendering of markdown in the renderer, but I can also see the argument that execution of tools might not need to happen there.

Ugh. Moving tool execution to the back end rather than the front breaks all kinds of internal assumptions in dartdoc and runs into a sync vs. async problem as well IIRC from the last time I tried to change the pipeline.

When you say that execution of tools "might not need to happen here" do you mean that it literally doesn't need to run, or just that it doesn't have to be moved? I do wonder if we can get away with "not solving this" for pub for the moment -- if we just forbid tool execution when the post-processing flag is on and then leave me the cleanup to get this working in a more comprehensive manner. IIRC tool execution is forbidden anyhow on pub?

IIRC tool execution is forbidden anyhow on pub?

Yes,

I'm thinking that -- if you want to do tool execution you have two options:

Execute tools before sanitizing HTML (approach I took in this PR)

Pros:

This makes side-stepping the HTML sanitizer impossible.

This is clearly the safest option.

Tools that produce images and tables should still work perfectly fine.

Cons:

This puts limits on the kinds of HTML that tools can produce.

You can't choose not to sanitize the HTML produced by tools.

If tools are trusted, you might want to avoid trust the HTML produced.

Execute tools after sanitizing HTML.

Pros:

This makes it possible to have trusted tools, that generate interactive experiences with embedded CSS / JS.

It is still possible to have untrusted tools where we sanitize the HTML output of the tool.

Cons:

This isn't as safe as sanitizing all HTML that comes out of any tool.

It's not unreasonable that dartdoc simply stick with (1), and then if needed expand the allowlist for the sanitizer. In practice most tools probably just want to:

Create images and include these with an <img> tag (allowed),

Embed an iframe from youtube or dartpad (could be allowed with a small extension to the sanitizer),

And maybe allowing embedding of sandboxed iframes could offer an escape hatch for remaining use-cases. So perhaps option (1) is fine, and we just always sanitize output from tools and {@inject-html} directives.

Note on HTML sanitizing:

A lot of the cons can be mitigated by expanding the allow-listed set of HTML tags and attributes.

Thus, it's not hard to expand the sanitizer to allow:

inline SVG,

sandboxed iframes (maybe sandboxed iframes with dataurls could be used to allow arbitrary HTML including css/js),

iframes from trusted domains.

But it would be very hard to expand the sanitizer to support:

CSS (both <style> or attributes style="...")

Javascript (both <script> and attributes onclick="...")

Expanding the sanitizer to allow new tags and attributes is not very hard. But since everything needs to be allowlisted we would always be playing catch-up. And we would probably want to support a CSS subset at some point. On the other hand, we can get very far with a fairly limited allowlist.

Thanks for the detailed notes on all of this!

Thinking about this from dartdoc's perspective going with #1 seems reasonable for now. Being as safe as possible seems a good default choice in any event. But we're going to have to do something about Flutter as script tags are part of their existing tool infrastructure. Perhaps we can make the sanitizer able to be disabled in dartdoc_options.yaml so it can be scoped? Pub already will filter things from dartdoc_options files so that seems like it could be reasonable to do (as long as pub knows that dartdoc_options.yaml files can be located anywhere in the tree?).

Examples from Flutter docs:

Trusted domain iframe from a youtube directive:
https://api.flutter.dev/flutter/material/material-library.html

Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html

Trusted domain iframe from a youtube directive:
https://api.flutter.dev/flutter/material/material-library.html

I think it would be easy allow iframes from trusted domains. Or we could simply hardcode an exception for youtube, that seems reasonable given how many users want to embed youtube.

But in general, I suspect <iframe sandbox="allow-same-origin allow-scripts" is reasonably safe, so long as the embedded document is, either: (i) from a separate domain, or (ii) trusted (if it's from the same domain). It seems like a reasonable expectation that if the document is from the same domain, then it's trusted.

Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html

Embedding <script> inline violates a lot of good practices, include security recommendations around content-security-policies (I think we strive to use something like strict CSP whenever possible).

In many cases, it would be better if the {@tool directive generated something like:

<div data-snippet-buttons="longSnippet1"></div>

Then dartdoc options to inject a <script src="static/snippet-buttons.js"> file in the header. Then have snippet-buttons.js do something like:

window.addEventListener('DOMContentLoaded', function() { document.querySelectorAll('[data-snippet-buttons]').forEach(function(div) { // Do whatever you want to do with `div`, setup child elements and onclick handlers... }); });

Hmm, looking at it Flutter docs already embed https://api.flutter.dev/assets/snippets.js, so instead of embedding onclick handlers in the HTML code, they just have use data-... properties (or special CSS classes) to indicate that an HTML element should onclick handlers, and set it up in an DOMContentLoaded handler in snippets.js.

It's similar to what we do with syntax highlighting, we just apply a special CSS class language-<language> on a <pre> tag, and then highlight.pack.js?v1 will do the rest in an DOMContentLoaded handler.

So perhaps we should just allow custom data- attributes, or we could limit it to only allow data-attributes on the form data-custom-* (that way you can't mess with data-base-href and data-using-base-href which are already used in dartdoc generate output).

jcollins-g · 2021-10-14T18:19:57Z

lib/src/render/documentation_renderer.dart

+    } else {
+      // Never allow scripts!
+      for (var s in asHtmlFragment.querySelectorAll('script')) {
+        s.remove();


Sadly we can't do this, see other comments. I'm slightly surprised that the tests still work (probably because this condition is Flutter specific), but browsing to generated flutter documentation will definitely break if we unconditionally forbid this.

The script tags we need are structured in a very, very specific way -- maybe we can pattern match here.

Tagging @gspencergoog , but it looks like maybe we could just get away with dumping any script that doesn't match the pattern, here:

https://github.com/flutter/flutter/blob/5d587f9528fbeb7f308d686b7e5751dd232640db/dev/snippets/config/skeletons/sample.html#L12
(where serial is an integer)

I mean, it's kind of hacky, but if it works? Definitely open to better ideas here too. Making this a configuration option seems like it might be wise too, in case we try it and go "oh no it is using the wrong thing".

There are some onclicks here too that maybe could be handled similarly when the sanitizer is on.

Maybe even some sort of dartdoc_options.yaml option that takes a set of regular expressions to match onClick or script tags. And yes, that scares me.... will continue scratching my head on this, and input very welcome.

I did this because this is what happens on master, currently:

dartdoc/lib/src/render/documentation_renderer.dart

Lines 29 to 31 in b4fa868

for (var s in asHtmlDocument.querySelectorAll('script')) {

s.remove();

}

But yeah, looking at the code it seems that <script>var visibleSnippet1 = "longSnippet1";</script> is present when I look do view-source:https://api.flutter.dev/flutter/widgets/AbsorbPointer-class.html

I'm 100% okay removing it, since this is with HTML sanitizing not enabled.

OK. I'm alright with this as is then, we can iterate to improve matters. I do think some end-to-end tests by hand would be good; the pipeline is complicated enough for doc generation that I want to be sure we're not missing something.

jonasfj · 2021-10-18T12:37:56Z

@jcollins-g, I can't merge, if you're happy feel free to merge :D

Also when do you think this will be published?

jcollins-g · 2021-10-20T15:23:15Z

targeting publish and roll for this 10/25

Sanitizing html after markdown and tool/inject-html directive rendering

fb47020

jonasfj requested a review from jcollins-g October 12, 2021 15:54

google-cla bot added the cla: yes Google CLA check succeeded. label Oct 12, 2021

jcollins-g reviewed Oct 12, 2021

View reviewed changes

jonasfj added 3 commits October 13, 2021 15:26

Making sanitizing hidden and off by deafult

7e9b609

dart run ginder build

5bf7549

Tests of HTML sanitizing

e014a45

jonasfj marked this pull request as ready for review October 13, 2021 14:31

jonasfj added 2 commits October 14, 2021 12:40

Fix number of classes

73ac1d4

Moved html sanitizer tests

2c29efa

jcollins-g suggested changes Oct 14, 2021

View reviewed changes

Removed the script removing hack

2ff8702

jcollins-g approved these changes Oct 15, 2021

View reviewed changes

jcollins-g merged commit f2eb530 into dart-lang:master Oct 20, 2021

isoos mentioned this pull request Oct 20, 2021

Sanitize dartdoc's HTML output. dart-lang/pub-dev#5181

Closed

	for (var s in asHtmlDocument.querySelectorAll('script')) {
	s.remove();
	}

Sanitizing html after markdown and tool/inject-html directive rendering #2833

Sanitizing html after markdown and tool/inject-html directive rendering #2833

Uh oh!

Conversation

jonasfj commented Oct 12, 2021

Uh oh!

jcollins-g commented Oct 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcollins-g Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonasfj commented Oct 18, 2021

Uh oh!

jcollins-g commented Oct 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcollins-g Oct 14, 2021 •

edited

Loading