Skip to content

Sanitizing html after markdown and tool/inject-html directive rendering #2833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 20, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 4.1.0-dev
* Experimental feature: HTML output from markdown rendering, `{@tool}` and
`{@inject-html}` is sanitized when hidden option `--sanitize-html` is passed.

## 4.0.0
* BREAKING CHANGE: Refactors to support NNBD and adapt to new analyzer
changes are technically semver breaking. If you make extensive use of
Expand Down
2 changes: 1 addition & 1 deletion dartdoc_options.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
dartdoc:
linkToSource:
root: '.'
uriTemplate: 'https://github.com/dart-lang/dartdoc/blob/v4.0.0/%f%#L%l%'
uriTemplate: 'https://github.com/dart-lang/dartdoc/blob/v4.1.0-dev/%f%#L%l%'
6 changes: 6 additions & 0 deletions lib/src/dartdoc_options.dart
Original file line number Diff line number Diff line change
Expand Up @@ -1311,6 +1311,8 @@ class DartdocOptionContext extends DartdocOptionContextBase

bool get injectHtml => optionSet['injectHtml'].valueAt(context);

bool get sanitizeHtml => optionSet['sanitizeHtml'].valueAt(context);

bool get excludeFooterVersion =>
optionSet['excludeFooterVersion'].valueAt(context);

Expand Down Expand Up @@ -1460,6 +1462,10 @@ Future<List<DartdocOption>> createDartdocOptions(
DartdocOptionArgOnly<bool>('injectHtml', false, resourceProvider,
help: 'Allow the use of the {@inject-html} directive to inject raw '
'HTML into dartdoc output.'),
DartdocOptionArgOnly<bool>('sanitizeHtml', false, resourceProvider,
hide: true,
help: 'Sanitize HTML generated from markdown, {@tool} and '
'{@inject-html} directives.'),
DartdocOptionArgOnly<String>(
'input', resourceProvider.pathContext.current, resourceProvider,
optionIs: OptionKind.dir,
Expand Down
1 change: 1 addition & 0 deletions lib/src/generator/templates.runtime_renderers.dart
Original file line number Diff line number Diff line change
Expand Up @@ -15260,6 +15260,7 @@ const _invisibleGetters = {
'includeExternal',
'includeSource',
'injectHtml',
'sanitizeHtml',
'excludeFooterVersion',
'tools',
'inputDir',
Expand Down
5 changes: 3 additions & 2 deletions lib/src/model/documentation.dart
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,9 @@ class Documentation {
}
_hasExtendedDocs = parseResult.hasExtendedDocs;

var renderResult =
_renderer.render(parseResult.nodes, processFullDocs: processFullDocs);
var renderResult = _renderer.render(parseResult.nodes,
processFullDocs: processFullDocs,
sanitizeHtml: _element.config.sanitizeHtml);

if (processFullDocs) {
_asHtml = renderResult.asHtml;
Expand Down
276 changes: 267 additions & 9 deletions lib/src/render/documentation_renderer.dart
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,17 @@
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.

import 'package:html/parser.dart' show parse;
import 'package:html/dom.dart' as dom;
import 'package:html/parser.dart' show parseFragment;

import 'package:markdown/markdown.dart' as md;
import 'package:meta/meta.dart';

abstract class DocumentationRenderer {
DocumentationRenderResult render(
List<md.Node> nodes, {
@required bool processFullDocs,
@required bool sanitizeHtml,
});
}

Expand All @@ -20,16 +23,16 @@ class DocumentationRendererHtml implements DocumentationRenderer {
DocumentationRenderResult render(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good place for this code. I had considered inserting this somewhere in the markdown rendering pipeline too to try to check embedded HTML, but perhaps users of the markdown output already have their own sanitization in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package:markdown explicitly says to do sanitization as a post-processing step:
https://pub.dev/packages/markdown#html-sanitization

They also reference Markdown's XSS Vulnerability (and how to mitigate it).

TL;DR: XSS mitigation is best done post-markdown rendering.

Maybe it's possible to write a custom implementation of HtmlRenderer which does sanitizing on-the-fly. But I'm not sure if Text nodes in markdown can contain inline HTML, as is HtmlRenderer does not escape <> when rendering Text nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick stab a it, and it doesn't look easy t make HtmlRenderer produce safe HTML... Even just using htmlSerializeEscape when outputting Text broke a lot of tests that aimed to ensure that inline HTML keeps working.

My two cents is that maybe... but probably we have to sanitize after markdown rendering -- otherwise, we'll probably have to dramatically refactor our markdown rendering pipeline.


If we wanted to support {@tool} directives which can side-step HTML sanitizing, then I suggest we extend the markdown parser with custom block syntax, and have it translate {@tool foo --my-flag}...{@end-tool} into <pre data-tool="foo --my-flag">...</pre>, then we could replace these <pre> tags after HTML sanitizing, run tools at this point and decide if we want to sanitize the output of a tool on a tool by tool basis.

Played a bit around with package:markdown, and we can do things like:

final _beginToolPattern = RegExp(r'^{@tool ([^}\n]+)}$');
final _endToolPattern = RegExp(r'^{@end-tool}');

class ToolBlockSyntax extends BlockSyntax {
  @override
  RegExp get pattern => _beginToolPattern;

  const ToolBlockSyntax();

  @override
  bool canParse(BlockParser parser) {
    return pattern.hasMatch(parser.current);
  }

  @override
  List<String> parseChildLines(BlockParser parser) {
    var childLines = <String>[];
    parser.advance();

    while (!parser.isDone && !_endToolPattern.hasMatch(parser.current)) {
      childLines.add(parser.current);
      parser.advance();
    }
    if (!parser.isDone) {
      parser.advance();
    }

    return childLines;
  }

  @override
  Node parse(BlockParser parser) {
    var match = pattern.firstMatch(parser.current)!;
    var toolCommand = match.group(1)!;

    var childLines = parseChildLines(parser);
    // We should always end with a line-break (just makes sense, I think)
    childLines.add('');

    var text = childLines.join('\n');
    var pre = Element.text('pre', escapeHtml(text));
    pre.attributes['data-tool'] = escapeHtmlAttribute(toolCommand);

    return pre;
  }
}

Then we just have find <pre data-tool="COMMAND">STDIN</pre> afterwards, decode the HTML text and attribute text before running the tool. Anyways, maybe that's a larger refactor :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this probably involves a quite a bit of refactoring to make it work -- as we would effectively need to move execution of tools and rendering of markdown into the renderer. I think it makes some sense to put rendering of markdown in the renderer, but I can also see the argument that execution of tools might not need to happen there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Moving tool execution to the back end rather than the front breaks all kinds of internal assumptions in dartdoc and runs into a sync vs. async problem as well IIRC from the last time I tried to change the pipeline.

When you say that execution of tools "might not need to happen here" do you mean that it literally doesn't need to run, or just that it doesn't have to be moved? I do wonder if we can get away with "not solving this" for pub for the moment -- if we just forbid tool execution when the post-processing flag is on and then leave me the cleanup to get this working in a more comprehensive manner. IIRC tool execution is forbidden anyhow on pub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC tool execution is forbidden anyhow on pub?

Yes,

I'm thinking that -- if you want to do tool execution you have two options:

  1. Execute tools before sanitizing HTML (approach I took in this PR)
    • Pros:
      • This makes side-stepping the HTML sanitizer impossible.
      • This is clearly the safest option.
      • Tools that produce images and tables should still work perfectly fine.
    • Cons:
      • This puts limits on the kinds of HTML that tools can produce.
      • You can't choose not to sanitize the HTML produced by tools.
      • If tools are trusted, you might want to avoid trust the HTML produced.
  2. Execute tools after sanitizing HTML.
    • Pros:
      • This makes it possible to have trusted tools, that generate interactive experiences with embedded CSS / JS.
      • It is still possible to have untrusted tools where we sanitize the HTML output of the tool.
    • Cons:
      • This isn't as safe as sanitizing all HTML that comes out of any tool.

It's not unreasonable that dartdoc simply stick with (1), and then if needed expand the allowlist for the sanitizer. In practice most tools probably just want to:

  • Create images and include these with an <img> tag (allowed),
  • Embed an iframe from youtube or dartpad (could be allowed with a small extension to the sanitizer),

And maybe allowing embedding of sandboxed iframes could offer an escape hatch for remaining use-cases. So perhaps option (1) is fine, and we just always sanitize output from tools and {@inject-html} directives.


Note on HTML sanitizing:

  • A lot of the cons can be mitigated by expanding the allow-listed set of HTML tags and attributes.
  • Thus, it's not hard to expand the sanitizer to allow:
    • inline SVG,
    • sandboxed iframes (maybe sandboxed iframes with dataurls could be used to allow arbitrary HTML including css/js),
    • iframes from trusted domains.
  • But it would be very hard to expand the sanitizer to support:
    • CSS (both <style> or attributes style="...")
    • Javascript (both <script> and attributes onclick="...")
  • Expanding the sanitizer to allow new tags and attributes is not very hard. But since everything needs to be allowlisted we would always be playing catch-up. And we would probably want to support a CSS subset at some point. On the other hand, we can get very far with a fairly limited allowlist.

Copy link
Contributor

@jcollins-g jcollins-g Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed notes on all of this!

Thinking about this from dartdoc's perspective going with #1 seems reasonable for now. Being as safe as possible seems a good default choice in any event. But we're going to have to do something about Flutter as script tags are part of their existing tool infrastructure. Perhaps we can make the sanitizer able to be disabled in dartdoc_options.yaml so it can be scoped? Pub already will filter things from dartdoc_options files so that seems like it could be reasonable to do (as long as pub knows that dartdoc_options.yaml files can be located anywhere in the tree?).

Examples from Flutter docs:

Trusted domain iframe from a youtube directive:
https://api.flutter.dev/flutter/material/material-library.html

Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trusted domain iframe from a youtube directive:
https://api.flutter.dev/flutter/material/material-library.html

I think it would be easy allow iframes from trusted domains. Or we could simply hardcode an exception for youtube, that seems reasonable given how many users want to embed youtube.

But in general, I suspect <iframe sandbox="allow-same-origin allow-scripts" is reasonably safe, so long as the embedded document is, either: (i) from a separate domain, or (ii) trusted (if it's from the same domain). It seems like a reasonable expectation that if the document is from the same domain, then it's trusted.

Here is an example of a tool inserting script tags:
https://api.flutter.dev/flutter/material/AboutListTile-class.html

Embedding <script> inline violates a lot of good practices, include security recommendations around content-security-policies (I think we strive to use something like strict CSP whenever possible).

In many cases, it would be better if the {@tool directive generated something like:

<div data-snippet-buttons="longSnippet1"></div>

Then dartdoc options to inject a <script src="static/snippet-buttons.js"> file in the header. Then have snippet-buttons.js do something like:

window.addEventListener('DOMContentLoaded', function() {
  document.querySelectorAll('[data-snippet-buttons]').forEach(function(div) {
    // Do whatever you want to do with `div`, setup child elements and onclick handlers...
  });
});

Hmm, looking at it Flutter docs already embed https://api.flutter.dev/assets/snippets.js, so instead of embedding onclick handlers in the HTML code, they just have use data-... properties (or special CSS classes) to indicate that an HTML element should onclick handlers, and set it up in an DOMContentLoaded handler in snippets.js.

It's similar to what we do with syntax highlighting, we just apply a special CSS class language-<language> on a <pre> tag, and then highlight.pack.js?v1 will do the rest in an DOMContentLoaded handler.


So perhaps we should just allow custom data- attributes, or we could limit it to only allow data-attributes on the form data-custom-* (that way you can't mess with data-base-href and data-using-base-href which are already used in dartdoc generate output).

List<md.Node> nodes, {
@required bool processFullDocs,
@required bool sanitizeHtml,
}) {
if (nodes.isEmpty) {
return DocumentationRenderResult.empty;
}

var rawHtml = md.HtmlRenderer().render(nodes);
var asHtmlDocument = parse(rawHtml);
for (var s in asHtmlDocument.querySelectorAll('script')) {
s.remove();
}
for (var pre in asHtmlDocument.querySelectorAll('pre')) {
var asHtmlFragment = parseFragment(rawHtml);

for (var pre in asHtmlFragment.querySelectorAll('pre')) {
if (pre.children.length > 1 && pre.children.first.localName != 'code') {
continue;
}
Expand All @@ -44,16 +47,21 @@ class DocumentationRendererHtml implements DocumentationRenderer {
// Assume the user intended Dart if there are no other classes present.
if (!specifiesLanguage) pre.classes.add('language-dart');
}

if (sanitizeHtml) {
_sanitize(asHtmlFragment);
}

var asHtml = '';

if (processFullDocs) {
// `trim` fixes an issue with line ending differences between Mac and
// Windows.
asHtml = asHtmlDocument.body.innerHtml?.trim();
asHtml = asHtmlFragment.outerHtml.trim();
}
var asOneLiner = asHtmlDocument.body.children.isEmpty
var asOneLiner = asHtmlFragment.children.isEmpty
? ''
: asHtmlDocument.body.children.first.innerHtml;
: asHtmlFragment.children.first.innerHtml;

return DocumentationRenderResult(asHtml: asHtml, asOneLiner: asOneLiner);
}
Expand All @@ -68,3 +76,253 @@ class DocumentationRenderResult {
const DocumentationRenderResult(
{@required this.asHtml, @required this.asOneLiner});
}

bool _allowClassName(String className) =>
className == 'deprecated' || className.startsWith('language-');

Iterable<String> _addLinkRel(String uri) {
final u = Uri.tryParse(uri);
if (u.host.isNotEmpty) {
// TODO(jonasfj): Consider allowing non-ugc links for trusted sites.
return ['ugc'];
}
return [];
}

void _sanitize(dom.Node node) {
if (node is dom.Element) {
final tagName = node.localName.toUpperCase();
if (!_allowedElements.contains(tagName)) {
node.remove();
return;
}
node.attributes.removeWhere((k, v) {
final attrName = k.toString();
if (attrName == 'class') {
node.classes.removeWhere((cn) => !_allowClassName(cn));
return node.classes.isEmpty;
}
return !_isAttributeAllowed(tagName, attrName, v);
});
if (tagName == 'A') {
final href = node.attributes['href'];
if (href != null) {
final rels = _addLinkRel(href);
if (rels != null && rels.isNotEmpty) {
node.attributes['rel'] = rels.join(' ');
}
}
}
}
if (node.hasChildNodes()) {
// doing it in reverse order, because we could otherwise skip one, when a
// node is removed...
for (var i = node.nodes.length - 1; i >= 0; i--) {
_sanitize(node.nodes[i]);
}
}
}

bool _isAttributeAllowed(String tagName, String attrName, String value) {
if (_alwaysAllowedAttributes.contains(attrName)) return true;

// Special validators for special attributes on special tags (href/src/cite)
final attributeValidators = _elementAttributeValidators[tagName];
if (attributeValidators == null) {
return false;
}

final validator = attributeValidators[attrName];
if (validator == null) {
return false;
}

return validator(value);
}

// Inspired by the set of HTML tags allowed in GFM.
final _allowedElements = <String>{
'H1',
'H2',
'H3',
'H4',
'H5',
'H6',
'H7',
'H8',
'BR',
'B',
'I',
'STRONG',
'EM',
'A',
'PRE',
'CODE',
'IMG',
'TT',
'DIV',
'INS',
'DEL',
'SUP',
'SUB',
'P',
'OL',
'UL',
'TABLE',
'THEAD',
'TBODY',
'TFOOT',
'BLOCKQUOTE',
'DL',
'DT',
'DD',
'KBD',
'Q',
'SAMP',
'VAR',
'HR',
'RUBY',
'RT',
'RP',
'LI',
'TR',
'TD',
'TH',
'S',
'STRIKE',
'SUMMARY',
'DETAILS',
'CAPTION',
'FIGURE',
'FIGCAPTION',
'ABBR',
'BDO',
'CITE',
'DFN',
'MARK',
'SMALL',
'SPAN',
'TIME',
'WBR',
};

// Inspired by the set of HTML attributes allowed in GFM.
final _alwaysAllowedAttributes = <String>{
'abbr',
'accept',
'accept-charset',
'accesskey',
'action',
'align',
'alt',
'aria-describedby',
'aria-hidden',
'aria-label',
'aria-labelledby',
'axis',
'border',
'cellpadding',
'cellspacing',
'char',
'charoff',
'charset',
'checked',
'clear',
'cols',
'colspan',
'color',
'compact',
'coords',
'datetime',
'dir',
'disabled',
'enctype',
'for',
'frame',
'headers',
'height',
'hreflang',
'hspace',
'ismap',
'label',
'lang',
'maxlength',
'media',
'method',
'multiple',
'name',
'nohref',
'noshade',
'nowrap',
'open',
'prompt',
'readonly',
'rel',
'rev',
'rows',
'rowspan',
'rules',
'scope',
'selected',
'shape',
'size',
'span',
'start',
'summary',
'tabindex',
'target',
'title',
'type',
'usemap',
'valign',
'value',
'vspace',
'width',
'itemprop',
};

bool _alwaysAllowed(String _) => true;

bool _validLink(String url) {
try {
final uri = Uri.parse(url);
return uri.isScheme('https') ||
uri.isScheme('http') ||
uri.isScheme('mailto') ||
!uri.hasScheme;
} on FormatException {
return false;
}
}

bool _validUrl(String url) {
try {
final uri = Uri.parse(url);
return uri.isScheme('https') || uri.isScheme('http') || !uri.hasScheme;
} on FormatException {
return false;
}
}

final _citeAttributeValidator = <String, bool Function(String)>{
'cite': _validUrl,
};

final _elementAttributeValidators =
<String, Map<String, bool Function(String)>>{
'A': {
'href': _validLink,
},
'IMG': {
'src': _validUrl,
'longdesc': _validUrl,
},
'DIV': {
'itemscope': _alwaysAllowed,
'itemtype': _alwaysAllowed,
},
'BLOCKQUOTE': _citeAttributeValidator,
'DEL': _citeAttributeValidator,
'INS': _citeAttributeValidator,
'Q': _citeAttributeValidator,
};
2 changes: 1 addition & 1 deletion lib/src/version.dart
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
// Generated code. Do not modify.
const packageVersion = '4.0.0';
const packageVersion = '4.1.0-dev';
4 changes: 2 additions & 2 deletions pubspec.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: dartdoc
# Run `grind build` after updating.
version: 4.0.0
# Run `dart run grinder build` after updating.
version: 4.1.0-dev
description: A non-interactive HTML documentation generator for Dart source code.
homepage: https://github.com/dart-lang/dartdoc
environment:
Expand Down
Loading