HTML Sanitizer API

Draft Community Group Report,

This version:
https://wicg.github.io/sanitizer-api/
Issue Tracking:
GitHub
Inline In Spec
Editors:
Frederik Braun (Mozilla)
Mario Heiderich (Cure53)
Daniel Vogelheim (Google LLC)

Abstract

This document specifies a set of APIs which allow developers to take untrusted strings of HTML, and sanitize them for safe insertion into a document’s DOM.

Status of this document

This specification was published by the Web Platform Incubator Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

1. Introduction

This section is not normative.

Web applications often need to work with strings of HTML on the client side, perhaps as part of a client-side templating solution, perhaps as part of rendering user generated content, etc. It is difficult to do so in a safe way, however; the naive approach of joining strings together and stuffing them into an Element's innerHTML is fraught with risk, as that can and will cause JavaScript execution in a number of unexpected ways.

Libraries like [DOMPURIFY] attempt to manage this problem by carefully parsing and sanitizing strings before insertion by constructing a DOM and walking its members through an allowlist. This has proven to be a fragile approach, as the parsing APIs exposed to the web don’t always map in reasonable ways to the browser’s behavior when actually rendering a string as HTML in the "real" DOM. Moreover, the libraries need to keep on top of browsers' changing behavior over time; things that once were safe may turn into time-bombs based on new platform-level features.

The browser, on the other, has an fairly good idea of when it is going to execute code. We can improve upon the userspace libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner, and do so in a way that is much more likely to be maintained and updated along with the browser’s own changing parser implementation. This document outlines an API which aims to do just that.

1.1. Goals

1.2. Examples

let userControlledInput = "<img src=x onerror=alert(1)//>";

// Create a DocumentFragment from unsanitized input:
let s = new Sanitizer();
let sanitizedFragment = s.sanitize(userControlledInput);

// Replace an element’s content from unsanitized input:
element.replaceChildren(s.sanitize(userControlledInput));

2. Framework

2.1. Sanitizer API {#sanitizer-api}

The core API is the Sanitizer object and the sanitize method. Sanitizers can be instanited using an optional SanitizerConfig dictionary for options. The most common use-case - preventing XSS - is handled by the built-in default lists, so that creating a Sanitizer with a custom config is necessary only to handle additional, application-specific use cases.

[
  Exposed=(Window),
  SecureContext
] interface Sanitizer {
  constructor(optional SanitizerConfig config = {});
  DOMString sanitizeToString(DOMString input);
  DocumentFragment sanitize(DOMString input);
};

Example:

  // Replace an element’s content from unsanitized input:
  element.replaceChildren(new Sanitizer().sanitize(userControlledInput));

2.2. The Configuration Dictionary {#config}

The sanitizer’s configuration object is a dictionary which describes modifications to the sanitze operation.

dictionary SanitizerConfig {
  sequence<DOMString> allowElements;
  sequence<DOMString> blockElements;
  sequence<DOMString> dropElements;
  sequence<DOMString> allowAttributes;
  sequence<DOMString> dropAttributes;
};
allowElements

The element allow list is a sequence of strings with elements that the sanitizer should retain in the input.

blockElements

The element block list is a sequence of strings with elements where the sanitizer should remove the elements from the input, but retain their children.

dropElements

The element drop list is a sequence of strings with elements that the sanitizer should remove from the input, including its children.

allowAttributes

TODO: attribute allow list

dropAttributes

TODO: attribute drop list

Note: allowElements creates a sanitizer that defaults to dropping elements, while blockElements and dropElements defaults to keeping unknown elements. Using both types is possible, but is probably of little practical use. The same applies to allowAttributes and dropAttributes.

Examples:

  const sample = "Some text <b><i>with</i></b> <blink>tags</blink>.";

  // "Some text <b>with</b> text tags."
  new Sanitizer({allowElements: [ "b" ]).sanitizeToString(sample);

  // "Some text <i>with</i> <blink>tags</blink>."
  new Sanitizer({blockElements: [ "b" ]).sanitizeToString(sample);

  // "Some text <blink>tags</blink>."
  new Sanitizer({dropElements: [ "b" ]).sanitizeToString(sample);

  // Note: The default configuration handles XSS-relevant input:

  // Non-scripting input will be passed through:
  new Sanitizer().sanitizeToString(sample);  // Will output sample unmodified.

  // Scripts will be blocked: "abc alert(1) def"
  new Sanitizer().sanitzeToString("abc <script>alert(1)</script> def");

2.3. Algorithms {#algorithms}

To sanitize a document fragment named fragment using sanitizer run these steps:

  1. let m be a map that maps nodes to {'keep', 'block', 'drop'}.

  2. let nodes be a list containing the inclusive descendants of fragment, in tree order.

  3. for each node in nodes:

    1. call sanitize a node and insert node and the result value into m

  4. for each node in nodes:

    1. if m[node] is 'drop', remove the node and all children from fragment.

    2. if m[node] is 'block', replace the node with all of its element and text node children from fragment.

    3. if m[node] is undefined or 'keep', do nothing.

To sanitize a node named node run these steps:

  1. if node is an element node, call sanitize an element and return its result.

  2. if node is an attribute node, call sanitize an attribute and return its result.

  3. return 'keep'

To sanitize an element named element, run these steps:

  1. let config be the sanitizer’s configuration dictionary.

  2. let name be element’s tag name.

  3. if name is contained in the built-in default element drop list return 'drop'.

  4. if name is in config’s element drop list return 'drop'.

  5. if name is contained in the built-in default element block list return 'block'.

  6. if name is in config’s element block list return 'block'.

  7. if config has a non-empty element allow list and name is not in config’s element allow list return 'block'

  8. return 'keep'

To sanitize an attribute named attr, run these steps:

  1. let config be the sanitizer’s configuration dictionary.

  2. let element be attr’s parent element.

  3. let name be element’s tag name, followed by ''.'', followed by attr’s name.

  4. if name is contained in the built-in default attribute drop list return 'drop'.

  5. if name is in config’s attribute drop list return 'drop'.

  6. if config has a non-empty attribute allow list and name is not in config’s attribute allow list return 'drop'

  7. return 'keep'

TODO: To create a document fragment ...

To sanitize a given input, run these steps:

  1. run create a document fragment algorithm on the input.

  2. run the sanitize document fragment algorithm on the resulting fragment,

  3. and return its result.

To sanitizeToString a given input, run these steps:

  1. run the sanitize algorithm on input,

  2. run the steps of the HTML Fragment Serialization Algorithm with the fragment root of step 1 as the node, and return the result string.

2.4. Default Configuration {#defaults}

The sanitizer defaults need to be carefully vetted, and are still under discussion. The values below are for illustrative purposes only.

The sanitizer has a built-in default configuration, which aims to eliminate any script-injection possibility. Note that the sanitize document fragment algorithm is defined so that these defaults are handled first and cannot be overriden by a custom configuration.

Default Drop Elements

The default element drop list has the following value:

 [ "script", "this is just a placeholder" ]
Default Block Elements

The default element block list has the following value:

[ "noscript", "this is just a placeholder" ]
Default Drop Attributes

The default attribute drop list has the following value:

{}

3. Acknowledgements

Cure53’s [DOMPURIFY] is a clear inspiration for the API this document describes, as is Internet Explorer’s window.toStaticHTML().

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[DOM-Parsing]
Travis Leithead. DOM Parsing and Serialization. 17 May 2016. WD. URL: https://www.w3.org/TR/DOM-Parsing/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard. URL: https://infra.spec.whatwg.org/
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/

Informative References

[DOMPURIFY]
DOMPurify. URL: https://github.com/cure53/DOMPurify

IDL Index

[
  Exposed=(Window),
  SecureContext
] interface Sanitizer {
  constructor(optional SanitizerConfig config = {});
  DOMString sanitizeToString(DOMString input);
  DocumentFragment sanitize(DOMString input);
};

dictionary SanitizerConfig {
  sequence<DOMString> allowElements;
  sequence<DOMString> blockElements;
  sequence<DOMString> dropElements;
  sequence<DOMString> allowAttributes;
  sequence<DOMString> dropAttributes;
};

Issues Index

The sanitizer defaults need to be carefully vetted, and are still under discussion. The values below are for illustrative purposes only.