HTML Sanitizer API

Draft Community Group Report,

This version:
https://wicg.github.io/sanitizer-api/
Issue Tracking:
GitHub
Inline In Spec
Editors:
Frederik Braun (Mozilla)
Mario Heiderich (Cure53)
Daniel Vogelheim (Google LLC)

Abstract

This document specifies a set of APIs which allow developers to take untrusted strings of HTML, and sanitize them for safe insertion into a document’s DOM.

Status of this document

This specification was published by the Web Platform Incubator Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

1. Introduction

This section is not normative.

Web applications often need to work with strings of HTML on the client side, perhaps as part of a client-side templating solution, perhaps as part of rendering user generated content, etc. It is difficult to do so in a safe way, however; the naive approach of joining strings together and stuffing them into an Element's innerHTML is fraught with risk, as that can and will cause JavaScript execution in a number of unexpected ways.

Libraries like [DOMPURIFY] attempt to manage this problem by carefully parsing and sanitizing strings before insertion by constructing a DOM and walking its members through an allow-list. This has proven to be a fragile approach, as the parsing APIs exposed to the web don’t always map in reasonable ways to the browser’s behavior when actually rendering a string as HTML in the "real" DOM. Moreover, the libraries need to keep on top of browsers' changing behavior over time; things that once were safe may turn into time-bombs based on new platform-level features.

The browser, on the other, has an fairly good idea of when it is going to execute code. We can improve upon the user-space libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner, and do so in a way that is much more likely to be maintained and updated along with the browser’s own changing parser implementation. This document outlines an API which aims to do just that.

1.1. Goals

1.2. Examples

let userControlledInput = "<img src=x onerror=alert(1)//>";

// Create a DocumentFragment from unsanitized input:
let s = new Sanitizer();
let sanitizedFragment = s.sanitize(userControlledInput);

// Replace an element’s content from unsanitized input:
element.replaceChildren(s.sanitize(userControlledInput));

2. Framework

2.1. Sanitizer API {#sanitizer-api}

The core API is the Sanitizer object and the sanitize method. Sanitizers can be instantiated using an optional SanitizerConfig dictionary for options. The most common use-case - preventing XSS - is handled by the built-in default lists, so that creating a Sanitizer with a custom config is necessary only to handle additional, application-specific use cases.

[
  Exposed=(Window),
  SecureContext
] interface Sanitizer {
  constructor(optional SanitizerConfig config = {});
  DocumentFragment sanitize(SanitizerInput input);
  DOMString sanitizeToString(SanitizerInput input);
};

Example:

  // Replace an element’s content from unsanitized input:
  element.replaceChildren(new Sanitizer().sanitize(userControlledInput));

2.2. Input Types {#inputs}

The sanitization methods support three input types: DOMString, Document, and DocumentFragment. In all cases, the sanitization will work on a DocumentFragment internally, but the work-fragment will be created by parsing, cloning, or using the fragment as-is, respectively.

typedef (DOMString or DocumentFragment or Document) SanitizerInput;

Note: Sanitizing a string will use the HTML Parser to parse the input, which will perform some degree of normalization. So even if no sanitization steps are taken on a particular input, it cannot be guaranteed that the output of sanitizeToString will be character-for-character identical to the input. Examples would be character regularization ("&szlig;" to "ß"), or light processing for some elements ("<image>" to "<img>");

2.3. The Configuration Dictionary {#config}

The sanitizer’s configuration object is a dictionary which describes modifications to the sanitize operation.

dictionary SanitizerConfig {
  sequence<DOMString> allowElements;
  sequence<DOMString> blockElements;
  sequence<DOMString> dropElements;
  AttributeMatchList allowAttributes;
  AttributeMatchList dropAttributes;
  boolean allowCustomElements;
};
allowElements

The element allow list is a sequence of strings with elements that the sanitizer should retain in the input.

blockElements

The element block list is a sequence of strings with elements where the sanitizer should remove the elements from the input, but retain their children.

dropElements

The element drop list is a sequence of strings with elements that the sanitizer should remove from the input, including its children.

allowAttributes

The attribute allow list is an attribute match list, which determines whether an attribute (on a given element) should be allowed.

dropAttributes

The attribute drop list is an attribute match list, which determines whether an attribute (on a given element) should be dropped.

allowCustomElements

allow custom elements option determines whether custom elements are to be considered. The default is to drop them. If this option is true, custom elements will still be checked against all other built-in or configured configured checks.

Note: allowElements creates a sanitizer that defaults to dropping elements, while blockElements and dropElements defaults to keeping unknown elements. Using both types is possible, but is probably of little practical use. The same applies to allowAttributes and dropAttributes.

Examples:

  const sample = "Some text <b><i>with</i></b> <blink>tags</blink>.";

  // "Some text <b>with</b> text tags."
  new Sanitizer({allowElements: [ "b" ]).sanitizeToString(sample);

  // "Some text <i>with</i> <blink>tags</blink>."
  new Sanitizer({blockElements: [ "b" ]).sanitizeToString(sample);

  // "Some text <blink>tags</blink>."
  new Sanitizer({dropElements: [ "b" ]).sanitizeToString(sample);

  // Note: The default configuration handles XSS-relevant input:

  // Non-scripting input will be passed through:
  new Sanitizer().sanitizeToString(sample);  // Will output sample unmodified.

  // Scripts will be blocked: "abc alert(1) def"
  new Sanitizer().sanitizeToString("abc <script>alert(1)</script> def");

2.3.1. Attribute Match Lists {#attr-match-list}

An attribute match list is a map of attribute names to element names, where the special name "*" stands for all elements. A given attribute belonging to an element matches an attribute match list, if the attribute’s local name is a key in the match list, and element’s local name or "*" are found in the attribute’s value list.

typedef record<DOMString, sequence<DOMString>> AttributeMatchList;

Examples for attributes and attribute match lists:

  const sample = "<span id='span1' class='theclass' style='font-weight: bold'>hello</span>";

  // Allow only <span style>: "<span style='font-weight: bold'>...</span>"
  new Sanitizer({allowAttributes: {"style": ["span"]}}).sanitizeToString(sample);

  // Allow style, but not on span: "<span>...</span>"
  new Sanitizer({allowAttributes: {"style": ["div"]}}).sanitizeToString(sample);

  // Allow style on any elements: "<span style='font-weight: bold'>...</span>"
  new Sanitizer({allowAttributes: {"style": ["*"]}}).sanitizeToString(sample);

  // Block <span id>: "<span class='theclass' style='font-weight: bold'>...</span>";
  new Sanitizer({blockAttributes: {"id": ["span"]}}).sanitizeToString(sample);

  // Block id, everywhere: "<span class='theclass' style='font-weight: bold'>...</span>";
  new Sanitizer({blockAttributes: {"id": ["*"]}}).sanitizeToString(sample);

2.4. Algorithms {#algorithms}

To sanitize a document fragment named fragment using sanitizer run these steps:

  1. let m be a map that maps nodes to {'keep', 'block', 'drop'}.

  2. let nodes be a list containing the inclusive descendants of fragment, in tree order.

  3. for each node in nodes:

    1. call sanitize a node and insert node and the result value into m

  4. for each node in nodes:

    1. if m[node] is 'drop', remove the node and all children from fragment.

    2. if m[node] is 'block', replace the node with all of its element and text node children from fragment.

    3. if m[node] is undefined or 'keep', do nothing.

To sanitize a node named node run these steps:

  1. if node is an element node, call sanitize an element and return its result.

  2. return 'keep'

To sanitize an element named element, run these steps:

  1. let config be the sanitizer’s configuration dictionary.

  2. let name be element’s tag name.

  3. if name is a valid custom element name and if config’s allow custom elements option is unset or set to anything other than true, return 'drop'.

  4. if name is contained in the built-in default element drop list return 'drop'.

  5. if name is in config’s element drop list return 'drop'.

  6. if name is contained in the built-in default element block list return 'block'.

  7. if name is in config’s element block list return 'block'.

  8. if config has a non-empty element allow list and name is not in config’s element allow list return 'block'

  9. for each attr in element’s attribute list:

    1. call sanitize an attribute with attr’s name and element’s local name.

    2. if the result is different from 'keep', remove attr from element.

  10. return 'keep'

This presently ignores all namespace info, making it impossible to support different actions for like-named elements from different namespaces.

To sanitize an attribute named attr belonging to element, run these steps:

  1. let config be the sanitizer’s configuration dictionary.

  2. if attr and element attribute-match the built-in default attribute drop list return 'drop'.

  3. if attr and element attribute-match the config’s attribute drop list return 'drop'.

  4. if config has a non-empty attribute allow list and attr and element do not attribute-match the config’s attribute allow list return 'drop'.

  5. return 'keep'.

To determine whether an attribute and element attribute-match an attribute match list list, run these steps:

  1. let attr-name be attribute’s local name.

  2. let elem-name be element’s local name.

  3. if list does not contain a key attr-name, return false.

  4. let matches be the value of list[attr-name].

  5. if matches contains the string elem-name, return true.

  6. if matches contains the string "*", return true.

  7. return false.

To create a document fragment named fragment from a Sanitizer input, run these steps:

  1. Switch based on input’s type:

    1. if input is of type DocumentFragment, then:

      1. let node refer to input.

    2. if input is of type Document, then:

      1. let node refer to input’s documentElement.

    3. if input is of type DOMString, then:

      1. let node be the result of the parseFromString algorithm with input as first parameter (string), and "text/html" as second parameter (type).

  2. Let clone be the result of running clone a node on node with the clone children flag set to true.

  3. Let f be the result of createDocumentFragment.

  4. Append the node clone to the parent f.

  5. Return f.

It’s unclear whether we can assume a generic context for parseFromString, or if we need to re-work the API to take the insertion context of the created fragment into account. <https://github.com/WICG/sanitizer-api/issues/42>

To sanitize a given input, run these steps:

  1. run create a document fragment algorithm on the input.

  2. run the sanitize document fragment algorithm on the resulting fragment,

  3. and return its result.

To sanitizeToString a given input, run these steps:

  1. run create a document fragment algorithm on the input.

  2. run the sanitize algorithm on the resulting fragment,

  3. run the steps of the HTML Fragment Serialization Algorithm with the fragment root of step 1 as the node, and return the result string.

2.5. Default Configuration {#defaults}

The sanitizer defaults need to be carefully vetted, and are still under discussion. The values below are for illustrative purposes only.

The sanitizer has a built-in default configuration, which aims to eliminate any script-injection possibility. Note that the sanitize document fragment algorithm is defined so that these defaults are handled first and cannot be overridden by a custom configuration.

Default Drop Elements

The default element drop list has the following value:

 [ "script", "this is just a placeholder" ]
Default Block Elements

The default element block list has the following value:

[ "noscript", "this is just a placeholder" ]
Default Drop Attributes

The default attribute drop list has the following value:

{}

3. Security Considerations {#security-considerations}

The Sanitizer API is intended to prevent DOM-Based Cross-Site Scripting by traversing a supplied HTML content and removing elements and attributes according to a configuration. The specified API must not support the construction of a Sanitizer object that leaves script-capable markup in and doing so would be a bug in the threat model.

That being said, there are security issues which the correct usage of the Sanitizer API will not be able to protect against and the scenarios will be laid out in the following sections.

3.1. Server-Side Reflected and Stored XSS {#server-side-xss}

This section is not normative.

The Sanitizer API operates solely in the DOM and adds a capability to traverse and filter an existing DocumentFragment. The Sanitizer does not address server-side reflected or stored XSS.

3.2. DOM clobbering {#dom-clobbering}

This section is not normative.

DOM clobbering describes an attack in which malicious HTML confuses an application by naming elements through id or name attributes such that properties like children of an HTML element in the DOM are overshadowed by the malicious content.

The Sanitizer API does not protect DOM clobbering attacks in its default state, but can be configured to remove id and name attributes.

3.3. XSS with Script gadgets {#script-gadgets}

This section is not normative.

Script gadgets is a technique in which an attacker uses existing application code from popular JavaScript libraries to cause their own code to execute. This is often done by injecting innocent-looking code or seemingly inert DOM nodes that is only parsed and interpreted by a framework which then performs the execution of JavaScript based on that input.

The Sanitizer API can not prevent these attacks, but requires page authors to explicitly allow attributes and elements that are unknown to HTML and markup that is known to be widely used for templating and framework-specific code, like data- and slot attributes and elements like <slot> and <template>. We believe that these restrictions are not exhaustive and encourage page authors to examine their third party libraries for this behavior.

3.4. Mutated XSS {#mutated-xss}

This section is not normative.

Mutated XSS or mXSS describes an attack based on parser mismatches when parsing an HTML snippet without the correct context. In particular, when a parsed HTML fragment has been serialized to a string, the format is not guaranteed to be parsed and interpreted exactly the same when inserted into a different parent element. An example for carrying out such an attack is by relying on the change of parsing behavior for foreign content or misnested tags.

The Sanitizer API does not protect against mutated XSS, however we encourage authors to use the sanitize() function of the API which returns a DocumentFragment and avoids risks that come with serialization and additional parsing. Directly operating on a fragment after sanitization also comes with a performance benefit, as the cost of additional serialization and parsing is avoided.

4. Acknowledgements

Cure53’s [DOMPURIFY] is a clear inspiration for the API this document describes, as is Internet Explorer’s window.toStaticHTML().

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[DOM-Parsing]
Travis Leithead. DOM Parsing and Serialization. 17 May 2016. WD. URL: https://www.w3.org/TR/DOM-Parsing/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard. URL: https://infra.spec.whatwg.org/
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/

Informative References

[DOMPURIFY]
DOMPurify. URL: https://github.com/cure53/DOMPurify

IDL Index

[
  Exposed=(Window),
  SecureContext
] interface Sanitizer {
  constructor(optional SanitizerConfig config = {});
  DocumentFragment sanitize(SanitizerInput input);
  DOMString sanitizeToString(SanitizerInput input);
};

typedef (DOMString or DocumentFragment or Document) SanitizerInput;

dictionary SanitizerConfig {
  sequence<DOMString> allowElements;
  sequence<DOMString> blockElements;
  sequence<DOMString> dropElements;
  AttributeMatchList allowAttributes;
  AttributeMatchList dropAttributes;
  boolean allowCustomElements;
};

typedef record<DOMString, sequence<DOMString>> AttributeMatchList;

Issues Index

This presently ignores all namespace info, making it impossible to support different actions for like-named elements from different namespaces.
It’s unclear whether we can assume a generic context for parseFromString, or if we need to re-work the API to take the insertion context of the created fragment into account. <https://github.com/WICG/sanitizer-api/issues/42>
The sanitizer defaults need to be carefully vetted, and are still under discussion. The values below are for illustrative purposes only.