1. Introduction
This section is not normative.
Web applications often need to work with strings of HTML on the client side,
perhaps as part of a client-side templating solution, perhaps as part of
rendering user generated content, etc. It is difficult to do so in a safe way.
The naive approach of joining strings together and stuffing them into
an Element
's innerHTML
is fraught with risk, as it can cause
JavaScript execution in a number of unexpected ways.
Libraries like [DOMPURIFY] attempt to manage this problem by carefully parsing and sanitizing strings before insertion, by constructing a DOM and filtering its members through an allow-list. This has proven to be a fragile approach, as the parsing APIs exposed to the web don’t always map in reasonable ways to the browser’s behavior when actually rendering a string as HTML in the "real" DOM. Moreover, the libraries need to keep on top of browsers' changing behavior over time; things that once were safe may turn into time-bombs based on new platform-level features.
The browser has a fairly good idea of when it is going to execute code. We can improve upon the user-space libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner, and do so in a way that is much more likely to be maintained and updated along with the browser’s own changing parser implementation. This document outlines an API which aims to do just that.
1.1. Goals
-
Mitigate the risk of DOM-based cross-site scripting attacks by providing developers with mechanisms for handling user-controlled HTML which prevent direct script execution upon injection.
-
Make HTML output safe for use within the current user agent, taking into account its current understanding of HTML.
-
Allow developers to override the default set of elements and attributes. Adding certain elements and attributes can prevent script gadget attacks.
1.2. API Summary
The Sanitizer API offers functionality to parse a string containing HTML into a DOM tree, and to filter the resulting tree according to a user-supplied configuration. The methods come in two by two flavours:
-
Safe and unsafe: The "safe" methods will not generate any markup that executes script. That is, they should be safe from XSS. The "unsafe" methods will parse and filter whatever they’re supposed to.
-
Context: Methods are defined on
Element
andShadowRoot
and will replace theseNode
's children, and are largely analogous toinnerHTML
. There are also static methods on theDocument
, which parse an entire document are largely analogous toDOMParser
.parseFromString()
.
2. Framework
2.1. Sanitizer API
The Element
interface defines two methods, setHTML()
and setHTMLUnsafe()
. Both of these take a DOMString
with HTML
markup, and an optional configuration.
partial interface Element { [CEReactions ]undefined ((
setHTMLUnsafe TrustedHTML or DOMString ),
html optional SetHTMLOptions = {}); [
options CEReactions ]undefined (
setHTML DOMString ,
html optional SetHTMLOptions = {}); };
options
Element
's setHTMLUnsafe(html, options) method steps are:
-
Let compliantHTML be the result of invoking the Get Trusted Type compliant string algorithm with
TrustedHTML
, this's relevant global object, html, "Element setHTMLUnsafe", and "script". -
Let target be this's template contents if this is a
template
element; otherwise this. -
Set and filter HTML given target, this, compliantHTML, options, and false.
Element
's setHTML(html, options) method steps are:
-
Let target be this's template contents if this is a
template
; otherwise this. -
Set and filter HTML given target, this, html, options, and true.
partial interface ShadowRoot { [CEReactions ]undefined ((
setHTMLUnsafe TrustedHTML or DOMString ),
html optional SetHTMLOptions = {}); [
options CEReactions ]undefined (
setHTML DOMString ,
html optional SetHTMLOptions = {}); };
options
These methods are mirrored on the ShadowRoot
:
ShadowRoot
's setHTMLUnsafe(html, options) method steps are:
-
Let compliantHTML be the result of invoking the Get Trusted Type compliant string algorithm with
TrustedHTML
, this's relevant global object, html, "ShadowRoot setHTMLUnsafe", and "script". -
Set and filter HTML using this, this's shadow host (as context element), compliantHTML, options, and false.
ShadowRoot
's setHTML(html, options) method steps are:
-
Set and filter HTML using this (as target), this (as context element), html, options, and true.
The Document
interface gains two new methods which parse an entire Document
:
partial interface Document {static Document ((
parseHTMLUnsafe TrustedHTML or DOMString ),
html optional SetHTMLOptions = {});
options static Document (
parseHTML DOMString ,
html optional SetHTMLOptions = {}); };
options
-
Let compliantHTML be the result of invoking the Get Trusted Type compliant string algorithm with
TrustedHTML
, this's relevant global object, html, "Document parseHTMLUnsafe", and "script". -
Let document be a new
Document
, whose content type is "text/html".Note: Since document does not have a browsing context, scripting is disabled.
-
Set document’s allow declarative shadow roots to true.
-
Parse HTML from a string given document and compliantHTML.
-
Let config be the result of calling get a sanitizer config from options with options and false.
-
If config is not empty, then call sanitize on document’s root node with config.
-
Return document.
-
Let document be a new
Document
, whose content type is "text/html".Note: Since document does not have a browsing context, scripting is disabled.
-
Set document’s allow declarative shadow roots to true.
-
Parse HTML from a string given document and html.
-
Let config be the result of calling get a sanitizer config from options with options and true.
-
Return document.
2.2. SetHTML options and the configuration object.
The family of setHTML()
-like methods all accept an options
dictionary. Right now, only one member of this dictionary is defined:
dictionary { (
SetHTMLOptions Sanitizer or SanitizerConfig )= {}; };
sanitizer
The Sanitizer
configuration object encapsulates a filter configuration.
The same config can be used with both safe or unsafe methods. The intent is
that one (or a few) configurations will be built-up early on in a page’s
lifetime, and can then be used whenever needed. This allows implementations
to pre-process configurations.
The configuration object is also query-able and can return canonical configuration dictionaries, in both safe and unsafe variants. This allows a page to query and predict what effect a given configuration will have, or to build a new configuration based on an existing one.
[Exposed =(Window ,Worker )]interface {
Sanitizer (
constructor optional SanitizerConfig = {});
config SanitizerConfig ();
get SanitizerConfig (); };
getUnsafe
-
Store config in this's internal slot.
-
Return the result of canonicalize a configuration with the value of this's internal slot and true.
-
Return the result of canonicalize a configuration with the value of this's internal slot and false.
2.3. The Configuration Dictionary
dictionary {
SanitizerElementNamespace required DOMString ;
name DOMString ?= "http://www.w3.org/1999/xhtml"; }; // Used by "elements"
_namespace dictionary :
SanitizerElementNamespaceWithAttributes SanitizerElementNamespace {sequence <SanitizerAttribute >;
attributes sequence <SanitizerAttribute >; };
removeAttributes typedef (DOMString or SanitizerElementNamespace );
SanitizerElement typedef (DOMString or SanitizerElementNamespaceWithAttributes );
SanitizerElementWithAttributes dictionary {
SanitizerAttributeNamespace required DOMString ;
name DOMString ?=
_namespace null ; };typedef (DOMString or SanitizerAttributeNamespace );
SanitizerAttribute dictionary {
SanitizerConfig sequence <SanitizerElementWithAttributes >;
elements sequence <SanitizerElement >;
removeElements sequence <SanitizerElement >;
replaceWithChildrenElements sequence <SanitizerAttribute >;
attributes sequence <SanitizerAttribute >;
removeAttributes boolean ;
comments boolean ; };
dataAttributes
3. Algorithms
Element
or DocumentFragment
target, an Element
contextElement, a string html, and a dictionary options, and a boolean safe:
-
If safe and contextElement’s local name is "
script
" and contextElement’s namespace is the HTML namespace or the SVG namespace, then return. -
Let config be the result of calling get a sanitizer config from options with options and safe.
-
Let newChildren be the result of the HTML fragment parsing algorithm steps given contextElement, html, and true.
-
Let fragment be a new
DocumentFragment
whose node document is contextElement’s node document. -
If config is not empty, then run sanitize on fragment using config.
-
Replace all with fragment within target.
-
Assert: options is a dictionary.
-
If options["
sanitizer
"] doesn’t exist, then return undefined. -
Assert: options["
sanitizer
"] is either aSanitizer
instance or a dictionary. -
If options["
sanitizer
"] is aSanitizer
instance:-
Then let config be the value of options["
sanitizer
"]'s internal slot. -
Otherwise let config be the value of options["
sanitizer
"].
-
-
Return the result of calling canonicalize a configuration on config and safe.
3.1. Sanitization Algorithms
ParentNode
node, a canonical SanitizerConfig
config, run these steps:
-
Let current be node.
-
For each child in current’s children:
-
Assert: child implements
Text
,Comment
, orElement
.Note: Currently, this algorithm is only called on output of the HTML parser for which this assertion should hold. If in the future this algorithm will be used in different contexts, this assumption needs to be re-examined.
-
If child implements
Text
: -
else if child implements
Comment
: -
else:
-
Let elementName be a
SanitizerElementNamespace
with child’s local name and namespace. -
If config["
elements
"] exists and config["elements
"] does not contain [elementName]:-
remove child.
-
-
else if config["
removeElements
"] exists and config["removeElements
"] contains [elementName]:-
remove child.
-
-
If config["
replaceWithChildrenElements
"] exists and config["replaceWithChildrenElements
"] contains elementName:-
Call sanitize on child with config.
-
Call replace all with child’s children within child.
-
-
If elementName equals «[ "
name
" → "template
", "namespace
" → HTML namespace ]»-
Then call sanitize on child’s template contents with config.
-
-
If child is a shadow host:
-
Then call sanitize on child’s shadow root with config.
-
-
For each attr in current’s attribute list:
-
Let attrName be a
SanitizerAttributeNamespace
with attr’s local name and namespace. -
If config["
attributes
"] exists and config["attributes
"] does not contain attrName:-
If "data-" is a code unit prefix of local name and if namespace is
null
and if config["dataAttributes
"] exists and is false:-
Remove attr from child.
-
-
-
else if config["
removeAttributes
"] exists and config["removeAttributes
"] contains attrName:-
Remove attr from child.
-
-
If config["
elements
"][elementName] exists, and if config["elements
"][elementName]["attributes
"] exists, and if config["elements
"][elementName]["attributes
"] does not contain attrName:-
Remove attr from child.
-
-
If config["
elements
"][elementName] exists, and if config["elements
"][elementName]["removeAttributes
"] exists, and if config["elements
"][elementName]["removeAttributes
"] contains attrName:-
Remove attr from child.
-
-
If «[elementName, attrName]» matches an entry in the navigating URL attributes list, and if attr’s protocol is "
javascript:
":-
Then remove attr from child.
-
-
Call sanitize on child’s shadow root with config.
-
-
else:
-
remove child.
-
-
-
3.2. Configuration Processing
-
config is a dictionary
-
config’s key set does not contain both "
elements
" and "removeElements
" -
config’s key set does not contain both "
removeAttributes
" and "attributes
". -
For any key of «[ "
elements
", "removeElements
", "replaceWithChildrenElements
", "attributes
", "removeAttributes
" ]» where config[key] exists:-
config[key] is valid.
-
-
If config["
elements
"] exists, then for any element in config[key] that is a dictionary:-
element does not contain both "
attributes
" and "removeAttributes
". -
If either element["
attributes
"] or element["removeAttributes
"] exists, then it is valid. -
Let tmp be a dictionary, and for any key «[ "
elements
", "removeElements
", "replaceWithChildrenElements
", "attributes
", "removeAttributes
" ]» tmp[key] is set to the result of canonicalize a sanitizer element list called on config[key], and HTML namespace as default namespace for the element lists, andnull
as default namespace for the attributes lists.Note: The intent here is to assert about list elements, but without regard to whether the string shortcut syntax or the explicit dictionary syntax is used. For example, having "img" in
elements
and{ name: "img" }
inremoveElements
. An implementation might well do this without explicitly canonicalizing the lists at this point.-
Given theses canonicalized name lists, all of the following conditions hold:
-
The intersection between tmp["
elements
"] and tmp["removeElements
"] is empty. -
The intersection between tmp["
removeElements
"] tmp["replaceWithChildrenElements
"] is empty. -
The intersection between tmp["
replaceWithChildrenElements
"] and tmp["elements
"] is empty. -
The intersection between tmp["
attributes
"] and tmp["removeAttributes
"] is empty.
-
-
Let tmpattrs be tmp["
attributes
"] if it exists, and otherwise built-in default config["attributes
"]. -
For any item in tmp["
elements
"]:-
If either item["
attributes
"] or item["removeAttributes
"] exists:-
Then the difference between it and tmpattrs is empty.
-
-
-
-
-
list is a list.
-
For all of its members name:
-
name is a
string
or a dictionary. -
If name is a dictionary:
-
-
config is valid.
-
config’s key set is a subset of «[ "
elements
", "removeElements
", "replaceWithChildrenElements
", "attributes
", "removeAttributes
", "comments
", "dataAttributes
" ]» -
config’s key set contains either:
-
both "
elements
" and "attributes
", but neither of "removeElements
" or "removeAttributes
". -
or both "
removeElements
" and "removeAttributes
", but neither of "elements
" or "attributes
".
-
-
For any key of «[ "
replaceWithChildrenElements
", "removeElements
", "attributes
", "removeAttributes
" ]» where config[key] exists:-
config[key] is canonical.
-
-
For any key of «[ "
comments
", "dataAttributes
" ]»:
-
list[key] is a list.
-
For all of its list[key]'s members name:
-
name is a dictionary.
-
-
«[ "
name
", "namespace
", "attributes
" ]» -
«[ "
name
", "namespace
", "removeAttributes
" ]»
-
name["
attributes
"] and name["removeAttributes
"] are canonical if they exist.
-
Note: The initial set of asserts assert properties of the built-in constants, like the defaults and the lists of known elements and attributes.
-
Assert: built-in default config["elements"] is a subset of known elements.
-
Assert: built-in default config["attributes"] is a subset of known attributes.
-
Assert: «[ "elements" → known elements, "attributes" → known attributes, ]» is canonical.
-
If config is empty and not safe, then return «[]»
-
Let result be a new dictionary.
-
For each key of «[ "
elements
", "removeElements
", "replaceWithChildrenElements
" ]»:-
If config[key] exists, set result[key] to the result of running canonicalize a sanitizer element list on config[key] with HTML namespace as the default namespace.
-
-
For each key of «[ "
attributes
", "removeAttributes
" ]»:-
If config[key] exists, set result[key] to the result of running canonicalize a sanitizer element list on config[key] with
null
as the default namespace.
-
-
Let default be the result of canonicalizing a configuration for the built-in default config.
-
If safe:
-
-
Let elementBlockList be the difference between known elements default["
elements
"].Note: The "natural" way to enforce the default element list would be to intersect with it. But that would also eliminate any unknown (i.e., non-HTML supplied element, like <foo>). So we construct this helper to be able to use it to subtract any "unsafe" elements.
-
Set result["
elements
"] to the difference of result["elements
"] and elementBlockList.
-
-
If config["
removeElements
"] exists:-
Set result["
elements
"] to the difference of default["elements
"] and result["removeElements
"]. -
Remove "
removeElements
" from result.
-
-
If neither config["
elements
"] nor config["removeElements
"] exist: -
If config["
attributes
"] exists:-
Let attributeBlockList be the difference between known attributes and default["
attributes
"]; -
Set result["
attributes
"] to the difference of result["attributes
"] and attributeBlockList.
-
-
If config["
removeAttributes
"] exists:-
Set result["
attributes
"] to the difference of default["attributes
"] and result["removeAttributes
"]. -
Remove "
removeAttributes
" from result.
-
-
If neither config["
attributes
"] nor config["removeAttributes
"] exist:-
Set result["
attributes
"] to default["attributes
"].
-
-
-
Else (if not safe):
-
If neither config["
elements
"] nor config["removeElements
"] exist: -
If neither config["
attributes
"] nor config["removeAttributes
"] exist:-
Set result["
attributes
"] to default["attributes
"].
-
-
-
Return result.
-
Let result be a new ordered set.
-
For each name in list, call canonicalize a sanitizer name on name with defaultNamespace and append to result.
-
Return result.
-
Assert: name is either a
DOMString
or a dictionary. -
If name is a
DOMString
, then return «[ "name
" → name, "namespace
" → defaultNamespace]». -
Assert: name is a dictionary and name["name"] exists.
-
Return «[
"name
" → name["name"],
"namespace
" → name["namespace"] if it exists, otherwise defaultNamespace
]».
3.3. Supporting Algorithms
element
and attribute name
lists
used in this spec, list membership is based on matching both "name
" and "namespace
"
entries:
A Sanitizer name list contains an item if there exists an entry of list that is an ordered map, and where item["name"] equals entry["name"] and item["namespace"] equals entry["namespace"]. -
Let set be a new ordered set.
-
For each item of A:
-
Return set.
3.4. Defaults
Note: The defaults should follow a certain form, which is checked for at the beginning of canonicalize a configuration.
The built-in default config is as follows:
{ elements: [....], attributes: [....], comments: true, }
The known elements are as follows:
[ { name: "div", namespace: "http://www.w3.org/1999/xhtml" }, ... ]
The known attributes are as follows:
[ { name: "class", namespace: null }, ... ]
Note: The known elements and known attributes should be derived from the HTML5 specification, rather than being explicitly listed here. Currently, there are no mechanics to do so.
javascript:
"
navigations are unsafe, are as follows:
«[
[
{ "name
" → "a
", "namespace
" → "HTML namespace" },
{ "name
" → "href
", "namespace
" → null
}
],
[
{ "name
" → "area
", "namespace
" → "HTML namespace" },
{ "name
" → "href
", "namespace
" → null
}
],
[
{ "name
" → "form
", "namespace
" → "HTML namespace" },
{ "name
" → "action
", "namespace
" → null
}
],
[
{ "name
" → "input
", "namespace
" → "HTML namespace" },
{ "name
" → "formaction
", "namespace
" → null
}
],
[
{ "name
" → "button
", "namespace
" → "HTML namespace" },
{ "name
" → "formaction
", "namespace
" → null
}
],
]»
4. Security Considerations
The Sanitizer API is intended to prevent DOM-based Cross-Site Scripting by traversing a supplied HTML content and removing elements and attributes according to a configuration. The specified API must not support the construction of a Sanitizer object that leaves script-capable markup in and doing so would be a bug in the threat model.
That being said, there are security issues which the correct usage of the Sanitizer API will not be able to protect against and the scenarios will be laid out in the following sections.
4.1. Server-Side Reflected and Stored XSS
This section is not normative.
The Sanitizer API operates solely in the DOM and adds a capability to traverse and filter an existing DocumentFragment. The Sanitizer does not address server-side reflected or stored XSS.
4.2. DOM clobbering
This section is not normative.
DOM clobbering describes an attack in which malicious HTML confuses an
application by naming elements through id
or name
attributes such that
properties like children
of an HTML element in the DOM are overshadowed by
the malicious content.
The Sanitizer API does not protect DOM clobbering attacks in its
default state, but can be configured to remove id
and name
attributes.
4.3. XSS with Script gadgets
This section is not normative.
Script gadgets are a technique in which an attacker uses existing application code from popular JavaScript libraries to cause their own code to execute. This is often done by injecting innocent-looking code or seemingly inert DOM nodes that is only parsed and interpreted by a framework which then performs the execution of JavaScript based on that input.
The Sanitizer API can not prevent these attacks, but requires page authors to
explicitly allow unknown elements in general, and authors must additionally
explicitly configure unknown attributes and elements and markup that is known
to be widely used for templating and framework-specific code,
like data-
and slot
attributes and elements like <slot>
and <template>
.
We believe that these restrictions are not exhaustive and encourage page
authors to examine their third party libraries for this behavior.
4.4. Mutated XSS
This section is not normative.
Mutated XSS or mXSS describes an attack based on parser context mismatches when parsing an HTML snippet without the correct context. In particular, when a parsed HTML fragment has been serialized to a string, the string is not guaranteed to be parsed and interpreted exactly the same when inserted into a different parent element. An example for carrying out such an attack is by relying on the change of parsing behavior for foreign content or mis-nested tags.
The Sanitizer API offers only functions that turn a string into a node tree.
The context is supplied implicitly by all sanitizer functions: Element.setHTML()
uses the current element; Document.parseHTML()
creates a
new document. Therefore Sanitizer API is not directly affected by mutated XSS.
If a developer were to retrieve a sanitized node tree as a string, e.g. via .innerHTML
, and to then parse it again then mutated XSS may occur.
We discourage this practice. If processing or passing of HTML as a
string should be necessary after all, then any string should be considered
untrusted and should be sanitized (again) when inserting it into the DOM. In
other words, a sanitized and then serialized HTML tree can no
longer be considered as sanitized.
A more complete treatment of mXSS can be found in [MXSS].
5. Acknowledgements
Cure53’s [DOMPURIFY] is a clear inspiration for the API this document
describes, as is Internet Explorer’s window.toStaticHTML()
.