HTMLVideoElement.requestVideoFrameCallback()

Draft Community Group Report,

This version:
https://wicg.github.io/video-rvfc/
Issue Tracking:
GitHub
Inline In Spec
Editor:
Thomas Guilbert (Google Inc.)
Participate:
Git Repository.
File an issue.
Version History:
https://github.com/wicg/video-rvfc/commits

Abstract

<video>.requestVideoFrameCallback() allows web authors to be notified when a frame has been presented for composition.

Status of this document

This specification was published by the Web Platform Incubator Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

1. Introduction

This section is non-normative

This is a proposal to add a requestVideoFrameCallback() method to the HTMLVideoElement.

This method allows web authors to register a callback which runs in the rendering steps, when a new video frame is sent to the compositor. The new callbacks are executed immediately before existing window.requestAnimationFrame() callbacks. Changes made from within both callback types within the same turn of the event loop will be visible on screen at the same time, with the next v-sync.

Drawing operations (e.g. drawing a video frame to a canvas via drawImage()) made through this API will be synchronized as a best effort with the video playing on screen. Best effort in this case means that, even with a normal work load, a callback can occasionally be fired one v-sync late, relative to when the new video frame was presented. This means that drawing operations might occasionally appear on screen one v-sync after the video frame does. Additionally, if there is a heavy load on the main thread, we might not get a callback for every frame (as measured by a discontinuity in the presentedFrames).

Note: A web author could know if a callback is late by checking whether expectedDisplayTime is equal to now, as opposed to roughly one v-sync in the future.

The VideoFrameRequestCallback also provides useful metadata about the video frame that was most recently presented for composition, which can be used for automated metrics analysis.

2. VideoFrameMetadata

dictionary VideoFrameMetadata {
  required DOMHighResTimeStamp presentationTime;
  required DOMHighResTimeStamp expectedDisplayTime;

  required unsigned long width;
  required unsigned long height;
  required double mediaTime;

  required unsigned long presentedFrames;
  double processingDuration;

  DOMHighResTimeStamp captureTime;
  DOMHighResTimeStamp receiveTime;
  unsigned long rtpTimestamp;
};

2.1. Definitions

media pixels are defined as a media resource’s visible decoded pixels, without pixel aspect ratio adjustments. They are different from CSS pixels, which account for pixel aspect ratio adjustments.

2.2. Attributes

presentationTime, of type DOMHighResTimeStamp

The time at which the user agent submitted the frame for composition.

expectedDisplayTime, of type DOMHighResTimeStamp

The time at which the user agent expects the frame to be visible.

width, of type unsigned long

The width of the video frame, in media pixels.

height, of type unsigned long

The height of the video frame, in media pixels.

Note: width and height might differ from videoWidth and videoHeight in certain cases (e.g, an anamorphic video might have rectangular pixels). When a calling texImage2D(), width and height are the dimensions used to copy the video’s media pixels to the texture, while videoWidth and videoHeight can be used to determine the aspect ratio to use, when using the texture.

mediaTime, of type double

The media presentation timestamp (PTS) in seconds of the frame presented (e.g. its timestamp on the video.currentTime timeline). MAY have a zero value for live-streams or WebRTC application.

presentedFrames, of type unsigned long

A count of the number of frames submitted for composition. Allows clients to determine if frames were missed between VideoFrameRequestCallbacks. MUST be monotonically increasing.

processingDuration, of type double

The elapsed duration in seconds from submission of the encoded packet with the same presentation timestamp (PTS) as this frame (e.g. same as the mediaTime) to the decoder until the decoded frame was ready for presentation.

In addition to decoding time, may include processing time. E.g., YUV conversion and/or staging into GPU backed memory.

SHOULD be present. In some cases, user-agents might not be able to surface this information since portions of the media pipeline might be owned by the OS.

captureTime, of type DOMHighResTimeStamp

For video frames coming from either a local or remote source, this is the time at which the frame was captured by the camera. For a remote source, the capture time is estimated using clock synchronization and RTCP sender reports to convert RTP timestamps to capture time as specified in RFC 3550 Section 6.4.1.

SHOULD be present for WebRTC applications, and absent otherwise.

receiveTime, of type DOMHighResTimeStamp

For video frames coming from a remote source, this is the time the encoded frame was received by the platform, i.e., the time at which the last packet belonging to this frame was received over the network.

SHOULD be present for WebRTC applications that receive data from a remote source, and absent otherwise.

rtpTimestamp, of type unsigned long

The RTP timestamp associated with this video frame.

SHOULD be present for WebRTC applications, and absent otherwise.

3. VideoFrameRequestCallback

callback VideoFrameRequestCallback = void(DOMHighResTimeStamp now, VideoFrameMetadata metadata);

Each VideoFrameRequestCallback object has a canceled boolean initially set to false.

4. HTMLVideoElement.requestVideoFrameCallback()

partial interface HTMLVideoElement {
    unsigned long requestVideoFrameCallback(VideoFrameRequestCallback callback);
    void cancelVideoFrameCallback(unsigned long handle);
};

4.1. Methods

Each HTMLVideoElement has a list of video frame request callbacks, which is initially empty. It also has a last presented frame indentifier and a video frame request callback identifier, which are both number which are initially zero.

requestVideoFrameCallback(callback)

Registers a callback to be fired the next time a frame is presented to the compositor.

When requestVideoFrameCallback is called, the user agent MUST run the following steps:

  1. Let video be the HTMLVideoElement on which requestVideoFrameCallback is invoked.

  2. Increment video’s ownerDocument's video frame request callback identifier by one.

  3. Let callbackId be video’s ownerDocument's video frame request callback identifier

  4. Append callback to video’s list of video frame request callbacks, associated with callbackId.

  5. Return callbackId.

cancelVideoFrameCallback(handle)

Cancels an existing video frame request callback given its handle.

When cancelVideoFrameCallback is called, the user agent MUST run the following steps:

  1. Let video be the target HTMLVideoElement object on which cancelVideoFrameCallback is invoked.

  2. Find the entry in video’s list of video frame request callbacks that is associated with the value handle.

  3. If there is such an entry, set its canceled boolean to true and remove it from video’s list of video frame request callbacks.

4.2. Procedures

An HTMLVideoElement is considered to be an associated video element of a Document doc if its ownerDocument attribute is the same as doc.

This spec should eventually be merged into the HTML spec, and we should directly call run the video frame request callbacks from the update the rendering steps. This procedure describes where and how to invoke the algorithm in the meantime.

When the update the rendering algorithm is invoked, run this new step:

immediately before this existing step:

using the definitions for docs and now described in the update the rendering algorithm.

Note: The effective rate at which callbacks are run is the lesser rate between the video’s rate and the browser’s rate. When the video rate is lower than the browser rate, the callbacks' rate is limited by the frequency at which new frames are presented. When the video rate is greater than the browser rate, the callbacks' rate is limited by the frequency of the update the rendering steps. This means, a 25fps video playing in a browser that paints at 60hz would fire callbacks at 25hz; a 120fps video in that same 60hz browser would fire callbacks at 60hz.

To run the video frame request callbacks for a HTMLVideoElement video with a timestamp now, run the following steps:

  1. If video’s list of video frame request callbacks is empty, abort these steps.

  2. Let metadata be the VideoFrameMetadata dictionary built from video’s latest presented frame.

  3. Let presentedFrames be the value of metadata’s presentedFrames field.

  4. If the last presented frame indentifier is equal to presentedFrames, abort these steps.

  5. Set the last presented frame indentifier to presentedFrames.

  6. Let callbacks be the list of video frame request callbacks.

  7. Set video’s list of video frame request callbacks to be empty.

  8. For each entry in callbacks

    1. If the entry’s canceled boolean is true, continue to the next entry.

    2. Invoke the callback, passing now and metadata as arguments

    3. If an exception is thrown, report the exception.

Note: There are no strict timing guarantees when it comes to how soon callbacks are run after a new video frame has been presented. Consider the following scenario: a new frame is presented on the compositor thread, just as the user agent aborts the algorithm above, when it confirms that there are no new frames. We therefore won’t run the callbacks in the current rendering steps, and have to wait until the next rendering steps, one v-sync later. In that case, visual changes to a web page made from within the delayed callbacks will appear on-screen one v-sync after the video frame does.

Offering stricter guarantees would likely force implementers to add cross-thread synchronization, which might be detrimental to video playback performance.

5. Security and Privacy Considerations

This specification does not expose any new privacy-sensitive information. However, the location correlation opportunities outlined in the Privacy and Security section of [webrtc-stats] also hold true for this spec: captureTime, receiveTime, and rtpTimestamp expose network-layer information which can be correlated to location information. E.g., reusing the same example, captureTime and receiveTime can be used to estimate network end-to-end travel time, which can give indication as to how far the peers are located, and can give some location information about a peer if the location of the other peer is known. Since this information is already available via the RTCStats, this specification doesn’t introduce any novel privacy considerations.

This specification might introduce some new GPU fingerprinting opportunities. processingDuration exposes some under-the-hood performance information about the video pipeline, which is otherwise inaccessible to web developers. Using this information, one could correlate the performance of various codecs and video sizes to a known GPU’s profile. We therefore propose a resolution of 100μs, which is still useful for automated quality analysis, but doesn’t offer any new sources of high resolution information. Still, despite a coarse clock, one could exploit the significant performance differences between hardware and software decoders to infer information about a GPU’s features. For example, this would make it easier to fingerprint the newest GPUs, which have hardware decoders for the latest codecs, which don’t yet have widespread hardware decoding support. However, rather than measuring the profiles themselves, one could directly get equivalent information from getting the MediaCapabilitiesInfo.

This specification also introduces some new timing information. presentationTime and expectedDisplayTime expose compositor timing information; captureTime and receiveTime expose network timing information. The clock resolution of these fields should therefore be coarse enough not to facilitate timing attacks.

6. Examples

6.1. Drawing frames at the video rate

This section is non-normative

Drawing video frames onto a canvas at the video rate (instead of the browser’s animation rate) can be done by using video.requestVideoFrameCallback() instead of window.requestAnimationFrame().

<body>
  <video controls></video>
  <canvas width="640" height="360"></canvas>
  <span id="fps_text"/>
</body>

<script>
  function startDrawing() {
    var video = document.querySelector('video');
    var canvas = document.querySelector('canvas');
    var ctx = canvas.getContext('2d');

    var paint_count = 0;
    var start_time = 0.0;

    var updateCanvas = function(now) {
      if(start_time == 0.0)
        start_time = now;

      ctx.drawImage(video, 0, 0, canvas.width, canvas.height);

      var elapsed = (now - start_time) / 1000.0;
      var fps = (++paint_count / elapsed).toFixed(3);
      document.querySelector('#fps_text').innerText = 'video fps: ' + fps;

      video.requestVideoFrameCallback(updateCanvas);
    }

    video.requestVideoFrameCallback(updateCanvas);

    video.src = "http://example.com/foo.webm"
    video.play()
  }
</script>

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[CSS-VALUES-3]
Tab Atkins Jr.; Elika Etemad. CSS Values and Units Module Level 3. 6 June 2019. CR. URL: https://www.w3.org/TR/css-values-3/
[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[HR-TIME-2]
Ilya Grigorik. High Resolution Time Level 2. 21 November 2019. REC. URL: https://www.w3.org/TR/hr-time-2/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[MEDIA-CAPABILITIES]
Mounir Lamouri. Media Capabilities. ED. URL: https://w3c.github.io/media-capabilities/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/

Informative References

[WEBRTC-STATS]
Harald Alvestrand; Varun Singh. Identifiers for WebRTC's Statistics API. 3 July 2018. CR. URL: https://www.w3.org/TR/webrtc-stats/

IDL Index

dictionary VideoFrameMetadata {
  required DOMHighResTimeStamp presentationTime;
  required DOMHighResTimeStamp expectedDisplayTime;

  required unsigned long width;
  required unsigned long height;
  required double mediaTime;

  required unsigned long presentedFrames;
  double processingDuration;

  DOMHighResTimeStamp captureTime;
  DOMHighResTimeStamp receiveTime;
  unsigned long rtpTimestamp;
};

callback VideoFrameRequestCallback = void(DOMHighResTimeStamp now, VideoFrameMetadata metadata);

partial interface HTMLVideoElement {
    unsigned long requestVideoFrameCallback(VideoFrameRequestCallback callback);
    void cancelVideoFrameCallback(unsigned long handle);
};

Issues Index

This spec should eventually be merged into the HTML spec, and we should directly call run the video frame request callbacks from the update the rendering steps. This procedure describes where and how to invoke the algorithm in the meantime.