View on GitHub

audio-focus

AudioFocus API Explained

Objectives

People consume a lot of media (audio/video) and the Web is one of the primary means of consuming this type of content. One problem of Web media is that they are usually badly mixed. The AudioFocus API helps improving the audio-mixing of Web media, so they can play on top of each other, or play exclusively. This API also improves the audio mixing between Web media and native applications.

Motivation

The existing behavior when multiple tabs play sound at the same time and how they interact with each other and native applications (on the platform) is barely defined, which usually brings bad user experience. It can be annoying when two tabs play video at the same time. However it is usually acceptable to let a transient ping play above music playback.

For example, if the user plays a video on Facebook while listening to music in another tab, the Facebook video should pause the music playback. However, if Facebook wants to play a messenger notification, it will annoy the user if the ping interrupt the music playback. Lowering the music volume and let the Facebook ping play above the music will bring better experience.

API design

The AudioFocus API design is still a sketch. Comments/suggestions are welcomed.

Firstly, audio focus means an audio-producing object is allowed to play sound. By convention, there are several audio focus types for different purposes:

The audio focus types above may be different from audio focus types provided by some platform. There might be compatibility issues for audio focus types in the API and platform. For example, iOS does not explicitly distinguish transient and non-transient, and Android does not explicitly have ambient type. Therefore, the audio focus types in the API are not strict, and the user agent should adapt them to platform conventions.

We list the possible solutions for Android & iOS here:

The implementations on Android & iOS for the 4 audio focus types in the API would be:

Audio focus type Android iOS
playback MUSIC x GAIN Playback
transient MUSIC/NOTIFICATION x TRANSIENT_MAY_DUCK Playback/Ambient x DuckOthers
transient-solo MUSIC/NOTIFICATION x TRANSIENT Playback/SoloAmbient
ambient Not requesting audio focus and respond or responding audio focus changes at all (may be bad)? Playback/Ambient x MixWithOthers

The model for handling audio focus

AudioFocusEntry is the minimum unit for handling audio focus. An AudioFocusEntry has a type indicating whether it is playback, transient, transient-solo or ambient. Here we only consider media elements for describing the model, other audio producing objects such as Flash, AudioContext (WebAudio) and WebRTC will discussed later.

The page creates an AudioFocusEntry with an audio focus type, and it can associate media elements to the AudioFocusEntry. When the page wants to play audio, it needs to request audio focus through the AudioFocusEntry, either by calling play() of the associated media element or explicitly requesting audio focus by javascript (even without associating with any media elements). The user agent needs to decide whether the request is successful and tell the AudioFocusEntry. Then the element can play after the audio focus is granted. Otherwise, the play request is rejected. AudioFocusEntrys may optionally tell the user agent when it does not want to play anymore (abandon audio focus).

A sample snippet is as follows:

// Suppose |audio| is an <audio> element.
var focusEntry = new AudioFocusEntry("playback");
audio.focusEntry = focusEntry;
audio.play()
    .then(function () {
        console.log("play() success");
    }).catch(function () {
        console.log("failed to request audio focus");
    });

AudioFocusEntrys can have the following states:

Note We may only want to expose only some of the states to the page. Maybe only active, suspended and inactive? Ducking should be handled internally in the user agent.

The user agent can perform the following operations to AudioFocusEntrys:

The user agent should keep track of all active, suspended and ducking AudioFocusEntrys, which is put into its managed AudioFocusEntry set. There is at most one active/suspended/ducking AudioFocusEntry of playback type. When an AudioFocusEntry tries to request audio focus or abandons audio focus, the user agent must decide whether to grant the play request and how to change the state of the managed AudioFocusEntry set. The detailed implementation is up to the user agent. Some example behaviors are:

Besides, if the platform has audio focus handling mechanisms, the user agent should behave as a proxy and forward the AudioFocusEntry requests to the platform. The user agent should also listen to audio focus related signals coming from the platform and update managed AudioFocusEntry states accordingly. For example, if another native music app starts playback, and the platform will tell the user agent it should suspend, then the user agent should suspend all AudioFocusEntrys.

To summarize, the processing model can be described as follows:

Handling WebAudio, Flash (and maybe WebRTC) Issue

The topic discussed in this section is still an open question.

WebAudio and Flash and WebRTC are not like media elements, they need to be addressed differently.

WebAudio

To let WebAudio participate in audio focus management, we can add a focusEntry attribute to AudioContext to let WebAudio join AudioFocusEntry:

focusEntry = new AudioFocusEntry();
audioContext = new AudioContext();
audioContext.focusEntry = focusEntry();

When Webaudio wants to start playback, since we cannot really know when WebAudio starts, the page is responsible to activate the AudioFocusEntry() explicitly:

focusEntry.activate()
  .then(function() {
    // Start WebAudio playback.
  }).catch(function(e) {
    console.log("cannot request audio focus for WebAudio");
  });

When responding to audio focus changes, the user agent should call AudioContext.suspend() or AudioContext.resume() accordingly.

Flash

Flash is similar with WebAudio, the playback cannot be controlled. Also, unlike WebAudio, it is harder to let Flash join AudioFocusEntry since we might need to modify elements. Maybe we could have a default AudioFocusEntry per page and let Flash join the default one.

WebRTC

WebRTC is more complex, since it usually require voice call focus, and on some platforms, the platform have to change the audio routing and trigger some other complex behaviors. Besides, there are different use cases we need to define the desired behavior, such as:

So maybe a one-shot audio focus type is needed for this case.

Fallback behavior when the page does not use AudioFocus API

We need to define a behavior when the page does not use AudioFocus API. The behavior should also be compatible with our model. There are several doable ways: