“Probably, Maybe, No”: The State of HTML5 Audio

A bare woofer, loudspeaker

With the hype around HTML5 and CSS3 exceeding levels not seen since 2005’s Ajax era, it’s worth noting that the excitement comes with good reason: the two specifications render many years of feature hacks redundant by replacing them with native features. For fun, consider how many CSS2-based rounded corners hacks you’ve probably glossed over, looking for a magic solution. These days, with CSS3, the magic is border-radius (and perhaps some vendor prefixes) followed by a coffee break.

CSS3’s border-radius, box-shadow, text-shadow and gradients, and HTML5’s <canvas>, <audio> and <video> are some of the most anticipated features we’ll see put to creative (ab)use as adoption of the ‘new shiny’ grows. Developers jumping on the cutting edge are using subsets of these features to little detriment, in most cases. The more popular CSS features are design flourishes that can degrade nicely, but the current audio and video implementations in particular suffer from a number of annoyances.

The new shiny: how we got here

Sound involves one of the five senses, a key part of daily life for most – and yet it has been strangely absent from HTML and much of the web by default. From a simplistic perspective, it seems odd that HTML did not include support for the full multimedia experience earlier, despite the CD-ROM-based craze of the early 1990s. In truth, standards like HTML can take much longer to bake, but eventually deliver the promise of a lowered barrier to entry, consistent implementations and shiny new features now possible ‘for free’ just about everywhere.

<img> was introduced early and naturally to HTML, despite having some opponents at the time. Perhaps <audio> and <video> were avoided, given the added technical complexity of decoding various multi-frame formats, plus the hardware and bandwidth limitations of the era. Perhaps there were quarrels about choosing a standard format or – more simply – maybe these elements just weren’t considered to be applicable to the HTML-based web at the time. In any event, browser plugins from programs like RealPlayer and QuickTime eventually helped to fill the in-page audio/video gap, handling <object> and <embed> markup which pointed to .wav, .avi, .rm or .mov files. Suffice it to say, the experience was inconsistent at best and, on the standards side of the fence right now, so is HTML5 in terms of audio and video.

As far as HTML goes, the code for <audio> is simple and logical. Just as with <img>, a src attribute specifies the file to load. Pretty straightforward – sounds easy, right?

<audio src="mysong.ogg" controls>
	<!-- alternate content for unsupported case -->
	Download <a href="mysong.ogg">mysong.ogg</a>;
</audio>

Ah, if only it were that simple. The first problem is that the OGG audio format, while ‘free’, is not supported by some browsers. Conversely, nor is MP3, despite being a de facto standard used in all kinds of desktop software (and hardware). In fact, as of November 2010, no single audio format is commonly supported across all major HTML5-enabled browsers.

What you end up writing, then, is something like this:

<audio controls>
	<source src="mysong.mp3" />
	<source src="mysong.ogg" />
	<!-- alternate content for unsupported case, maybe Flash, etc. -->
	Download <a href="mysong.ogg">mysong.ogg</a> or <a href="mysong.mp3">mysong.mp3</a>
</audio>

Keep in mind, this is only a ‘first class’ experience for the HTML5 case; also, for non-supported browsers, you may want to look at another inline player (object/embed, or a JavaScript plus Flash API) to have inline audio. You can imagine the added code complexity in the case of supporting ‘first class’ experiences for older browsers, too.

With <img>, you typically don’t have to worry about format support – it just works – and that’s part of what makes a standard wonderful. JPEG, PNG, BMP, GIF, even TIFF images all render just fine if for no better reason, perhaps, than being implemented during the ‘wild west’ days of the web. The situation with <audio> today reflects a very different – read: business-aware – environment in 2010. (Further subtext: There’s a lot of [potential] money involved.) Regrettably, this is a collision of free and commercial interests, where the casualty is ultimately the user. Second up in the casualty list is you, the developer, who has to write additional code around this fragmented support.

The HTML5 audio API as implemented in JavaScript has one of the most un-computer-like responses I’ve ever seen, and inspired the title of this post. Calling new Audio().canPlayType('audio/mp3'), which queries the system for format support according to a MIME type, is supposed to return one of “probably”, “maybe”, or “no”. Sometimes, you’ll just get a null or empty string, which is also fun. A “maybe” response does not guarantee that a format will be supported; sometimes audio/mp3 gives “maybe,” but then audio/mpeg; codecs="mp3" will give a more-solid “probably” response. This can vary by browser or platform, too, depending on native support – and finally, the user may also be able to install codecs, extending support to include other formats. (Are you excited yet?)

Damn you, warring formats!

New market and business opportunities go hand-in-hand with technology developments. What we have here is certainly not failure to communicate; rather, we have competing parties shouting loudly in public in attempts to influence mindshare towards a de facto standard for audio and video. Unfortunately, the current situation means that at least two formats are effectively required to serve the majority of users correctly.

As it currently stands, we have the free and open source software camp of OGG Vorbis/WebM and its proponents (notably, Mozilla, Google and Opera in terms of browser makers), up against the non-free, proprietary and ‘closed’ camp of MP3 and MPEG4/HE-AAC/H.264 – which is where you’ll find commitments from Apple and Microsoft, among others. Apple is likely in with H.264 for the long haul, given its use of the format for its iTunes music store and video offerings.

It is generally held that H.264 is a technically superior format in terms of file size versus quality, but it involves intellectual property and, in many use cases, requires licensing fees. To be fair, there is a business model with H.264 and much has been invested in its development, but this approach is not often the kind that wins over the web. On that front, OGG/WebM may eventually win for being a ‘free’ format that does not involve a licensing scheme.

Closed software and tools ideologically clash with the open nature of the web, which exists largely thanks to free and open technology. Because of philosophical and business reasons, support for audio and video is fragmented across browsers adopting HTML5 features. It does not help that a large amount of audio and video currently exists in non-free MP3 and MPEG-4 formats. Adoption of <audio> and <video> may be slowed, since it is more complex than <img> and may feel ‘broken’ to developers when edge cases are encountered. Furthermore, the HTML5 spec does not mandate a single required format. The end result is that, as a developer, you must currently provide at least both MP3 and OGG, for example, to serve most existing HTML5-based user agents.

Transitioning to

A small circular

There will be some growing pains as developers start to pick up the new HTML5 shiny, while balancing the needs of current and older agents that don’t support either <audio> or the preferred format you may choose (for example, MP3). In either event, Flash or other plugins can be used as done traditionally within HTML4 documents to embed and play the relevant audio.

A screenshot of a Muxtape.com-styled UI with title, time, progress bars and a VU meter, part of SoundManager 2The SoundManager 2 page player demo in action.

Ideally, HTML5 audio should be used whenever possible with Flash as the backup option. A few JavaScript/Flash-based audio player projects exist which balance the two; in attempting to tackle this problem, I develop and maintain SoundManager 2, a JavaScript sound API which transparently uses HTML5 Audio() and, if needed, Flash for playing audio files. The internals can get somewhat ugly, but the transition between HTML4 and HTML5 is going to be just that – and even with HTML5, you will need some form of format fall-back in addition to graceful degradation.

It may be safest to fall back to MP3/MP4 formats for inline playback at this time, given wide support via Flash, some HTML5-based browsers and mobile devices. Considering the amount of MP3/MP4 media currently available, it is wiser to try these before falling through to a traditional file download process.

Early findings

Here is a brief list of behavioural notes, annoyances, bugs, quirks and general weirdness I have found while playing with HTML5-based audio at time of writing (November 2010):

Apple iPad/iPhone (iOS 4, iPad 3.2+)

  • Only one sound can be played at a time. If a second sound starts, the first is stopped.
  • No auto-play allowed. Sounds follow the pop-up window security model and can only be started from within a user event handler such as onclick/touch, and so on. Otherwise, playback attempts silently fail.
  • Once started, a sequence of sounds can be created or played via the ‘finish’ event of the previous sound (for example, advancing through a playlist without interaction after first track starts).
  • iPad, iOS 3.2: Occasional ‘infinite loop’ bug seen where audio does not complete and stop at a sound’s logical end – instead, it plays again from the beginning. Might be specific to example file format (HE-AAC) encoded from iTunes.

Apple Safari, OS X Snow Leopard 10.6.5

  • Critical bug: Safari 4 and 5 intermittently fail to load or play HTML5 audio on Snow Leopard due to bug(s) in QuickTime X and/or other underlying frameworks. Known Apple ‘radar’ bug: bugs.webkit.org #32159 (see also, test case.) Amusing side note: Safari on Windows is fine.

Apple Safari, Windows

  • Food for thought: if you download “Safari” alone on Windows, you will not get HTML5 audio/video support (tested in WinXP). You need to download “Safari + QuickTime” to get HTML5 audio/video support within Safari. (As far as I’m aware, Chrome, Firefox and Opera either include decoders or use system libraries accordingly. Presumably IE 9 will use OS-level APIs.)

General Quirks

  • Seeking and loading, ‘progress’ events, and calculating bytes loaded versus bytes total should not be expected to be linear, as users can arbitrarily seek within a sound. It appears that some support for HTTP ranges exists, which adds a bit of logic to UI code. Browsers seem to vary slightly in their current implementations of these features.
  • The onload event of a sound may be of little relevance, if non-linear loading is involved (see above note re: seeking).
  • Interestingly (perhaps I missed it), the current spec does not seem to specify a panning or left/right channel mix option.
  • The preload attribute values may vary slightly between browsers at this time.

Upcoming shiny: HTML5 Audio Data API

A screenshot of a circular audio UI design showing waveforms being drawn from audio data. With access to audio data, you can incorporate waveform and spectrum elements that make your designs react to music.

The HTML5 audio spec does a good job covering the basics of playback, but did not initially get into manipulation or generation of audio on-the-fly, something Flash has had for a number of years now. What if JavaScript could create, monitor and change audio dynamically, like a sort of audio <canvas> element? With that kind of capability, many dynamic audio processing features become feasible and, when combined with other media, can make for some impressive demos.

What started as a small idea among a small group of audio and programming enthusiasts grew to inspire a W3C audio incubator group, and continued to establish the Mozilla Audio Data API. Contributors wrote a patch for Firefox which was reviewed and revised, and is now slated to be in the public release of Firefox 4. Some background and demos are also detailed in an article from the BBC R&D blog.

There are plenty of live demos to see, which give an impression of the new creative ideas this API enables. Many concepts are not new in themselves, but it is exciting to see this sort of thing happening within the native browser context.

Mozilla is not alone in this effort; the WebKit folks are also working on a JavaScriptAudioNode interface, which implements similar audio buffering and sample elements.

The future?

It is my hope that we’ll see a common format emerge in terms of support across the major browsers for both audio and video; otherwise, support will continue to be fragmented and mildly frustrating to develop for, and that can impede growth of the feature. It’s a big call, but if <img> had lacked a common format back in the wild west era, I doubt the web would have grown to where it is today.

Complaints and nitpicks aside, HTML5 brings excellent progress on the browser multimedia front, and the first signs of native support are a welcome improvement given all audio and video previously relied on plugins. There is good reason to be excited. While there is room for more, support could certainly be much worse – and as tends to happen with specifications, the implementations targeting them should improve over time.

Note: Thanks to Nate Koechley, who suggested the Audio().canPlayType() response be part of the article title.

About the author

Scott Schiller has been enjoying building web things since 1995. He also has a fondness for Creedence and the occasional White Russian.

Scott’s personal site is probably best known for its DHTML Arkanoid remake (2002), Snowstorm and holiday christmas light-smashing distractions and other random JavaScript + CSS-based experiments.

In his spare time, Scott tinkers with side projects like SoundManager 2, a JavaScript sound API used by some nifty sites to drive audio features. By day, he works for Yahoo! building shiny things at Flickr.

More articles by Scott

Comments