TABLE OF CONTENTS
Table of Contents
Chapter 1: A Short Market Review
Chapter 2: Maximum Tolerated Delay Times in Musical Collaboration
Chapter 3: How to reach Minimum Delay
3.1: End-System Delay
3.1.1: Digital Audio Processing
3.1.2: Real Time Audio Compression
3.1.3: Minimum Bandwidth
3.2: Network Delay
3.2.1: Casting Concepts
3.2.2: Transport Protocols
3.2.4: Audio Optimizing
Chapter 4: Conclusion
List of Tables
List of Figures
- the attached research paper is my own, original work undertaken in partial fulfillment of my degree.
- I have made no use of sources, materials or assistance other than those which have been openly and fully acknowledged in the text. If any part of another person’s work has been quoted, this either appears in inverted commas or (if beyond a few lines) is indented.
- Any direct quotation or source of ideas has been identified in the text by author, date, and page number(s) immediately after such an item, and full details are provided in a reference list at the end of the text.
- I understand that any breach of the fair practice regulations may result in a mark of zero for this research paper and that it could also involve other repercussions.
- I understand also that too great a reliance on the work of others may lead to a low mark.
Winkler, Stephan 2
High Bandwidth and the rapid development of social networks introduce a new way of making music together. The users of this new community are ambitious musicians and can log into online jam sessions to play together, overcoming the distance between them. Remote collaborative musical performances over the internet are already possible and there are numerous solutions out there. But all these solutions are dealing with one problem: Latency. This means, the audio signal of one user reaches the other user not in real time but lagged, which makes it impractical to make music together. It’s not feasible to eliminate Latency but it is feasible to reduce it. In this work I will define tolerable latency and I will outline possible problems which create latency in the audio conduction and recommend solutions, for keeping latency as low as possible.
This includes both client-side difficulties like bandwidth, real time audio compression, buffer times, and problems in the server-side streaming architecture like transport protocols, synchronization, packet loss, and delay times.
“Hey Sean, let’s meet in the Online Jam Room at 7 p.m. Please invite a drummer to join our session.” - “OK, I’ll be there, see you then!” Due to the current wide-spread availability of high speed Internet, ardent musicians are nowadays able to join virtual jam sessions and make music together using this medium. The only problematic requirement is called real time. It’s a fact that there is no real time on the Internet, there is only delay time.
It takes a certain amount of time for an audio signal to get from one location to another location. Some delay is created by the distance itself, some delay is created by technological problems in the architecture of the computer hardware and the Internet. In a conversation over the Internet, that’s not a big problem. However, with a heavily delayed audio signal from another user, it’s almost impossible to play along with the multi-party musical performance. This paper is dealing with these technological challenges and is recommending solutions, for keeping delay time as low as possible.
In the first chapter, existing providers for networked collaborative musical performances are being introduced by outlining some of their technological advantages and disadvantages. In the following chapters, I recall some of the providers, referring to the respective topic.
The second chapter is dealing with maximum tolerated delay times in musical collaborations depending on the physical distance between two musicians.
The third chapter is being split up into two subdivisions. In the first subdivision, I give details regarding client-side difficulties like bandwidth, real time audio compression, and buffer times - all of which cause delay. In the second subdivision, problems in the server-side streaming architecture like transport protocols, synchronization, packet loss, and delay times are being discussed.
The fourth chapter concludes this essay with a personal review of recommended technology, which could be used to achieve minimum delay times.
In order to launch an online community for musicians to make music together, technological problems have to be discussed and possible solutions have to be worked out. This paper serves as a guide for planning an online platform which has the ability of networked collaborative musical performances.
CHAPTER 1: A SHORT MARKET REVIEW
There are various providers on the Internet which offer software or hardware bringing musicians together in a virtual jam room and transferring audio data from each participant to the other participants. Some statements in this chapter are taken from an online forum where users post their personal experience with these providers. These comments are considered not to be taken too reliably.
NINJAM (Novel Intervallic Network Jamming Architecture for Music) is a software based, client server platform, which you can download at http://www.ninjam.com. It was one of the first online products offering musical collaboration over the Internet. The users are able to record voice or to use any instrument excluding MIDI instruments. As NINJAM declares latency is a part of the Internet and thus is as an unavoidable part of experimentation, it uses OGG Vorbis compressed audio streams, which create additional latency. When a jam session starts, NINJAM waits intentionally for every audio signal from each client, splits it up into intervals, and then sends a lagged, but synchronized stream of all participants to all clients. This technique also generates extra latency. All in all, NINJAM works with more than 65msec latency, consisting of hardware, codec, and network latency. (cf. NINJAM, 2009, [http://www.ninjam.com])
Regarding online forum comments, it’s a very popular solution for online jamming since there is no charge. It works with good sound quality without regard to distances between the participants. The only flaw is a very high delay time and just skilled musicians can work with it. (cf. COCKOS, 2010, [http://forum.cockos.com/showthread.php?p=461954])
eJamming AUDiiO is a peer-to-peer software tool comparable to NINJAM, but it charges its users a monthly subscription payment. Moreover, there is no server between the participants, which saves latency time. In the new version of eJamming AUDiiO, it’s also possible to join sessions with MIDI instruments. To hold down hardware latency, eJamming is only able to work with ASIO compliant hardware. Participants can choose between Jam Mode, Sync Mode, and VRS mode. The Jam Mode is not a synchronized mode which produces very low latency, but the participants have to listen carefully to each other, so as to not drop off the beat. The Sync Mode is similar to NINJAM, intentionally lagging each musician’s instrument in order to gather one tense synchronized stream for all participants, which implies latency. The figure 1.1 illustrates the difference between Jam Mode and Sync Mode.
illustration not visible in this excerpt
Figure 1: The difference between Jam Mode and Sync Mode
The users choose the VRS mode to play in real time to a previously recorded track, while the other participants can listen to it. (cf. eJamming, 2010, [http://www.ejamming.com/faq/]) Users of eJamming like the fact, that they play together simultaneously, although they are not synchronized and without any click-track. The only necessity is a high bandwidth. (cf. COCKOS, 2010, [http://forum.cockos.com/showthread.php?p=461954])
jamLink is a hardware tool for networked collaborative musical performances offering ultra low delay time and highest audio quality. The web browser just controls the jamLink tool, but has nothing to do with the conduction of the audio signal over the Internet. An Ethernet port sends the incoming audio signal straight to the other client skipping audio interface, computer hardware, and software and thus omitting lots of latency. Indeed, a 1,000 kbps upstream and 2,000 kbps downstream bandwidth is required and the tool itself costs $149,99. (cf. MusicianLink, 2010, [http://www.musicianlink.com/content/about/us])
Online Jam Sessions
Online Jam Sessions is a web browser based social network with the opportunity of musical collaboration including video streaming, blogs, galleries, and many more gadgets. Probably, it’s the easiest way to get into an online jam, because you just have to sign in and log onto a jam session. There is no software, no hardware, and no monthly charge. Nevertheless, users stated a strong delay while jamming online. (cf. COCKOS, 2010, [http://forum.cockos.com/showthread.php?p=461954]) A comparable online community is “D-Live Entertainment” at http://www.dlive-entertainment.com.
CHAPTER 2: MAXIMUM TOLERATED DELAY TIMES IN MUSICAL COLLABORATION
In order to understand the basic principles of delay times in musical collaboration, first of all, we have to define the term “audio delay”. Delay or latency is the time lag between the incurrence of a sound and its actual reaching of our ears. (cf. Raffaseder, 2002, p. 218) The sound of a created sound event is spreading out as sound waves in the air. The dissemination via air takes a defined amount of time. Sound waves in the air have a speed of 340m/s. As a consequence, a sound wave lays back 1m in 1/340s, or 0.0029s, which is about 3msec. Lightning and thunder is a practical model illustrating delay time. Thunder is created by lightning. You see lightning and seconds after the incurrence of lightning you hear thunder. In other words, sound is much slower than light.
The impact of delay times in musical collaboration is exemplified in the following paragraph. Note that in a huge orchestral performance, musicians are positioned at a certain distance to each other, usually based on instrumental arrangement. A violinist and a trumpeter, for instance, are located at a distance of 15 meter; they hear the sound from each other in a time lag of 15m x 3msec, or 45msec. Consequently, musicians in an orchestral performance have to be experienced to overcome relatively high delay times and a conductor is absolutely indispensable for a synchronized performance in the right tempo. (cf. San Segundo, 2008, [http://www.delamar.de/musikproduktion/die-latenz-in-der-musikproduktion-2838/])
Due to the fact that there is no conductor on a networked musical performance, it’s required to find a way to keep audio transmission latency over the Internet as low as possible, as several milliseconds of end-to-end delay can be disturbing for the participants. But how many milliseconds are tolerable in a musical collaboration? Before answering this question, we have to consider the following essentials. The maximum tolerance of latency is directly influenced by two values: tempo and attack time. Barbosa et al. found out in June 2004 at the Sound and Image Department at the Portuguese Catholic University that there is an immediate coherence between tempo and latency. They proved their thesis with some experiments which states that “of the instrumental skills or the music instrument, all musicians were able to tolerate more feedback-delay for slower Tempos.” (Barbosa / Cardoso / Geiger, 2005, p. 185) Thus, the faster the tempo is, the lower is the tolerated latency. The other value, the attack time of sound events is conducive to the tolerated latency, too. Sound events with a longer attack time can be played with more latency whereas sound events with short attack times cause a very low latency tolerance. (cf. San Segundo, 2008, [http://www.delamar.de/musikproduktion/die-latenz-in-der-musikproduktion-2838/])
There are many opinions, stated in various articles, about a maximum tolerated latency. Some say you can’t define an exact quantity of milliseconds, because every human brain has another tolerance scope and is actually very adaptable. From a delay of 11msec on, the brain is capable of dividing between two sound events. (cf. ibid.) Kurtisi et al. mention in their article “Enabling Network- Centric Music Performance in Wide-Area Networks” that “30msec4 is a widely recognized bound.” (Kurtisi / Gu / Wolf, 2006, p. 52)
Zimmermann et al. realized some audio delay experiments with musicians for their research paper “Distributed Musical Performances: Architecture and Stream Management”. They disovered two different forms of delay experiments and they separated them. “The goal of these tests was to explore the delay bounds above which a tight, expressive musical performance was no longer possible. […] A: The musicians would hear their own sound with zero latency […] while at the same time the sound from their partner was delayed by a fixed delay.” (Zimmermann et al., 2008, p. 13) The results of experiment A have shown that “latencies up to 50msec were generally tolerable for a musical performance”. (ibid., p. 13) In experiment B where the musician’s own signal was delayed with the same lag time, they heard the other participant. Here it was possible to play along the synchronized audio stream with a delay of up to 75msec. (cf. ibid., p. 13) As a consequence, the stream management in a networked audio environment could have two different architectures. In the first architecture the musician hears himself playing with zero delay and a delayed signal stream from the other participants. The second architecture is one collected and synchronized audio stream where each participant hears himself with the same delay as the others. As mentioned in the first chapter, eJamming combines the two architectures, allowing the participants to decide between them. Figure 1.1 illustrates the difference between the two architectures.
Considering the diverse statements of maximum tolerable delay times, the results, found through experiments, are worth working with. Additionally, Stan Vonog, founder of Musigy another virtual environment for networked musical collaboration, affirms that “you can manage to talk around a lag as high as 200 to 300 milliseconds, […] but for multiplayer jam sessions, anything above 50msec renders a piece of music practically unplayable.” (Anderson, 2007, [http://spectrum.ieee.org/computing/software/virtual-jamming]) The maximum tolerable delay time of 50msec in a non synchronized environment is the basis of further investigations in the following chapter where we split up the occurring delay time to two parts.
CHAPTER 3: HOW TO REACH MINIMUM DELAY
As mentioned in the last paragraph of chapter 2, minimum delay is considered to be 50msec. There are two major areas where delay time is evoked. The audio signal processing in the client’s computer induces insurmountable latency and the signal transmission over the Internet adds additional latency. Both parts summed up, there has to be a delay of maximum 50msec for an acceptable networked collaborative musical performance.
3.1: END-SYSTEM DELAY
The path of an audio signal, processed in the client’s computer, goes from the soundcard over the CPU to the network card and reversed. This includes AD/DA-converters, soundcard hardware buffer, microprocessor queuing, and audio compression. Let’s take a closer look at the sources in audio signal processing where actual audio delay occurs.
3.1.1: AUDIO SIGNAL PROCESSING
Since every participant in a networked performance has another set of computer hardware and software, it’s not viable to define the ultimate hardware setup and software settings to keep latency as low as possible. In this chapter, basic theory about latency in audio signal processing and some proposals are revealed.
First, we have a look at the hardware based audio input of a participant’s (participant A) computer and analyze different obstacles, creating latency. When an analog audio signal from a microphone or a guitar reaches the input of a soundcard, the analog signal is converted into a digital signal.
“Converters usually add a delay of about 1ms”. (Walker, 2005, [http://www.soundonsound.com/sos/jan05/articles/pcmusician.htm]) Also, when the audio signal is conducted through the soundcard, the optional sample-rate conversion in the soundcard, for instance from 44,100Hz into 32,000Hz, can cause further delay of about 1msec. (cf. ibid.) At the moment, we have a digitized audio signal on a delay of about 1-2msec, ready to be buffered. Any data stream has to be buffered, in order to guarantee a continuous flow for the upcoming processing. Every burning process of a DVD, for instance, needs a certain buffer, to insure nonstop recording without gaps or overflows. As there must not be gaps in an audio signal either a hardware buffer is required here as well. (cf. ibid.) This hardware buffer is applied on the soundcard itself and controllable via software-based control panels.
If the buffers are too small and the data runs out before Windows can get back to […] empty them (recording), you'll get a gap in the audio stream that sounds like a click or pop in the waveform. […] Making the buffers a lot bigger […] has an unfortunate side effect: any change that you make to the audio from your audio software doesn't take effect until the next buffer is accessed. This is latency […]. (Walker, 2005, [http://www.soundonsound.com/sos/jan05/articles/pcmusician.htm])
Every soundcard has its individual buffer settings to reach optimal low latency. Here two suggestions can be offered. In order to achieve lowest latency on optimal buffer settings, a high quality audio interface, and an appropriate driver have to be used. High quality audio interfaces, such as the M-Audio® Fast Track® Series, assure high speed audio processing. The most appropriate driver would be the audio protocol ASIO, specifically the 'ASIO DirectX Full Duplex Driver' for all Windows systems, which “generally offer the lowest latencies and CPU overheads, and are also supported by a wide range of host applications”. (ibid.)