Latency and microSD Cards
I've read quite a few posts recently discussing the relationship between microSD card speeds and audio latency, and would like to set the record straight. Contrary to popular belief, most of the measured latency in Robertsonics audio players has nothing to do with the microSD card. In fact, there's no reason that an audio player that streams audio from microSD cards can't have the same latency as a one that plays audio from RAM.
Let me explain.
First, a definition: Latency is the delay between the activation of a trigger - a digital input, a MIDI Note On message or a serial command - and when sound appears at the analog audio output of the device.
Once a sound has started playing, latency is not really the issue. If the sound is streaming from microSD media, either the card is fast enough or it isn't. If not, the audio DAC will be starved for data and you'll hear a glitch. So let's agree that if a player doesn't glitch, then the microSD card must be fast enough to keep up with the demand for audio, whatever that demand is. Otherwise, you'd hear it.
Back to latency. I made the claim that a player that streams audio from microSD need not have any greater latency than one that plays from RAM. If you know ahead of time what files you want to stream, then you can preload the first block of audio data into RAM for each file. That way, when a trigger comes along, you already have the first part of the file in RAM. Once the file has started, we've already said that the microSD is fast enough to keep up or it isn't. If it isn't, the result is not a delay in the start time (it's already started), but a glitch. Not latency.
The WAV Trigger and Tsunami provide random access to over 4000 tracks, and can't know ahead of time which one will play next. Because there's not enough RAM to preload the first block of audio data for all those tracks, they have to wait and read the first block when the trigger occurs. This is the first (and only, as we'll see) source of latency that can be attributed to the microSD card.
(It should be noted here that Robertsonics players avoid having to search the FAT file system directory each time a file is started because they pre-index all of the sound files and therefore already know the starting sector for each file. When a trigger occurs, all that's required is to read the first block of data.)
It turns out that with a decent microSD card, a random 4K read operation takes less than 1ms. In the scope image below, the top trace is the onset of the digital input trigger, and the bottom trace brackets the microSD read operation that takes place to fetch the first block of audio data. (All measurements were made with a Tsunami and the latest firmware.)
You can see that the latency between the trigger onset and the start of the read operation is negligible, and that the read operation itself takes approximately 700 microseconds. This is quite consistent, since it's just another 4K random read and elsewhere I've documented that a good card will never take more than 1ms for such a read operation.
Polyphony
The next component of latency in the WAV Trigger and Tsunami has nothing to do with the microSD card and everything to do with the fact that they're polyphonic.
Polyphonic means that multiple sound files can be played and digitally mixed asynchronously and independently to the same DAC. In a polyphonic playback system, the sound engine is always running - data is always being clocked out to the DAC. It pretty much has to be this way, since when you want to start a sound, there may already be another sound playing. Even when no files are being streamed, zeroes are being sent to the DAC (silence.)
A typical audio player utilizes DMA to move blocks data from memory to a DAC, thus alleviating the CPU from the intensive task of keeping up with the sample-by-sample demand. Once a block of audio data is given over to the DMA, it can't be modified, so a typical setup is to have two buffers. One has been handed off to the DMA transfer, and the other is used to mix the next block of audio from all the sounds that happen to be playing at that moment. The CPU has the length of time that it takes to clock out the first buffer to prepare the second and do any DSP. At the end of that period the second buffer is handed off to the DMA and the first buffer is used to mix the next block of data. This is exactly analogous to computer DAW audio buffers.
The key point here is that once a block of data has been handed off to the DAC, it's gone. It can't be modified, and if you want to start a sound at that moment, you have to mix it into the next block.
Both the WAV Trigger and Tsunami use two 128 frame audio buffers. At 44.1kHz sample rate, 128 sample frames represents about 3ms of audio.
This scope image shows the relationship to a digital input trigger and analog audio output of a 440Hz sine wave sound file.
The above represents the best case scenario for Tsunami. The trigger causes the first block of data to be read from the microSD card in about 1mS. It's immediately mixed into the audio buffer not currently being serviced by the DMA, and 3ms later, it starts to appear at the output of the DAC. So 4ms latency.
The worst case scenario is that the trigger comes along just after the buffer mixing process happens at the start of the buffer period, since the trigger and the audio engine are asynchronous. In this case, we must wait until the next buffer hand-off to start mixing the new file, and then another buffer period for it to reach the DAC, for a total of 7ms latency. The average is therefore somewhere in between - about 5 or 6 ms.
The point here is that only the first 1ms has anything to do with the microSD card. The rest is a result of polyphony, our audio engine and the need for audio buffers, and would be the same if we were streaming from RAM with no microSD card.
You can of course reduce latency by making the buffers smaller, but this reduces the amount of time for mixing, DSP and necessary background tasks like reading blocks of data from the microSD card. If any of that ever takes longer than one buffer time, you'll get an audio glitch. Or, if polyphony is not a requirement and you're only playing one sound at a time, you could literally stop the audio engine, and start it when you need to play a sound.
I hope this helps clarify things. Please use the comment section here to ask questions if you have them.