Wednesday, July 7, 2010

The Fastest Loop in Actionscript?

We learned a lot about Flash while optimizing the signal processing code in the StandingWave audio framework. Audio code consists almost entirely of endless tight loops, where each instruction could be executed hundreds of thousands of times per second. Native audio hackers traditionally optimize these types of loops in assembler, but in a sluggish VM like the Flash Player we don't have that option.

A lot of performance issues come from how the whole audio system interoperates...but at the lowest level, we're dead if we can't get all our simple low-level loops as fast as possible. Mixing is a core process, so let's look at the seemingly simple task of mixing two signals into a third, and see how we do.

A mixing loop would run many types in each sample event handler. Since we're measuring such small intervals, we'll have to mix *a lot* of stuff to get accurate time measurements. So our test condition works like this: 5 timer handlers trigger a function that each time, runs ten times, and mixes two 32k sample buffers. So in each test, we mix 5x 10x 32768 frames, or about 1.6 million sample frames.

What if these signals were represented as Arrays of Number in ActionScript? We might mix them like this:

for (var i:int=0; i<32768; i++) {
  testArray3[i] = testArray1[i] + testArray2[i];
}

In our test suite, this takes about 385ms. That is not fast.

StandingWave 2 modeled each signal as a Vector of Numbers. What if we mix two Vectors like this?

for (var i:int=0; i<32768; i++) {
    testVector3[i] = testVector1[i] + testVector2[i];
}

In our tests, this clocks in at about 68ms. It's possible to squeeze a little more by tweaking the loop, but this is pretty much the hard limit for a pure AS3-based approach. You could work with this, but we can do better.

Now, let's try working with the same loops in Adobe Alchemy. In StandingWave3 all the audio processing happens in C code that is compiled into a swc through the hopefully now-well-known Alchemy process. 

In C, we have 3 arrays of floats. What happens if we write the same loop in C?

for (i=0; i<32768; i++) { 
    scratch3[i] = scratch2[i] + scratch1[i];   
}

It comes in at about 41ms. We've already beat AS3 Vector performance, and we haven't broken a sweat yet. But now we're also in territory that is well mapped by the many explorers of C optimizations, so let's try some of those out.

You see a lot of audio code in C that looks like this, and it's usually pretty fast. But how does it do for us?

int i=32768
while(i--) {
    *buffer3++ = *buffer2++ + *buffer1++;
}

It's even faster, at about 32ms.

The pointer crawling is certainly nice and concise code, but there's still some inefficiencies hiding here. Doesn't it seem like there's some redundant incrementing going on? And do we really need to check to see if we're at the end of the buffer thirty-two thousand types? So, two things make this better. We only need 1 counter, and we're going to unroll the loop.

Joe laughed at me when I showed him this, but the old loop-unrolling trick works wonders here. In practical tests, 8x seems like the best compromise between speed-gain and code-bloat, so we'll go with that.

In this example, if "i" is the number of frames you need to process, we get the 8 factor and the remainder, and proceed like this:


int i=32768;
int count8 = i / 8;
int count1 = i % 8;

while(count8) { 
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;  
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
    count8--;
    scratch3[count8] = scratch2[count8] + scratch1[count8];
}
while (count1--) {
    scratch3[count1] = scratch2[count1] + scratch1[count1];
}


It's hard to even measure this, since Flash's getTimer() function has millisecond-level accuracy, but this is on the order of 10ms. Wow!

And that's about the best I can do so far. So the final results are:

AS3 Array for loop: 385ms
AS3 Vector for loop: 68ms
for C loop: 41ms
Pointer crawling for C loop: 32ms
Unrolled 8x C while loop: 10 ms

And its hard to believe but in the grand scheme of things, 10ms for mixing 1.6 million samples is still kind of slow, but it's enough for us to be able to build something interesting.

All tests done on a 2.4 GHz MacBook with Flash Player 10.1 release version.

1 comment: