Sunday, 30 November 2014

Dynamic lip sync using AS3 and Flash with mp3 and Audacity

Sometimes it's nice to try flexing your coding muscles on something small and achievable in a single sitting, rather than having to load an entire, complex, multi-module application into your head, just to edit a few lines of code.

This weekend, we spent some time thinking about the kinds of games we'd like to create for our electronic board game system. Not just vague genres, but how to actually implement some of these ideas into a game.

For example, we're already decided on a fantasy football game. That's a given.
But wouldn't it be great to have some online commentary as you play? As a playing piece is moved down the pitch, and tackles attempted, the ball fumbled and so on, some disembodied radio-commentary-style voice could read out the action as it unfolds.

We're pretty much decided that such a thing would be really cool. But then - as often happens with good ideas - someone went at took it a stage further. Why not have an animated commentator, in the top of the screen somewhere? A John-Moston looky-likey maybe- delivering his audio commentary into a microphone, on the screen?

That would be cool. But could prove quite tricky to implement.
After all, it was a long time ago that a few of us put together a perfectly lip-synched "Nothing Compares 2U" animation with a couple of 3d characters (using Hash Animation Master if you're interested!). That was only three minutes long and the animation took about three weeks, of scrubbing, listening, manually setting key frames and so on. That would be a massive task!

Michael (who had a hand in organising the phoneme shapes for our original animation) suggested an alternative, slightly-cheating-but-good-enough-for-a-bit-of-fun approach to lip-synching, using on-the-fly processing: moving the mouth to match the amplitude of the sound coming out of it.

This seemed a bit of a cop-out, but it does ensure that the most important parts are at least in sync (i.e. the mouth is closed at the start, and after completing a sentence). The bit inbetween, where words are spewing out of the mouth will just be more like a "flappy-gob" style of animation. But it's better than hand-crafting the keyframes for every phrase.

Michael also suggested that if the lip sync is being done on-the-fly, then there's no reason why we couldn't make the audio interchangeable: simply load an mp3 file at runtime and play that, to make the mouth animate on-screen. Load a different mp3 file, get a different set of sounds, but re-use the same lip-sync routines to allow for custom, user-generated content.

This is already getting way more complicated that anything we wanted to put together. But there's also something about being given a seemingly impossible challenge and a tight deadline to complete it by!

So we set about planning how things might work, in our fantasy football game.
To begin with, we'd have one, long, single mp3 file, containing loads of stock footballing phrases and cliches. These would be "chopped up" at runtime, and different sections of the mp3 played at different points during the game.

While the mp3 is playing (the commentary is being "said" by our onscreen avatar) the actual sound byte data would be analysed, an average amplitude of the section calculated, and an appropriate mouth shape displayed on the screen.

So far so easy....

For the sake of testing, we mashed a few Homer Simpson phrases together into a single, larger, mp3 file in Audacity.  This great, free, audio editor allows you to create a "label layer" and place labels at various points during the audio file.



It's worth noting that as you select a position in the label track, the "playhead" in the audio track is updated to the same point in the audio data: this helps line up the labels with the actual audio to be played.

We placed some simple titles at specific points in our audio file, to indicate which phrase was being said at which point in the file.


With all our labels in place, we exported the label track as a text file.


It is possible to stretch each label, to define a start and an end point. We didn't consider this necessary, since our mp3 track consists of lots of small phrases, all following each other with just a short break between them. For our use, we can say that the start point of the next phrase in the file is the end point of the previous one. So long as these points occur during the short silences between phrases, there should be no problems!

In our Flash file, we created a single movieclip, consisting of five frames, each with a different mouth shape


Then we added a few buttons to simply set the playhead position of the audio playback.
Then a bit of actionscript does the rest....


import flash.events.Event;
import flash.events.MouseEvent;
import flash.media.SoundChannel;

var left:Number;
var playerObject:Object = new Object();

var mySound:Sound = new Sound();
var myChannel:SoundChannel = new SoundChannel();
mySound.addEventListener(Event.COMPLETE, onSoundLoaded);
var stopAudioAt:Number=0;
var playHeadPos:Number=0;
var position:Number=0;
var samplesPerSecond:Number=0;

function processSound(event:SampleDataEvent):void{
var bytes:ByteArray = new ByteArray();
     
     // grab some audio data
     playerObject.sourceSnd.extract(bytes, 4096, playHeadPos);
           
     if(playHeadPos>stopAudioAt+(4096*2)){            
           scale=0;
           playerObject.outputSnd.removeEventListener(SampleDataEvent.SAMPLE_DATA, processSound);
           myChannel.stop();
           
     }else{
           
           bytes.position = 0;
           while(bytes.bytesAvailable > 0) {
                 left = Math.abs(bytes.readFloat()*128);            
                 var scale:Number = left*2;
           }
                 
           // send the audio data to the speaker
           event.data.writeBytes(bytes);
           
     }
     
     playHeadPos+=4096;
     
     // display the appropriate mouth shape, based on amplitude
     if(scale<0.04){
           mouth.gotoAndStop(5);
     }else if (scale<1) {
           mouth.gotoAndStop(1);
     }else if (scale<10) {
           mouth.gotoAndStop(2);
     }else if (scale<25) {
           mouth.gotoAndStop(3);
     }else if (scale<50) {
           mouth.gotoAndStop(4);
     }else {
           mouth.gotoAndStop(5);
     }
}

function playSound(){
     trace("playing sound from "+playHeadPos);
     playerObject.outputSnd.addEventListener(SampleDataEvent.SAMPLE_DATA, processSound);      
     myChannel=playerObject.outputSnd.play(); //playHeadPos);
}

function onSoundLoaded(e:Event){      
     playerObject.sourceSnd = mySound;
     playerObject.outputSnd = new Sound();
     
     // get the sample rate from the mp3 if possible
     // there are 8 bytes per stereo sample (4 bytes left channel, 4 bytes right channel)      
     // so if the audio is recorded at 44.1khz, samples per second is 44100*8
     // but we recorded at 11025hz in mono to keep the file size down, so samples per
     // second is actually 11025*4
     
     var sampleRate:Number=11025;
     samplesPerSecond=sampleRate*4;      
     
     btnBlah.addEventListener(MouseEvent.MOUSE_DOWN, playBlah);
     btnBozo.addEventListener(MouseEvent.MOUSE_DOWN, playBozo);
     btnTrust.addEventListener(MouseEvent.MOUSE_DOWN, playTrust);
     btnNews.addEventListener(MouseEvent.MOUSE_DOWN, playNews);
}


// load the external mp3 file
function loadSound(){
     mySound.load(new URLRequest("homer.mp3"));
}

// load the labels from a file

function playBlah(e:MouseEvent){
     playHeadPos=0.123875*samplesPerSecond;
     stopAudioAt=1.907678*samplesPerSecond;      
     playSound();
}

function playBozo(e:MouseEvent){      
     playHeadPos=1.907678*samplesPerSecond;      
     stopAudioAt=3.815355*samplesPerSecond;
     playSound();
}

function playTrust(e:MouseEvent){      
     playHeadPos=3.815355*samplesPerSecond;      
     stopAudioAt=9.765493*samplesPerSecond;
     playSound();
}

function playNews(e:MouseEvent){      
     playHeadPos=9.765493*samplesPerSecond;
     stopAudioAt=12.781134*samplesPerSecond;
     playSound();
}

mouth.gotoAndStop(5);
loadSound();
stop();

The AS3 above uses two "sound" objects. One holds the audio data of our mp3 file, which is loaded from an external file at runtime. The other sound object is the one that actually plays sounds. There are few comments which explain basically what's going on, but it's important to understand a few core concepts:

We use an eventlistener to populate the "outgoing" sound object with wavdata, just before it gets played (sent to the sound card). It's also this data that we interrogate, just before sending it to the soundcard, to decide which mouth shape to display.

To play a sound, we simply say from which point in the loaded sound data we want to start extracting the wavdata bytes. When we've reached the end of the sample section, we stop the audio from playing.

The trickiest part in all this is to convert the label data from Audacity into "sample data offset" values, to get the audio to start/stop at the correct place. The labels in Audacity are placed according to how many seconds have elapsed. But Flash works with "samples" not time-based data.

Knowing that Flash uses four bytes for each sample, we can work out that one second of audio data, recorded at CD-quality 44.1khz should contain 44100 * 8 = 352800 bytes. So if we start extracting the sample data from our loaded mp3 file, 352800 bytes in, we effectively start playback one second into the audio. Similarly, to start playback from 5 seconds in, we need to start extracting the audio sample data from byte 44100 * 8 * 5 = 1,764,000.

To reduce our file sizes (and decrease the overhead in manipulating files) we recorded our mp3 samples in mono, at 11025hz. This means that instead of 8 bytes per sample, we're only using 4 (since there's no left/right track, just 4-bytes per sample for a single, mono track). So we need to calculate our byte offset as:

startByte= 11025 * 4 * timeInSeconds


Here's the result: http://www.nerdclub.co.uk/lipsync_demo.htm



Still to do - load the "cue points" from an external file (rather than embed them into the AS3 code) but this should be a relatively trivial matter; we've managed to demonstrate the idea works in principle. And once we can load "cue points" from an external file, as well as the actual mp3 audio, there's no reason why we can't include user-editable commentary in our fantasy football game. Now that would be exciting!