Wednesday, 17 October 2012

The Magic Behind the Phase Vocoder Part A

Hopefully you've seen my post on my phase vocoder project.  Otherwise there won't be much motivation for this slog at the theory; when you have this much maths and theory there better be a good reason for it.  Fortunately, the phase vocoder can do some extremely cool stuff, like shift the pitch of some audio while keeping it at the same speed. If you haven't read about Fourier's theory, or the Fourier transform, look back to simpler times.  Without understanding this you really won't understand the next bit.

A FFT is pretty useful for breaking down the frequencies in a snapshot of audio.  However, a snapshot is just a snapshot - capturing an instance in time.  Generally we'd be talking a transform size of 1024, up to maybe 4096 samples.  You might wonder why you can't just take a transform of an entire 3-4 minutes of audio, and then use that for all your spectral needs.  While that transform will have astounding number of spectral points - i.e. really super "spectral resolution" it will have awful "time resolution" - i.e. you have no way of seeing how the frequencies change over time.  You wouldn't be able to pick out the distinct thumps of a bass drum or the squeal of a 30 second guitar solo, because it's squished down into a generic spectral mush over the whole song.

What would the intuitive next step be, if we want to see how the frequencies change in time?  What about taking two transforms, with the first transform of the first half of the audio and the second transform the second half.  Now we have half the number of spectral points but we do have some time resolution i.e. can see how the spectrum changes over the audio.  Moving on, the natural progression is to make the transforms smaller and have more of them.  You could make the transform size 1024 samples big and then just split the audio into tiny sections.  After taking loads of Fourier transforms, you end up with a load of spectral data.  You will now be able to see how the frequencies change over time, with good enough time and spectral resolution.  This is the beginnings of the STFT - the Short Term Fourier Transform.  A good visual way to imagine this process, is as a spectrogram (often a set of bars that light up, indicating frequency levels) on a CD player or in a Digital workstation.

It would be nice if the STFT was as simple as slicing up audio and shifting the magical FFT along, but there are catches.  Because the FFT abruptly slices audio, but then assumes that snapshot is periodic, you get some nasty spectral leakage.  The phenomona is illustrated below.

How the Fourier transform mauls your signal
Spectral leakage is the bane of DFT processing.   Discontinuities are pretty much guaranteed when transforming an arbitrary signal.  There is no perfect solution to this, but there is a very good one.  Often you'll here mention of the "transform window"; the window is the samples that the DFT is currently “looking at”.  Leaving the samples alone, this window is called a rectangular window.  Ultimately we want to get rid of those jumps at the sides of the window.  To do this we use a windowing function, a function designed to smooth out discontinuities while not affecting the frequencies too much.

To apply the window you simply multiply the input sample with the corresponding sample of the windowing function.  There are many other windowing functions e.g. Blackman, triangular, Gaussian.  The two most popular windowing functions are the Hann window and the Hamming window.  They’re almost identical, apart from a slightly different constant.  The formula and pictures of what they look like are given nicely on Wikipedia

Windowing however ruins the audio a bit, and on reconstruction the audio will sound like it's coming in and out really quickly (amplitude modulation), not really what you wanted.  The solution is to overlap the snapshots enough, so on reconstruction the original signal can be rebuilt fine.  For most windows the overlap required is an overlap of 75%.  Often, for the STFT we refer to the "hop size", i.e. how many samples you "hop" the window over each time.  If N is the transform size, a common hop size is N/4 or maybe, if you can affort the extra data, N/8. 

The STFT, pictorially (drawn badly)
The STFT is useful for some spectral things, for example the cross synthesis project I did earlier, but not capable of shifting pitch or stretching time without distortion.  However, with only a small modification of the spectral data, we can – this operation is the heart of the phase vocoder algorithm.   All will be revealed in the next installment.

No comments:

Post a Comment