A FFT is pretty useful for breaking down the frequencies in a snapshot of audio. However, a snapshot is just a snapshot - capturing an instance in time. Generally we'd be talking a transform size of 1024, up to maybe 4096 samples. You might wonder why you can't just take a transform of an entire 3-4 minutes of audio, and then use that for all your spectral needs. While that transform will have astounding number of spectral points - i.e. really super "spectral resolution" it will have awful "time resolution" - i.e. you have no way of seeing how the frequencies change over time. You wouldn't be able to pick out the distinct thumps of a bass drum or the squeal of a 30 second guitar solo, because it's squished down into a generic spectral mush over the whole song.

What would the intuitive next step be, if we want to see how the frequencies change in time? What about taking two transforms, with the first transform of the first half of the audio and the second transform the second half. Now we have half the number of spectral points but we do have some time resolution i.e. can see how the spectrum changes over the audio. Moving on, the natural progression is to make the transforms smaller and have more of them. You could make the transform size 1024 samples big and then just split the audio into tiny sections. After taking loads of Fourier transforms, you end up with a load of spectral data. You will now be able to see how the frequencies change over time, with good enough time and spectral resolution. This is the beginnings of the STFT - the Short Term Fourier Transform. A good visual way to imagine this process, is as a spectrogram (often a set of bars that light up, indicating frequency levels) on a CD player or in a Digital workstation.

It would be nice if the STFT was as simple as slicing up audio and shifting the magical FFT along, but there are catches. Because the FFT abruptly slices audio, but then assumes that snapshot is periodic, you get some nasty spectral leakage. The phenomona is illustrated below.

How the Fourier transform mauls your signal |

To apply the window you simply multiply the input sample with the corresponding sample of the windowing function. There are many other windowing functions e.g. Blackman, triangular, Gaussian. The two most popular windowing functions are the Hann window and the Hamming window. They’re almost identical, apart from a slightly different constant. The formula and pictures of what they look like are given nicely on Wikipedia.

Windowing however ruins the audio a bit, and on reconstruction the audio will sound like it's coming in and out really quickly (amplitude modulation), not really what you wanted. The solution is to overlap the snapshots enough, so on reconstruction the original signal can be rebuilt fine. For most windows the overlap required is an overlap of 75%. Often, for the STFT we refer to the "hop size", i.e. how many samples you "hop" the window over each time. If N is the transform size, a common hop size is N/4 or maybe, if you can affort the extra data, N/8.

The STFT, pictorially (drawn badly) |

## No comments:

## Post a Comment