Saturday, 5 November 2011


Since this blog is called programming notes, I figure there should be a bit of stuff more focused on programming.  If you're only interested the audio side of the blog, then this article is probably a skip.

Another current little project I've been working on is a program that uses Huffman compression to compress and decompress a text file.  The topic of compression is interesting, because it's actually something that has a major implication in audio programming.  The project isn't majorly complicated, but getting it working fully has taken a couple of weeks.  I guess school-work has got in the way, most weekdays I can get 3-4 hours of programming in, but still, usually these kind of projects take 3-4 days.

The big difference with the Huffman compressor from other data structure things I've done is it works almost entirely at the bit-level.  C(++ for this project)  is great because it allows you to work with bits and binary input / output with relative ease.  However, I soon realised my task of writing a compressor would mean solving 2 problems:
1) Streaming binary data i.e. creating a stream of binary digits rather than characters / numbers etc.
2) Writing a Huffman compressor / decompressor

Almost all my grief has come from the first problem.  This is for several reasons.  First, working at the bit level is something I don't tend to do a lot, and all the bit-shifting, bit-masking business is easy to get subtly wrong.  The thing is, if it goes subtly wrong, there will be a tiny bug that won't be picked up until the very last stage of decompression, which is when I open the file and realise I'm definitely not looking at my original text file. Any tiny problem with the compressed file can result in the output being complete rubbish.

I'm definitely no expert, but hours and hours of frustration have taught me things about debugging.  Every time I write more code, I learn a bit more about it. Here I'm going to give a few of my observations when it comes this often overlooked, but majorly important, topic.

Compiling != finished: Actually, the sooner you can compile, even if you've done very little code, the better.  This is so you can fix problems individually, as they appear, rather than hundreds simultaneously.

Small is better: Splitting a program into small manageable modules can make the difference between hours of work and weeks of work, or even never being able to finish something.  Split the problem into sub problems, and solve each one individually.  Relating it back to debugging - by splitting a mega mess of code into nice, manageable classes and functions, you can thoroughly test each little bit on its own and ensure each little bit works before combining.  Neglecting this has cost me days and a huge amount of unnecessary of effort, because a little function I assumed worked was messing up stuff, and the larger the bulk of code you're testing, the harder it is to locate a bug.

Work through on paper: Often, the best way to see a problem is going and doing each step yourself.  Working it through, how you think it should go, allows you to step through code and gives you a point of comparison.  If the program suddenly does something different, you might be able to pinpoint the problem and see what went wrong.  In my compressor, one thing that caused problems was bit-shifting in the wrong direction.  This became really obvious when I worked it through myself and compared my values with the variables in my program.  Also, drawing out diagrams was especially useful when I implemented more complicated data structures.

Debug cleverly: Debugging cleverly is essential to solving problems.  Debugging cleverly means looking for output patterns, putting breakpoints in good places, feeding strategic input into your program, and in general, thinking about what you're doing.  For compression - things I did included testing certain lengths of input, certain characters (e.g. extended ASCII ones) and certain frequencies of characters.  Another thing I did was find the smallest input which worked, and then kept increasing the size by a character / line, until something went wrong.

Make it visual: Although a good IDE (I use Visual Studio 2008) will allow you to look at values of variables and get a really good insight into the state of your program at any point, I still find printing out to the console can really help me debug my programs.  Things like printing out the states of streams, or the value of a variable in a loop that iterates a lot often help.  Another thing is spending a little time and effort to make a class printable, so you can then see an overview of its objects more easily.  A really good example of this is a little recursive function I made to print a binary tree to the console.  The function took almost no work to make, but it makes it a million times easier to see what's happening inside the structure. 

Finally, test program to the limit:  Debugging essentially is testing your program to find bugs.  To find as many bugs as possible, if your program takes any form of input, then you'll have to try a wide range of inputs.  I've often made a program and debugged it, but only fixed it to work with one specific input.  When I debugged my compressor late on, it worked perfectly for a smallish file with >100 lines, but when the file got big, a little bug crept in, where some characters were missed at the end.  Another bug occurred when extended ASCII codes (128-256) were in the file (this bug was actually more serious).

I always find it really tempting to leave a project after getting it to work the first time, because I don't want to find more bugs.  If it ain't broke, don't fix it surely?  The problem is, you have to assume the program's broke until you can thoroughly prove it isn't.  Try to force errors.  If your program takes only positive numbers, put in some cheeky negative numbers.  See if you can get some really ugly errors - segmentation faults and the like.  Assume you'll run out of memory and calls to new will fail.  My last test was to run my compressor for a text file with over 500,000 characters (251KB) and then for a WAV file (21,941KB).  Because the files were really long, and files of that length would test most possibilities, after compressing the files and then decompressing the output back to the original successfully I can be fairly sure my program will work reliably.

That concludes my programming ramblings.  Again, by no means am I an expert, but I just wanted to share a few opinions on a topic that has the potential to make life easier, and code better.

No comments:

Post a Comment