jueves, 4 de abril de 2013

[IT] Homework 3: Text compression

For this homwork, we had to implement the tree-based Huffman's encoding for unicode strings.
Also, we had to design, document, and analyze an experiment to determine the worst-case and typical-case complexity and compression ratio.

Huffman's coding

The Huffman encoding algorithm is an optimal compression algorithm. In this algorithm only the frequency of each individual letters are used to compress the data.
The main idea is that if the text have some letters that are more frequent than others, then that letters can be represented by fewer bits to encode those letters than to encode the less frequent letters.

For instance, if you have the string:

"I ate an apple"

We can get the frequencies of each letter and build a hash table:

LetterFrequencyEncoding representation

Originally, each characters is represented by a byte, so, if our string have 14 characters then the size is 14 bytes or 112 bits.
Using the new encoding representation, the string is represented in that form:


|0101|00   |100|011|110|00   |100|1111|00   |100|101|101|1110|110|
i    space a   t   e   space a   n    space a   p   p   l    e

Requiring only 46 bits, so, we can calculate a compress ratio of 38% or 1 to 2.6 , with a space saving of 62%.

In fact, each binary ecoding representation is a path in a binary tree, each bit represents the direction where we must to search to find a letter (left or right, 0 or 1) and the length of the binary representation means which deeper we must to search to find the letter.

If you want to know more of the background of Huffman's encoding algorithm, I recommend the following links:


Basically what I did was to program two Python's scripts.
The first one performs the Huffman's encoding algorithm. That scripts follows these steps for compression:
  1. Receive the path to the text file.
  2. Reads the file and calculate the frequencies of each character.
  3. Build the binary tree.
  4. Get the binary representation of each character walking through the paths of the tree.
  5. Compress the text using the new binary representation of each characters.
  6. Calculates:
    1. the time of the compression process
    2. the text size before and after compression
    3. the memory used during the compressio (tree size)
    4. the compress ratio and the space savings
And these steps for decompression:
  1. Receive the path to the compressed file and the dictionary file.
  2. Loads the dictionary on memory.
  3. Reads the compressed file char by char searching on the dictionary for the binary-like string representation.
  4. Changes the binary-like string representation for the corresponding letter.
  5. Calculates:
    1. the time of the decompression process
    2. the text size before and after decompression
    3. the memory used during the decompression (doctionary size)
    4. the compress ratio and the space savings

The second script builds the experiment environment, passing the text files to the first script and receiving all the calculated data from the first script.

Each text is compressed/decompressed 30 times and an AWK script calculates the average time, memory and compress ratio for each case.

The results are saved on plain text file separated by spaces, in order to be plotted later using Gnuplot.

For the experiment I ran two cases
  • The typical one where I used twenty (20) chapters of "Don Quixote de la Mancha" in 3 different sizes
  • The worst one where I generated twenty (20) texts with an uniform frecuency of characters, using random functions, in 3 different sizes.
  • The sizes of the files was:
    • Small = 10000 characters
    • Medium = 50000 characters
    • Big = 100000 characters


A) Compression case

Typical / Time

Worst / Time

As we can see in the pictures, the typical case is faster than the worst time, but only by few miliseconds.
Also, in this case, we can see that the size of the texts doesn't affect the performance of the algorithm because the time grows linearly with respect the size of the text.
Also, there are some fluctuations over the compression of each text in each size case, maybe this happens because the computer has more processes in memory, you can see in the graphs that the fulctuations has some pattern due the computer usage while the algorithm was running.

Typical / Memory

Worst / Memory

The memory aparently doesn't change too much in every case, only a a few bytes in every size change.
In the typical case, the size of the tree it's being plotted, in the worst case the size of the dictionary used to decompress the text it's being plotted.
While the size of the dictionary it's the same in each text size for the worst cases, the size of the tree grows in the typical case, but doesnt have a signifficant growth.

Compression Ratio
Typical / Ratio

Worst / Ratio

In this graphs, the data could be a little tricky, as greater than the compression ratio is, the space savings is lower. The compression ratio is measured in a scale from 0 to 1.
In the typical case, the compression ratio is almost the same in all the cases, arround 0.5 or 0.6, so, the --correct interpretation of the compression ratio is write a scale such as 1:2 aproximately. That means arround a 40% - 50% of space savings.

B) Decompression case

For the decompression case, I only plot the time of the process and the used memory.

Typical / Time

Worst / Time

In this case, the time aparently behaves similar to the compression case, but now, the decompression time of the bigger texts is greater than in the compression case.
But in this cases, the time appears to grow also linearly when the text size grow.


Typical / Memory

Worst / Memory

Also, the memory appears to be consistent in each case, and doesn't fluctuates very much in each repetition. The more significant change is in te typical case, in the jump from 10000 characters to 50000 characters. Apparently, the text size could have a more remarkable effect in the decompression process than the compression process.


In conclusion, we can say that:

  • In the time, as great is the compression/decompression text, the time is also greater, but the time grows linearly.
  • The memory appears to not grow to much, the most significant growth is when we jump from small texts to medium texts. 
  • The compression ratio doesn't change to much in each text, but, we espect that when the size text grows, the compression ratio is lower and the space savings are greater.
  • The performance of the algorithm is only affected in time, when the text size grows.



Experiment builder

Awk postprocessor

1 comentario:

  1. Bien :) No tengas miedo a manipulación binaria. Y pon en tus scripts de awk además del promedio la deviación estándar (y luego usas yerrorbars en tus plots). 7+8 pts.