Data compression – Huffman text reading encoding matlab

1. Introduction to Huffman coding

Huffman coding is a variable length coding (VLC, variable length coding) method. Compared with fixed-length ASCII coding, Huffman coding can save a lot of space because the frequency of each character is not consistent; consistent;

It is an entropy coding algorithm used for lossless data compression. It is usually used to compress character data with a high repetition rate.

If we transmit the string BCAADDDDCCACACAC through binary encoding by converting it into binary data corresponding to ASCII code, then the number of binary digits transmitted for one character is 8 bits, so a total of 120 binary bits are needed; and if Huffman encoding is used, the string Characters can be compressed to 28 bits.

2. Encoding method:

1. Initial string:

2. Calculate character frequency:

3. Sort characters according to their frequency of occurrence to form a queue Q

The ones with the lowest frequency are at the front, and the ones with the highest frequency are at the back.

3. Use these characters as leaf nodes to start building a Huffman tree

The Huffman tree, also known as the optimal binary tree, is a binary tree with the shortest weighted path length. The so-called Huffman coding can be viewed as aninverted Huffman tree.

(1) First create an empty node z, assign the character with the smallest frequency to the left side of z, and assign the character with the second highest frequency to the right side of z, and then assign z to the sum of the frequencies of the two characters; B and D are then removed from queue Q and their sum is added to the queue at the position indicated by * in the figure above.

(2) Immediately afterwards, re-create an empty node z, and use 4 as the node on the left, A with frequency 5 as the node on the right, the sum of 4 and 5 as the parent node, andput this sum Add them to the queue in order, and then build a tree structure from small to large according to frequency (the small one is on the left)

(3) Continue to build the tree according to the previous idea until all characters appear in the nodes of the tree, and the construction of the Huffman tree is completed.

The weighted path length of a node is the product of the path length from the root node to the node and the node weight.

The weighted path length of this binarytree WPL = 6 * 1 + 5 * 2 + 3 * 3 + 1 * 3 = 28.

4. Encode characters:

The Huffman tree is constructed. Now we need to encode each letter. The encoding principle is: for each non-leaf node, assign 0 to the left side of the connecting line and 1 to the right side of the connecting line strong>, the final character encoding is the 0 1 sequence combination on the path starting from the root node to the node.

Therefore, the codes for each letter are:

A B C D
11 100 0 101

Before Huffman encoding, the binary representation of the string “BCAADDDDCCACACAC” is:

10000100100001101000001010000010100010001000100010001000100001101000011010000010100001101000001010000110100000101000011

That is, it takes up 120 bits;

After encoding:

1000111110110110100110110110

Occupies 28 bits.

3. Huffman text reading encoding

1. Text input
clc
clear all
% Read text file
filename = 'input.txt';
text = fileread(filename);

inData = text; %Input text
disp(inData)
2. Create Huffman tree,
% counts the number of occurrences of a character
char_count = cell(1, length(text));
disp(length(text))
for i = 1:length(text)
   char_count{i} = numel(regexp(text, char(i)));
end


%%Create Huffman tree
% Sort the characters in order from low to high by their probability of occurrence
p = uniqueCharacter_p;
[a,b]=sort(p); % arrange the p probability matrix
m(1,:) = b;
for i = 1:length(a)
    c(i) = uniqueCharacter(b(i));%Update the sorting of the character queue
end
q = sort(p); %Update probability order
n = length(uniqueCharacter_p);
for i = 2:n
    matrix(i-1,2) = c(1); %Record the left child in matrix
    matrix(i-1,3) = c(2); %Record the right child in matrux
    matrix(i-1,1) = double(i-1); %Record the root node in matrix
    c(2) = double(i-1);% The value 1 is added here for the purpose of sorting in the future. This position will always be ranked last and will not be processed.
    q(2) = q(1) + q(2); %Update the root node value
    q(1) = 1;
    % Sort new probability combinations
    [a,b]=sort(q);
    [a1,b1] = sort(b);
    m(i,:)=b1; %% perform two sorting and record the position corresponding to the probability
    temp_c = c; %Introduce intermediate variables
    temp_q = q;
    for i = 1:length(a1)
         c(i) = temp_c(b(i));% Update the sorting of the character queue
         q(i) = temp_q(b(i));
         
    end
end
3. Implement encoding and file output parts
% specifies the file name to save the encoding results
file_name = 'huffman.txt';
file_name1 = 'code.dat';


%Read Huffman code
disp('Huffman coding table:')
code = uniqueCharacter';%The single quotes here indicate that this is a string, not a separate character.
fileID = fopen(file_name, 'w');
fileID1 = fopen(file_name1, 'w');
disp(n)
for i = 1:n
    [temp_code,n] = Coding(matrix,uniqueCharacter(i));
    code(i,3:n + 2) = temp_code;
    len(i) = n;
    
    %Store to Huffman.txt
    fprintf(fileID, 'Character is:');
    fprintf(fileID, '%s',uniqueCharacter(i));
    fprintf(fileID, ' ');
    fprintf(fileID, 'Encoded as:');
    fprintf(fileID, '%s',code(i,3:n + 2));

    fprintf(fileID, ' ');
    fprintf(fileID, 'The number of occurrences of the character is:');
    fprintf(fileID, '%d\
',uniqueCharacter_num(i));
   
end

%store to code.dat
   for i=1:length(text)
        for j=1:n
      
        if text(i) ==uniqueCharacter(j)
            fprintf(fileID1,code(j,3:n + 2),'ubit1');
        end
        end
    end



disp(uniqueCharacter_num)% output character number
disp(code)% output character encoding

fclose(fileID);
fclose(fileID1);

%encoding function
function [code,n] = Coding(matrix,character)
    [a,b] = size(matrix);
    for i = 1:a
        [row,col] = find(matrix(:,2:3)==character);
        character = matrix(row,1);
        if col == 1
            temp_code(i) = '0';
        else
            temp_code(i) = '1';
        end
        code(i) = temp_code(i);
        if row == a
            break
        end
    end
    %At this moment, the encoding result needs to be reversed
    temp_code = code;
    n = length(code);
    for i = 1:n
        code(i) = temp_code(length(code)-i + 1);
    end
end
4. Coding efficiency part
% encoding compression efficiency
disp('Encoding compression efficiency:')
e = (sum(uniqueCharacter_num)*8)/sum(len.*uniqueCharacter_num)
5. Implement the decoding part
%Solution to Huffman coding

% Read the code.dat file
filename = 'code.dat';
fid = fopen(filename, 'r');
iffid==-1
    error('Cannot open file');
end

% Read the file content line by line
codeData = '';
tline = fgetl(fid);
while ischar(tline)
    codeData = [codeData tline];
    tline = fgetl(fid);
end

% close file
fclose(fid);

% Print the content read
disp(codeData);

code = double(code);

decodedText = huffmandeco(codeData, code(:, 3:n + 2));
disp("The decoding result is: ")
disp(decodedText)

%Huffman decoding
function [decoded_string] = Decoding(matrix, code)
    a = size(matrix, 1);
    decoded_string = '';
    
    % Reverse the encoding result
    reversed_code = fliplr(code);
    disp(length(reversed_code))
    for i = 1:length(reversed_code)
        current_code = reversed_code(i);
        
        if current_code == '0'
            index = find(matrix(:, 1) == matrix(i, 2));
        else
            index = find(matrix(:, 1) == matrix(i, 3));
        end
        
        decoded_string = strcat(decoded_string, matrix(index, 1));
        
        if index == a
            break
        end
    end
end
6. Test

Test with small text first

input.txt:

Output results:

Files stored as:

Huffman.txt:

Large file test:

input.txt:

Huffman.txt:

code.dat:

Hoffman’s principle is partially borrowed from the boss’s blog:

Detailed explanation of the principle of Huffman Coding-CSDN Blog

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56,890 people are learning the system