1. Introduction to Huffman coding
Huffman coding is a variable length coding (VLC, variable length coding) method. Compared with fixed-length ASCII coding, Huffman coding can save a lot of space because the frequency of each character is not consistent; consistent;
It is an entropy coding algorithm used for lossless data compression. It is usually used to compress character data with a high repetition rate.
If we transmit the string BCAADDDDCCACACAC through binary encoding by converting it into binary data corresponding to ASCII code, then the number of binary digits transmitted for one character is 8 bits, so a total of 120 binary bits are needed; and if Huffman encoding is used, the string Characters can be compressed to 28 bits.
2. Encoding method:
1. Initial string:
2. Calculate character frequency:
3. Sort characters according to their frequency of occurrence to form a queue Q
The ones with the lowest frequency are at the front, and the ones with the highest frequency are at the back.
3. Use these characters as leaf nodes to start building a Huffman tree
The Huffman tree, also known as the optimal binary tree, is a binary tree with the shortest weighted path length. The so-called Huffman coding can be viewed as aninverted Huffman tree.
(1) First create an empty node z, assign the character with the smallest frequency to the left side of z, and assign the character with the second highest frequency to the right side of z, and then assign z to the sum of the frequencies of the two characters; B and D are then removed from queue Q and their sum is added to the queue at the position indicated by * in the figure above.
(2) Immediately afterwards, re-create an empty node z, and use 4 as the node on the left, A with frequency 5 as the node on the right, the sum of 4 and 5 as the parent node, andput this sum Add them to the queue in order, and then build a tree structure from small to large according to frequency (the small one is on the left)
(3) Continue to build the tree according to the previous idea until all characters appear in the nodes of the tree, and the construction of the Huffman tree is completed.
The weighted path length of a node is the product of the path length from the root node to the node and the node weight.
The weighted path length of this binarytree WPL = 6 * 1 + 5 * 2 + 3 * 3 + 1 * 3 = 28.
4. Encode characters:
The Huffman tree is constructed. Now we need to encode each letter. The encoding principle is: for each non-leaf node, assign 0 to the left side of the connecting line and 1 to the right side of the connecting line strong>, the final character encoding is the 0 1 sequence combination on the path starting from the root node to the node.
Therefore, the codes for each letter are:
A | B | C | D |
---|---|---|---|
11 | 100 | 0 | 101 |
Before Huffman encoding, the binary representation of the string “BCAADDDDCCACACAC” is:
10000100100001101000001010000010100010001000100010001000100001101000011010000010100001101000001010000110100000101000011
That is, it takes up 120 bits;
After encoding:
1000111110110110100110110110
Occupies 28 bits.
3. Huffman text reading encoding
1. Text input
clc clear all % Read text file filename = 'input.txt'; text = fileread(filename); inData = text; %Input text disp(inData)
2. Create Huffman tree,
% counts the number of occurrences of a character char_count = cell(1, length(text)); disp(length(text)) for i = 1:length(text) char_count{i} = numel(regexp(text, char(i))); end %%Create Huffman tree % Sort the characters in order from low to high by their probability of occurrence p = uniqueCharacter_p; [a,b]=sort(p); % arrange the p probability matrix m(1,:) = b; for i = 1:length(a) c(i) = uniqueCharacter(b(i));%Update the sorting of the character queue end q = sort(p); %Update probability order n = length(uniqueCharacter_p); for i = 2:n matrix(i-1,2) = c(1); %Record the left child in matrix matrix(i-1,3) = c(2); %Record the right child in matrux matrix(i-1,1) = double(i-1); %Record the root node in matrix c(2) = double(i-1);% The value 1 is added here for the purpose of sorting in the future. This position will always be ranked last and will not be processed. q(2) = q(1) + q(2); %Update the root node value q(1) = 1; % Sort new probability combinations [a,b]=sort(q); [a1,b1] = sort(b); m(i,:)=b1; %% perform two sorting and record the position corresponding to the probability temp_c = c; %Introduce intermediate variables temp_q = q; for i = 1:length(a1) c(i) = temp_c(b(i));% Update the sorting of the character queue q(i) = temp_q(b(i)); end end
3. Implement encoding and file output parts
% specifies the file name to save the encoding results file_name = 'huffman.txt'; file_name1 = 'code.dat'; %Read Huffman code disp('Huffman coding table:') code = uniqueCharacter';%The single quotes here indicate that this is a string, not a separate character. fileID = fopen(file_name, 'w'); fileID1 = fopen(file_name1, 'w'); disp(n) for i = 1:n [temp_code,n] = Coding(matrix,uniqueCharacter(i)); code(i,3:n + 2) = temp_code; len(i) = n; %Store to Huffman.txt fprintf(fileID, 'Character is:'); fprintf(fileID, '%s',uniqueCharacter(i)); fprintf(fileID, ' '); fprintf(fileID, 'Encoded as:'); fprintf(fileID, '%s',code(i,3:n + 2)); fprintf(fileID, ' '); fprintf(fileID, 'The number of occurrences of the character is:'); fprintf(fileID, '%d\ ',uniqueCharacter_num(i)); end %store to code.dat for i=1:length(text) for j=1:n if text(i) ==uniqueCharacter(j) fprintf(fileID1,code(j,3:n + 2),'ubit1'); end end end disp(uniqueCharacter_num)% output character number disp(code)% output character encoding fclose(fileID); fclose(fileID1); %encoding function function [code,n] = Coding(matrix,character) [a,b] = size(matrix); for i = 1:a [row,col] = find(matrix(:,2:3)==character); character = matrix(row,1); if col == 1 temp_code(i) = '0'; else temp_code(i) = '1'; end code(i) = temp_code(i); if row == a break end end %At this moment, the encoding result needs to be reversed temp_code = code; n = length(code); for i = 1:n code(i) = temp_code(length(code)-i + 1); end end
4. Coding efficiency part
% encoding compression efficiency disp('Encoding compression efficiency:') e = (sum(uniqueCharacter_num)*8)/sum(len.*uniqueCharacter_num)
5. Implement the decoding part
%Solution to Huffman coding % Read the code.dat file filename = 'code.dat'; fid = fopen(filename, 'r'); iffid==-1 error('Cannot open file'); end % Read the file content line by line codeData = ''; tline = fgetl(fid); while ischar(tline) codeData = [codeData tline]; tline = fgetl(fid); end % close file fclose(fid); % Print the content read disp(codeData); code = double(code); decodedText = huffmandeco(codeData, code(:, 3:n + 2)); disp("The decoding result is: ") disp(decodedText) %Huffman decoding function [decoded_string] = Decoding(matrix, code) a = size(matrix, 1); decoded_string = ''; % Reverse the encoding result reversed_code = fliplr(code); disp(length(reversed_code)) for i = 1:length(reversed_code) current_code = reversed_code(i); if current_code == '0' index = find(matrix(:, 1) == matrix(i, 2)); else index = find(matrix(:, 1) == matrix(i, 3)); end decoded_string = strcat(decoded_string, matrix(index, 1)); if index == a break end end end
6. Test
Test with small text first
input.txt:
Output results:
Files stored as:
Huffman.txt:
Large file test:
input.txt:
Huffman.txt:
code.dat:
Hoffman’s principle is partially borrowed from the boss’s blog:
Detailed explanation of the principle of Huffman Coding-CSDN Blog
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56,890 people are learning the system