Huffman tree and compression using Huffman coding

Huffman tree

Huffman Tree is a tree structure used to encode characters. Its characteristic is that the higher the frequency of characters, the closer they are to the root node, that is, the shorter the encoding length. This feature makes Huffman trees widely used in the field of data compression.

The process of constructing a Huffman tree is to first sort the characters according to their frequency of occurrence from small to large, then select the two characters with the smallest frequency of occurrence each time as leaf nodes, add their frequency of occurrence to get a new node, and then add This node is added to the sorted list. Repeat this process until there is only one node left in the list, which is the root node of the Huffman tree.

In the Huffman tree, each character has a unique code. Going left means the code is 0, and going right means the code is 1. Due to the characteristics of the Huffman tree, the encoding of each character is unique, and the encoding length is also the shortest.

Assuming there is such a list, the construction of a Huffman tree is to take out the smallest two elements from the list each time and store them as a new node as the left and right nodes.

Here you can take out 1 and 2 as the left and right nodes, and add their values and store them in the new node, so that there is a new node with a value of 3.

Take out the two smallest nodes again, this time 3 (child nodes 1 and 2) and 3 to form a new tree.

After repeating the operation, the new node obtained is stored in the original array again, that is, 6 is added. Repeating this operation, you can finally get a Huffman tree.

The following is the code of the Huffman tree node. Each node mainly has several attributes. The first one is the weight value. The tree is constructed through value, the left and right sub-nodes, the characters stored in the node and the subsequent Huffman coding. The code required.

class HNode implements Comparable<HNode>{

    int value;

    HNode left;

    HNode right;

    String chr;

    String code;

    public HNode(int value, HNode left, HNode right, String chr, String code) {
        this.value = value;
        this.left = left;
        this.right = right;
        this.chr = chr;
        this.code = code;
    }

    public HNode(int data){
        this.value = value;
    }

    public HNode(){}

//This part is the implementations Comparable interface, which allows nodes to compare values and sort them, so that the two nodes with the smallest value can be taken out each time
    @Override
    public int compareTo(HNode o) {
        //Judge based on weight value
        int qz=this.value;//Current weight

        int qzz=o.value;//
        int res=qz - qzz;

        return res;
    }
}

The following code is the construction of Huffman tree. The specific implementation principle is as described above.

public class HuffmanTree {
    Map<String,String> hmap = new HashMap<>();
    public static HNode createHuffmanTree(Map<Character,Integer> map){
        List<HNode> nodes = new ArrayList<HNode>();
        for (Character s : map.keySet()) {//Traverse the map collection
            HNode node=new HNode();//Create an empty node
            node.chr= String.valueOf(s);
            node.value = map.get(s);

            nodes.add(node);
        }

        while(nodes.size()!=1){//Sort the list according to the weight
            Collections.sort(nodes); //Sort the node weights from small to large
            HNode left = nodes.remove(0);
            left.code="0";

            HNode right=nodes.remove(0);
            right.code="1";
            //The above operations constitute a small tree branch
            HNode node=new HNode();//Create a node

            //Calculate the node weight of the previous layer
            int rootvalue=Integer.valueOf(left.value) + Integer.valueOf(right.value);//Get the node weight of the previous layer
            node.value = rootvalue;
            node.left = left;
            node.right = right;

            nodes.add(node);

        }
        return nodes.get(0);
    }
}

Huffman coding

Huffman coding is a variable length coding (VLC) technology that can encode information symbols of different lengths so that symbols with high frequency use shorter codes, while symbols that appear more frequently use shorter codes. Symbols with low frequency use longer codes, resulting in the shortest average length of the coded binary sequence. Huffman coding is widely used in data compression, communication transmission, encryption and decryption and other fields.

After we build a Huffman tree, we can use traversal to add Huffman coding to each node. The specific implementation method is: when creating the node, we set the code attribute of the left child node to “0 “, the child to the right is set to “1”. Then perform a traversal to splice the codes of the nodes, and you can get the codes of the leaf nodes.

public static HNode buildHuffmanCode(HNode node){

        if(node.code==null){
            node.code="";
        }

        if(node.left!=null){
            node.left.code = node.code + node.left.code;//Reconstruct the code of the left child
            buildHuffmanCode(node.left);
        }

        if(node.right!=null){
            node.right.code=node.code + node.right.code;//Reconstruct the code of the left child
            buildHuffmanCode(node.right);
        }

        return node;
    }
public Map<String,String> printHuffmanCode(HNode node){
        if (node == null) return null;

        //Go left to the end
        if(node.left==null & amp; & amp;node.right==null){
            hmap.put(node.chr, node.code);
            System.out.println("node" + node.chr + ",coding" + node.code);
        }else{
            printHuffmanCode(node.left);
            printHuffmanCode(node.right);
        }
        return hmap;
    }

Use Huffman coding for file compression

Huffman coding is a compression algorithm based on the frequency of character occurrences. Its basic idea is to use shorter codes to represent characters that appear more frequently, and to use longer codes to represent characters that appear less frequently, thereby reducing the total code length and achieving the purpose of compressing file size.

The specific compression process is as follows:

  1. Count the frequency of occurrence of each character in the file;
  2. Generate a Huffman tree based on the frequency of character occurrences;
  3. Encode each character according to a Huffman tree;
  4. Write the encoded data to a file.

Compressed files can be restored to original files through corresponding decompression algorithms. The Huffman coding algorithm has the advantages of high compression rate and fast compression and decompression speed, and is widely used in multimedia fields such as images, audio, and video.

 public static void main(String[] args) throws Exception {
// int[] arr = {6,8,4,1,3,5,9};
// HuffmanTree htree = new HuffmanTree();
// HNode root = createHuffmanTree(arr);
// htree.preorder(root);
// htree.preorder(root);
        String st = readTxtFile("resource/hfmtree.txt");

        System.out.println(st);
        Map<Character,Integer> map = countNumbers(st);

        HuffmanTree htree = new HuffmanTree();
        HNode root = createHuffmanTree(map);
        htree.preorder(root);
        System.out.println("");
        htree.inorder2(root);
        System.out.println("");
        buildHuffmanCode(root);
        
        Map<String,String> hmap = htree.printHuffmanCode(root);
        System.out.println(hmap);
        Set<Map.Entry<String,String>> entries = hmap.entrySet();
        Iterator<Map.Entry<String, String>> iterator = entries.iterator();
        Map.Entry<String,String> ite;
        while (iterator.hasNext()){
            ite=iterator.next();
            System.out.println(ite.getKey() + ":" + ite.getValue());
        }

        String changedSt = new String();
        for(int i =0;i<st.length();i + + ){
            Character c = st.charAt(i);
            changedSt = changedSt + hmap.get(String.valueOf(c));
        }
        System.out.println(changedSt);

        writeFile(changedSt,hmap,"resource/hfmtree.txt");
        readhfmFile("resource/hfmtree.txt");

    }
}

File compression can be implemented using Huffman coding. The file is first read through the IO stream, and then the Huffman tree corresponding to the file is constructed to obtain the code table. Next, each corresponding character in the file is converted into Huffman code in the code table, and then they are stored in the file using the output stream.

public static String readTxtFile(String filePath){
        try{
            String encoding = "GBK";
            File file = new File(filePath);
            if(file.isFile() & amp; & amp; file.exists()){
                InputStreamReader reader = new InputStreamReader(new FileInputStream(file),encoding);
                BufferedReader bufferedReader = new BufferedReader(reader);
                char[] tt = new char[1024];
                int b;
                while((b=bufferedReader.read(tt))!=-1){
                    String s = new String(tt,0,b);
                    System.out.println(s);
                    return s;
                }
                reader.close();
            }else {
                System.out.println("Cannot find the correct file");
                return null;
            }
        }catch (Exception e){
            System.out.println(e);
        }
        return null;
    }
public static void writeFile(String st,Map<String,String> hmap,String filename) throws IOException {
        FileOutputStream fos = new FileOutputStream(filename);
        ObjectOutputStream oos = new ObjectOutputStream(fos);
        String paddedString = StringPadding(st);
        
        oos.writeObject(hmap);
        int stlength = paddedString.length();
        int x = stlength/8;
        byte[] bits = new byte[x];
        for(int i = 0;i<x;i + + ){
            String sub = paddedString.substring(i*8,(i + 1)*8);
            System.out.println(sub);
            int bit =Integer.parseInt(sub,2);
            System.out.println("|" + bit);

            fos.write(bit);
        }
        fos.write(buling);

        fos.close();
    }

When writing a file, because every 8 binary bits form a byte, the original length may not be a multiple of 8, and the last bit needs to be padded with 0, so we need to add an extra byte to record the number of padded 0s, and add the hash The code table of Furman encoding is stored in the file for easy use during decoding.

Decoding using Huffman coding

During decoding, we read the code table and the byte stream of the compressed file from the file, first process the 0-padded part to restore it to the original byte stream, and then restore the data through code table comparison.

public static void readhfmFile(String filename) throws Exception {
        FileInputStream fis = new FileInputStream(filename);
        ObjectInputStream ois = new ObjectInputStream(fis);
        Map<String,String> hfmCodes = (Map<String, String>) ois.readObject();
        System.out.println(hfmCodes);
        ArrayList<Integer> hfmbyteList = new ArrayList<Integer>();
        while(true){
            int hfmByte = fis.read();
            if(hfmByte==-1){
                break;
            }else {
                hfmbyteList.add(hfmByte);
            }
        }
        fis.close();
        System.out.println(hfmbyteList);
        int last = hfmbyteList.get(hfmbyteList.size()-1);

        String st = "";
        for(int i=0;i<hfmbyteList.size()-1;i + + ){
            String temp = Integer.toBinaryString(hfmbyteList.get(i));
            int n = temp.length();
            for(int j =0;j<8-n;j + + ){
                temp="0" + temp;
            }
            st = st + temp;
            System.out.println(temp);
        }
        String decodeSt = st.substring(0,st.length()-last);
        System.out.println(decodeSt);
        String afterdecode = "";
        for(int i =0,index =i;i<decodeSt.length();){
            index + + ;
            for(Map.Entry<String,String> entry:hfmCodes.entrySet()){
                String key = entry.getKey();
                String value = hfmCodes.get(key);
                if(decodeSt.substring(i,index).equals(value)){
                    afterdecode = afterdecode + key;
                    i=index;
                    break;
                }
            }
        }

The order in which compressed files are read needs to be the same as the order in which the output streams are compressed.

In the code table comparison, you can intercept the string through substring, and then compare it with each one in the code table. If there is no match, the string intercepted by substring will be extended by one bit until the comparison is completed.

In summary, this article introduces the principles and code implementation of Huffman trees, Huffman coding, and data compression using Huffman coding.

Huffman tree is a binary tree. Its characteristic is that nodes with smaller weights are located closer to the root, and nodes with larger weights are located farther from the root. The construction of the Huffman tree requires the use of a greedy algorithm. Specifically, the two nodes with the smallest weights are merged into a new node. The weight of the new node is the sum of the weights of the two nodes. This process is repeated until all nodes are merged into one root node.

Huffman encoding is a prefix encoding method that maps each character to a binary encoding, making the entire encoding process unambiguous, that is, the encoding of any character is not the prefix of another character encoding. The implementation of this encoding method requires the use of a Huffman tree. Specifically, the character node starts from the leaf node of the Huffman tree and goes up to the root node. If the path traveled turns left, it will be encoded as 0, and if it turns right, The code is 1, and the final code is the Huffman code of this character.

The principle of using Huffman coding for data compression is to map each character in the data to be compressed to the corresponding Huffman code, and then splice these codes together to form a new binary sequence, and finally output this sequence. The length of this sequence will usually be shorter than the original data, because after using Huffman coding, characters that appear more frequently will be assigned shorter codes, while characters that appear less frequently will be assigned longer codes. The advantage of this is that it can effectively reduce the data storage space.

In terms of code implementation, the article gives code examples of Huffman trees, Huffman coding and compression. It should be noted that in actual use, some details may need to be considered, such as how to store Huffman trees, how to convert binary sequences into byte sequences, etc.