[PDFBox] PDFBox operates PDF documents by reading the text content of the specified page, reading the text content of all pages, and generating PDF documents according to the template file

This article mainly introduces PDFBox to operate PDF documents, read the text content of the specified page, read the text content of all pages, and generate PDF documents according to the template file.

Directory

1. PDFBox operation text

1.1. Read all page text content

1.2. Read the text content of the specified page

1.3, write text content

1.4. Replace text content

(1) Custom PDTextStripper class

(2) Create a KeyWordEntity entity class

(3) Download the font file

(4) Create PDFUtil tool class

(5) Operation effect

(6) Deficiencies


1. PDFBox operation text

PDFBox needs to use the text extractor PDTextStripper object to operate the text content. This PDTextStripper class provides methods for operating text content, such as: getText() to obtain text, writeString() to write strings, etc. The following introduces the PDFBox operation text Several situations.

1.1, read all page text content

A PDF document is composed of multiple pages, and a certain page may contain text content. The [getText()] method provided by the PDTextStripper class can obtain the text content of the entire PDF document. The example code is as follows:

package pdfbox.demo.text;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

/**
 * @version 1.0.0
 * @Date: 2023/7/18 9:03
 * @Author ZhuYouBin
 * @Description: Read all plain text content in PDF documents
 */
public class ReadAllText {
    public static void main(String[] args) throws IOException {
        // 1. Load the specified PDF document
        PDDocument document = PDDocument.load(new File("D:\demo.pdf"));
        // 2. Create a text extraction object
        PDFTextStripper stripper = new PDFTextStripper();
        // 3. Get the text content of the specified page
        String text = stripper. getText(document);
        System.out.println("Get text content: " + text);
        // 4. Close
        document. close();
    }
}

1.2, read the text content of the specified page

In some cases, we may need to obtain the text content in a certain page. At this time, we can set the page boundary through the PDTextStripper class, that is, set the text content in which pages to extract. We only need to call [setStartPage()] and [setEndPage( )] method, the case code is as follows:

package pdfbox.demo.text;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

/**
 * @version 1.0.0
 * @Date: 2023/7/18 9:03
 * @Author ZhuYouBin
 * @Description: Read all plain text content in PDF documents
 */
public class ReadPageText {
    public static void main(String[] args) throws IOException {
        // 1. Load the specified PDF document
        PDDocument document = PDDocument.load(new File("D:\demo.pdf"));
        // 2. Create a text extraction object
        PDFTextStripper stripper = new PDFTextStripper();
        // Specify page to read content
        stripper.setStartPage(0); // Set the start page, set it to 0 here, it means read the first page
        stripper.setEndPage(0); // Set the end page, if set to 0 here, it means to read the first page
        // 3. Get the text content of the specified page
        String text = stripper. getText(document);
        System.out.println("Get text content: " + text);
        // 4. Close
        document. close();
    }
}

1.3, write text content

The previous articles have introduced how to use PDFBox to write plain text content into a PDF document. The written content can be written in a single line or in multiple lines. You can refer to the article:

[[PDFBox] PDFBox operates PDF documents to create PDF documents, load PDF documents, add blank pages, delete pages, get total number of pages, add text content, PDFBox coordinate system].

1.4, replace text content

Replace text content, PDFBox does not provide a method to replace text content, here I use a certain method to realize the function of replacing text content, the general idea:

  • First read the text content, and obtain the page coordinate position of the replaced text in the PDF document.
  • After obtaining the coordinates of the replacement text, write the content of this area into a rectangle, and the background color of the rectangle is white, that is, the replacement text is covered.
  • In the white rectangle area, rewrite the replaced text content.
  • Using this idea, you can roughly realize the function of replacing the specified text.

(1) Custom PDTextStripper class

To get the coordinate information of the text, you must customize a class, inherit from the PDTextStripper class, and then rewrite the [writeString()] method, which has two parameters:

  • The first parameter is text: indicates the text content currently read.
  • The second parameter is List: indicates the coordinate information of a character in the current text content.
package pdfbox.demo.text.keyword;

import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.util.Matrix;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;

/**
 * @version 1.0.0
 * @Date: 2023/7/18 10:18
 * @Author ZhuYouBin
 * @Description: Customize the text extractor, get the coordinate position of the searched text
 */
public class KeyWordPositionStripper extends PDFTextStripper {

    /**
     * Set of keywords to find
     */
    private final List<String> keyWordList;
    /**
     * Find the set of successful keyword entity objects
     */
    private final List<KeyWordEntity> keyWordEntityList = new ArrayList<>();

    public KeyWordPositionStripper(List<String> keyWordList) throws IOException {
        this.keyWordList = keyWordList;
    }

    @Override
    protected void writeString(String text, List<TextPosition> positions) {
        int size = positions. size();
        for (String keyWord : keyWordList) {
            char[] chars = keyWord.toCharArray();
            for (int i = 0; i < size; i ++ ) {
                // Get the currently read character
                String currentChar = positions. get(i). getUnicode();
                // Match the current character with the keyWord keyword
                if (!Objects.equals(currentChar, String.valueOf(chars[0]))) {
                    continue;
                }
                int count = 1;
                int j;
                for (j = 1; j < chars. length & amp; & amp; i + j < size; j ++ ) {
                    currentChar = positions. get(i + j). getUnicode();
                    if (!Objects.equals(currentChar, String.valueOf(chars[j]))) {
                        break;
                    }
                    count + + ;
                }
                // If the match is successful, record the coordinate position of the text
                if (count == chars. length) {
                    TextPosition startPosition = positions. get(i);
                    TextPosition endPosition = positions.get(i + j < size ? i + j : i + j - 1);
                    // create entity object
                    KeyWordEntity entity = new KeyWordEntity();
                    entity.setKeyWord(keyWord);
                    // Get the starting character coordinates
                    Matrix matrix = startPosition. getTextMatrix();
                    float x = matrix. getTranslateX();
                    float y = matrix. getTranslateY();
                    // Get end character coordinates
                    Matrix endMatrix = endPosition. getTextMatrix();
                    float x2 = endMatrix.getTranslateX();
                    // get font size
                    float fontSizeInPt = startPosition. getFontSizeInPt();
                    entity.setX(x);
                    entity.setY(y - fontSizeInPt / 5);
                    float width = i + j < size ? x2 - x : x2 - x + fontSizeInPt;
                    entity.setWidth(width);
                    entity.setHeight(fontSizeInPt);
                    keyWordEntityList.add(entity);
                }
            }
        }
    }

    public List<KeyWordEntity> getKeyWordEntityList() {
        return keyWordEntityList;
    }
}

(2) Create KeyWordEntity entity class

Create a KeyWordEntity entity class to represent the keyword text that needs to be searched. The keyword is the text content that we need to replace. Generally, in actual development, it is equivalent to the content of the template placeholder. The entity class needs to set the coordinate information of the keyword name and text.

package pdfbox.demo.text.keyword;

import java.io.Serializable;

/**
 * @version 1.0.0
 * @Date: 2023/7/18 11:22
 * @Author ZhuYouBin
 * @Description: The keyword to find
 */
public class KeyWordEntity implements Serializable {
    private String keyWord;

    private float x;
    private float y;
    private float width;
    private float height;

    public String getKeyWord() {
        return keyWord;
    }

    public void setKeyWord(String keyWord) {
        this.keyWord = keyWord;
    }

    public float getX() {
        return x;
    }

    public void setX(float x) {
        this.x = x;
    }

    public float getY() {
        return y;
    }

    public void setY(float y) {
        this.y = y;
    }

    public float getWidth() {
        return width;
    }

    public void setWidth(float width) {
        this.width = width;
    }

    public float getHeight() {
        return height;
    }

    public void setHeight(float height) {
        this.height = height;
    }
}

(3) Download font file

If you don’t want to use the fonts provided by PDFBox, you can use external font files, and the font files can be downloaded from the [Classic Song Type Simplified|Classic|Font Download] website.

(4) Create PDFUtil tool class

package pdfbox.demo.text.keyword;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.springframework.core.io.ClassPathResource;

import java.awt.*;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.*;
import java.util.List;

/**
 * @version 1.0.0
 * @Date: 2023/7/18 16:01
 * @Author ZhuYouBin
 * @Description: Tool class based on PDFBox
 */
public class PDFUtil {

    /**
     * Read the PDF template file and replace the data of the specified keyword
     * @param keyWordMap The keyword data to be replaced, key represents the placeholder, value represents the replaced content
     * @param pdfPath The path of the PDF template file
     * @param destPdf generated target PDF file
     */
    public static void replaceText(Map<String, String> keyWordMap, String pdfPath, String destPdf) throws IOException {
        if (keyWordMap == null || keyWordMap.keySet().size() <= 0) {
            return;
        }
        Set<String> keyWordSet = keyWordMap.keySet();
        // 1. Read the PDF template file
        PDDocument document = PDDocument. load(new File(pdfPath));
        // 2. Create a custom text extractor
        KeyWordPositionStripper stripper = new KeyWordPositionStripper(new ArrayList<>(keyWordSet));
        stripper.setSortByPosition(true);
        // Note: The writeString() method must be executed after the getText() method
        stripper. getText(document);
        // 3. Get the keyword entity object
        List<KeyWordEntity> keyWordEntityList = stripper. getKeyWordEntityList();
        // 4. Replace the specified keyword text content
        PDPageContentStream stream = new PDPageContentStream(document, document.getPage(0), PDPageContentStream.AppendMode.APPEND, true);
        // 5. Load an external font file, here it is loaded directly through File, if you are a SpringBoot project, you can load it through stream
        PDType0Font font = PDType0Font.load(document, new File("D:\simsun.ttf"));
        // 6. Loop to replace the text content
        for (KeyWordEntity keyWord : keyWordEntityList) {
            stream.setNonStrokingColor(Color.WHITE);
            stream.addRect(keyWord.getX(), keyWord.getY(), keyWord.getWidth(), keyWord.getHeight());
            stream. fill();
            // set the brush color
            stream.setNonStrokingColor(Color.BLACK);
            // Replace keyword text content
            stream.beginText();
            stream.setFont(font, 14);
            stream.newLineAtOffset(keyWord.getX(), keyWord.getY());
            stream.showText(keyWordMap.get(keyWord.getKeyWord()));
            stream. endText();
        }
        // close the content stream
        stream. close();
        // Save the replaced document
        document. save(destPdf);
        // close document
        document. close();
    }

    public static void main(String[] args) throws IOException {
        Map<String, String> keyWordMap = new HashMap<>();
        keyWordMap.put("{<!-- -->{name}}", "Zhang San");
        keyWordMap.put("{<!-- -->{age}}", "25");
        keyWordMap.put("{<!-- -->{sex}}", "Male");
        keyWordMap.put("{<!-- -->{address}}", "Xiamen City, Fujian Province");
        // mock test
        PDFUtil.replaceText(keyWordMap, "D:\pdfbox-template.pdf", "D:\\
ew-document.pdf");
    }
}

(5) Running effect

The PDF template file here is shown below:

After using PDFBox to replace the content of the template file, the running results are as follows:

(6) Inadequacies

Although the replacement text content can be realized here, there are still some deficiencies in this code, as follows:

  • 1. The position of the replaced text cannot be guaranteed to be aligned with the content of the original text. You need to adjust the corresponding coordinate position according to the actual template.
  • 2. When the replaced text content is too much, it will overwrite the following text content.
  • 3. At present, only the text content of the specified page can be replaced.
  • 4. Other deficiencies. . .

At this point, the PDFBox operation text is introduced.

To sum up, this article is over. It mainly introduces PDFBox to operate PDF documents, read the text content of the specified page, read the text content of all pages, and generate PDF documents according to the template file.