Manipulate PDF files with Apache PDFBox

Introduction

The Apache PDFBox library is an open source Java tool for manipulating PDF documents. This project allows creating new PDF documents, manipulating existing PDF documents, and extracting content from PDF documents. Apache PDFBox also includes several command line utilities.

The main functions of Apache PDFBox are as follows:

  • Extract Unicode text from PDF files.
  • Split a single PDF into multiple files or merge multiple PDF files.
  • Extract data from PDF forms or fill out PDF forms.
  • Verify that a PDF file is compliant with the PDF/A-1b standard.
  • Print PDF files using the standard Java printing API.
  • Save PDF as an image file, such as PNG or JPEG.
  • Create PDFs from scratch, including embedding fonts and images.
  • Digitally sign PDF files.

Import

First, we need to make sure the PDFBox library has been added to my Java project. If you are using maven, add the following dependencies to pom.xml:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.28</version>
</dependency>

The version used here is: 2.0.28.

Talk is cheap. Show me the code.

Create a PDF document

We can create a simple PDF document with the following code:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDType1Font;

public class CreatePDF {<!-- -->
    public static void main(String[] args) {<!-- -->
        PDDocument document = new PDDocument();
        PDPage page = new PDPage();
        document. addPage(page);
        
        PDType1Font font = PDType1Font.HELVETICA_BOLD;
        
        try {<!-- -->
            PDPageContentStream contentStream = new PDPageContentStream(document, page);
            contentStream.beginText();
            contentStream.setFont(font, 12);
            contentStream. newLineAtOffset(100, 700);
            contentStream.showText("Hello, World!");
            contentStream. endText();
            contentStream. close();
            
            document.save(new File("one-more.pdf"));
            document. close();
            
            System.out.println("PDF created successfully.");
        } catch (IOException e) {<!-- -->
            e.printStackTrace();
        }
    }
}

This code snippet creates a new PDF document and writes “Hello, World!” on its first page. I used the Helvetica Bold font and set its size to 12.

Next, I display the text on the PDF page and close the PDPageContentStream object using the contentStream.close() method.

Finally, I save the document as a “one-more.pdf” file, then close the PDDocument object. The effect is as follows:

Wanmao Academy

Read PDF files

We can read the entire contents of a PDF file using the following code:

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ReadPDFExample {<!-- -->
    public static void main(String[] args) {<!-- -->
        // create file object
        File file = new File("one-more.pdf");
        
        try {<!-- -->
            // create PDF document object
            PDDocument document = PDDocument. load(file);
            
            // create PDF text stripper
            PDFTextStripper stripper = new PDFTextStripper();
            
            // Get the entire content of the PDF file
            String text = stripper. getText(document);
            
            // Output the entire content of the PDF file
            System.out.println(text);
            
            // close the PDF document object
            document. close();
        } catch (IOException e) {<!-- -->
            e.printStackTrace();
        }
    }
}

First, create a file object, then use the static method load() of the PDDocument class to load the PDF file and create a PDF document object.

Then, we create a PDFTextStripper object and use its getText() method to get the entire content of the PDF file.

Finally, we output the entire contents of the PDF file and close the PDF document object.

The output is what we wrote earlier:

Hello, World!

Insert image

We can use the following code to insert an image in a PDF file:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class InsertImageInPDF {<!-- -->
    public static void main(String[] args) {<!-- -->
        try {<!-- -->
            // load the PDF file
            PDDocument document = PDDocument.load(new File("one-more.pdf"));

            // get the first page
            PDPage page = document. getPage(0);

            // load the image file
            PDImageXObject image = PDImageXObject.createFromFile("one-more.jpg", document);

            // Insert the image at the specified position
            PDPageContentStream contentStream = new PDPageContentStream(document, page, AppendMode.APPEND, true, true);
            contentStream.drawImage(image, 200, 500, image.getWidth(), image.getHeight());

            // close the stream
            contentStream. close();

            // Save the modified PDF file
            document.save("one-more-jpg.pdf");

            // close document
            document. close();
            System.out.println("PDF created successfully.");
        } catch (IOException e) {<!-- -->
            e.printStackTrace();
        }
    }
}

In this example, we loaded a PDF file called “one-more.pdf”, fetched the first page, and loaded an image file called “one-more.jpg”.

Then, we inserted the image at the specified position in the PDF document using the drawImage() method.

Finally, we save the modified document to a new file called “one-more-jpg.pdf” and close the document. The effect is as follows:

Load image

We can read an image in a PDF file using the following code:

import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class ReadPDFImagesExample {<!-- -->

    public static void main(String[] args) {<!-- -->
        try {<!-- -->
            // load the PDF file
            PDDocument document = PDDocument.load(new File("one-more-jpg.pdf"));

            PDPageTree pageTree = document. getPages();

            // loop through each page
            for (PDPage page : pageTree) {<!-- -->
                int pageNum = pageTree. indexOf(page) + 1;
                int count = 1;
                System.out.println("Page " + pageNum + ":");
                for (COSName xObjectName : page.getResources().getXObjectNames()) {<!-- -->

                    PDXObject pdxObject = page.getResources().getXObject(xObjectName);
                    if (pdxObject instanceof PDImageXObject) {<!-- -->
                        PDImageXObject image = (PDImageXObject) pdxObject;
                        System.out.println("Found image with width"
                                 + image. getWidth()
                                 + "px and height"
                                 + image. getHeight()
                                 + "px.");
                        String fileName = "one-more-" + pageNum + "-" + count + ".jpg";
                        ImageIO.write(image.getImage(), "jpg", new File(fileName));
                        count + + ;
                    }
                }
            }

            document. close();
        } catch (IOException e) {<!-- -->
            e.printStackTrace();
        }
    }
}

In this example, we use the PDDocument class to load the document from the specified PDF file and iterate through each page to find the images within it.

For each page, we fetch its resources (including images) and check if an image exists in it.

If they exist, we iterate over them and use the PDImageXObject object to get their properties, such as width and height.

Then, use ImageIO to save the image to the local file system.

The output is as follows:

Page 1:
Found image with width 150px and height 150px.

End

Apache PDFBox is a powerful tool. In addition to the above functions, there are many other functions worth exploring and discovering. If you have any questions about Apache PDFBox or want to know more functions, welcome to ask me in the comment area, or visit the official website directly: https://pdfbox.apache.org/.