Introduction
The Apache PDFBox library is an open source Java tool for manipulating PDF documents. This project allows creating new PDF documents, manipulating existing PDF documents, and extracting content from PDF documents. Apache PDFBox also includes several command line utilities.
The main functions of Apache PDFBox are as follows:
- Extract Unicode text from PDF files.
- Split a single PDF into multiple files or merge multiple PDF files.
- Extract data from PDF forms or fill out PDF forms.
- Verify that a PDF file is compliant with the PDF/A-1b standard.
- Print PDF files using the standard Java printing API.
- Save PDF as an image file, such as PNG or JPEG.
- Create PDFs from scratch, including embedding fonts and images.
- Digitally sign PDF files.
Import
First, we need to make sure the PDFBox library has been added to my Java project. If you are using maven, add the following dependencies to pom.xml:
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.28</version> </dependency>
The version used here is: 2.0.28.
Talk is cheap. Show me the code.
Create a PDF document
We can create a simple PDF document with the following code:
import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.font.PDType1Font; public class CreatePDF {<!-- --> public static void main(String[] args) {<!-- --> PDDocument document = new PDDocument(); PDPage page = new PDPage(); document. addPage(page); PDType1Font font = PDType1Font.HELVETICA_BOLD; try {<!-- --> PDPageContentStream contentStream = new PDPageContentStream(document, page); contentStream.beginText(); contentStream.setFont(font, 12); contentStream. newLineAtOffset(100, 700); contentStream.showText("Hello, World!"); contentStream. endText(); contentStream. close(); document.save(new File("one-more.pdf")); document. close(); System.out.println("PDF created successfully."); } catch (IOException e) {<!-- --> e.printStackTrace(); } } }
This code snippet creates a new PDF document and writes “Hello, World!” on its first page. I used the Helvetica Bold font and set its size to 12.
Next, I display the text on the PDF page and close the PDPageContentStream
object using the contentStream.close()
method.
Finally, I save the document as a “one-more.pdf” file, then close the PDDocument object. The effect is as follows:
Read PDF files
We can read the entire contents of a PDF file using the following code:
import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; public class ReadPDFExample {<!-- --> public static void main(String[] args) {<!-- --> // create file object File file = new File("one-more.pdf"); try {<!-- --> // create PDF document object PDDocument document = PDDocument. load(file); // create PDF text stripper PDFTextStripper stripper = new PDFTextStripper(); // Get the entire content of the PDF file String text = stripper. getText(document); // Output the entire content of the PDF file System.out.println(text); // close the PDF document object document. close(); } catch (IOException e) {<!-- --> e.printStackTrace(); } } }
First, create a file object, then use the static method load() of the PDDocument class to load the PDF file and create a PDF document object.
Then, we create a PDFTextStripper object and use its getText() method to get the entire content of the PDF file.
Finally, we output the entire contents of the PDF file and close the PDF document object.
The output is what we wrote earlier:
Hello, World!
Insert image
We can use the following code to insert an image in a PDF file:
import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; public class InsertImageInPDF {<!-- --> public static void main(String[] args) {<!-- --> try {<!-- --> // load the PDF file PDDocument document = PDDocument.load(new File("one-more.pdf")); // get the first page PDPage page = document. getPage(0); // load the image file PDImageXObject image = PDImageXObject.createFromFile("one-more.jpg", document); // Insert the image at the specified position PDPageContentStream contentStream = new PDPageContentStream(document, page, AppendMode.APPEND, true, true); contentStream.drawImage(image, 200, 500, image.getWidth(), image.getHeight()); // close the stream contentStream. close(); // Save the modified PDF file document.save("one-more-jpg.pdf"); // close document document. close(); System.out.println("PDF created successfully."); } catch (IOException e) {<!-- --> e.printStackTrace(); } } }
In this example, we loaded a PDF file called “one-more.pdf”, fetched the first page, and loaded an image file called “one-more.jpg”.
Then, we inserted the image at the specified position in the PDF document using the drawImage()
method.
Finally, we save the modified document to a new file called “one-more-jpg.pdf” and close the document. The effect is as follows:
Load image
We can read an image in a PDF file using the following code:
import java.io.IOException; import java.util.List; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; public class ReadPDFImagesExample {<!-- --> public static void main(String[] args) {<!-- --> try {<!-- --> // load the PDF file PDDocument document = PDDocument.load(new File("one-more-jpg.pdf")); PDPageTree pageTree = document. getPages(); // loop through each page for (PDPage page : pageTree) {<!-- --> int pageNum = pageTree. indexOf(page) + 1; int count = 1; System.out.println("Page " + pageNum + ":"); for (COSName xObjectName : page.getResources().getXObjectNames()) {<!-- --> PDXObject pdxObject = page.getResources().getXObject(xObjectName); if (pdxObject instanceof PDImageXObject) {<!-- --> PDImageXObject image = (PDImageXObject) pdxObject; System.out.println("Found image with width" + image. getWidth() + "px and height" + image. getHeight() + "px."); String fileName = "one-more-" + pageNum + "-" + count + ".jpg"; ImageIO.write(image.getImage(), "jpg", new File(fileName)); count + + ; } } } document. close(); } catch (IOException e) {<!-- --> e.printStackTrace(); } } }
In this example, we use the PDDocument
class to load the document from the specified PDF file and iterate through each page to find the images within it.
For each page, we fetch its resources (including images) and check if an image exists in it.
If they exist, we iterate over them and use the PDImageXObject
object to get their properties, such as width and height.
Then, use ImageIO to save the image to the local file system.
The output is as follows:
Page 1: Found image with width 150px and height 150px.
End
Apache PDFBox is a powerful tool. In addition to the above functions, there are many other functions worth exploring and discovering. If you have any questions about Apache PDFBox or want to know more functions, welcome to ask me in the comment area, or visit the official website directly: https://pdfbox.apache.org/.